## Rainfall Dataset for Uganda

This experiment will be used to predict 10-day rainfall and examine the relationships between the different features of the dataset.

This dataset contains dekadal rainfall indicators computed from Climate Hazards Group InfraRed Precipitation satellite imagery with insitu Station data (CHIRPS) version 2, aggregated by subnational administrative units.

Included indicators are (for each dekad):

    10 day rainfall [mm] (rfh)
    rainfall 1-month rolling aggregation [mm] (r1h)
    rainfall 3-month rolling aggregation [mm] (r3h)
    rainfall long term average [mm] (rfh_avg)
    rainfall 1-month rolling aggregation long term average [mm] (r1h_avg)
    rainfall 3-month rolling aggregation long term average [mm] (r3h_avg)
    rainfall anomaly [%] (rfq)
    rainfall 1-month anomaly [%] (r1q)
    rainfall 3-month anomaly [%] (r3q)

The administrative units used for aggregation are based on WFP data and contain a Pcode reference attributed to each unit. The number of input pixels used to create the aggregates, is provided in the n_pixelscolumn.

[Reference](https://data.humdata.org/dataset/uga-rainfall-subnational) - Uganda Rainfall dataset (https://data.humdata.org)

## Import necessary modules

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
import numpy as np
from sklearn.preprocessing import StandardScaler

## Import the data

In [2]:
df = pd.read_csv("rainfall_dataset.csv")

## Data Preprocessing

In [3]:
# Removing null columns
null_percentage = df.isnull().mean() * 100
null_cols = null_percentage[null_percentage > 50]
# Check if there are null columns with nulls greater than 50%
if not null_cols.empty:
    # Drop null columns
    df.drop(null_cols.index, axis=1, inplace=True)
    print("Null columns removed:", null_cols.index.tolist())
else:
    print("No null columns with nulls greater than 50%")

# Drop the null rows
df = df.dropna()

# Drop rows where column 'version' has 'prelim'
df = df[df['version'] != 'prelim']

# Drop columns "adm2_id", "ADM2_PCODE", and "version" directly without chaining
df.drop(["adm2_id", "ADM2_PCODE", "version"], axis=1, inplace=True)

No null columns with nulls greater than 50%


## Data Preprocessing

In [4]:
# Columns to be scaled
selected_columns = df.columns[1:]  
df_selected = df[selected_columns]

scaler = StandardScaler()
scaler.fit(df[selected_columns])
scaled_columns = scaler.transform(df[selected_columns])
df[selected_columns] = scaled_columns