## Rainfall Weather Forecasting project by Francis Afuwah
Batch DS2312

## Introduction
This report details the development and evaluation of two predictive models using machine learning to analyze rainfall data. The primary goals were to create:

1. A binary classification model to forecast whether it will rain the next day.
2. A regression model to predict the amount of rainfall for the next day.
The dataset utilized, named Rainfall.csv, contains daily weather observations from various locations.

## Imports

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score


In [2]:
# Load the data
data_path = "Rainfall.csv"
rainfall_data = pd.read_csv(data_path)

## View dataset

In [3]:
# Display the first few rows 
rainfall_data.head(), 

(         Date Location  MinTemp  MaxTemp  Rainfall  Evaporation  Sunshine  \
 0  2008-12-01   Albury     13.4     22.9       0.6          NaN       NaN   
 1  2008-12-02   Albury      7.4     25.1       0.0          NaN       NaN   
 2  2008-12-03   Albury     12.9     25.7       0.0          NaN       NaN   
 3  2008-12-04   Albury      9.2     28.0       0.0          NaN       NaN   
 4  2008-12-05   Albury     17.5     32.3       1.0          NaN       NaN   
 
   WindGustDir  WindGustSpeed WindDir9am  ... Humidity9am  Humidity3pm  \
 0           W           44.0          W  ...        71.0         22.0   
 1         WNW           44.0        NNW  ...        44.0         25.0   
 2         WSW           46.0          W  ...        38.0         30.0   
 3          NE           24.0         SE  ...        45.0         16.0   
 4           W           41.0        ENE  ...        82.0         33.0   
 
    Pressure9am  Pressure3pm  Cloud9am  Cloud3pm  Temp9am  Temp3pm  RainToday  \
 0 

In [4]:
# summary of the data
rainfall_data.describe()

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm
count,8350.0,8365.0,8185.0,4913.0,4431.0,7434.0,8349.0,8318.0,8366.0,8323.0,7116.0,7113.0,6004.0,5970.0,8369.0,8329.0
mean,13.193305,23.859976,2.805913,5.389395,7.632205,40.174469,13.847646,18.533662,67.822496,51.24979,1017.640233,1015.236075,4.566622,4.503183,17.762015,22.442934
std,5.403596,6.136408,10.459379,5.044484,3.896235,14.665721,10.174579,9.766986,16.833283,18.423774,6.828699,6.766681,2.877658,2.731659,5.627035,5.98002
min,-2.0,8.2,0.0,0.0,0.0,7.0,0.0,0.0,10.0,6.0,989.8,982.9,0.0,0.0,1.9,7.3
25%,9.2,19.3,0.0,2.6,4.75,30.0,6.0,11.0,56.0,39.0,1013.0,1010.4,1.0,2.0,13.8,18.0
50%,13.3,23.3,0.0,4.6,8.7,39.0,13.0,19.0,68.0,51.0,1017.7,1015.3,5.0,5.0,17.8,21.9
75%,17.4,28.0,1.0,7.0,10.7,50.0,20.0,24.0,80.0,63.0,1022.3,1019.8,7.0,7.0,21.9,26.4
max,28.5,45.5,371.0,145.0,13.9,107.0,63.0,83.0,100.0,99.0,1039.0,1036.0,8.0,8.0,39.4,44.1


## Data Exploration and Preprocessing
The dataset includes multiple weather-related features such as temperature, humidity, wind speed, and atmospheric pressure, among others. The data also featured significant amounts of missing values in columns like Evaporation and Sunshine, necessitating comprehensive preprocessing.

The preprocessing steps included:

1. Handling missing values by imputation.
2. Encoding categorical variables using one-hot encoding.
3. Standardizing numerical data to facilitate model convergence.

## Model Building

In [5]:
# Separate features and targets
X = rainfall_data.drop(columns=['RainTomorrow', 'Rainfall'])
y_classification = rainfall_data['RainTomorrow'].apply(lambda x: 1 if x == 'Yes' else 0)
y_regression = rainfall_data['Rainfall'].fillna(0)  # Fill NA with 0 for regression

# Identify categorical and numerical columns
categorical_cols = X.select_dtypes(include=['object']).columns.tolist()
numerical_cols = X.select_dtypes(exclude=['object']).columns.tolist()

# Create preprocessors 
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),  # Replace missing values with the mean
    ('scaler', StandardScaler())  # Standardize data
])


In [6]:
# Create preprocessors for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),  # Replace missing values with 'missing'
    ('onehot', OneHotEncoder(handle_unknown='ignore'))  # One-hot encode
])

In [7]:
# Combine preprocessors in a ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

In [8]:
# Split data into training and testing sets
X_train, X_test, y_class_train, y_class_test, y_reg_train, y_reg_test = train_test_split(
    X, y_classification, y_regression, test_size=0.3, random_state=42)


In [9]:
# Apply preprocessing to the training data
X_train_transformed = preprocessor.fit_transform(X_train)
X_test_transformed = preprocessor.transform(X_test)

In [10]:
# Check the shape of the transformed data
X_train_transformed.shape, X_test_transformed.shape

((5897, 2923), (2528, 2923))

## 1). Logistic Regression Model: 
Binary classification: 
1. This model will predict whether it will rain tomorrow (Yes/No) 
2. Evaluate the model using accuracy, precision, recall, and F1-score.

In [11]:
# Initialize and train the Logistic Regression model for binary classification
classifier = LogisticRegression(max_iter=1000, random_state=42)
classifier.fit(X_train_transformed, y_class_train)

In [12]:
# Predict on the test data
y_class_pred = classifier.predict(X_test_transformed)


In [13]:
# Calculate accuracy and other evaluation metrics
classification_accuracy = accuracy_score(y_class_test, y_class_pred)
classification_report_details = classification_report(y_class_test, y_class_pred, target_names=['No Rain', 'Rain'])

classification_accuracy

0.8571993670886076

In [14]:
classification_report_details

'              precision    recall  f1-score   support\n\n     No Rain       0.87      0.95      0.91      1943\n        Rain       0.77      0.54      0.64       585\n\n    accuracy                           0.86      2528\n   macro avg       0.82      0.75      0.77      2528\nweighted avg       0.85      0.86      0.85      2528\n'

The Logistic Regression model for predicting whether it will rain tomorrow achieved an accuracy of approximately 85.7%. 

Report of the other evaluation metrics:
1. Precision for "No Rain" is 87%, and for "Rain" is 77%.
2. Recall for "No Rain" is 95%, indicating good coverage of this class, but for "Rain" it's 54%, suggesting some missed rainy days.
3. F1-score reflects a balance between precision and recall, with 91% for "No Rain" and 64% for "Rain".

The model showed robust performance, particularly in predicting non-rainy days. However, it was less effective in identifying rainy days, as evidenced by the lower recall and F1-score for the Rain category.

## 2). Random Forest Regression model
1. This is used to predict the amount of rainfall.
2. Evaluate the model using mean squared error (MSE) and R² score.

In [15]:
# Initialize and train the Random Forest Regressor model for predicting rainfall amount
regressor = RandomForestRegressor(n_estimators=100, random_state=42)
regressor.fit(X_train_transformed, y_reg_train)

In [16]:
# Predict on the test data
y_reg_pred = regressor.predict(X_test_transformed)

In [17]:
# Calculate MSE and R^2
mse = mean_squared_error(y_reg_test, y_reg_pred)
r2 = r2_score(y_reg_test, y_reg_pred)

mse, r2

(124.10813346162976, 0.20716790834118948)

## Regression Model
Objective: Quantify the amount of rainfall.
Model Used: Attempted with Random Forest Regressor and Linear Regression.

Challenges: The training of the Random Forest Regressor was computationally intensive, leading to timeouts. Attempts to train a simpler Linear Regression model were hindered by internal errors, potentially due to the large number of processed features or dataset size.

## Evaluation and Metrics
For the classification model, metrics such as accuracy, precision, recall, and F1-score were utilized to assess performance across both classes (Rain/No Rain). For the regression model, although not completed, the typical metrics intended were Mean Squared Error (MSE) and R-squared (R²), which provide insights into the model's error magnitude and explanatory power, respectively.

## Conclusion
The logistic regression model for binary classification successfully predicted rain occurrence with high accuracy, particularly for days without rain. However, predicting rainy days remains challenging and may benefit from further model tuning or alternative algorithms.

Due to computational limitations, the regression model's training was not successfully completed in this environment. It is recommended to address these limitations or adjust model complexity for future attempts.

This analysis demonstrates the potential of machine learning in weather forecasting, highlighting both its capabilities and the challenges posed by large datasets and complex models. Further research and optimization could enhance model performance, especially in terms of predicting less frequent events such as rainfall days.