In [13]:
import pandas as pd

# Load the dataset from your data folder
df = pd.read_csv("data/weatherAUS.csv")

# Check basic shape
print("Dataset shape:", df.shape)
df.head()


Dataset shape: (145460, 23)


Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,W,...,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,No
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,...,44.0,25.0,1010.6,1007.8,,,17.2,24.3,No,No
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,W,...,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,No
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,SE,...,45.0,16.0,1017.6,1012.8,,,18.1,26.5,No,No
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,ENE,...,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,No


#### Step 1: Check for Missing Values

Understanding which columns have missing data helps determine what should be cleaned or dropped.



In [14]:
# Check how many missing values per column
missing = df.isnull().sum().sort_values(ascending=False)
missing[missing > 0]


Sunshine         69835
Evaporation      62790
Cloud3pm         59358
Cloud9am         55888
Pressure9am      15065
Pressure3pm      15028
WindDir9am       10566
WindGustDir      10326
WindGustSpeed    10263
Humidity3pm       4507
WindDir3pm        4228
Temp3pm           3609
RainTomorrow      3267
Rainfall          3261
RainToday         3261
WindSpeed3pm      3062
Humidity9am       2654
WindSpeed9am      1767
Temp9am           1767
MinTemp           1485
MaxTemp           1261
dtype: int64

#### Step 2: Drop Columns with Excessive Missing Data

The following features have more than 40% missing data and would reduce the dataset size significantly if we tried to impute them. Therefore, we remove them:
- Sunshine (69,835 missing)
- Evaporation (62,790)
- Cloud3pm (59,358)
- Cloud9am (55,888)

These features are dropped to maintain data quality and avoid introducing noise.


In [15]:
# List of high-missing columns we want to drop
to_drop = ['Sunshine', 'Evaporation', 'Cloud3pm', 'Cloud9am']

# Drop only the ones that exist
df.drop(columns=[col for col in to_drop if col in df.columns], inplace=True)


#### Step 3: Drop Rows with Missing Target

Rows missing the target variable (`RainTomorrow`) cannot be used for training or evaluation and are removed.



In [16]:
df.dropna(subset=['RainTomorrow'], inplace=True)


#### Step 4: Fill Remaining Missing Values

Numerical columns are filled with their mean, and categorical columns with their most frequent value (mode).


In [17]:
# Fill numerical columns
num_cols = df.select_dtypes(include=['float64', 'int64']).columns
df[num_cols] = df[num_cols].fillna(df[num_cols].mean())

# Fill categorical columns
cat_cols = df.select_dtypes(include=['object']).columns
df[cat_cols] = df[cat_cols].fillna(df[cat_cols].mode().iloc[0])


#### Step 5: Encode Categorical Variables

The target and binary features are converted to 1s and 0s. All other categorical features are one-hot encoded.


In [18]:
df['RainTomorrow'] = df['RainTomorrow'].map({'Yes': 1, 'No': 0})
df['RainToday'] = df['RainToday'].map({'Yes': 1, 'No': 0})

# One-hot encode remaining categorical variables
df = pd.get_dummies(df, drop_first=True)


#### Step 6: Split Features and Target

Now that the dataset is clean and encoded, we separate the features (X) from the target variable (y).


In [19]:
X = df.drop('RainTomorrow', axis=1)
y = df['RainTomorrow']


#### Step 7: Train/Test Split

Split the data into training and testing sets to evaluate performance.


In [24]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# First: stratified sampling to reduce size to ~10,000 rows
X_small, _, y_small, _ = train_test_split(X, y, train_size=10000, stratify=y, random_state=42)

# Then scale
scaler = StandardScaler()
X_small_scaled = scaler.fit_transform(X_small)

# Final train/test split
X_train_scaled, X_test_scaled, y_train, y_test = train_test_split(
    X_small_scaled, y_small, test_size=0.2, random_state=42, stratify=y_small
)


In [26]:
# Recreate X and y from df
X = df.drop('RainTomorrow', axis=1)
y = df['RainTomorrow']

from sklearn.model_selection import train_test_split

# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)


## 2. Modelling / Classification

In this section, we apply several supervised machine learning algorithms to classify whether it will rain tomorrow. The models are trained on the preprocessed data and evaluated using accuracy, confusion matrix, and classification reports. Three different classifiers are compared to assess performance.



### 2.1 Logistic Regression

Logistic Regression is a simple and interpretable linear model for binary classification. It models the probability of a binary response based on one or more predictor variables.


In [27]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

lr_model = LogisticRegression(max_iter=5000, solver='liblinear')
lr_model.fit(X_train, y_train)

y_pred_lr = lr_model.predict(X_test)

print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_lr))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred_lr))
print("\nClassification Report:\n", classification_report(y_test, y_pred_lr))


Logistic Regression Accuracy: 0.8455993530011604

Confusion Matrix:
 [[20962  1102]
 [ 3289  3086]]

Classification Report:
               precision    recall  f1-score   support

           0       0.86      0.95      0.91     22064
           1       0.74      0.48      0.58      6375

    accuracy                           0.85     28439
   macro avg       0.80      0.72      0.74     28439
weighted avg       0.84      0.85      0.83     28439



### 2.1 Logistic Regression

A logistic regression model was trained to predict whether it will rain tomorrow. The model achieved an accuracy of **84.56%**. It showed strong performance in identifying the "No Rain" class with high precision and recall. However, its recall for the "Rain" class was **48%**, indicating that the model misses a significant portion of rainy days. This reflects the class imbalance in the dataset.

**Confusion Matrix Summary:**
- True Negatives (No Rain predicted correctly): 20,962
- False Positives (Rain wrongly predicted): 1,102
- False Negatives (Rain missed): 3,289
- True Positives (Rain predicted correctly): 3,086

**Classification Report Highlights:**
- Precision (Rain): 74%
- Recall (Rain): 48%
- F1-score (Rain): 58%


### 2.2 Random Forest

Random Forest is an ensemble method that builds multiple decision trees and merges them to get a more accurate and stable prediction. It works well on structured/tabular data.


In [28]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Reduce number of trees to speed things up
rf_model = RandomForestClassifier(n_estimators=30, max_depth=15, random_state=42)
rf_model.fit(X_train, y_train)

y_pred_rf = rf_model.predict(X_test)

print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred_rf))
print("\nClassification Report:\n", classification_report(y_test, y_pred_rf))


Random Forest Accuracy: 0.7900066809662787

Confusion Matrix:
 [[22030    34]
 [ 5938   437]]

Classification Report:
               precision    recall  f1-score   support

           0       0.79      1.00      0.88     22064
           1       0.93      0.07      0.13      6375

    accuracy                           0.79     28439
   macro avg       0.86      0.53      0.50     28439
weighted avg       0.82      0.79      0.71     28439



### 2.2 Random Forest Classifier

The Random Forest classifier achieved an accuracy of **79.00%**. It performed very well on the majority class ("No Rain"), but struggled significantly with the minority class ("Rain"). The recall for predicting rain was only **7%**, indicating that the model failed to detect most of the actual rainy days. This is likely due to class imbalance, where the dataset contains far more "No Rain" examples than "Rain".

**Confusion Matrix Summary:**
- True Negatives (No Rain predicted correctly): 22,030
- False Positives (Rain wrongly predicted): 34
- False Negatives (Rain missed): 5,938
- True Positives (Rain predicted correctly): 437

**Classification Report Highlights:**
- Precision (Rain): 93%
- Recall (Rain): 7%
- F1-score (Rain): 13%


### 2.3 Support Vector Machine (SVM)

SVM aims to find a hyperplane that best separates the classes in the data. It is effective in high-dimensional spaces but can be slow on large datasets.


In [29]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Sample 10,000 rows while preserving class balance
X_small, _, y_small, _ = train_test_split(X, y, train_size=10000, stratify=y, random_state=42)

# Scale the reduced set
scaler = StandardScaler()
X_small_scaled = scaler.fit_transform(X_small)

# Split into training and testing sets
X_train_scaled, X_test_scaled, y_train, y_test = train_test_split(
    X_small_scaled, y_small, test_size=0.2, random_state=42, stratify=y_small
)


In [30]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

svm_model = SVC(kernel='rbf', C=1, gamma='scale')
svm_model.fit(X_train_scaled, y_train)

y_pred_svm = svm_model.predict(X_test_scaled)

print("SVM Accuracy:", accuracy_score(y_test, y_pred_svm))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred_svm))
print("\nClassification Report:\n", classification_report(y_test, y_pred_svm))


SVM Accuracy: 0.778

Confusion Matrix:
 [[1486   66]
 [ 378   70]]

Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.96      0.87      1552
           1       0.51      0.16      0.24       448

    accuracy                           0.78      2000
   macro avg       0.66      0.56      0.55      2000
weighted avg       0.73      0.78      0.73      2000



### 2.3 Support Vector Machine (SVM)

A Support Vector Machine was trained on a scaled and sampled version of the dataset to improve performance and reduce training time. The model reached an accuracy of **77.8%**. It performed strongly on the "No Rain" class but had poor recall for the "Rain" class (only **16%**). The precision for predicting rain was **51%**, indicating a moderately high rate of false positives. Overall, the model struggled with class imbalance, similar to the Random Forest.

**Confusion Matrix Summary:**
- True Negatives (No Rain predicted correctly): 1,486
- False Positives (Rain wrongly predicted): 66
- False Negatives (Rain missed): 378
- True Positives (Rain predicted correctly): 70

**Classification Report Highlights:**
- Precision (Rain): 51%
- Recall (Rain): 16%
- F1-score (Rain): 24%


## 3. Solution Improvement

This section focuses on improving the classification model's performance using hyperparameter tuning. Random Forest was selected for optimization due to its poor recall on predicting rain. GridSearchCV is applied to find the best combination of parameters and enhance the model’s ability to generalize.



In [32]:
# Recreate the full training dataset for GridSearch
X = df.drop('RainTomorrow', axis=1)
y = df['RainTomorrow']

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)


In [33]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [10, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

# Create base model
rf = RandomForestClassifier(random_state=42)

# Setup GridSearch
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=3,
    n_jobs=-1,
    scoring='f1',
    verbose=1
)

# Fit GridSearch to data
grid_search.fit(X_train, y_train)


Fitting 3 folds for each of 16 candidates, totalling 48 fits


0,1,2
,estimator,RandomForestC...ndom_state=42)
,param_grid,"{'max_depth': [10, 20], 'min_samples_leaf': [1, 2], 'min_samples_split': [2, 5], 'n_estimators': [50, 100]}"
,scoring,'f1'
,n_jobs,-1
,refit,True
,cv,3
,verbose,1
,pre_dispatch,'2*n_jobs'
,error_score,
,return_train_score,False

0,1,2
,n_estimators,50
,criterion,'gini'
,max_depth,20
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [34]:
print("Best Score:", grid_search.best_score_)
print("Best Params:", grid_search.best_params_)


Best Score: 0.2732372924449166
Best Params: {'max_depth': 20, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50}


## 3. Model Improvement – Hyperparameter Tuning with GridSearchCV

To improve the Random Forest model’s performance, hyperparameter tuning was conducted using `GridSearchCV`. A parameter grid was defined for key hyperparameters:

- `n_estimators`: [50, 100] – number of trees
- `max_depth`: [10, 20] – maximum depth of each tree
- `min_samples_split`: [2, 5] – minimum samples to split a node
- `min_samples_leaf`: [1, 2] – minimum samples per leaf node

The search was run with 3-fold cross-validation and evaluated using the **F1-score** to balance precision and recall.

**Best Hyperparameters Found:**
```python
{
    'n_estimators': 50,
    'max_depth': 20,
    'min_samples_split': 2,
    'min_samples_leaf': 1
}


In [35]:
best_rf = grid_search.best_estimator_
y_pred_best_rf = best_rf.predict(X_test)

print("Test Accuracy:", f"{accuracy_score(y_test, y_pred_best_rf):.3f}")
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred_best_rf))
print("\nClassification Report:\n", classification_report(y_test, y_pred_best_rf))


Test Accuracy: 0.803

Confusion Matrix:
 [[21935   129]
 [ 5466   909]]

Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.99      0.89     22064
           1       0.88      0.14      0.25      6375

    accuracy                           0.80     28439
   macro avg       0.84      0.57      0.57     28439
weighted avg       0.82      0.80      0.74     28439



## 4. Conclusion

The goal of this project was to build a machine learning model that predicts whether it will rain tomorrow in Australia, using historical weather data and classification techniques.

Three supervised machine learning models were evaluated:

- **Logistic Regression**: Achieved the highest overall accuracy (84.6%) with strong performance on both precision and recall for the majority class, though it struggled to identify rain accurately.
- **Random Forest**: Showed slightly lower accuracy (79.0%) but provided more robustness and was further optimized using `GridSearchCV`.
- **Support Vector Machine**: Required downsampling due to computational limitations and achieved moderate accuracy (77.8%) with limited ability to classify rain cases.

After hyperparameter tuning, the optimized Random Forest achieved:
- **Test Accuracy**: 80.3%
- **Improved recall on rain prediction**, but still underperformed due to class imbalance.

### Key Challenges:
- Severe class imbalance: majority of observations were 'No Rain'
- Missing values in key features
- Limited hardware, which restricted model complexity and grid search scope

### Future Improvements:
- Apply data balancing techniques such as SMOTE or undersampling
- Explore ensemble models and boosting techniques (e.g., XGBoost)
- Include temporal
