# Rainfall Prediction Classifier

**Author:** Gautam Govind

A machine learning project to predict rainfall in the Melbourne area using historical weather data.

---

## Project Overview

This project builds classification models to predict whether it will rain today based on various meteorological features. The dataset contains daily weather observations from 2008 to 2017 across Australian locations.

---

## Setup

In [None]:
!pip install numpy pandas matplotlib scikit-learn seaborn

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
import seaborn as sns

---

## Data Loading and Exploration

In [None]:
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/_0eYOqji3unP1tDNKWZMjg/weatherAUS-2.csv"
df = pd.read_csv(url)
df.head()

**Output:**
```
         Date Location  MinTemp  MaxTemp  Rainfall  ...  Temp9am  Temp3pm RainToday RainTomorrow
0  2008-12-01   Albury     13.4     22.9       0.6  ...     16.9     21.8        No           No
1  2008-12-02   Albury      7.4     25.1       0.0  ...     17.2     24.3        No           No
2  2008-12-03   Albury     12.9     25.7       0.0  ...     21.0     23.2        No           No
3  2008-12-04   Albury      9.2     28.0       0.0  ...     18.1     26.5        No           No
4  2008-12-05   Albury     17.5     32.3       1.0  ...     17.8     29.7        No           No

[5 rows x 23 columns]
```

In [None]:
df.count()

**Output:**
```
Date             145460
Location         145460
MinTemp          143975
MaxTemp          144199
Rainfall         142199
Evaporation       82670
Sunshine          75625
WindGustDir      135134
WindGustSpeed    135197
WindDir9am       134894
WindDir3pm       141232
WindSpeed9am     143693
WindSpeed3pm     142398
Humidity9am      142806
Humidity3pm      140953
Pressure9am      130395
Pressure3pm      130432
Cloud9am          89572
Cloud3pm          86102
Temp9am          143693
Temp3pm          141851
RainToday        142199
RainTomorrow     142193
dtype: int64
```

**Key Observation:** Sunshine and cloud cover have too many missing values to impute effectively.

---

## Data Cleaning

In [None]:
df = df.dropna()
df.info()

**Output:**
```
<class 'pandas.core.frame.DataFrame'>
Index: 56420 entries, 6049 to 142302
Data columns (total 23 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Date           56420 non-null  object 
 1   Location       56420 non-null  object 
 2   MinTemp        56420 non-null  float64
 3   MaxTemp        56420 non-null  float64
 4   Rainfall       56420 non-null  float64
 5   Evaporation    56420 non-null  float64
 6   Sunshine       56420 non-null  float64
 7   WindGustDir    56420 non-null  object 
 8   WindGustSpeed  56420 non-null  float64
 9   WindDir9am     56420 non-null  object 
 10  WindDir3pm     56420 non-null  object 
 11  WindSpeed9am   56420 non-null  float64
 12  WindSpeed3pm   56420 non-null  float64
 13  Humidity9am    56420 non-null  float64
 14  Humidity3pm    56420 non-null  float64
 15  Pressure9am    56420 non-null  float64
 16  Pressure3pm    56420 non-null  float64
 17  Cloud9am       56420 non-null  float64
 18  Cloud3pm       56420 non-null  float64
 19  Temp9am        56420 non-null  float64
 20  Temp3pm        56420 non-null  float64
 21  RainToday      56420 non-null  object 
 22  RainTomorrow   56420 non-null  object 
dtypes: float64(16), object(7)
memory usage: 10.3+ MB
```

After dropping missing values, we still have 56,420 observations - sufficient for modeling.

---

## Addressing Data Leakage

To make predictions practical, I reframed the problem to predict today's rainfall using data available up to yesterday.

In [None]:
df = df.rename(columns={
    'RainToday': 'RainYesterday',
    'RainTomorrow': 'RainToday'
})

---

## Geographic Filtering

Focusing on Melbourne metropolitan area (Melbourne, Melbourne Airport, Watsonia):

In [None]:
df = df[df.Location.isin(['Melbourne', 'MelbourneAirport', 'Watsonia'])]
df.info()

**Output:**
```
<class 'pandas.core.frame.DataFrame'>
Index: 7557 entries, 64191 to 80997
Data columns (total 23 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Date           7557 non-null   object 
 1   Location       7557 non-null   object 
...
dtypes: float64(16), object(7)
memory usage: 1.4+ MB
```

---

## Feature Engineering: Seasonality

In [None]:
def date_to_season(date):
    month = date.month
    if month in [12, 1, 2]:
        return 'Summer'  # Southern Hemisphere
    elif month in [3, 4, 5]:
        return 'Autumn'
    elif month in [6, 7, 8]:
        return 'Winter'
    else:
        return 'Spring'

df['Date'] = pd.to_datetime(df['Date'])
df['Season'] = df['Date'].apply(date_to_season)
df = df.drop(columns=['Date'])
df.head()

**Output:**
```
               Location  MinTemp  MaxTemp  Rainfall  ...  Temp3pm RainYesterday RainToday  Season
64191  MelbourneAirport     11.2     19.9       0.0  ...     18.1            No       Yes  Summer
64192  MelbourneAirport      7.8     17.8       1.2  ...     15.8           Yes        No  Summer
64193  MelbourneAirport      6.3     21.1       0.0  ...     19.6            No        No  Summer
64194  MelbourneAirport      8.1     29.2       0.0  ...     28.2            No        No  Summer
64195  MelbourneAirport      9.7     29.0       0.0  ...     27.1            No        No  Summer

[5 rows x 23 columns]
```

---

## Target Variable Analysis

In [None]:
X = df.drop(columns=['RainToday'], axis=1)
y = df['RainToday']

y.value_counts()

**Output:**
```
RainToday
No     5766
Yes    1791
Name: count, dtype: int64
```

**Analysis:**
- Rain occurs ~24% of the time (1791/7557)
- Dataset is imbalanced (76% No Rain vs 24% Rain)
- Simply predicting "No Rain" every day would yield 76% accuracy
- This highlights the need for better evaluation metrics beyond accuracy

---

## Model Development

### Train-Test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

### Feature Preprocessing

In [None]:
# Detect feature types
numeric_features = X_train.select_dtypes(include=['number']).columns.tolist()
categorical_features = X_train.select_dtypes(include=['object', 'category']).columns.tolist()

# Define transformers
numeric_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

# Combine into preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

### Create Pipeline

In [None]:
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])

param_grid = {
    'classifier__n_estimators': [50, 100, 150],
    'classifier__max_depth': [None, 10, 20],
    'classifier__min_samples_split': [2, 5, 10]
}

---

## Model Training: Random Forest

In [None]:
grid_search = GridSearchCV(
    pipeline,
    param_grid,
    cv=5,
    scoring='accuracy',
    verbose=2,
    n_jobs=-1
)

grid_search.fit(X_train, y_train)

**Output:**
```
Fitting 5 folds for each of 27 candidates, totalling 135 fits
[CV] END classifier__max_depth=None, classifier__min_samples_split=2, classifier__n_estimators=50; total time=   0.5s
[CV] END classifier__max_depth=None, classifier__min_samples_split=2, classifier__n_estimators=50; total time=   0.4s
...
[135 fits completed]
```

### Best Parameters

In [None]:
print("\nBest parameters found: ", grid_search.best_params_)
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))

**Output:**
```
Best parameters found:  {'classifier__max_depth': 20, 'classifier__min_samples_split': 2, 'classifier__n_estimators': 100}
Best cross-validation score: 0.85
```

### Test Set Performance

In [None]:
test_score = grid_search.score(X_test, y_test)
print("Test set score: {:.2f}".format(test_score))

**Output:**
```
Test set score: 0.84
```

---

## Model Evaluation

### Predictions

In [None]:
y_pred = grid_search.predict(X_test)

### Classification Report

In [None]:
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

**Output:**
```
Classification Report:
              precision    recall  f1-score   support

          No       0.86      0.95      0.90      1154
         Yes       0.75      0.51      0.61       358

    accuracy                           0.84      1512
   macro avg       0.81      0.73      0.76      1512
weighted avg       0.84      0.84      0.83      1512
```

**Key Metrics:**
- **Overall Accuracy:** 84%
- **Precision (Rain):** 75% - When predicting rain, correct 75% of the time
- **Recall (Rain):** 51% - Only catches 51% of actual rainy days ⚠️
- **F1-Score (Rain):** 0.61

### Confusion Matrix

In [None]:
conf_matrix = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=conf_matrix)
disp.plot(cmap='Blues')
plt.title('Confusion Matrix')
plt.show()

**Output:**
![Confusion Matrix showing: True Negatives=1097, False Positives=57, False Negatives=175, True Positives=183]

**Analysis:**
- **True Negatives (1097):** Correctly predicted no rain
- **True Positives (183):** Correctly predicted rain
- **False Negatives (175):** Missed rainy days - 49% miss rate
- **False Positives (57):** False alarms

**True Positive Rate:** 183/(183+175) = **51%** - The model only catches about half of rainy days.

---

## Feature Importance Analysis

In [None]:
feature_importances = grid_search.best_estimator_['classifier'].feature_importances_

# Get feature names (note: OneHotEncoder expands categorical features)
feature_names = numeric_features + list(
    grid_search.best_estimator_['preprocessor']
    .named_transformers_['cat']
    .get_feature_names_out(categorical_features)
)

importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': feature_importances
}).sort_values(by='Importance', ascending=False)

N = 20
top_features = importance_df.head(N)

plt.figure(figsize=(10, 6))
plt.barh(top_features['Feature'], top_features['Importance'], color='skyblue')
plt.gca().invert_yaxis()
plt.title(f'Top {N} Most Important Features')
plt.xlabel('Importance Score')
plt.show()

**Output:**
[Bar chart showing Humidity3pm as most important, followed by Pressure3pm, Cloud9am, Cloud3pm, Sunshine, etc.]

**Top Predictive Features:**
1. **Humidity3pm** - Most important
2. **Pressure3pm** - Atmospheric pressure
3. **Cloud9am / Cloud3pm** - Cloud cover
4. **Sunshine** - Hours of sunshine
5. **RainYesterday** - Previous day's rain

---

## Model Comparison: Logistic Regression

In [None]:
# Update pipeline with Logistic Regression
pipeline.set_params(classifier=LogisticRegression(random_state=42))
grid_search.estimator = pipeline

param_grid = {
    'classifier__solver': ['liblinear'],
    'classifier__penalty': ['l1', 'l2'],
    'classifier__class_weight': [None, 'balanced']
}

grid_search.param_grid = param_grid
grid_search.fit(X_train, y_train)

y_pred_lr = grid_search.predict(X_test)

print(classification_report(y_test, y_pred_lr))

**Output:**
```
              precision    recall  f1-score   support

          No       0.84      0.92      0.88      1154
         Yes       0.67      0.46      0.55       358

    accuracy                           0.82      1512
   macro avg       0.76      0.69      0.71      1512
weighted avg       0.80      0.82      0.81      1512
```

In [None]:
conf_matrix_lr = confusion_matrix(y_test, y_pred_lr)
sns.heatmap(conf_matrix_lr, annot=True, cmap='Blues', fmt='d')
plt.title('Logistic Regression - Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.tight_layout()
plt.show()

---

## Final Model Comparison

| Metric | Random Forest | Logistic Regression |
|--------|--------------|---------------------|
| **Accuracy** | 84% | 82% |
| **Recall (Rain)** | 51% | 46% |
| **Precision (Rain)** | 75% | 67% |
| **F1-Score (Rain)** | 0.61 | 0.55 |
| **True Positive Rate** | 51% | 46% |

### Key Findings:

1. **Random Forest outperforms Logistic Regression** with 2% higher accuracy and 5% better rain detection rate

2. **Both models struggle with the minority class** - missing nearly half of rainy days due to class imbalance

3. **Random Forest achieves:**
   - 1097 correct "No Rain" predictions (95% of no-rain days)
   - 183 correct "Rain" predictions (51% of rainy days)
   - 84% overall accuracy

4. **Trade-off exists between accuracy and recall** - A model predicting "No Rain" always would achieve 76% accuracy but catch 0% of rainy days

---

## Conclusions

This project successfully built rainfall prediction models achieving 84% accuracy. However, the analysis revealed important insights:

**Strengths:**
- Strong overall accuracy (84%)
- Good precision when predicting rain (75%)
- Effective feature engineering (seasonality, location filtering)
- Robust preprocessing pipeline

**Challenges:**
- Class imbalance (76% no-rain vs 24% rain)
- Low recall for rainy days (51%)
- High false negative rate (49% of rainy days missed)

**Practical Implications:**
For real-world deployment where missing a rainy day prediction has high cost (e.g., outdoor event planning), the current model may need improvement despite good accuracy.

---

## Future Improvements

1. **Address class imbalance:**
   - Apply SMOTE (Synthetic Minority Over-sampling)
   - Use class_weight='balanced' in Random Forest
   - Try ensemble methods with different thresholds

2. **Feature engineering:**
   - Create temporal features (rolling averages, trends)
   - Add pressure/temperature differentials
   - Include day-of-week patterns

3. **Model enhancements:**
   - Try Gradient Boosting (XGBoost, LightGBM)
   - Experiment with ensemble stacking
   - Optimize decision threshold for business needs

4. **Expand scope:**
   - Add more geographic locations with location-specific models
   - Include more recent data
   - Build seasonal sub-models

---

**Project by Gautam Govind**  
*Machine Learning Classification Project*