# Binary Rainfall Prediction Using Machine Learning

### Project Overview:

* The goal of this project is to predict whether it will rain on a given day based on various weather-related features such as temperature, humidity, pressure, wind speed, and more. The dataset provided contains historical weather data, and the target variable is a binary label indicating whether rainfall occurred (1) or not (0). The project involves data exploration, preprocessing, model training, and evaluation to build a robust binary classification model.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Import necessary libraries
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Input ,BatchNormalization,Conv1D,MaxPooling1D ,Flatten
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.metrics import AUC
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from sklearn.model_selection import train_test_split

# 1. Data Exploration

In [2]:
# Define file paths for easy access
train_file = '/kaggle/input/playground-series-s5e3/train.csv'
test_file = '/kaggle/input/playground-series-s5e3/test.csv'

train_df = pd.read_csv(train_file)
test_df = pd.read_csv(test_file)

In [3]:
def data_exploration(train_df, test_df):
    
    # Print the first few rows of the training data
    print("Training Data Head:")
    print(train_df.head())
    
    # Print the first few rows of the testing data
    print("\nTesting Data Head:")
    print(test_df.head())
    
    # Print information about the training data
    print("\nTraining Data Info:")
    print(train_df.info())
    
    # Print information about the testing data
    print("\nTesting Data Info:")
    print(test_df.info())
    
    # Print the total number of rows and columns in the training data
    print("\nTraining Data Shape:")
    print(f"Rows: {train_df.shape[0]}, Columns: {train_df.shape[1]}")
    
    # Print the total number of rows and columns in the testing data
    print("\nTesting Data Shape:")
    print(f"Rows: {test_df.shape[0]}, Columns: {test_df.shape[1]}")
    
    # Print statistical summary for the training data
    print("\nTraining Data Statistics:")
    print(train_df.describe())
    
    # Print statistical summary for the testing data
    print("\nTesting Data Statistics:")
    print(test_df.describe())

# Call the function with the paths to your datasets
data_exploration(train_df, test_df)

Training Data Head:
   id  day  pressure  maxtemp  temparature  mintemp  dewpoint  humidity  \
0   0    1    1017.4     21.2         20.6     19.9      19.4      87.0   
1   1    2    1019.5     16.2         16.9     15.8      15.4      95.0   
2   2    3    1024.1     19.4         16.1     14.6       9.3      75.0   
3   3    4    1013.4     18.1         17.8     16.9      16.8      95.0   
4   4    5    1021.8     21.3         18.4     15.2       9.6      52.0   

   cloud  sunshine  winddirection  windspeed  rainfall  
0   88.0       1.1           60.0       17.2         1  
1   91.0       0.0           50.0       21.9         1  
2   47.0       8.3           70.0       18.1         1  
3   95.0       0.0           60.0       35.6         1  
4   45.0       3.6           40.0       24.8         0  

Testing Data Head:
     id  day  pressure  maxtemp  temparature  mintemp  dewpoint  humidity  \
0  2190    1    1019.5     17.5         15.8     12.7      14.9      96.0   
1  2191    2 

### Observations:

* The dataset consists of weather-related features such as pressure, temperature (max, min, and average), humidity, cloud cover, sunshine, wind direction, wind speed, and rainfall (binary target variable).
* Training Data has 2,190 rows and 13 columns, while the Testing Data also follows a similar structure but lacks the target variable (rainfall).

# 2: Missing Value Finding and Imputation

### In this step, we will:

1. Identify missing values in both the training and testing datasets.
2. Handle missing values by imputing them with appropriate strategies.
3. Create new meaningful features if necessary to improve model prediction.

### Observations:

* From the info() output, we can see that the testing dataset has 1 missing value in the winddirection column.
* The training dataset has no missing values.

### Strategy for Missing Values:
* For numerical columns with missing values, we will impute them with the median value of the column.
* If new meaningful features can be created (e.g., temperature range, dew point spread), we will add them to improve the model.

In [4]:
def handle_missing_values_and_feature_engineering(train_df, test_df):
    # Check for missing values in the training data
    print("Missing values in Training Data:")
    print(train_df.isnull().sum())
    
    # Check for missing values in the testing data
    print("\nMissing values in Testing Data:")
    print(test_df.isnull().sum())
    
    # Impute missing values in the testing data
    # For numerical columns, use median imputation
    test_df['winddirection'].fillna(test_df['winddirection'].median(), inplace=True)
    
    # Verify if missing values are handled
    print("\nMissing values in Testing Data after imputation:")
    print(test_df.isnull().sum())
    
    # Print the first few rows to verify new features
    print("\nTraining Data with New Features:")
    print(train_df.head())
    
    print("\nTesting Data with New Features:")
    print(test_df.head())
    
    return train_df, test_df

# Call the function
train_df, test_df = handle_missing_values_and_feature_engineering(train_df, test_df)

Missing values in Training Data:
id               0
day              0
pressure         0
maxtemp          0
temparature      0
mintemp          0
dewpoint         0
humidity         0
cloud            0
sunshine         0
winddirection    0
windspeed        0
rainfall         0
dtype: int64

Missing values in Testing Data:
id               0
day              0
pressure         0
maxtemp          0
temparature      0
mintemp          0
dewpoint         0
humidity         0
cloud            0
sunshine         0
winddirection    1
windspeed        0
dtype: int64

Missing values in Testing Data after imputation:
id               0
day              0
pressure         0
maxtemp          0
temparature      0
mintemp          0
dewpoint         0
humidity         0
cloud            0
sunshine         0
winddirection    0
windspeed        0
dtype: int64

Training Data with New Features:
   id  day  pressure  maxtemp  temparature  mintemp  dewpoint  humidity  \
0   0    1    1017.4     21.2    

In [5]:
# some new features
train_df['humidity_cloud_interaction'] = train_df['humidity'] * train_df['cloud']
train_df['humidity_sunshine_interaction'] = train_df['humidity'] * train_df['sunshine']
train_df['cloud_sunshine_ratio'] = train_df['cloud'] / (train_df['sunshine'] + 1e-5)
train_df['relative_dryness'] = 100 - train_df['humidity']
train_df['sunshine_percentage'] = train_df['sunshine'] / (train_df['sunshine'] + train_df['cloud'] + 1e-5)
train_df['weather_index'] = (0.4 * train_df['humidity']) + (0.3 * train_df['cloud']) - (0.3 * train_df['sunshine'])

test_df['humidity_cloud_interaction'] = test_df['humidity'] * test_df['cloud']
test_df['humidity_sunshine_interaction'] = test_df['humidity'] * test_df['sunshine']
test_df['cloud_sunshine_ratio'] = test_df['cloud'] / (test_df['sunshine'] + 1e-5)
test_df['relative_dryness'] = 100 - test_df['humidity']
test_df['sunshine_percentage'] = test_df['sunshine'] / (test_df['sunshine'] + test_df['cloud'] + 1e-5)
test_df['weather_index'] = (0.4 * test_df['humidity']) + (0.3 * test_df['cloud']) - (0.3 * test_df['sunshine'])

#### Created new interaction features to capture complex relationships between variables:
* humidity_cloud_interaction: Interaction between humidity and cloud cover.
* humidity_sunshine_interaction: Interaction between humidity and sunshine.
* cloud_sunshine_ratio: Ratio of cloud cover to sunshine.
* relative_dryness: Reverse measure of humidity.
* sunshine_percentage: Proportion of sunshine relative to cloud cover.
* weather_index: Weighted combination of humidity, cloud cover, and sunshine.

In [6]:
train_df.head()

Unnamed: 0,id,day,pressure,maxtemp,temparature,mintemp,dewpoint,humidity,cloud,sunshine,winddirection,windspeed,rainfall,humidity_cloud_interaction,humidity_sunshine_interaction,cloud_sunshine_ratio,relative_dryness,sunshine_percentage,weather_index
0,0,1,1017.4,21.2,20.6,19.9,19.4,87.0,88.0,1.1,60.0,17.2,1,7656.0,95.7,79.99927,13.0,0.012346,60.87
1,1,2,1019.5,16.2,16.9,15.8,15.4,95.0,91.0,0.0,50.0,21.9,1,8645.0,0.0,9100000.0,5.0,0.0,65.3
2,2,3,1024.1,19.4,16.1,14.6,9.3,75.0,47.0,8.3,70.0,18.1,1,3525.0,622.5,5.662644,25.0,0.15009,41.61
3,3,4,1013.4,18.1,17.8,16.9,16.8,95.0,95.0,0.0,60.0,35.6,1,9025.0,0.0,9500000.0,5.0,0.0,66.5
4,4,5,1021.8,21.3,18.4,15.2,9.6,52.0,45.0,3.6,40.0,24.8,0,2340.0,187.2,12.49997,48.0,0.074074,33.22


In [7]:
test_df.head()

Unnamed: 0,id,day,pressure,maxtemp,temparature,mintemp,dewpoint,humidity,cloud,sunshine,winddirection,windspeed,humidity_cloud_interaction,humidity_sunshine_interaction,cloud_sunshine_ratio,relative_dryness,sunshine_percentage,weather_index
0,2190,1,1019.5,17.5,15.8,12.7,14.9,96.0,99.0,0.0,50.0,24.3,9504.0,0.0,9900000.0,4.0,0.0,68.1
1,2191,2,1016.5,17.5,16.5,15.8,15.1,97.0,99.0,0.0,50.0,35.3,9603.0,0.0,9900000.0,3.0,0.0,68.5
2,2192,3,1023.9,11.2,10.4,9.4,8.9,86.0,96.0,0.0,40.0,16.9,8256.0,0.0,9600000.0,14.0,0.0,63.2
3,2193,4,1022.9,20.6,17.3,15.2,9.5,75.0,45.0,7.1,20.0,50.6,3375.0,532.5,6.338019,25.0,0.136276,41.37
4,2194,5,1022.2,16.1,13.8,6.4,4.3,68.0,49.0,9.2,20.0,19.4,3332.0,625.6,5.326081,32.0,0.158076,39.14


# 3. Model training

In [8]:
# features and target
X = train_df.drop(columns=['id', 'rainfall'], axis=1)
y = train_df['rainfall']
X_test = test_df.drop(columns=['id'])

In [9]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_test_scaled = scaler.transform(X_test)

# Reshape Input for CNN (adding a channel dimension)
X_train, X_val, y_train, y_val = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
X_train = X_train.reshape((X_train.shape[0], X_train.shape[1], 1))
X_val = X_val.reshape((X_val.shape[0], X_val.shape[1], 1))
X_test_scaled = X_test_scaled.reshape((X_test_scaled.shape[0], X_test_scaled.shape[1], 1))

- **Feature Scaling:** Standardized numerical features using StandardScaler.
- **Train-Test Split:** 80-20 split for training and validation.
- **Model Architecture:**
  * Convolutional Neural Network (CNN) with Conv1D, MaxPooling1D, Flatten, Dense, and Dropout layers.
  * Used ReLU activation and Adam optimizer.
  * Early Stopping and Learning Rate Reduction implemented to prevent overfitting.

In [10]:
model = Sequential([
    Conv1D(filters=32, kernel_size=3, activation='relu', input_shape=(X_train.shape[1], 1)),
    MaxPooling1D(pool_size=2),
    Conv1D(filters=16, kernel_size=3, activation='relu'),
    MaxPooling1D(pool_size=2),
    Flatten(),
    Dense(32, activation='relu'),
    Dropout(0.3),
    Dense(16, activation='relu'),
    Dense(1, activation='sigmoid')  
])

optimizer = Adam(learning_rate=0.001)
model.compile(optimizer=optimizer, loss='mean_squared_error', metrics=['mae'])
early_stopping = EarlyStopping(monitor='val_loss', patience=20, restore_best_weights=True, verbose=1)
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=10, min_lr=1e-5, verbose=1)


history = model.fit(
    X_train, y_train, 
    epochs=200, batch_size=32, validation_data=(X_val, y_val), 
    callbacks=[early_stopping, reduce_lr], verbose=1
)

Epoch 1/200
[1m55/55[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 44ms/step - loss: 0.2402 - mae: 0.4844 - val_loss: 0.1560 - val_mae: 0.3553 - learning_rate: 0.0010
Epoch 2/200
[1m55/55[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - loss: 0.1422 - mae: 0.3232 - val_loss: 0.1216 - val_mae: 0.2399 - learning_rate: 0.0010
Epoch 3/200
[1m55/55[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - loss: 0.1084 - mae: 0.2306 - val_loss: 0.1211 - val_mae: 0.2184 - learning_rate: 0.0010
Epoch 4/200
[1m55/55[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - loss: 0.0980 - mae: 0.2068 - val_loss: 0.1205 - val_mae: 0.2257 - learning_rate: 0.0010
Epoch 5/200
[1m55/55[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - loss: 0.1045 - mae: 0.2167 - val_loss: 0.1214 - val_mae: 0.2146 - learning_rate: 0.0010
Epoch 6/200
[1m55/55[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - loss: 0.1032 - mae: 0.2012 - val_loss

In [11]:
test_preds = model.predict(X_test_scaled).flatten()

if np.isnan(test_preds).sum() > 0:
    print(f"Found {np.isnan(test_preds).sum()} NaN values in predictions. Fixing them...")
    test_preds = np.nan_to_num(test_preds)

[1m23/23[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 22ms/step


In [12]:
submission = pd.DataFrame({"id": test_df['id'], "rainfall": test_preds})
submission.to_csv("submission.csv", index=False)
submission

Unnamed: 0,id,rainfall
0,2190,0.994570
1,2191,0.994405
2,2192,0.976222
3,2193,0.188719
4,2194,0.082990
...,...,...
725,2915,0.991637
726,2916,0.833734
727,2917,0.989235
728,2918,0.987771


* **Predictions on Test Data:** Generated rainfall probability scores.
* **NaN Check:** Ensured there were no missing values in predictions, and handled any potential NaNs.
* **Submission File:** Created a submission.csv file containing id and predicted rainfall.