## **Objective**  
In this episode we are given the task of predicting the rainfall for each day of the year. 
Submissions are evaluated on the **Area Under the Receiver Operating Characteristic Curve** between the predicted probability and the observed target.

**AUC-ROC** is defined as:

$$
\textrm{AUC} = \sum_{i=1}^{n} ( \textrm{FPR}_i - \textrm{FPR}_{i-1} ) \times \textrm{TPR}_i
$$
 
## **Data Overview**  
The dataset for this competition is generated from a deep learning model trained on the [Rainfall Prediction using Machine Learning](https://www.kaggle.com/datasets/subho117/rainfall-prediction-using-machine-learning)  

**Key Features:**  
- **Pressure:** Atmospheric pressure recorded daily.  
- **Max Temperature (maxtemp):** Highest temperature recorded in a day.  
- **Temperature:** Average daily temperature.  
- **Min Temperature (mintemp):** Lowest temperature recorded in a day.  
- **Dew Point:** Temperature at which condensation occurs.  
- **Humidity:** Percentage of moisture in the air.  
- **Cloud Cover:** Extent of cloudiness during the day.  
- **Sunshine Duration:** Total hours of sunshine received in a day.  
- **Wind Direction:** Direction from which the wind is blowing.  
- **Wind Speed:** Speed of the wind measured in a given unit.  
- **Day:** Recorded day of the year.  

In [None]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as  np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, precision_recall_curve, auc


sns.set_theme(style="whitegrid", palette="muted")

print('Libaires imported')

In [None]:
train_data = pd.read_csv('/kaggle/input/playground-series-s5e3/train.csv')
test_data = pd.read_csv('/kaggle/input/playground-series-s5e3/test.csv')
original_data = pd.read_csv('/kaggle/input/rainfall-prediction-using-machine-learning/Rainfall.csv')

print('Data imported')

In [None]:
print("Train Data Preview:")
display(train_data.tail())

print("\nOriginal Data Preview:")
display(original_data.tail())

print("\nTest Data Preview:")
display(test_data.head())

In [None]:
# Removing whitespace in column names
train_data.columns = train_data.columns.str.strip()
test_data.coloumns = test_data.columns.str.strip()
original_data.columns = original_data.columns.str.strip()

# Correct spelling inconsistency in train_data
train_data = train_data.rename(columns={'temparature': 'temperature'})
test_data = test_data.rename(columns={'temparature': 'temperature'})
original_data = original_data.rename(columns={'temparature': 'temperature'})

# Converting original dataset 'rainfall' column to binary format
original_data['rainfall'] = (original_data['rainfall'] == 'yes').astype(int)

# Reorder columns in original_data to match train_data
original_data = original_data.reindex(columns=train_data.columns)

In [None]:
print("Train Data Preview:")
display(train_data.tail())

print("\nOriginal Data Preview:")
display(original_data.tail())

print("\nTest Data Preview:")
display(test_data.head())

In [None]:
train_duplicates = train_data.duplicated().sum()
test_duplicates = test_data.duplicated().sum()
original_duplicates = original_data.duplicated().sum()

print(f'Number of duplicate rows in train_data: {train_duplicates}')
print(f'Number of duplicate rows in test_data: {test_duplicates}')
print(f'Number of duplicate rows in original_data: {original_duplicates}')

In [None]:
#Missing and unique values

missing_values_train = pd.DataFrame({'Feature': train_data.columns,
                                     '[TRAIN] Missing Values': train_data.isnull().sum().values})

missing_values_test = pd.DataFrame({'Feature': test_data.columns,
                                     '[TEST] Missing Values': test_data.isnull().sum().values})

missing_values_original = pd.DataFrame({'Feature': original_data.columns,
                                       '[ORIGINAL] Missing Values': original_data.isnull().sum().values})

unique_values_train = pd.DataFrame({'Feature': train_data.columns,
                              '[TRAIN] Unique Values': train_data.nunique().values})

unique_values_test = pd.DataFrame({'Feature': test_data.columns,
                              '[TEST] Unique Values': test_data.nunique().values})

unique_values_original = pd.DataFrame({'Feature': original_data.columns,
                              '[ORIGINAL] Unique Values': original_data.nunique().values})


feature_types = pd.DataFrame({'Feature': train_data.columns,
                              '[TRAIN] DataType': train_data.dtypes})

merged_df = pd.merge(missing_values_train, missing_values_test, on='Feature', how='left')
merged_df = pd.merge(merged_df, missing_values_original, on='Feature', how='left')
merged_df = pd.merge(merged_df, unique_values_train, on='Feature', how='left')
merged_df = pd.merge(merged_df, unique_values_test, on='Feature', how='left')
merged_df = pd.merge(merged_df, unique_values_original, on='Feature', how='left')
merged_df = pd.merge(merged_df, feature_types, on='Feature', how='left')

merged_df

In [None]:
train_data.describe().T

In [None]:
test_data.describe().T

In [None]:
original_data.describe().T

## **Handling Missing Value**

In [None]:
# Missing value in original_data
original_data['winddirection'].fillna(original_data['winddirection'].mean(), inplace=True)
original_data['windspeed'].fillna(original_data['windspeed'].mean(), inplace=True)

# Missing value in test_data
test_data['winddirection'].fillna(test_data['winddirection'].mean(), inplace=True)

## **Data Observations**

**Temperature:**
- Max Temp: The mean is approximately 26°C, with a range from about 7°C to 36°C across all datasets.
- Min Temp: The mean is around 22°C, showing moderate temperature variation.
- Temperature Variation: Both max and min temperatures indicate a relatively consistent climate with some seasonal variation.

**Humidity Levels:**
- Average Humidity: Ranges from 80% to 82% across datasets, indicating a humid climate.
- Consistency: Humidity levels are highly consistent, suggesting a stable atmospheric condition.

**Sunshine & Cloud Cover:**
- Cloud Cover: Train and test datasets show higher cloudiness (75-76%) compared to the original dataset (71%), indicating more sunny days in the original data.
- Sunshine Hours: Low average sunshine (3-4 hours), with a maximum of 12.1 hours, suggesting cloudy conditions dominate.

**Wind & Pressure Trends:**
- Pressure Levels: Consistent mean pressure around 1013 hPa across datasets, indicating stable atmospheric conditions.
- Wind Speed & Direction: Mean wind speed is about 21-22 km/h, with extreme values up to 59 km/h. Wind direction varies widely (10° to 350°), indicating diverse wind patterns.

## **EDA**

In [None]:
target_variable = 'rainfall'

numerical_variables = ['winddirection', 'pressure', 'maxtemp', 'temparature', 'mintemp', 'dewpoint', 'humidity', 'cloud', 'sunshine', 'windspeed']
categorical_variables = []


## **Rainfall Distribution**

In [None]:
def plot_target_distribution(data, target_variable, title_suffix=""):
    
    fig, axes = plt.subplots(1, 2, figsize=(12, 4))

    sns.countplot(y=target_variable, data=data, ax=axes[0])
    axes[0].set_title(f'Distribution of {target_variable} in {title_suffix}')
    axes[0].set_xlabel('Count')
    axes[0].set_ylabel(target_variable)

    for p in axes[0].patches:
        axes[0].annotate(f'{int(p.get_width())}', 
                         (p.get_width(), p.get_y() + p.get_height() / 2), 
                         ha='left', va='center', 
                         fontsize=10)

    axes[0].set_axisbelow(True)
    axes[0].grid(axis='x', linestyle='--', linewidth=0.7)  
    sns.despine(left=True, bottom=True)

    rainfall_counts = data[target_variable].value_counts()
    wedges, texts, autotexts = axes[1].pie(
        rainfall_counts, 
        labels=rainfall_counts.index, 
        autopct='%1.1f%%', 
        startangle=90
    )
    centre_circle = plt.Circle((0, 0), 0.70, fc='white')
    fig.gca().add_artist(centre_circle)
    axes[1].set_title(f'Percentage of {target_variable} Distribution in {title_suffix}')
    axes[1].axis('equal')

    plt.tight_layout()
    plt.show()


plot_target_distribution(train_data, 'rainfall', title_suffix="Train Data")
plot_target_distribution(original_data, 'rainfall', title_suffix="Original Data")

## **Rainfall Distribution Analysis**

**Distribution of Rainfall in Train Data:**
- **Significant Imbalance:** There is an overrepresentation of instances of rainfall, approximately 75.3% of the data points indicating rainfall while only 24.7% indicate the absence of rain.
- **Implications for Model Training:** This imbalance could pose a challenge for model training, leading to bias towards the majority class.

**Distribution of Rainfall in Original Data:**
- **Imbalance Persists:** The original dataset also exhibits imbalance, approximately 68% of the data points indicating rainfall while only 32%% indicate the absence of rain.


## **Numerical Variable Distributions Across Datasets**

In [None]:
fig, axes = plt.subplots(len(numerical_variables), 2, figsize=(12, len(numerical_variables) * 4))

for i, feature in enumerate(numerical_variables):
    # Histogram
    sns.histplot(train_data[feature], label='Train Data', bins=20, kde=True, ax=axes[i, 0])
    sns.histplot(test_data[feature], label='Test Data', bins=20, kde=True, ax=axes[i, 0])
    sns.histplot(original_data[feature], label='Original Data', bins=20, kde=True, ax=axes[i, 0])
    axes[i, 0].set_title(f'Histogram of {feature}')
    axes[i, 0].legend()
    axes[i, 0].grid(linestyle='--', linewidth=0.7)

    # Horizontal Boxplot
    sns.boxplot(data=[train_data[feature], test_data[feature], original_data[feature]], 
                orient='h', ax=axes[i, 1])
    axes[i, 1].set_title(f'Horizontal Boxplot of {feature}')
    axes[i, 1].set_yticklabels(['Train Data', 'Test Data', 'Original Data'])
    axes[i, 1].grid(axis='x', linestyle='--', linewidth=0.7)

plt.tight_layout()
plt.show()

## **Numerical Variable Distributions Across Datasets**

**Day:**
- Train and test data show a relatively uniform distribution. As does the original dataset over a limitted range.  
- Train and test data likely cover multiple years.
- The original data represents a smaller limited time period.

**Pressure:**
- All datasets show a relatively normal distribution centered around a similar mean.
- Outliers are present in all three data sets.  
- Atmospheric pressure is relatively stable across all datasets.
- Their are outliers.

**Maxtemp, Temperature, Mintemp:**
- These temperature-related features have similar distributions, with a relatively normal shape.  
- Temperatures are generally consistent across datasets.

**Dewpoint:**
- The dewpoint distributions are similar across datasets, indicating consistent moisture levels.
- Slight negative skew is present in all the datasets.  
- Moisture patterns are relatively stable.
- Lower dewpoints are less common.

**Humidity:**
- Humidity levels show consistent distributions, with a slight left skew, indicating a trend towards higher humidity.  
- The climate is generally humid.  

**Cloud:**
- Cloud cover shows a bimodal distribution, with peaks at high cloud cover and a secondary peak at lower cloud cover.  
- The climate experiences both clear and overcast conditions.

**Sunshine:**
- Sunshine hours are heavily skewed towards lower values.
- The original dataset has a higher average of sunshine.
- Cloudy conditions are prevalent.

**Winddirection:**
- Wind direction shows a relatively uniform distribution, indicating variability in wind patterns.
- There are a few outliers.  
- Wind patterns are diverse.

**Windspeed:**  
- Wind speed is skewed towards lower values, with a long tail indicating occasional high wind speeds.
- There are a few outliers.  
- Moderate wind speeds are more common, but strong winds occur.

## **Correlation Analysis of Numerical Variables**

In [None]:
fig, axes = plt.subplots(3, 1, figsize=(10, 18))  
# Train Data Heatmap
sns.heatmap(train_data.corr(), cmap='coolwarm', annot=True, fmt='.2f', ax=axes[0])
axes[0].set_title('Correlation Heatmap - Train Data')

# Original Data Heatmap
sns.heatmap(original_data.corr(), cmap='coolwarm', annot=True, fmt='.2f', ax=axes[1])
axes[1].set_title('Correlation Heatmap - Original Data')

# Test Data Heatmap
sns.heatmap(test_data.corr(), cmap='coolwarm', annot=True, fmt='.2f', ax=axes[2])
axes[2].set_title('Correlation Heatmap - Test Data')

plt.tight_layout()
plt.show()

## **Correlation Analysis of Numerical Variables**

**General Observations Across All Datasets:**  

**Temperature Correlations:**
- maxtemp, temperature, and mintemp exhibit very strong positive correlations across all datasets. This indicates that these temperature-related features move together consistently and possible redudency between these numerical features.
- They show strong negative correlations with pressure.
- They show a strong positive correlation with dewpoint.

**Pressure Correlations:**
- pressure has strong negative correlations with all temperature-related features and dewpoint across all datasets.
- pressure has a moderate positive correlation with windspeed.

**Humidity Correlations:**
- humidity shows a moderate positive correlation with cloud and a moderate negative correlation with sunshine across all datasets.
- humidity has a moderate positive correlation with rainfall.

**Cloud and Sunshine Correlations:**
- cloud and sunshine have a strong negative correlation, indicating that cloudy conditions are associated with less sunshine.
- cloud has a moderate positive correlation with rainfall.

**Wind Direction Correlations:**
- winddirection has a moderate negative correlation with pressure and a moderate positive correlation with the temperature related columns.

**Rainfall Correlations:**
- rainfall has a moderate positive correlation with humidity and cloud and a moderate negative correlation with sunshine.

**Intresting Thoughts**
- The moderate correlation in winddirection and temperature could suggest that wind direction may be associated with warmer air masses?

## **KDE Plots for Target and Numerical Varible Relationship**

In [None]:
# KDE plot for Feature-Target Relationship
plt.figure(figsize=(14, 10))
for i, col in enumerate(numerical_variables, 1):
    plt.subplot(3, 4, i)
    sns.kdeplot(train_data[col][train_data['rainfall'] == 1], color='red', label='Rainfall: 1')
    sns.kdeplot(train_data[col][train_data['rainfall'] == 0], color='blue', label='Rainfall: 0')
    plt.title(f'Distribution of {col} by Rainfall')
    plt.legend()
plt.tight_layout()
plt.show()

## **Pair Plots for Numerical Variable Relationships**


In [None]:
#Visualising relationships between variables
sns.pairplot(train_data[numerical_variables + ['rainfall']], hue='rainfall', palette='coolwarm', diag_kind='kde')
plt.show()

## **Transformations of Skewed Data**

In [None]:
log_transformation = ['sunshine', 'windspeed', 'cloud', 'humidity', 'dewpoint']

for col in log_transformation:
    train_data[f'log_{col}'] = np.log1p(train_data[col])
for col in log_transformation:
    original_data[f'log_{col}'] = np.log1p(original_data[col])
for col in log_transformation:
    test_data[f'log_{col}'] = np.log1p(test_data[col])

fig, axes = plt.subplots(len(log_transformation), 2, figsize=(12, len(log_transformation) * 4))

for i, feature in enumerate(log_transformation):
    # Histogram
    sns.histplot(train_data[feature], label='Train Data', bins=20, kde=True, ax=axes[i, 0])
    sns.histplot(test_data[feature], label='Test Data', bins=20, kde=True, ax=axes[i, 0])
    sns.histplot(original_data[feature], label='Original Data', bins=20, kde=True, ax=axes[i, 0])
    axes[i, 0].set_title(f'Histogram of log_{feature}')
    axes[i, 0].legend()
    axes[i, 0].grid(linestyle='--', linewidth=0.7)

    # Horizontal Boxplot
    sns.boxplot(data=[train_data[feature], test_data[feature], original_data[feature]], 
                orient='h', ax=axes[i, 1])
    axes[i, 1].set_title(f'Horizontal Boxplot of log_{feature}')
    axes[i, 1].set_yticklabels(['Train Data', 'Test Data', 'Original Data'])
    axes[i, 1].grid(axis='x', linestyle='--', linewidth=0.7)

plt.tight_layout()
plt.show()

## **Feature Engineering**

In [None]:
def feature_engineering_data(data):
    # Feature Engineering
    data["dew_humidity"] = data["dewpoint"] * data["humidity"] # ***
    data["cloud_windspeed"] = data["cloud"] * data["windspeed"] # ***
    data["cloud_to_humidity"] = data["cloud"] / data["humidity"]
    data["temp_to_sunshine"] = data["sunshine"] / data["temparature"] # ***

    
    #data["temp_range"] = data["maxtemp"] - data["mintemp"]
    #data["temp_from_dewpoint"] = data["temparature"] - data["dewpoint"] # **?
    #data["wind_speeddirection"] = data["windspeed"] * data["winddirection"]
    #data['avg_temp'] = (data['maxtemp'] + data['mintemp']) / 2
    #data['cloud_persistence'] = data['cloud'] * data['sunshine']  # If both are low, it means the cloud cover persists.
    #data['pressure_temp_ratio'] = data['pressure'] / (data['temparature'] + 1)  # Avoid division by zero.
    #data['dew_temp_diff'] = data['temparature'] - data['dewpoint']
    #data['dew_humidity_ratio'] = data['dewpoint'] / (data['humidity'] + 1)
    #data['cloud_humidity_plus'] = data['cloud'] + data["humidity"] 
    #data['cloud_humidity_sunshine_plus'] = data['cloud'] + data["humidity"] + data['sunshine']
    #data['cloud_sunshine_*'] = data['cloud'] * data['sunshine']
    data['wind_temp_interaction'] = data['windspeed'] * data['temparature']
    #data['sunshine_wind_interaction'] = data['sunshine'] + data['windspeed'] # *
    #data['cloud_humidity_ratio'] = data['cloud'] + (data['humidity'])  # Avoid division by zero
    #data['pressure_temp_ratio'] = data['pressure'] / (data['temparature'] + 1)  # Avoid division by zero
    #data['cloud_wind_ratio'] = data['cloud'] / (data['windspeed'] + 1)  # Avoid division by zero


    #data['cloud_coverage_rate'] = data['cloud'] / 100  # Normalize to 0-1 range 
    #data['cloud_sun_interaction'] = data['cloud'] * (1 - data['sunshine'])

    
    #data['weather_severity'] = (data['cloud'] * data['humidity']) / (data['pressure'] * (data['sunshine'] + 1))
    data['cloud_sun_ratio'] = data['cloud'] / (data['sunshine'] + 1) # ***
    #data["cloud_sunshine_+"] = data["cloud"] + data["sunshine"]
    #data["cloud_sunshine_-"] = data["cloud"] - data["sunshine"]
    data["dew_humidity/sun"] = data["dewpoint"] * data["humidity"] / (data['sunshine'] + 1)
    data["dew_humidity_+"] = data["dewpoint"] * data["humidity"]
    

    data['humidity_sunshine_*'] = data["humidity"] * data['sunshine']

    data["cloud_humidity/pressure"] = (data["cloud"] * data["humidity"]) / data["pressure"]
    

    # Extract temporal features
    data['month'] = ((data['day'] - 1) // 30 + 1).clip(upper=12)
    data['season'] = data['month'].apply(lambda x: 1 if 3 <= x <= 5  # Spring
                                         else 2 if 6 <= x <= 8  # Summer
                                         else 3 if 9 <= x <= 11  # Autumn
                                         else 0)  # Winter
    # Seasonal trends
    #data['season_temp_trend'] = data['temparature'] * data['season']
    data['season_cloud_trend'] = data['cloud'] * data['season']
    

    # Seasonal deviation from mean values
    data['season_cloud_deviation'] = data['cloud'] - data.groupby('season')['cloud'].transform('mean')
    data['season_temperature'] = data['temparature'] * data['season']  # Interaction of temper



    
    data = data.drop(columns=["month"])
    #data['season_temp_trend'] = data['avg_temp'] * data['season']
    #data['season_dewpoint_trend'] = data['dewpoint'] * data['season']
    #data["dew_humidity_with_season"] = data['humidity'] * data['season']
    
    data = data.drop(columns=["maxtemp", "winddirection","humidity","temparature","pressure","day","season"])

    return data

# Apply to train and test datasets
train_data = feature_engineering_data(train_data)
test_data = feature_engineering_data(test_data)

In [None]:
sns.heatmap(train_data.corr(), cmap='coolwarm', annot=True, fmt='.2f', ax=axes[0])
axes[0].set_title('Correlation Heatmap - Train Data')

## **Model and Feature Analysis**

In [None]:
# Select features and target variable
X = train_data.drop(['rainfall', 'id'], axis=1)
y = train_data['rainfall']
X_test = test_data.drop(['id'], axis=1)

# Standardize the features
scaler = StandardScaler()
X = scaler.fit_transform(X)
X_test = scaler.transform(X_test)

In [None]:
models = {
    "Logistic Regression": LogisticRegression(random_state=42,max_iter=1000),
    "Random Forest": RandomForestClassifier(random_state=42, n_estimators=100),
    "Gradient Boosting": GradientBoostingClassifier(random_state=42),
    "Support Vector Machine": SVC(probability=True, random_state=42),
    "K-Nearest Neighbors": KNeighborsClassifier(),
    "Neural Network": MLPClassifier(random_state=42, max_iter=100, hidden_layer_sizes=(10)),
    "XGBoost": XGBClassifier(random_state=42, n_estimators=100, learning_rate=0.05, max_depth=6),
    "CatBoost": CatBoostClassifier(random_state=42, iterations=100, learning_rate=0.14, depth=6, verbose=0)
}

# Train models using StratifiedKFold CV
FOLDS = 13
skf = StratifiedKFold(n_splits=FOLDS, shuffle=True, random_state=42)
auc_scores = {}
roc_curves = {}

for name, model in models.items():
    oof_preds = np.zeros(len(y))
    
    for train_idx, val_idx in skf.split(X, y):
        X_train, X_val = X[train_idx], X[val_idx]
        y_train, y_val = y[train_idx], y[val_idx]

        
        if hasattr(model, 'fit'):
            if "eval_set" in model.fit.__code__.co_varnames:
                model.fit(X_train, y_train, eval_set=[(X_val, y_val)], early_stopping_rounds=50, verbose=0)
            else:
                model.fit(X_train, y_train)
        
        oof_preds[val_idx] = model.predict_proba(X_val)[:, 1]
    
    auc_score = roc_auc_score(y, oof_preds)
    auc_scores[name] = auc_score
    fpr, tpr, _ = roc_curve(y, oof_preds)
    roc_curves[name] = (fpr, tpr, auc_score)
    print(f"{name}: AUC = {auc_score:.4f}")

## **ROC Curve Comparison**

In [None]:
# Plot ROC curves
plt.figure(figsize=(8, 6))
for model_name, (fpr, tpr, auc_score) in roc_curves.items():
    plt.plot(fpr, tpr, label=f"{model_name} (AUC = {auc_score:.4f})")
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve Comparison")
plt.legend()
plt.show()

## **Submission File**

In [None]:
# Predictions for the test set with the top N features
test_preds = best_model.predict_proba(X_test_top)[:, 1]

# Submission
submission = pd.DataFrame({'id': test_data['id'], 'rainfall': test_preds})
submission.to_csv("submission.csv", index=False)
print("Submission file saved as 'submission.csv'.")

In [None]:
submission