# INTELLIHACK 5.0 - TEAM HYPER TUNERS
## Task 1 - Part 1: Weather Prediction Model
  
✅ **Objective**: Build a machine learning model to predict the probability of rain over the next 21 days using historical weather data, with thorough preprocessing, analysis, and optimization.  
**Dataset**: `weather_data.csv` (assumed columns: `date`, `avg_temperature`, `humidity`, `avg_wind_speed`, `cloud_cover`, `pressure`, `rain_or_not`)  
**Approach**:  
- Preprocess data to handle missing values, outliers, and feature engineering.  
- Perform exploratory data analysis (EDA) to uncover patterns.  
- Train and tune LightGBM and ensemble models for prediction.  
- Generate and visualize rain probabilities for the next 21 days.

## Table of Contents
1. [Initial Setup](#initial-setup)  
2. [Loading the Dataset](#step-1-loading-the-dataset)  
3. [Data Preprocessing](#step-2-data-preprocessing)  
4. [Feature Engineering](#step-3-feature-engineering)  
5. [Exploratory Data Analysis](#step-4-exploratory-data-analysis)  
6. [Model Training and Evaluation](#step-5-model-training-and-evaluation)  
7. [Hyperparameter Tuning](#step-6-hyperparameter-tuning)  
8. [Ensemble Model](#step-7-ensemble-model)  
9. [Feature Importance Analysis](#step-8-feature-importance-analysis)  
10. [Rain Probability Prediction](#step-9-rain-probability-prediction)  
11. [Conclusion](#conclusion)

In [None]:
#Importing libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.impute import KNNImputer
from sklearn.metrics import classification_report, roc_auc_score, accuracy_score, confusion_matrix
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
import lightgbm as lgb
from datetime import timedelta

In [None]:
# Step 1: Load the dataset

# Set random seed for reproducibility
np.random.seed(42)

# Load dataset
try:
    df = pd.read_csv('weather_data.csv') 
    print("Dataset Loaded Successfully. Shape:", df.shape)
    print("First 5 Rows:\n", df.head())
except FileNotFoundError:
    print("Error: File not found. Please provide the correct path to 'weather_data.csv'.")
    exit()

✅ **Preprocessing Notes:**  
- Outliers (e.g., negative values) are clipped to realistic ranges (e.g., humidity ≤ 100%).  
- Missing values are filled using linear interpolation for time-series continuity, with forward/backward fill and KNN imputation as backups.  
- The target `rain_or_not` is encoded as binary (1 = Rain, 0 = No Rain).

In [None]:
# Step 2 - --- Preprocessing ---

# Define numeric columns for outlier checks and imputation
numeric_cols = ['avg_temperature', 'humidity', 'avg_wind_speed', 'cloud_cover', 'pressure']

# Handle outliers and incorrect entries
df[numeric_cols] = df[numeric_cols].clip(lower=0)  # Ensure no negative values
df['humidity'] = df['humidity'].clip(upper=100)    # Cap humidity at 100%
df['cloud_cover'] = df['cloud_cover'].clip(upper=100)  # Cap cloud cover at 100%

# Encode target variable
df['rain_or_not'] = df['rain_or_not'].map({'Rain': 1, 'No Rain': 0})

# Check missing values before preprocessing
print("\nMissing Values Before Preprocessing:\n", df.isnull().sum())

# Clip outliers (e.g., negative values to 0, cap humidity/cloud_cover at 100)
df['humidity'] = df['humidity'].clip(0, 100)
df['cloud_cover'] = df['cloud_cover'].clip(0, 100)

# Step 1.1: Time-series interpolation for temporal continuity
df = df.infer_objects(copy=False)  # Address FutureWarning for object dtype
df.interpolate(method='linear', inplace=True)

# Step 1.2: Forward and backward fill as fallback
df.ffill(inplace=True)  # Replace deprecated fillna(method='ffill')
df.bfill(inplace=True)  # Replace deprecated fillna(method='bfill')
print("\nAfter Interpolation and Fill:\n", df.isnull().sum())

# Conditional KNN imputation if NaNs persist
if df[numeric_cols].isnull().sum().sum() > 0:
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(df[numeric_cols])
    imputer = KNNImputer(n_neighbors=4)
    X_imputed = imputer.fit_transform(X_scaled)
    df[numeric_cols] = scaler.inverse_transform(X_imputed)
    print("\nAfter KNN Imputation:\n", df.isnull().sum())

In [None]:
# Reset index after preprocessing
df.reset_index(drop=True, inplace=True)  # Drop old index to avoid duplication

In [None]:
# Define features
features = ['avg_temperature', 'humidity', 'avg_wind_speed', 'cloud_cover', 'pressure',
            'temp_humidity_interaction', 'cloud_pressure_ratio', 'month']

# Summary statistics
print("Summary Statistics:\n", df.describe())

In [None]:
# Ensure 'numeric_cols' is defined
numeric_cols = ['avg_temperature', 'humidity', 'avg_wind_speed', 'cloud_cover', 'pressure', 'rain_or_not']

# --- Feature Engineering ---

# Convert 'date' column to datetime
df['date'] = pd.to_datetime(df['date'])

# Add engineered features
df['temp_humidity_interaction'] = df['avg_temperature'] * df['humidity']  # Interaction term
df['cloud_pressure_ratio'] = df['cloud_cover'] / (df['pressure'] + 1e-6)  # Avoid division by zero
df['month'] = df['date'].dt.month  # Extract month for seasonality

# Define final feature set and target
features = [
    'avg_temperature', 'humidity', 'avg_wind_speed', 'cloud_cover', 'pressure',
    'temp_humidity_interaction', 'cloud_pressure_ratio', 'month'
]


✅ **EDA Insights:**  
- The correlation matrix highlights strong relationships between `humidity` and `rain_or_not`.  
- Boxplots reveal distinct feature distributions for rainy vs. non-rainy days.  
- Engineered features like `temp_humidity_interaction` may capture combined effects.


In [None]:
#  Step 3 - --- Exploratory Data Analysis (EDA) ---

# 1. Correlation Matrix
# Correlation heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(df[numeric_cols + ['rain_or_not']].corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title("Correlation Matrix of Features and Target")
plt.tight_layout()
plt.show()

# 2. Feature vs Target Analysis
plt.figure(figsize=(12, 8))
for i, col in enumerate(features[:6], 1):  # Plot first 6 features
    plt.subplot(2, 3, i)
    sns.boxplot(x='rain_or_not', y=col, data=df, color='lightgreen')
    plt.title(f'{col} vs Rain')
plt.tight_layout()
plt.show()

# 3. Distribution of Features
plt.figure(figsize=(15, 10))
for i, col in enumerate(features + ['temp_humidity_interaction', 'cloud_pressure_ratio', 'month'], 1):  # Plot all features
    plt.subplot(4, 3, i)
    sns.histplot(df[col], bins=20, kde=True, color='royalblue')
    plt.title(f'Distribution of {col}')
plt.tight_layout()
plt.show()

# 4. Pairplot for Key Features
sns.pairplot(df[['avg_temperature', 'humidity', 'cloud_cover', 'rain_or_not']], hue='rain_or_not')
plt.show()

# 5. Histograms
# Plot histograms for numeric features
numeric_features = df.select_dtypes(include=['float64', 'int64']).columns
for feature in numeric_features:
    plt.figure(figsize=(8, 4))
    sns.histplot(df[feature], kde=True, color='gray')
    plt.title(f'Histogram of {feature}')
    plt.show()

# 6. Scatter Plots
# Scatter plot between avg_temperature and humidity
plt.figure(figsize=(8, 6))
sns.scatterplot(x='avg_temperature', y='humidity', hue='rain_or_not', data=df, color='darkblue')
plt.xlabel('Avg Temperature')
plt.ylabel('Humidity')
plt.title('Scatter Plot of Avg Temperature vs Humidity')
plt.show()


# Summary statistics
print("Summary Statistics:\n", df.describe())

In [None]:
# Step 5 -  --- Model Training and Evaluation ---

# Define features and target variable
X = df[features]
y = df['rain_or_not']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=100, stratify=y)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train LightGBM model
lgb_model = lgb.LGBMClassifier(random_state=100)
lgb_model.fit(X_train_scaled, y_train)

# Evaluate LightGBM
train_pred = lgb_model.predict(X_train_scaled)
test_pred = lgb_model.predict(X_test_scaled)
train_prob = lgb_model.predict_proba(X_train_scaled)[:, 1]
test_prob = lgb_model.predict_proba(X_test_scaled)[:, 1]

print("\n--- LightGBM Results: ---")
print("Train Accuracy:", accuracy_score(y_train, train_pred)*100, "%")
print("Test Accuracy:", accuracy_score(y_test, test_pred)*100, "%")
print("Test Classification Report:\n", classification_report(y_test, test_pred))
print("Test ROC AUC:", roc_auc_score(y_test, test_prob)*100)
print("Confusion Matrix:\n", confusion_matrix(y_test, test_pred))

In [None]:
# Step 6 - --- Hyperparameter Tuning for LightGBM ---

# Define parameter grid for LightGBM - Took the best pre-trained values from the previous run
param_grid = {
    'learning_rate': [0.11197348326497208],
    'n_estimators': [153],
    'num_leaves': [27],
    'max_depth': [8],
    'min_child_samples': [30],
    'subsample': [0.9583523180973899],
    'colsample_bytree': [0.8034866277716478], 
    'reg_alpha': [0.3105330288217051],
    'reg_lambda': [0.10291039503067402]
}

# Initialize LightGBM model
lgb_model = lgb.LGBMClassifier(random_state=100)

# Perform grid search
grid_search = GridSearchCV(estimator=lgb_model, param_grid=param_grid, cv=3, scoring='accuracy')
grid_search.fit(X_train_scaled, y_train)


In [None]:
# --- Post-Tuning Analysis and Prediction ---

# Extract the best LightGBM model
best_lgb_model = grid_search.best_estimator_

# Evaluate the tuned model on test data
test_pred_tuned = best_lgb_model.predict(X_test_scaled)
test_prob_tuned = best_lgb_model.predict_proba(X_test_scaled)[:, 1]

In [None]:
# --- Ensemble Method: Combine Tuned LightGBM with Random Forest ---

# Initialize Random Forest model
rf_model = RandomForestClassifier(
    n_estimators=150,           # Fewer trees for small dataset efficiency
    max_depth=7,              # Moderate depth to balance complexity
    min_samples_split=5,      # Prevent over-splitting on small data
    min_samples_leaf=2,       # Ensure robust leaves
    max_features='sqrt',      # Use sqrt(n_features) for diversity
    n_jobs=-5,                # Utilize all CPU cores for speed
    random_state=100,
    bootstrap=True         # Consistent with LightGBM for reproducibility
)

# Create ensemble with soft voting
ensemble_model = VotingClassifier(
    estimators=[('lgb', best_lgb_model), ('rf', rf_model)],
    voting='soft'  # Average probabilities
)

# Fit ensemble on training data
ensemble_model.fit(X_train_scaled, y_train)

# Evaluate ensemble
train_pred_ensemble = ensemble_model.predict(X_train_scaled)
test_pred_ensemble = ensemble_model.predict(X_test_scaled)
test_prob_ensemble = ensemble_model.predict_proba(X_test_scaled)[:, 1]

In [None]:
# --- Results ---

# Display best parameters and score from GridSearchCV
print("Best Parameters:", grid_search.best_params_)
print("Best Cross-Validation Accuracy:", grid_search.best_score_)


print("\n --- Ensemble (LightGBM + Random Forest) Results: ---")
print("Train Accuracy:", accuracy_score(y_train, train_pred_ensemble)*100,"%")
print("Test Accuracy:", accuracy_score(y_test, test_pred_ensemble)*100,"%")
print("roc_auc_score:", roc_auc_score(y_test, test_prob_ensemble)*100)
#print("Test Classification Report:\n", classification_report(y_test, test_pred_ensemble))
#print("Test ROC AUC:", roc_auc_score(y_test, test_prob_ensemble))
print("Confusion Matrix:\n", confusion_matrix(y_test, test_pred_ensemble))


print("\n --- Tuned LightGBM Results: --- ")
print("Train Accuracy:", accuracy_score(y_train, best_lgb_model.predict(X_train_scaled))*100,"%")
print("Test Accuracy:", accuracy_score(y_test, test_pred_tuned)*100,"%")
print("roc_auc_score:", roc_auc_score(y_test, test_prob_tuned)*100)
#print("Test Classification Report:\n", classification_report(y_test, test_pred_tuned))
#print("Test ROC AUC:", roc_auc_score(y_test, test_prob_tuned))
print("Confusion Matrix:\n", confusion_matrix(y_test, test_pred_tuned))

💭  Assumptions and Limitations
- **Assumptions**: Historical data is representative of future weather patterns; simulated future data uses historical averages with minor noise.  
- **Limitations**: The model excludes external factors (e.g., geographical influences, extreme weather events) and assumes stationarity in weather trends.

In [None]:
# Feature Importance Graph

# Ensure the model is fitted
if not hasattr(lgb_model, 'feature_importances_'):
	lgb_model.fit(X_train_scaled, y_train)

feature_importance = pd.DataFrame({'Feature': features, 'Importance': lgb_model.feature_importances_})
feature_importance = feature_importance.sort_values('Importance', ascending=False)  

plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=feature_importance, color='red')
plt.title('Feature Importance')
plt.show()

In [None]:
#  Step 7 - --- Probability Output for Next 21 Days ---

# Generate future data for next 21 days
last_date = df['date'].max()
future_dates = [last_date + timedelta(days=i) for i in range(1, 22)]
future_data = pd.DataFrame({
    col: df[col].tail(21).mean() + np.random.normal(0, df[col].std() / 10, 21)
    for col in features
})
future_data_scaled = scaler.transform(future_data)

# Predict Probabilities
future_probabilities = best_lgb_model.predict_proba(future_data_scaled)[:, 1] #change to ensemble_model for ensemble model
future_predictions = pd.DataFrame({
    'Date': future_dates,
    'Rain_Probability': future_probabilities
})

# Add a column for Rain or No Rain based on the threshold
future_predictions['Rain_or_No_Rain'] = future_predictions['Rain_Probability'].apply(lambda x: 'Rain' if x >= 0.5 else 'No Rain')

print("Rain Probabilities for Next 21 Days:\n", future_predictions)


# Visualize predictions
plt.figure(figsize=(10, 6))
plt.plot(future_predictions['Date'], future_predictions['Rain_Probability'], marker='o')
plt.axhline(0.5, color='red', linestyle='--', label='Threshold (0.5)')
plt.title("Rain Probability for Next 21 Days (Ensemble Model)")
plt.xlabel("Date")
plt.ylabel("Probability of Rain")
plt.xticks(rotation=45)
plt.legend()
plt.tight_layout()
plt.show()

# Contextual explanation
print("🔴 The Probability of rain is predicted for each day. A probability above 0.5 indicates rain,\nwhile below 0.5 indicates no rain. The shaded area represents uncertainty in the prediction.")