### 📀 **Capstone Project: AI-Powered Rainfall Prediction for High-Impact Decision Making**

## **1️⃣ Business Understanding & Problem Statement**

### 🌍 **Context & Motivation**
Accurate rainfall prediction is critical in **agriculture, disaster preparedness, and urban planning**. A missed forecast can mean **devastating crop losses, infrastructure failures, or economic disruptions**. Traditional weather models rely on rigid rule-based systems, often failing to capture **complex, non-linear interactions** between meteorological variables.

This project takes a **modern AI-driven approach** by leveraging **advanced machine learning** techniques to develop a high-accuracy binary classification model that predicts **rainfall occurrence** with unprecedented precision.

### 💪 **Why This Matters**
- **Farmers & Agribusiness**: Optimizing irrigation schedules, reducing crop loss risk.
- **Disaster Management**: Enhancing flood forecasting & emergency preparedness.
- **Urban Infrastructure**: Assisting city planners in drainage & water resource management.

### 🔖 **Project Challenge & Competitive Edge**
- Build a **state-of-the-art predictive model** using real-world historical weather data.
- Ensure the model **outperforms traditional methods** and ranks competitively in **Kaggle’s leaderboard-driven environment**.
- Demonstrate a **scalable, real-world AI solution** with potential deployment applications beyond this competition.

---

## **2️⃣ Project Objectives & Key Performance Indicators (KPIs)**

### 🎯 **Primary Objective**
- Develop a **high-accuracy machine learning model** to predict **rainfall occurrence** (Binary Classification: **Rain = 1, No Rain = 0**).

### 📈 **Secondary Objectives**
1. **Exploratory Data Analysis (EDA)**: Discover underlying weather patterns that influence rainfall.
2. **Feature Engineering**: Enhance the dataset with high-impact variables for model optimization.
3. **Model Selection & Tuning**: Implement and benchmark various **machine learning algorithms**.
4. **Performance Optimization**: Achieve **≥97% accuracy** and secure a **Top 10 Kaggle leaderboard placement**.
5. **Academic & Industry Impact**: Showcase a robust, end-to-end AI workflow for **real-world adoption**.
6. **Reproducibility & Documentation**: Ensure the project is well-documented, easy to replicate, and meets industry best practices.

---

## **3️⃣ Data Understanding & Competitive Dataset Analysis**

### 📚 **Dataset Source & Overview**
This project is based on **Kaggle’s Playground Series - S5E3 competition dataset**, consisting of **historical meteorological data** designed to challenge participants in predictive modeling.

### 🔄 **Dataset Breakdown**
- **Train Dataset (`train.csv`)**: **2,190** samples with **13 features**.
- **Test Dataset (`test.csv`)**: **730** samples with **12 features** (excludes `rainfall` target variable).
- **Submission File (`sample_submission.csv`)**: Kaggle’s submission format for predicted outputs.

### 🎯 **Feature Engineering Considerations**
| **Feature**       | **Description & Significance**  |
|------------------|--------------------------------|
| `day`           | Sequential identifier (potential time-series dependencies). |
| `pressure`      | Atmospheric pressure, influencing rainfall patterns. |
| `maxtemp`      | Maximum recorded temperature, a potential indicator of precipitation likelihood. |
| `temparature`   | Average recorded temperature, linked to evaporation and condensation cycles. |
| `mintemp`      | Minimum temperature, useful for analyzing dew point variations. |
| `dewpoint`      | Key metric for moisture content in the air. |
| `humidity`      | Relative humidity (%), highly correlated with rainfall probability. |
| `cloud`         | Cloud cover percentage (%), a strong predictor for precipitation. |
| `sunshine`      | Total hours of sunshine, inversely affecting rainfall chances. |
| `winddirection` | Wind direction, impacting weather system movements. |
| `windspeed`     | Wind speed, affecting cloud formation and storm intensity. |
| `rainfall`      | **Target Variable** (1 = Rain, 0 = No Rain). |

### 🔬 **Initial Observations & Challenges**
- **All features are numerical**, simplifying preprocessing.
- **Potential Class Imbalance**: Requires resampling techniques (e.g., SMOTE, undersampling).
- **Feature Correlation Analysis**: High correlation expected among `humidity`, `dewpoint`, and `cloud`.
- **Outlier Detection**: Potential extreme values in `pressure` and `windspeed`.
- **Missing Values**: 1 missing value in `winddirection`, which will be imputed.

---

### 🚀 **Next Steps & Strategic Roadmap**

✅ **Step 1: Exploratory Data Analysis (EDA)**
- Visualize distributions, relationships, and correlations.
- Identify missing values, feature importance, and outliers.

✅ **Step 2: Feature Engineering & Data Preprocessing**
- Create derived features (e.g., **humidity-temperature index, pressure deltas**).
- Normalize & scale features for improved model performance.

✅ **Step 3: Baseline Model Implementation**
- Train **Logistic Regression, Decision Trees, and Random Forest** as benchmarks.

✅ **Step 4: Advanced Model Development & Hyperparameter Tuning**
- Implement **XGBoost, LightGBM, and CatBoost**.
- Optimize using **GridSearchCV, Bayesian Optimization, and Optuna**.

✅ **Step 5: Model Evaluation & Leaderboard Strategy**
- Use **AUC-ROC, Precision-Recall, and Cross-Validation** to fine-tune accuracy.
- Deploy **Stacking, Blending, and Ensemble Learning** for leaderboard performance.

✅ **Step 6: Reproducibility & Documentation**
- **Environment Setup**: Create `requirements.txt` for dependencies.
- **Code Modularity**: Structure notebooks for clarity.
- **README Optimization**: Clearly document project workflow.
- **GitHub Repository Compliance**: Ensure README includes **elevator pitch, dataset details, implementation steps, and model performance**.

✅ **Step 7: Final Submission & Academic Presentation**
- Optimize final model selection and prepare Kaggle submissions.
- Document findings in **Jupyter notebooks & GitHub README** for industry-grade presentation.
- Prepare for **capstone defense** with clear justifications for model choices.

---

### 🏆 **Conclusion: The Road to Kaggle & Academic Excellence**
This project represents a **cutting-edge application of AI in meteorology**, bridging academia and industry by showcasing **practical, high-impact machine learning workflows**. Through rigorous **data exploration, feature engineering, model optimization, and leaderboard analysis**, we aim to achieve a **Top 10 Kaggle ranking** while contributing **meaningful insights to real-world weather forecasting applications**.

🔗 **GitHub Repository (Work in Progress)**: [https://github.com/Otim135/PHASE_5_CAPSTONE_PROJECT]

🚀 **Next Up:** EDA & Feature Engineering! 🔍📊


In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [7]:
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score

In [9]:
import os
print(os.getcwd())


/Users/mac/Documents/Phase 5 Capstone Project/PHASE_5_CAPSTONE_PROJECT/Notebooks


In [10]:
df_train = pd.read_csv('../Data/train.csv')
df_train.head(10)

Unnamed: 0,id,day,pressure,maxtemp,temparature,mintemp,dewpoint,humidity,cloud,sunshine,winddirection,windspeed,rainfall
0,0,1,1017.4,21.2,20.6,19.9,19.4,87.0,88.0,1.1,60.0,17.2,1
1,1,2,1019.5,16.2,16.9,15.8,15.4,95.0,91.0,0.0,50.0,21.9,1
2,2,3,1024.1,19.4,16.1,14.6,9.3,75.0,47.0,8.3,70.0,18.1,1
3,3,4,1013.4,18.1,17.8,16.9,16.8,95.0,95.0,0.0,60.0,35.6,1
4,4,5,1021.8,21.3,18.4,15.2,9.6,52.0,45.0,3.6,40.0,24.8,0
5,5,6,1022.7,20.6,18.6,16.5,12.5,79.0,81.0,0.0,20.0,15.7,1
6,6,7,1022.8,19.5,18.4,15.3,11.3,56.0,46.0,7.6,20.0,28.4,0
7,7,8,1019.7,15.8,13.6,12.7,11.8,96.0,100.0,0.0,50.0,52.8,1
8,8,9,1017.4,17.6,16.5,15.6,12.5,86.0,100.0,0.0,50.0,37.5,1
9,9,10,1025.4,16.5,14.4,12.0,8.6,77.0,84.0,1.0,50.0,38.3,0


In [12]:
df_test = pd.read_csv('../Data/test.csv')
df_test.head(10)

Unnamed: 0,id,day,pressure,maxtemp,temparature,mintemp,dewpoint,humidity,cloud,sunshine,winddirection,windspeed
0,2190,1,1019.5,17.5,15.8,12.7,14.9,96.0,99.0,0.0,50.0,24.3
1,2191,2,1016.5,17.5,16.5,15.8,15.1,97.0,99.0,0.0,50.0,35.3
2,2192,3,1023.9,11.2,10.4,9.4,8.9,86.0,96.0,0.0,40.0,16.9
3,2193,4,1022.9,20.6,17.3,15.2,9.5,75.0,45.0,7.1,20.0,50.6
4,2194,5,1022.2,16.1,13.8,6.4,4.3,68.0,49.0,9.2,20.0,19.4
5,2195,6,1027.1,15.6,12.6,11.5,9.0,76.0,94.0,0.0,20.0,41.4
6,2196,7,1022.6,15.5,13.7,10.7,11.8,79.0,95.0,0.0,20.0,43.1
7,2197,8,1013.5,20.5,16.2,15.2,13.1,94.0,93.0,0.2,70.0,41.3
8,2198,9,1021.3,16.3,13.2,11.3,10.8,85.0,99.0,0.1,20.0,34.0
9,2199,10,1026.1,10.4,8.5,7.0,3.1,69.0,88.0,0.0,20.0,26.4


In [13]:
# Check data types
print(df_train.info())

# Summary statistics
print(df_train.describe())

# Check for missing values
print(df_train.isnull().sum())
print(df_test.isnull().sum())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2190 entries, 0 to 2189
Data columns (total 13 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             2190 non-null   int64  
 1   day            2190 non-null   int64  
 2   pressure       2190 non-null   float64
 3   maxtemp        2190 non-null   float64
 4   temparature    2190 non-null   float64
 5   mintemp        2190 non-null   float64
 6   dewpoint       2190 non-null   float64
 7   humidity       2190 non-null   float64
 8   cloud          2190 non-null   float64
 9   sunshine       2190 non-null   float64
 10  winddirection  2190 non-null   float64
 11  windspeed      2190 non-null   float64
 12  rainfall       2190 non-null   int64  
dtypes: float64(10), int64(3)
memory usage: 222.5 KB
None
                id          day     pressure      maxtemp  temparature  \
count  2190.000000  2190.000000  2190.000000  2190.000000  2190.000000   
mean   1094.500000   179.94

In [14]:
# preprocessing.py

import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
import seaborn as sns
import matplotlib.pyplot as plt

def load_dataset(train_path, test_path):
    df_train = pd.read_csv(train_path)
    df_test = pd.read_csv(test_path)
    return df_train, df_test

def rename_columns(df_train, df_test):
    df_train.rename(columns={'temparature': 'temperature'}, inplace=True)
    df_test.rename(columns={'temparature': 'temperature'}, inplace=True)

def handle_missing_values(df_test):
    df_test['winddirection'] = df_test['winddirection'].fillna(df_test['winddirection'].median())

def create_cyclical_features(df_train, df_test):
    df_train['day_sin'] = np.sin(2 * np.pi * df_train['day'] / 365)
    df_train['day_cos'] = np.cos(2 * np.pi * df_train['day'] / 365)
    df_test['day_sin'] = np.sin(2 * np.pi * df_test['day'] / 365)
    df_test['day_cos'] = np.cos(2 * np.pi * df_test['day'] / 365)
    df_train.drop(columns=['day'], inplace=True)
    df_test.drop(columns=['day'], inplace=True)

def create_temp_range(df_train, df_test):
    df_train['temp_range'] = df_train['maxtemp'] - df_train['mintemp']
    df_test['temp_range'] = df_test['maxtemp'] - df_test['mintemp']

def detect_outliers(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
    return outliers

def plot_outliers(df, column):
    plt.figure(figsize=(6, 4))
    sns.boxplot(x=df[column])
    plt.title(f'Boxplot of {column}')
    plt.show()

def scale_continuous_features(df_train, df_test, continuous_features):
    scaler = MinMaxScaler()
    df_train[continuous_features] = scaler.fit_transform(df_train[continuous_features])
    df_test[continuous_features] = scaler.transform(df_test[continuous_features])
    return scaler

def encode_categorical(df_train, df_test, column_name):
    df_train = pd.get_dummies(df_train, columns=[column_name], prefix=column_name)
    df_test = pd.get_dummies(df_test, columns=[column_name], prefix=column_name)
    missing_cols = set(df_train.columns) - set(df_test.columns)
    for col in missing_cols:
        df_test[col] = 0
    return df_train, df_test

def run_full_preprocessing(train_path, test_path):
    df_train, df_test = load_dataset(train_path, test_path)
    rename_columns(df_train, df_test)
    handle_missing_values(df_test)
    create_cyclical_features(df_train, df_test)
    create_temp_range(df_train, df_test)

    for col in ['windspeed', 'temperature']:
        print(f"Outliers in {col}:\n", detect_outliers(df_train, col))
        plot_outliers(df_train, col)

    scaler = scale_continuous_features(df_train, df_test, ['windspeed', 'temperature', 'maxtemp', 'mintemp', 'humidity'])
    df_train, df_test = encode_categorical(df_train, df_test, 'winddirection')

    print("Preprocessing complete! Ready for model training 🚀")
    return df_train, df_test, scaler
