# **Milestone 2: Australian Weather Prediction for Agricultural Optimization**

Farhan Falahaqil Rafi

FTDS-003-BSD

# **Introduction**

## Background:
Australia's diverse climate ranges from tropical in the north to temperate in the south. The agricultural sector is significantly affected by the weather conditions, especially concerning temperature, rainfall, and humidity. Reliable predictions of weather conditions are crucial for farmers to plan sowing, irrigation, and harvesting activities. The recent unpredictable weather patterns due to climate change have increased the demand for more accurate and localized weather forecasting.


## Objective:
The main objective of this project is to develop a Machine Learning model that can predict weather conditions, specifically rainfall, in various locations across Australia. The predictions will help farmers and agricultural businesses in decision-making processes, optimizing crop yield, and minimizing losses due to unforeseen adverse weather conditions. By providing an accurate forecast, the model aims to contribute towards a more sustainable and efficient agricultural sector in Australia.

## Dataset Overview:
The dataset chosen for this project is the 'weatherAUS.csv', which contains about 10 years of daily weather observations from numerous Australian weather stations. The dataset includes various features such as date, location, temperature, humidity, pressure, wind speed, and more, along with the target variable 'RainTomorrow', indicating if it rained the next day.

## Approach:
The project will involve several stages, starting with exploratory data analysis to understand the patterns and characteristics of the weather data. Following this, feature engineering will be conducted to prepare the data for modeling. Several Supervised Learning models will be experimented with, including but not limited to Decision Trees, Random Forest, and Gradient Boosting, to predict the likelihood of rain. The models' performances will be compared based on appropriate metrics, and the best performing model will be optimized further through hyperparameter tuning. Finally, the chosen model will be deployed to provide easy access for end-users, primarily farmers and agricultural planners.

## Expected Impact:
The successful completion of this project is expected to provide a robust tool for accurate weather prediction, thereby aiding the agricultural sector in better planning and resource allocation. This will not only increase efficiency and yield but also contribute to the economic stability of the agricultural community and related industries in Australia.

# **Import Libraries**

In [None]:
# Data Manipulation Libraries
import numpy as np
import pandas as pd

# Data Analysis and Visualization Libraries
import matplotlib.pyplot as plt
import seaborn as sns
import phik

# Preprocessing and Feature Engineering Libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.impute import SimpleImputer
from feature_engine.outliers import Winsorizer
from scipy.stats import zscore

# Machine Learning Model Libraries
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

# Model Evaluation Metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix

# Hyperparameter Tuning
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

# Handling Imbalanced Dataset
from imblearn.over_sampling import SMOTE

# Pipeline Libraries
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Model Persistence
import pickle

# Miscellaneous Utilities
import warnings
warnings.filterwarnings('ignore')

# Set seed for reproducibility
np.random.seed(42)

# **Data Loading**

### Size of the Dataset

In [None]:
# load the dataset
df = pd.read_csv('weatherAUS.csv')

# display the first few rows of the dataframe
df.head(10)

In [None]:
# Display the size of the dataset
print("\nSize of the dataset (rows, columns):", df.shape)

The dataset contains 145,460 entries and 23 columns. Each entry corresponds to daily weather observations at various locations across Australia.

### Dataframe Information

In [None]:
# Display a concise summary of the dataframe, including the number of non-null values in each column
print("\nDataframe information:")
df.info()

- **Entries and Columns**: The dataframe consists of 145,460 entries and 23 features.
- **Feature Types**: The features are a mix of numerical (float64) and categorical (object) types.
- **Non-Null Counts**: Some columns have missing values; for instance, 'Sunshine' and 'Evaporation' have a significant number of missing entries.

### Statistical Summary of Numerical Features

In [None]:
# Display statistical summary of numerical features
print("\nStatistical Summary of Numerical Features:")
df.describe()

- The statistical summary includes count, mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum for numerical features.
- Notable observations include the range of temperatures, rainfall, and wind speed, indicating the climate's variability across different regions and timeframes.

### Checking for Null Values

In [None]:
# Check for the presence of null values
print("\nChecking for null values:")
print(df.isnull().sum())

- Null value counts in each column indicate missing data. For example, 'Evaporation' and 'Sunshine' have the highest number of missing values, followed by 'Cloud9am' and 'Cloud3pm'.
- Handling these missing values will be a crucial step in the data preprocessing stage to ensure the quality of our machine learning model.

### Unique Values in 'RainTomorrow'

In [None]:
# Print the unique values in 'RainTomorrow', our target variable
print("\nUnique values in 'RainTomorrow':")
print(df['RainTomorrow'].unique())

- The target variable 'RainTomorrow' indicates if it rained the next day with 'Yes', 'No', or nan (missing value).
- Understanding the distribution of this variable is essential for predicting rainfall and evaluating model performance.

# **Exploratory Data Analysis**

## Data Distribution Analysis

In [None]:
# set the aesthetic style of the plots
sns.set_style('whitegrid')

# plot distribution for each feature in the dataset
for column in df.columns:
    plt.figure(figsize=(8,4))
    if len(df[column].unique()) > 10:
        sns.histplot(df[column], kde=True, color='skyblue')
        plt.title(f'Distribution of {column}')
    else:
        sns.countplot(x=column, data=df, palette='Set2')
        plt.title(f'Count of different classes in {column}')
    plt.show()

In examining the distribution plots for our key weather variables, I've noticed several interesting patterns:

- **MinTemp (Minimum Temperature)**: The distribution is approximately normal but leans a bit towards the cooler side. This makes sense as it reflects a range of minimum temperatures, with a majority clustering near the average value.

- **MaxTemp (Maximum Temperature)**: Similar to MinTemp, this graph is also somewhat bell-shaped but tilts towards warmer temperatures. It's a clear indication of the diverse range of maximum temperatures we see, which is expected given Australia's varied climate.

- **Rainfall**: This one's intriguing as it's heavily skewed to the left. It looks like days with little to no rainfall are far more common than days with heavy rainfall, which suggests that high rainfall is an exception rather than the norm.

- **Humidity9am**: Here, we see a slight skew to the left, pointing out that higher humidity levels in the morning are more frequently observed.

- **Humidity3pm**: The afternoon humidity tells a different story. The spread is wider, showing more variability in humidity levels later in the day, with a tendency towards drier conditions.

- **Pressure9am and Pressure3pm**: Both these variables exhibit bell-shaped distributions and are centered around similar ranges, indicative of relatively stable atmospheric pressure with minor fluctuations.

- **WindGustSpeed**: This plot is right-skewed, which tells me that days with lower wind gust speeds are much more common, but there are occasional instances of significantly higher speeds.

Through these distributions, I'm getting a clearer picture of our weather patterns. For instance, the skewness in the Rainfall data highlights the rarity of heavy rainfall, while the temperature and pressure distributions suggest a more consistent range of values. This analysis is laying a solid foundation for understanding the climatic behaviors in our dataset.

## Correlation Analysis

In [None]:
# calculating the phi_k correlation matrix
phi_k_correlation = df.phik_matrix()

# setting up the matplotlib figure
plt.figure(figsize=(12, 10))

# drawing the heatmap with the mask and correct aspect ratio
sns.heatmap(phi_k_correlation, annot=True, fmt=".2f", linewidths=.5, cmap='coolwarm')

# adding the title
plt.title('Phi_k Correlation Matrix Heatmap')

# show the plot
plt.show()

Analyzing the correlation matrix of our dataset has revealed several interesting relationships between different weather variables, which are crucial for understanding the dynamics of weather patterns in Australia.

- **Temperature Variables (MinTemp, MaxTemp, Temp9am, Temp3pm)**: These variables show a strong positive correlation with each other. This is expected as they are all measures of temperature at different times of the day. The high correlation between MaxTemp and Temp3pm (0.99) is particularly notable, suggesting that the maximum temperature of the day is a good predictor of the temperature in the late afternoon.

- **Humidity Variables (Humidity9am, Humidity3pm)**: There's a significant positive correlation between morning and afternoon humidity levels. However, Humidity3pm shows a stronger correlation with RainTomorrow (0.64) compared to Humidity9am (0.35), indicating that afternoon humidity levels might be a more critical predictor for rain the next day.

- **Pressure Variables (Pressure9am, Pressure3pm)**: These two exhibit a very high positive correlation, which is logical given they represent atmospheric pressure at different times of the day. Both these variables also show a moderate negative correlation with temperature variables, highlighting the inverse relationship between atmospheric pressure and temperature.

- **Wind Variables (WindGustSpeed, WindSpeed9am, WindSpeed3pm)**: WindGustSpeed shows a significant positive correlation with both WindSpeed9am and WindSpeed3pm. This indicates that higher gust speeds are generally accompanied by higher wind speeds throughout the day.

- **Sunshine**: Sunshine shows a strong negative correlation with Cloud9am and Cloud3pm, which is expected as more clouds mean less sunshine. Interestingly, Sunshine also has a significant negative correlation with RainTomorrow (-0.58), suggesting that less sunny days have a higher likelihood of rain the next day.

- **Cloud Variables (Cloud9am, Cloud3pm)**: These two variables are strongly correlated, indicating that cloud cover tends to be consistent throughout the day. Both variables also show a positive correlation with RainTomorrow, with Cloud3pm having a slightly stronger relationship.

Focusing on the target variable, ***'RainTomorrow'***, the correlation analysis provides crucial insights into how various weather features might influence the likelihood of rain the following day. Here's a detailed interpretation:

1. **Temperature Variables (MinTemp, MaxTemp, Temp9am, Temp3pm)**: 
   - These show a negative correlation with 'RainTomorrow', particularly Temp3pm (correlation: -0.26). This suggests that higher temperatures during the day might be associated with a lower likelihood of rain the next day. 

2. **Rainfall & RainToday**: 
   - Both 'Rainfall' and 'RainToday' exhibit a positive correlation with 'RainTomorrow' (correlations: 0.12 and 0.47, respectively). This indicates that days following rainfall, or days classified as 'RainToday', have a higher probability of experiencing rain.

3. **Humidity Variables (Humidity9am, Humidity3pm)**: 
   - Humidity, especially in the afternoon (Humidity3pm), shows a strong positive correlation with 'RainTomorrow' (correlation: 0.64). This highlights humidity's role as a significant predictor, where higher humidity levels might increase the chances of rain the next day.

4. **Pressure Variables (Pressure9am, Pressure3pm)**: 
   - Both morning and afternoon pressures exhibit a negative correlation with 'RainTomorrow' (correlations: -0.33 and -0.31, respectively), suggesting that higher atmospheric pressure could be associated with lower chances of rain.

5. **Wind Variables (WindGustSpeed, WindSpeed9am, WindSpeed3pm)**: 
   - WindGustSpeed shows a modest positive correlation with 'RainTomorrow' (correlation: 0.31), indicating that days with stronger wind gusts might have a slightly increased likelihood of rain.

6. **Sunshine**: 
   - Sunshine has a notable negative correlation with 'RainTomorrow' (correlation: -0.58), implying that fewer hours of sunshine are associated with a higher likelihood of rain the following day.

7. **Cloud Variables (Cloud9am, Cloud3pm)**: 
   - Both Cloud9am and Cloud3pm show positive correlations with 'RainTomorrow' (correlations: 0.42 and 0.52, respectively), with Cloud3pm being more strongly correlated. This suggests that increased cloud cover might be a good predictor of rain.

8. **Evaporation**: 
   - Evaporation has a low negative correlation with 'RainTomorrow' (correlation: -0.04), indicating it might not be a strong predictor for rain.

In summary, the correlation analysis, especially with respect to 'RainTomorrow', has highlighted the critical importance of specific weather features in predicting rainfall. Key variables such as Humidity3pm, Cloud3pm, Sunshine, and RainToday demonstrate significant correlations with the likelihood of rain on the following day. These findings, coupled with the strong interrelationships observed among temperature, humidity, pressure, and cloud cover variables, provide a comprehensive understanding of the factors influencing rainfall. This comprehensive correlation insight is not only vital in identifying pivotal roles these variables play in our predictive modeling but also instrumental in guiding the feature selection and engineering process. Such an understanding is crucial for developing an effective model to accurately predict rain occurrences.

## Trend Analysis

### Temporal Feature Extraction

In [None]:
# Convert the 'Date' column to datetime format and extract year and month for trend analysis
df['Date'] = pd.to_datetime(df['Date'])
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month

### Analysis and Visualization

In [None]:
# Calculating annual and monthly averages for key variables
annual_trends = df.groupby('Year')[['MinTemp', 'MaxTemp', 'Rainfall', 'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm']].mean()
monthly_trends = df.groupby('Month')[['MinTemp', 'MaxTemp', 'Rainfall', 'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm']].mean()

In [None]:
# Displaying annual trends
annual_trends

In [None]:
# Displaying monthly trends
monthly_trends

In [None]:
# Set up the figure and axes for subplots
fig, axes = plt.subplots(nrows=2, ncols=4, figsize=(20, 10))

# Plotting annual trends
annual_trends[['MinTemp', 'MaxTemp']].plot(ax=axes[0,0], title='Annual Avg Temperature')
annual_trends['Rainfall'].plot(ax=axes[0,1], title='Annual Avg Rainfall')
annual_trends[['Humidity9am', 'Humidity3pm']].plot(ax=axes[0,2], title='Annual Avg Humidity')
annual_trends[['Pressure9am', 'Pressure3pm']].plot(ax=axes[0,3], title='Annual Avg Pressure')

# Plotting monthly trends
monthly_trends[['MinTemp', 'MaxTemp']].plot(ax=axes[1,0], title='Monthly Avg Temperature')
monthly_trends['Rainfall'].plot(ax=axes[1,1], title='Monthly Avg Rainfall')
monthly_trends[['Humidity9am', 'Humidity3pm']].plot(ax=axes[1,2], title='Monthly Avg Humidity')
monthly_trends[['Pressure9am', 'Pressure3pm']].plot(ax=axes[1,3], title='Monthly Avg Pressure')

# Enhancing layout
plt.tight_layout()
plt.show()

Reflecting on the annual and monthly trends in our weather dataset, here are my observations:

**Annual Trends:**
- **Temperature (MinTemp and MaxTemp)**: Reviewing both minimum and maximum temperatures over the years, I noticed some fluctuation. However, there's no evident long-term trend indicating a gradual increase or decrease in temperatures throughout the dataset's timeframe.
  
- **Rainfall**: The annual rainfall data also presented variations from year to year. While some years showed higher average rainfall, there wasn't a clear, consistent trend of increasing or decreasing rainfall over the years.

- **Humidity (Morning and Afternoon)**: Analyzing humidity levels for both morning and afternoon, I observed yearly variances. Yet, similar to temperature and rainfall, there wasn't an apparent trend that pointed to a gradual increase or decrease in humidity levels over the years.

- **Pressure (Morning and Afternoon)**: The atmospheric pressure readings, both in the morning and afternoon, exhibited some yearly fluctuations. However, these readings remained relatively consistent, showing no significant long-term changes.

**Monthly Trends:**
- **Temperature**: The monthly temperature data clearly reflected the seasonal cycle, with higher temperatures during the warmer months and lower during the cooler months. This pattern was consistently observed each year, aligning with the expected seasonal shifts.

- **Rainfall**: Rainfall also showed a distinct seasonal pattern. Certain months consistently experienced more rainfall, indicating a variation between wet and dry seasons throughout the year.

- **Humidity**: Like temperature and rainfall, humidity levels also exhibited seasonal variations. Some months consistently showed higher or lower humidity.

- **Pressure**: The atmospheric pressure varied across different months, displaying a pattern possibly linked to the seasonal changes observed in temperature and rainfall.

In conclusion, while annual trends in temperature, rainfall, humidity, and pressure showed variability, they did not exhibit a clear long-term directional trend. On the other hand, the monthly trends for these variables strongly mirrored the expected seasonal patterns, reaffirming the influence of seasonal cycles on weather patterns.

## Regional Analysis

In [None]:
# Grouping data by location and calculating the mean for key variables
location_based_analysis = df.groupby('Location')[['MinTemp', 'MaxTemp', 'Rainfall', 'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm']].mean()

# Sorting the results for better readability
location_based_analysis_sorted = location_based_analysis.sort_values(by=['Location'])

location_based_analysis_sorted

Analyzing regional weather data from the dataset, I've drawn several conclusions about the climatic patterns across different locations in Australia:

1. **Temperature Variations (MinTemp and MaxTemp)**:
   - Regions like Alice Springs and Katherine show higher average temperatures, indicative of warmer climates typically found in central and northern Australia.
   - Conversely, locations like Mount Ginini and Canberra exhibit lower average temperatures, reflecting cooler climates often seen in southern regions.

2. **Rainfall Patterns**:
   - Coastal areas like Cairns, Coffs Harbour, and Sydney experience higher average rainfall, aligning with the typical wetter climates of coastal regions.
   - Inland locations such as Woomera and Uluru register lower rainfall averages, which is consistent with the drier conditions usually found in central Australia.

3. **Humidity Levels (Humidity9am and Humidity3pm)**:
   - Coastal regions (e.g., Darwin, Wollongong) tend to exhibit higher humidity levels, both in the morning and afternoon, compared to inland areas like Alice Springs and Woomera where humidity levels are considerably lower.
   - This pattern of higher humidity near the coast and lower humidity inland aligns well with the general climatic differences between these areas.

4. **Atmospheric Pressure (Pressure9am and Pressure3pm)**:
   - Pressure readings are fairly consistent across different locations, with slight variations. For instance, Darwin and Katherine, being coastal and northern, show slightly lower pressure values, which is typical for such regions.
   - Inland areas like Alice Springs and Woomera tend to have slightly higher pressure readings on average, reflecting the atmospheric conditions of drier, inland locations.

5. **Regional Variability**:
   - Each region presents a unique combination of these weather elements, illustrating the diverse climatic conditions across Australia. For example, Darwin shows high temperatures and high rainfall, typical of a tropical climate, whereas Alice Springs exhibits high temperatures but low rainfall, characteristic of a desert climate.


### By State

In [None]:
# Creating a dictionary to map cities to their respective states or territories in Australia
australian_cities_states = {
    "Adelaide": "South Australia",
    "Albany": "Western Australia",
    "Albury": "New South Wales",
    "AliceSprings": "Northern Territory",
    "BadgerysCreek": "New South Wales",
    "Ballarat": "Victoria",
    "Bendigo": "Victoria",
    "Brisbane": "Queensland",
    "Cairns": "Queensland",
    "Canberra": "Australian Capital Territory",
    "Cobar": "New South Wales",
    "CoffsHarbour": "New South Wales",
    "Dartmoor": "Victoria",
    "Darwin": "Northern Territory",
    "GoldCoast": "Queensland",
    "Hobart": "Tasmania",
    "Katherine": "Northern Territory",
    "Launceston": "Tasmania",
    "Melbourne": "Victoria",
    "MelbourneAirport": "Victoria",
    "Mildura": "Victoria",
    "Moree": "New South Wales",
    "MountGambier": "South Australia",
    "MountGinini": "Australian Capital Territory",
    "Newcastle": "New South Wales",
    "Nhil": "Victoria",
    "NorahHead": "New South Wales",
    "NorfolkIsland": "External Territory",
    "Nuriootpa": "South Australia",
    "PearceRAAF": "Western Australia",
    "Penrith": "New South Wales",
    "Perth": "Western Australia",
    "PerthAirport": "Western Australia",
    "Portland": "Victoria",
    "Richmond": "New South Wales",
    "Sale": "Victoria",
    "SalmonGums": "Western Australia",
    "Sydney": "New South Wales",
    "SydneyAirport": "New South Wales",
    "Townsville": "Queensland",
    "Tuggeranong": "Australian Capital Territory",
    "Uluru": "Northern Territory",
    "WaggaWagga": "New South Wales",
    "Walpole": "Western Australia",
    "Watsonia": "Victoria",
    "Williamtown": "New South Wales",
    "Witchcliffe": "Western Australia",
    "Wollongong": "New South Wales",
    "Woomera": "South Australia"
}

# Applying the provided mapping to the weather dfset to create a new column 'State'
df['State'] = df['Location'].map(australian_cities_states)

# Now let's perform the location-based analysis grouped by State this time
state_based_analysis = df.groupby('State')[['MinTemp', 'MaxTemp', 'Rainfall', 'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm']].mean()

state_based_analysis_sorted = state_based_analysis.sort_values(by='State')
state_based_analysis_sorted

Analyzing the summarized weather data for different states and territories in Australia, we can draw some interesting conclusions about the climatic variations across the country:

1. **Australian Capital Territory (ACT)**:
   - Exhibits relatively cooler temperatures with an average MinTemp of 5.94°C and MaxTemp of 17.97°C.
   - Rainfall is moderate, and humidity levels are comparatively higher, especially in the morning.

2. **External Territory (Norfolk Island)**:
   - Features a milder climate with an average MinTemp of 16.87°C and MaxTemp of 21.83°C.
   - Higher rainfall and humidity, especially in the afternoon, reflecting its island geography.

3. **New South Wales (NSW)**:
   - Shows warmer temperatures (MinTemp: 12.96°C, MaxTemp: 23.87°C) and moderate rainfall.
   - Humidity levels are moderate, with a notable decrease from morning to afternoon.

4. **Northern Territory (NT)**:
   - Has the warmest average temperatures among the states (MinTemp: 18.03°C, MaxTemp: 31.50°C), consistent with its tropical and desert regions.
   - Lower humidity levels and a significant drop in atmospheric pressure, characteristic of arid, inland areas.

5. **Queensland**:
   - Warm and humid climate, with higher average temperatures (MinTemp: 18.83°C, MaxTemp: 27.77°C) and the highest rainfall among the states.
   - High humidity levels, especially in the afternoon, typical of its tropical and subtropical climate.

6. **South Australia**:
   - Moderate temperatures (MinTemp: 11.05°C, MaxTemp: 22.75°C) with low rainfall.
   - Lower humidity levels, particularly in the afternoon, indicating drier conditions.

7. **Tasmania**:
   - Cooler climate with lower average temperatures (MinTemp: 8.47°C, MaxTemp: 18.40°C).
   - Moderate rainfall and higher humidity levels, reflecting its maritime climate.

8. **Victoria**:
   - Mild temperatures (MinTemp: 9.43°C, MaxTemp: 20.65°C) with moderate rainfall.
   - Higher humidity, especially in the morning, indicative of its temperate climate.

9. **Western Australia**:
   - Diverse climatic conditions with warmer temperatures (MinTemp: 11.82°C, MaxTemp: 23.32°C) and moderate rainfall.
   - Moderate humidity levels with a slight drop from morning to afternoon.


### Conclusion

In conclusion, comparing the weather data analysis using city-level granularity versus state-level aggregation reveals significant insights into the effectiveness of dimensionality reduction for this dataset. 

1. **City-Level Analysis**:
   - Provides detailed, localized climatic information, capturing the unique microclimates and weather patterns specific to each city.
   - Ideal for applications requiring high precision, such as urban planning, local agricultural strategies, and targeted weather forecasting.
   - However, it involves handling a larger dataset with higher dimensionality, which can be more complex and computationally intensive.

2. **State-Level Aggregation**:
   - Offers a broader perspective, encapsulating the overall climate trends of an entire state or territory.
   - Useful for macro-level planning and analysis, like statewide agricultural policies, tourism planning, and regional climate studies.
   - Reduces the complexity and size of the dataset, simplifying the analysis and potentially enhancing the performance and interpretability of machine learning models.
   - However, this approach may overlook local variations and microclimates within each state.

In terms of dimensionality reduction, state-level aggregation is an effective method to streamline the dataset, making it more manageable and less resource-intensive to analyze. This approach can be particularly beneficial for machine learning models, as it reduces the risk of overfitting and improves computational efficiency. However, the choice between city-level detail and state-level aggregation should be guided by the specific requirements of the analysis or application in question. If the goal is to capture nuanced, local weather patterns, city-level data is more appropriate. Conversely, for broader, regional insights or when working with resource constraints, state-level aggregation offers a practical and efficient alternative.

## Missing Value Analysis

In [None]:
# Performing a missing value analysis on the dataset

# Calculating the total number of missing values for each column
missing_values_total = df.isnull().sum()

# Calculating the percentage of missing values for each column
missing_values_percentage = (df.isnull().sum() / len(df)) * 100

# Combining both total and percentage of missing values into a single DataFrame for better presentation
missing_values_analysis = pd.DataFrame({'Total Missing': missing_values_total, 'Percentage Missing': missing_values_percentage})

missing_values_analysis.sort_values(by='Percentage Missing', ascending=False)

After conducting a missing value analysis on the dataset, I've uncovered some key insights about the data gaps in various weather parameters, which are crucial for my understanding of the dataset's completeness:

1. **Significant Gaps in Specific Parameters**:
   - The data for 'Sunshine', 'Evaporation', 'Cloud3pm', and 'Cloud9am' show notably high percentages of missing entries. Particularly, 'Sunshine' tops the list with nearly half of its data missing. This highlights substantial gaps in these specific weather measurements.

2. **Moderately Missing Data in Certain Features**:
   - Atmospheric pressure readings for 9 am and 3 pm, along with wind-related parameters like 'WindDir9am', 'WindGustDir', and 'WindGustSpeed', exhibit around 7-10% missing data. This level of missingness is moderate and noteworthy.

3. **Relatively Complete Core Temperature and Rainfall Data**:
   - The core temperature readings ('MinTemp' and 'MaxTemp') and basic rainfall data have low missing values, under 3%. This lower rate indicates higher reliability and completeness for these fundamental weather measurements.

4. **Implications for My Modeling Work**:
   - The absence of data in crucial variables such as 'Sunshine' and 'Evaporation' could potentially limit the effectiveness of my predictive models, especially if these factors are significant predictors.

5. **Reflections on Data Collection and Quality**:
   - The pattern of missing data points towards possible issues in the data collection process, especially for parameters like 'Sunshine' and 'Evaporation'. This could be due to missing equipment or inconsistent recording practices in some recording stations.

6. **Solid Foundation in Date and Location Data**:
   - The absence of missing data in 'Date', 'Location', and 'State' is a positive aspect, ensuring a robust base for any time-series or location-based analyses I plan to conduct.

In conclusion, this analysis of missing values is a critical step in preparing my dataset for more detailed analysis and modeling. It highlights the importance of carefully handling these gaps, perhaps through data imputation techniques or by considering the limitations they impose on my analysis. The strategy I choose to address these missing values will be pivotal and will depend on the specific needs of my analysis and the potential impact on the accuracy and reliability of my findings.

## Outlier Detection

### Skew

In [None]:
# Selecting key columns for outlier analysis
key_columns = ['MinTemp', 'MaxTemp', 'Rainfall', 'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'WindGustSpeed']

# Calculating the skewness for key numerical columns in the dataset
skewness_analysis = df[key_columns].skew()

skewness_analysis

After reviewing the skewness of key numerical columns in my dataset, I've made the following observations regarding the distribution of these weather parameters:

1. **MinTemp**: The skewness is approximately 0.02, which suggests that the distribution of minimum temperatures is almost symmetric. This indicates a balanced spread of data on either side of the mean.

2. **MaxTemp**: With a skewness of 0.22, the distribution of maximum temperatures is slightly right-skewed (positive skew). This implies a modest concentration of data below the mean, with a longer tail extending towards higher temperatures.

3. **Rainfall**: The skewness value of 9.84 for rainfall indicates a highly right-skewed distribution. This is typical for rainfall data, where extreme values or heavy rainfall days are not uncommon.

4. **Humidity9am**: A skewness of -0.48 reveals a slight left-skewed (negative skew) distribution. This means that morning humidity levels are generally high, with fewer instances of very low humidity.

5. **Humidity3pm**: The skewness around 0.03 suggests a nearly symmetric distribution for afternoon humidity levels, similar to MinTemp.

6. **Pressure9am and Pressure3pm**: Both these pressure readings show a slight left skew, with skewness values of -0.10 and -0.05, respectively. This indicates a slightly higher frequency of days with above-average atmospheric pressure.

7. **WindGustSpeed**: Exhibiting a skewness of 0.87, the distribution is moderately right-skewed. This suggests that lower wind gust speeds are more common, but there are occasional days with significantly higher speeds.

Based on this skewness analysis:

- For nearly symmetric distributions like MinTemp, Humidity3pm, Pressure9am, and Pressure3pm, traditional methods such as the Interquartile Range (IQR) or Z-score are suitable for outlier detection.
- For MaxTemp and WindGustSpeed, which show moderate skewness, I can still employ IQR or Z-score methods but should be mindful of the skewness potentially affecting the analysis.
- Rainfall, being highly skewed, might require a data transformation approach (such as logarithmic transformation) before applying outlier detection methods. Alternatively, I could use non-parametric methods better suited for skewed distributions.
- For Humidity9am, with its slight negative skew, methods like IQR or Z-score can be used, but it's important to consider the skewness in any analysis.

### Outlier Detection

In [None]:
# Performing an outlier analysis on the dfset

def detect_outliers(df, column):
    # Calculating IQR
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1

    # Defining boundaries for outliers
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Identifying outliers
    outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
    return outliers

# Selecting key columns for outlier analysis
normal_columns = ['MinTemp', 'MaxTemp', 'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'WindGustSpeed']

# Applying outlier detection for each key column
outlier_analysis = {}
for col in normal_columns:
    outlier_analysis[col] = {
        "Outliers": detect_outliers(df, col).shape[0],
        "Percentage": (detect_outliers(df, col).shape[0] / df.shape[0]) * 100
    }

outlier_analysis

In [None]:
# Applying a log transformation to the 'Rainfall' column
# Adding a small constant to avoid log(0) which is undefined
df['Rainfall_log'] = np.log(df['Rainfall'] + 1)  # Adding 1 to avoid log(0)

# Plotting the original and transformed 'Rainfall' distributions for comparison
plt.figure(figsize=(12, 6))

# Original Rainfall distribution
plt.subplot(1, 2, 1)
sns.histplot(df['Rainfall'], bins=30, kde=True)
plt.title('Original Rainfall Distribution')

# Transformed (Log) Rainfall distribution
plt.subplot(1, 2, 2)
sns.histplot(df['Rainfall_log'], bins=30, kde=True)
plt.title('Log Transformed Rainfall Distribution')

plt.tight_layout()
plt.show()

# Calculating outliers using IQR on the transformed Rainfall data
Q1 = df['Rainfall_log'].quantile(0.25)
Q3 = df['Rainfall_log'].quantile(0.75)
IQR = Q3 - Q1

# Defining boundaries for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Identifying outliers
outliers_transformed = df[(df['Rainfall_log'] < lower_bound) | (df['Rainfall_log'] > upper_bound)]

# Number of outliers in the transformed df
num_outliers_transformed = outliers_transformed.shape[0]

num_outliers_transformed, lower_bound, upper_bound

After conducting an outlier analysis on key numerical features in my dataset, here's what I've found:

1. **MinTemp**:
   - Outliers: 54 (0.037% of the data)
   - The low percentage of outliers indicates that minimum temperatures are generally consistent with few extreme values.

2. **MaxTemp**:
   - Outliers: 489 (0.336% of the data)
   - A higher percentage of outliers compared to MinTemp, suggesting more variability in maximum temperatures.

3. **Humidity9am**:
   - Outliers: 1,425 (0.980% of the data)
   - This significant percentage indicates considerable variation in morning humidity levels, possibly due to geographic or seasonal factors.

4. **Humidity3pm**:
   - Outliers: 0 (0.0% of the data)
   - Surprisingly, no outliers were found in afternoon humidity, indicating a very consistent range of values.

5. **Pressure9am**:
   - Outliers: 1,191 (0.819% of the data)
   - This suggests some variability in morning atmospheric pressure, though less than in humidity readings.

6. **Pressure3pm**:
   - Outliers: 919 (0.632% of the data)
   - Similar to Pressure9am, indicating variability in atmospheric pressure but with slightly fewer outliers.

7. **WindGustSpeed**:
   - Outliers: 3,092 (2.126% of the data)
   - A significant number of outliers, indicating frequent occurrences of unusually high wind gust speeds.

8. **Rainfall (Post Log Transformation)**:
   - After applying a log transformation to the Rainfall data and using the IQR method, approximately 14.27% of the observations are considered outliers. This high percentage of outliers, even after transformation, suggests that extreme rainfall events, though relatively rare, are a significant characteristic of the data.

In conclusion, this outlier analysis has provided valuable insights into the variability and distribution of key weather parameters. The high percentage of outliers in Humidity9am, WindGustSpeed, and post-transformation Rainfall indicates that these features, in particular, exhibit considerable deviations from the norm. This understanding is crucial for data preprocessing and ensuring that my models are robust against such variations. For features with a high percentage of outliers, I might consider additional transformations or robust statistical methods to mitigate their impact.

## Categorical Data Analysis

In [None]:
# Identifying categorical columns in the dfset
categorical_columns = df.select_dtypes(include=['object']).columns

# Counting the number of unique values and their frequency for each categorical column
categorical_analysis = {}
for col in categorical_columns:
    categorical_analysis[col] = df[col].value_counts()

categorical_analysis

## EDA Conclusion

Based on the comprehensive Exploratory Data Analysis (EDA) conducted on the weather dataset, several key findings and patterns have been identified, shaping the direction for the upcoming feature engineering phase. Here’s a conclusion of the EDA and a detailed list of steps to undertake in feature engineering:

### Conclusion of EDA:
1. **Distribution Analysis**: Identified symmetric distributions in some variables like MinTemp and Humidity3pm, and skewness in others like Rainfall and WindGustSpeed, guiding the approach for outlier handling and normalization.
2. **Correlation Analysis**: Revealed significant relationships between certain features, such as humidity levels and rainfall, which are crucial for model feature selection.
3. **Annual and Monthly Trends**: Uncovered seasonal patterns and yearly variations in weather parameters, suggesting the potential use of date-time features to capture these trends.
4. **Regional Variations**: Observed distinct climatic differences across states and cities, highlighting the importance of location-based features in the models.
5. **Categorical Data Insights**: Gained understanding of prevailing wind directions and rainfall occurrences, which are essential for categorical feature encoding.
6. **Missing Value Analysis**: Identified features with high missing values, indicating a need for robust imputation strategies.

### Feature Engineering To-Do List:
* **Outlier Treatment**: 
   - Apply appropriate methods like IQR or Z-score for symmetrically distributed features.
   - Consider transformations or non-parametric methods for highly skewed features like Rainfall.

* **Handling Missing Data**: 
   - Impute missing values using statistical methods (mean, median) or model-based imputation, depending on the nature of the variable.
   - Consider dropping features with excessively high missing values if imputation isn’t feasible.

* **Feature Creation**: 
   - Create new features that might be relevant, such as 'Temperature Range' (MaxTemp - MinTemp) or 'Average Pressure' ((Pressure9am + Pressure3pm) / 2).

* **Date-Time Features**: 
   - Extract year, month, and other relevant date-time components from the 'Date' column to capture seasonal and annual patterns.

* **Feature Selection**: 
   - Utilize correlation analysis and domain knowledge to select relevant features.
   - Consider automated feature selection techniques like Recursive Feature Elimination (RFE) or feature importance from ensemble models.

* **Data Partitioning**: 
   - Split the dataset into training and testing sets in preparation for model training.

* **Feature Transformation**: 
   - Normalize or standardize features with skewness.
   - Apply log transformation to Rainfall to reduce skewness impact.

* **Encoding Categorical Variables**: 
   - Use one-hot encoding or label encoding for categorical variables like WindGustDir, WindDir9am, WindDir3pm, RainToday, and Location.

* **Dimensionality Reduction**:
   - Explore PCA or other dimensionality reduction techniques, especially for high-dimensional data post one-hot encoding.

* **Pipeline Creation**:
    - Develop a preprocessing pipeline that integrates these feature engineering steps to streamline model training and validation processes.

This feature engineering to-do list is aimed at refining and preparing the dataset for the modeling phase. Each step is designed to address specific insights gained from the EDA and to enhance the overall predictive power and robustness of the forthcoming models.

# **Feature Engineering**

## Handling Missing Values

In our dataset, we face a challenge with certain columns exhibiting a high percentage of missing values. Generally, when a column has a substantial amount of missing data, particularly more than 30-50%, it raises concerns about the reliability and usefulness of that variable. Such columns could potentially skew our analysis or model predictions.

In our case, the following columns have a notably high percentage of missing values:
- **Sunshine**: Approximately 48% of its values are missing.
- **Evaporation**: Around 43% of its data is missing.
- **Cloud3pm**: Missing about 41% of its values.
- **Cloud9am**: Has roughly 38% missing values.

Considering the significant proportion of missing data in these columns, I'm contemplating whether to drop them from the dataset. It does not make sense to impute these values due to the sheer amount missing, so we will drop them now.

We will impute the rest based on the datatype.

In [None]:
# Dropping features with too many missing values
columns_to_drop = ['Sunshine', 'Evaporation', 'Cloud3pm', 'Cloud9am']
df = df.drop(columns=columns_to_drop)

In [None]:
# Dropping rows where the target variable is missing
df.RainTomorrow.dropna(inplace=True)

In [None]:
# Imputing missing values for the remaining columns
# For numerical columns, we use the median
# For categorical columns, we use the mode
for column in df.columns:
    if df[column].dtype == 'object':
        # Impute with mode for categorical columns
        mode = df[column].mode()[0]
        df[column].fillna(mode, inplace=True)
    else:
        # Impute with median for numerical columns
        median = df[column].median()
        df[column].fillna(median, inplace=True)

In [None]:
# Checking for remaining null values
print(df.isnull().sum().sum())

## Feature Creation

### Temperature Range (TempRange)

This feature represents the daily temperature variation, calculated as the difference between the maximum and minimum temperatures. A larger range might indicate more extreme weather conditions, which could be significant for certain predictions.

In [None]:
df['TempRange'] = df['MaxTemp'] - df['MinTemp']

### Average Pressure (AvgPressure)

By averaging morning and afternoon pressure readings, this feature provides a daily overview of atmospheric pressure. It's useful because extreme or abnormal pressure readings can be indicative of unusual weather patterns.

In [None]:
df['AvgPressure'] = (df['Pressure9am'] + df['Pressure3pm']) / 2


### Humidity Change (HumidityChange)

This measures the change in humidity from morning to afternoon. Significant changes might indicate weather fronts or systems moving through an area, which can be crucial for predicting rainfall or other weather events.

In [None]:
df['HumidityChange'] = df['Humidity3pm'] - df['Humidity9am']

### Wind Speed Change (WindSpeedChange)

If available, calculating the change in wind speed between morning and afternoon can highlight days with increasing or decreasing wind conditions. This could be relevant for understanding weather systems and their impacts.

In [None]:
df['WindSpeedChange'] = df['WindSpeed3pm'] - df['WindSpeed9am']

### Average Temperature (AvgTemp)

This feature calculates the daily average temperature by combining the maximum and minimum temperatures. It offers a balanced view of the day's overall thermal conditions, which can be vital for understanding general climate trends.

In [None]:
df['AvgTemp'] = (df['MaxTemp'] + df['MinTemp']) / 2

### Average Humidity (AvgHumidity)

This feature computes the average humidity level for the day by combining morning and afternoon readings. It provides a more stable measure of the day's overall humidity, smoothing out the usual diurnal variations.

In [None]:
df['AvgHumidity'] = (df['Humidity9am'] + df['Humidity3pm']) / 2

### Average Wind Speed (AvgWindSpeed)

Calculating the average wind speed for the day, based on morning and afternoon measurements, this feature offers a consistent view of the day's wind conditions. It is especially useful in determining the general windiness and its potential impacts on weather events.

In [None]:
if 'WindSpeed9am' in df.columns and 'WindSpeed3pm' in df.columns:
    df['AvgWindSpeed'] = (df['WindSpeed9am'] + df['WindSpeed3pm']) / 2

## Date Extraction

In [None]:
# Date was already converted. Year and Month already extracted.
df['Day'] = df['Date'].dt.day
df['DayOfWeek'] = df['Date'].dt.dayofweek
df['WeekOfYear'] = df['Date'].dt.isocalendar().week
df['IsWeekend'] = df['Date'].dt.dayofweek >= 5

## Feature Selection

### Correlation Analysis

In [None]:
# Calculating Correlations
phi_k_correlation_matrix = df.phik_matrix()

In [None]:
# Extract the correlation of all features with the target variable
correlation_with_target = phi_k_correlation_matrix['RainTomorrow']

# Display the correlation values
print(correlation_with_target)

After conducting a Phi K correlation analysis with the new features, focusing on their relationship with the target variable 'RainTomorrow', we can draw several conclusions:

1. **Significant Correlations**:
   - **Humidity3pm (0.621)**: Shows the highest correlation with 'RainTomorrow'. This indicates that afternoon humidity levels are a strong predictor for rain the next day.
   - **Rainfall_log (0.433)** and **RainToday (0.462)**: Both show strong correlations. The logarithmic transformation of 'Rainfall' enhances its predictive value.
   - **TempRange (0.456)**: The temperature range within a day also shows a significant correlation, suggesting that larger variations in daily temperature can be an indicator of changing weather conditions, potentially leading to rain.

2. **Moderate Correlations**:
   - **WindGustSpeed (0.296)**, **Pressure9am (0.296)**, **AvgPressure (0.294)**, and **HumidityChange (0.351)**: These features have moderate correlations with 'RainTomorrow', indicating their relevance in predicting rain.
   - **AvgHumidity (0.552)**: While lower than Humidity3pm, it still shows a notable correlation, confirming that overall humidity levels are crucial for rainfall prediction.

3. **Low Correlations**:
   - **AvgTemp (0.080)** and **AvgWindSpeed (0.138)**: These features have lower correlations, suggesting they might not be as influential in predicting rain as other features.

4. **Negligible Correlations**:
   - **WindSpeedChange (0.028)**, **DayOfWeek (0.010)**, and **IsWeekend (0.002)**: These features show very low correlations, indicating they might not contribute significantly to predicting rain the next day.

### Feature Selection for Modeling:
Based on the correlation analysis, the following features are recommended for inclusion in the predictive model:
- **Humidity3pm**
- **Rainfall_log**
- **RainToday**
- **TempRange**
- **WindGustSpeed**
- **Pressure9am**
- **AvgPressure**
- **HumidityChange**
- **AvgHumidity**

Features like **AvgTemp** and **AvgWindSpeed** could be considered for inclusion, but their lower correlations suggest they might have less predictive power. Conversely, features with negligible correlations might be dropped to simplify the model and potentially improve its performance.

In conclusion, this correlation analysis has helped identify key features that are strongly linked to the likelihood of rain the next day, guiding the feature selection process for our predictive model. It's essential to consider these correlations while balancing model complexity and performance.

In [None]:
# Selecting the working dataframe
df = df[['Humidity3pm', 'Rainfall_log', 'RainToday', 'TempRange', 'WindGustSpeed', 'Pressure9am', 'AvgPressure', 'HumidityChange', 'AvgHumidity', 'RainTomorrow']]

## Splitting the Dataset

In [None]:
# split data into features and target variable
X = df.drop('RainTomorrow', axis=1)
y = df['RainTomorrow']

# split the data into training and temporary sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# print the shapes of the transformed sets
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

## Feature Transformation

### Encoding the target feature

In [None]:
#displaying RainTomorow
y_train

For the model to understand and classify our target feature, we need to change it into binary.

In [None]:
# Converting 'Yes'/'No' to 1/0 for training and test sets
y_train = y_train.map({'Yes': 1, 'No': 0})
y_test = y_test.map({'Yes': 1, 'No': 0})

In [None]:
#displaying RainTomorow encoded
y_train

### SMOTE

We need to encode this feature as well to use SMOTE.

In [None]:
X_train.to_csv('train.csv')

In [None]:
le_rain = LabelEncoder()
X_train['RainToday'] = le_rain.fit_transform(X_train['RainToday'])
X_test['RainToday'] = le_rain.transform(X_test['RainToday'])

In [None]:
smote = SMOTE(random_state=42)
X_train, y_train = smote.fit_resample(X_train, y_train)

In [None]:
print("Class distribution:", Counter(y_train))

### Outlier Treatment

In [None]:
# Displaying the data before outlier handling 
X_train.describe()

In [None]:
X_train.skew()

In [None]:
# Columns with skewness beyond ±0.5
cols_winsorizer = ['Rainfall_log', 'RainToday', 'TempRange', 'WindGustSpeed']

# Columns for Z-score treatment
cols_zscore = ['Humidity3pm', 'Pressure9am', 'AvgPressure', 'HumidityChange', 'AvgHumidity']

# Apply Winsorizer
winsorizer = Winsorizer(capping_method='iqr', tail='both', fold=1.5, variables=cols_winsorizer)
X_train[cols_winsorizer] = winsorizer.fit_transform(X_train[cols_winsorizer])

# Apply Z-score
for col in cols_zscore:
    X_train[col] = zscore(X_train[col])
    # Cap values beyond 3 standard deviations
    X_train[col] = np.where(X_train[col] > 3, 3, X_train[col])
    X_train[col] = np.where(X_train[col] < -3, -3, X_train[col])

In [None]:
# Displaying the data after outlier handling
X_train.describe()

**Outlier Handling Analysis with SMOTE Applied**

**Overview:**
Following the application of SMOTE (Synthetic Minority Over-sampling Technique) to balance the class distribution, we reassessed the outlier treatment in our dataset. This step was crucial to ensure the robustness of our weather prediction model, especially considering the changes SMOTE could introduce in feature distributions.

**Detailed Analysis:**

1. **Humidity3pm**: With values standardized (mean ~0, std ~1), the distribution appears normal. The range from -2.68 to 1.98 suggests effective control of extreme values, reflecting typical humidity conditions.

2. **Rainfall_log**: The reduced maximum value of 2.91, compared to pre-SMOTE data, indicates a moderated impact of extreme rainfall values, contributing to a more normalized distribution.

3. **TempRange**: The capped range with a maximum of 23.13 demonstrates successful limitation of unusual temperature variations, likely improved by the data balancing effect of SMOTE.

4. **WindGustSpeed**: The new maximum value of 75.90, lower than pre-SMOTE, points to a reduction in extreme wind speeds, aligning the data more closely with common weather scenarios.

5. **Pressure9am**: Adjusted to a standardized scale, with values capped at 3 standard deviations, this feature now portrays a realistic range of atmospheric pressure variations.

6. **AvgPressure**: Similar to Pressure9am, the standardized and capped values offer a more accurate representation of daily pressure fluctuations.

7. **HumidityChange**: The range, now between -3 to 3, represents a more typical daily humidity variation, enhanced by the application of SMOTE and capping.

8. **AvgHumidity**: The adjustment of the minimum value to -3 and the standardization ensure the dataset captures a realistic spread of average daily humidity levels.

**Conclusion:**
The combination of SMOTE and tailored outlier treatment has yielded a dataset that more accurately reflects real-world weather conditions. The transformation of features such as `Rainfall_log`, `TempRange`, and `WindGustSpeed` is particularly noteworthy, as they now represent a more realistic range of weather variations. This approach enhances our predictive model's ability to learn from a balanced and realistic dataset, which is critical for accurate and reliable weather forecasting.

### Scaling and Encoding

In [None]:
# Define numerical and categorical columns
numerical_cols = ['Humidity3pm', 'Rainfall_log', 'TempRange', 'WindGustSpeed', 'Pressure9am', 'AvgPressure', 'HumidityChange', 'AvgHumidity']
categorical_cols = ['RainToday']

# Create numerical transformer (scaling, imputing)
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Create categorical transformer (imputing, encoding)
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine transformers into a preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# Define the final model pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor)])

# Fit the pipeline to our data
pipeline.fit(X_train, y_train)

#### Data Preprocessing Pipeline Description

The pipeline is structured to preprocess both numerical and categorical data efficiently:

1. **Defining Columns**:
   - Separates features into numerical (`numerical_cols`) and categorical (`categorical_cols`) categories.

2. **Numerical Transformer**:
   - **Imputation**: Fills missing values with the median.
   - **Scaling**: Standardizes features using StandardScaler.

3. **Categorical Transformer**:
   - **Imputation**: Fills missing values with the most frequent value.
   - **One-Hot Encoding**: Converts categories into binary columns.

4. **Column Transformer**:
   - Merges the numerical and categorical transformers, applying respective transformations.

5. **Pipeline Integration**:
   - The final pipeline combines preprocessing steps, ready to be applied to the training data.

The pipeline is then fitted to the training data, ensuring consistent preprocessing for model training and future predictions. This structured approach streamlines the data preparation, making the data suitable for effective machine learning model training.

In [None]:
X_train_transformed = pipeline.transform(X_train)
X_test_transformed = pipeline.transform(X_test)

# **Model Definition**

For this classification task, I will be exploring a variety of models to determine which performs best on the dataset. The chosen models are known for their effectiveness in classification problems and include:

1. K-Nearest Neighbors (KNN):
   - **Rationale**: KNN is a simple yet effective algorithm suitable for classification tasks. It's based on feature similarity and can be very useful in scenarios where the decision boundary is irregular.
   
2. Support Vector Machine (SVM):
   - **Rationale**: SVM is known for its ability to handle high-dimensional data and effectiveness in binary classification tasks. It's particularly useful when the classes are separable by a clear margin of separation.

3. Decision Tree:
   - **Rationale**: Decision Trees are intuitive and easy to interpret, making them a good choice for initial exploration. They work by breaking down the data into smaller subsets while incrementally developing a decision tree.

4. Random Forest:
   - **Rationale**: As an ensemble of Decision Trees, Random Forest can yield more robust and accurate predictions. It's effective in reducing overfitting, a common issue with single Decision Trees.

5. XGBoost:
   - **Rationale**: XGBoost is a powerful and efficient implementation of gradient boosting that can handle a variety of data types, distributions, and relationships. It's known for its speed and performance.


**Evaluation Metrics**

Since this is a classification task, the primary metric for evaluating model performance will be:

- **Accuracy**: Measures the proportion of correctly predicted instances. It's useful for getting a general sense of how often the model is correct.
- **F1 Score**: Harmonic mean of precision and recall. It's particularly useful when dealing with imbalanced datasets.

**Cross-Validation Strategy**

To compare these models fairly, we'll use K-fold cross-validation. This method involves dividing the dataset into 'K' subsets and iteratively training the model 'K' times, each time using a different subset as the test set and the remaining data as the training set.

In [None]:
# Define the models
models = {
    "KNN": KNeighborsClassifier(),
    "SVM": SVC(),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "XGBoost": XGBClassifier(tree_method='gpu_hist')
}

# Perform cross-validation
cv_results = {}
for name, model in models.items():
    cv_acc = cross_val_score(model, X_train_transformed, y_train, cv=5, scoring='accuracy').mean()
    cv_results[name] = {'CV Accuracy': cv_acc}

# Display the results
for model, performance in cv_results.items():
    print(f"{model} - Accuracy: {performance['CV Accuracy']:.4f}")

**Updated Cross-Validation Analysis with SMOTE:**

1. **Random Forest**:
   - Accuracy: 0.8850
   - **Analysis**: Random Forest emerges as the top performer post-SMOTE application. It demonstrates a strong balance between accuracy and F1 score, indicating its effectiveness in handling both major and minor classes in the dataset. This model is particularly adept at managing the complexities introduced by SMOTE.

2. **XGBoost**:
   - Accuracy: 0.8586
   - **Analysis**: XGBoost, while not the top performer, still shows commendable results. The slight decrease compared to the Random Forest model could be due to its sensitivity to the class balancing done by SMOTE. Nevertheless, it remains a strong candidate.

3. **KNN**:
   - Accuracy: 0.8313
   - **Analysis**: KNN has shown a noticeable improvement post-SMOTE, indicating its enhanced capability to deal with the balanced classes. It's a considerable option, especially for its simplicity and effectiveness.

4. **SVM**:
   - Accuracy: 0.7849
   - **Analysis**: SVM's performance has dropped post-SMOTE. This might be due to its nature of being less effective with larger, balanced datasets. It seems less suitable for this specific task.

5. **Decision Tree**:
   - Accuracy: 0.8252
   - **Analysis**: The Decision Tree model has shown decent performance but still lags behind the ensemble methods. Its tendency for overfitting could be a factor in its lower scores compared to Random Forest and XGBoost.

**Verdict**: The Random Forest model is the recommended choice post-SMOTE. Its ability to efficiently handle the balanced classes makes it the best fit for this dataset. The high accuracy indicates its robustness and reliability in making predictions. Further optimization through hyperparameter tuning can be explored to potentially enhance its performance.

# **Model Training**

We will now train the Random Forest model, identified as the best model through cross-validation, and evaluate its initial performance. To maximize the model's efficiency, we will use parallel processing by utilizing all available CPU cores.

## Training The Model

Random Forest is known for its effectiveness in handling complex datasets and providing robust results. It works by building multiple decision trees and merging their outputs for more accurate and stable predictions. Although Random Forest does not natively support GPU acceleration in Scikit-Learn, its parallel processing over multiple CPU cores significantly enhances its performance, especially on large datasets. Let's proceed with training and initial evaluation of the Random Forest model.

In [None]:
# Defining the Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)

# Training the model on the training dataset
rf_model.fit(X_train_transformed, y_train)

## Initial Evaluation on Training and Testing Sets

After training the model, we assess its performance on both the training and test datasets. The evaluation metrics include accuracy, recall, precision, and the count of False Negatives (FN) and False Positives (FP).

In [None]:
# Defining the Random Forest model
rf_model = RandomForestClassifier(n_jobs=-1)

# Training the model on the training dataset
rf_model.fit(X_train_transformed, y_train)

# Predictions on training data
train_preds = rf_model.predict(X_train_transformed)

# Predictions on test data
test_preds = rf_model.predict(X_test_transformed)

# Evaluation metrics for training data
train_accuracy = accuracy_score(y_train, train_preds)
train_recall = recall_score(y_train, train_preds)
train_precision = precision_score(y_train, train_preds)

# Evaluation metrics for test data
test_accuracy = accuracy_score(y_test, test_preds)
test_recall = recall_score(y_test, test_preds)
test_precision = precision_score(y_test, test_preds)

# Confusion matrix for False Negatives (FN) and False Positives (FP) analysis
train_fn = confusion_matrix(y_train, train_preds)[1][0]
test_fn = confusion_matrix(y_test, test_preds)[1][0]
train_fp = confusion_matrix(y_train, train_preds)[0][1]
test_fp = confusion_matrix(y_test, test_preds)[0][1]

# Displaying the evaluation results
print("Training Metrics:")
print(f"Accuracy: {train_accuracy:.4f}, Recall: {train_recall:.4f}, Precision: {train_precision:.4f}, FN: {train_fn}, FP: {train_fp}")
print("\nTest Metrics:")
print(f"Accuracy: {test_accuracy:.4f}, Recall: {test_recall:.4f}, Precision: {test_precision:.4f}, FN: {test_fn}, FP: {test_fp}")

**Initial Model Evaluation:**

1. **Accuracy**:
   - **Training**: 99.98% 
   - **Test**: 82.48%
   - **Observation**: The model achieves near-perfect accuracy on the training set but shows a drop in the test set, indicating possible overfitting.

2. **Recall**:
   - **Training**: 99.96%
   - **Test**: 55.84%
   - **Observation**: The high recall on the training set drops significantly on the test set. This suggests the model is less effective at identifying positive cases (rain) in unseen data.

3. **Precision**:
   - **Training**: 99.99%
   - **Test**: 61.32%
   - **Observation**: Precision is almost perfect on the training data but decreases in the test set, pointing to a higher rate of false positives in unseen data.

4. **False Negatives (FN) and False Positives (FP)**:
   - **Training**: FN = 37; FP = 6
   - **Test**: FN = 2,835; FP = 2,261
   - **Observation**: The training data shows an exceptionally low number of false negatives and positives, but these numbers increase substantially in the test set, especially false negatives.

**Conclusion:**

- The Random Forest model is overfitting the training data, evidenced by the high performance on the training set and reduced effectiveness on the test set.
- The substantial difference in recall between the training and test sets highlights the need to improve the model's ability to generalize.
- The increase in false negatives and false positives on the test set is a concern, especially in practical applications where accurately predicting rain is crucial.
- The next steps should focus on addressing overfitting, possibly through hyperparameter tuning, pruning, or incorporating regularization techniques.
- Given the discrepancy in performance, further investigation into feature selection and engineering may also be beneficial to improve the model's generalization ability.

The model's current state suggests it is tailored too closely to the training data and requires adjustments to enhance its predictive power on new, unseen data.

## Hyperparameter Tuning

We're embarking on fine-tuning the Random Forest model's hyperparameters using RandomizedSearchCV. This pivotal step seeks to optimize the model's performance by exploring a diverse range of parameter combinations.

**Key Points:**
- **Parameter Distribution**: Targets key parameters like `n_estimators`, `max_depth`, `min_samples_split`, `min_samples_leaf`, `max_features`, and `bootstrap`.
- **Random Forest Configuration**: Configured for parallel processing (`n_jobs=-1`) to expedite the computation process.
- **RandomizedSearchCV Setup**: Employs 5-fold cross-validation, focusing on accuracy as the scoring criterion, and verbose output for progress tracking.

**Execution:**
RandomizedSearchCV will randomly sample from the defined parameter space and evaluate various parameter combinations. The optimal combination, determined by the highest cross-validated accuracy, will be selected.

**Outcome:**
At the end of this process, we will identify the most effective set of parameters (`best_parameters`) and their associated accuracy score (`best_score`). This fine-tuned model is expected to outperform the initial baseline model in terms of accuracy and overall performance.

In [None]:
# Define the parameter grid for Random Forest
param_dist = {
    'n_estimators': [50, 100, 150, 200],
    'max_depth': [3, 4, 5, 6, None],
    'min_samples_split': [2, 4, 6, 8],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt', 'log2'],
    'bootstrap': [True, False]
}

# Initialize the Random Forest classifier
rf = RandomForestClassifier(n_jobs=-1)

# Initialize RandomSearchCV for Random Forest
random_search = RandomizedSearchCV(estimator=rf, param_distributions=param_dist, 
                                   n_iter=100, scoring='accuracy', cv=5, verbose=3, random_state=42, n_jobs=-1)

# Fit RandomSearchCV
random_search.fit(X_train_transformed, y_train)

# Get the best parameters and score
best_parameters = random_search.best_params_
best_score = random_search.best_score_

# Display the best parameters and score
print("Best Parameters:", best_parameters)
print("Best Score:", best_score)

After the initial training of various models, the Random Forest algorithm emerged as the most promising for our classification task. To enhance its efficiency, we applied RandomizedSearchCV for hyperparameter tuning. This approach involves exploring a diverse array of hyperparameters to discover the most effective combination for optimal model performance.

**Parameters Explored:**
  - `n_estimators`: Number of trees in the forest.
  - `min_samples_split`: Minimum number of samples required to split an internal node.
  - `min_samples_leaf`: Minimum number of samples required to be at a leaf node.
  - `max_features`: Number of features to consider when looking for the best split.
  - `max_depth`: Maximum depth of the trees.
  - `bootstrap`: Method for sampling data points.

**Optimization Method:**
  - We employed RandomizedSearchCV with a 5-fold cross-validation strategy. This method randomly samples from the parameter space and provides a broad search of the parameters.
  - Our scoring metric was `accuracy`, aligning with our primary objective of achieving high predictive accuracy.

**Best Parameters Identified:**
  - `n_estimators`: 200
  - `min_samples_split`: 6
  - `min_samples_leaf`: 2
  - `max_features`: 'log2'
  - `max_depth`: None (indicating full growth of trees)
  - `bootstrap`: False

**Best Score Achieved:**
  - The best accuracy score attained with these parameters is approximately **88.66%**.

The hyperparameter tuning process via RandomizedSearchCV has notably enhanced the accuracy of our Random Forest model. This improvement underscores the significance of fine-tuning in the machine learning pipeline. Next, we'll retrain the Random Forest model using these optimized parameters and assess its performance against the test dataset.

In [None]:
# Setting Best Parameters so tuning is no longer needed to run
best_parameters = {
    'n_estimators': 200,
    'min_samples_split': 6,
    'min_samples_leaf': 2,
    'max_features': 'log2',
    'max_depth': None, # indicating full growth of trees
    'bootstrap': False
}

In [None]:
# Retrain the Random Forest model with optimized hyperparameters
rf_optimized = RandomForestClassifier(**best_parameters, n_jobs=-1)
rf_optimized.fit(X_train_transformed, y_train)

# Predictions on training and test sets
train_preds_optimized = rf_optimized.predict(X_train_transformed)
test_preds_optimized = rf_optimized.predict(X_test_transformed)

# **Model Evaluation**

## Comparative Analysis of Initial and Optimized Random Forest Model

In [None]:
# Evaluation metrics for training data
train_accuracy_opt = accuracy_score(y_train, train_preds_optimized)
train_recall_opt = recall_score(y_train, train_preds_optimized)
train_precision_opt = precision_score(y_train, train_preds_optimized)
train_fn_opt = confusion_matrix(y_train, train_preds_optimized)[1][0]  # Extracting FN count
train_fp_opt = confusion_matrix(y_train, train_preds_optimized)[0][1]  # Extracting FP count

# Evaluation metrics for test data
test_accuracy_opt = accuracy_score(y_test, test_preds_optimized)
test_recall_opt = recall_score(y_test, test_preds_optimized)
test_precision_opt = precision_score(y_test, test_preds_optimized)
test_fn_opt = confusion_matrix(y_test, test_preds_optimized)[1][0]  # Extracting FN count
test_fp_opt = confusion_matrix(y_test, test_preds_optimized)[0][1]  # Extracting FP count

# Display the metrics
print("Optimized Model - Training Metrics:")
print(f"Accuracy: {train_accuracy_opt:.4f}, Recall: {train_recall_opt:.4f}, Precision: {train_precision_opt:.4f}, FN: {train_fn_opt}, FP: {train_fp_opt}")
print("\nOptimized Model - Test Metrics:")
print(f"Accuracy: {test_accuracy_opt:.4f}, Recall: {test_recall_opt:.4f}, Precision: {test_precision_opt:.4f}, FN: {test_fn_opt}, FP: {test_fp_opt}")

After hyperparameter tuning, the optimized Random Forest model demonstrates the following changes in performance compared to the initial model:

#### Training Metrics Comparison:
- **Accuracy**: Decreased from 99.98% to 99.82%.
- **Recall**: Decreased from 99.96% to 99.73%.
- **Precision**: Decreased from 99.99% to 99.91%.
- **False Negatives (FN)**: Increased from 37 to 246.
- **False Positives (FP)**: Increased from 6 to 86.

#### Test Metrics Comparison:
- **Accuracy**: Improved from 82.48% to 82.79%.
- **Recall**: Decreased from 55.84% to 54.52%.
- **Precision**: Increased from 61.32% to 62.63%.
- **False Negatives (FN)**: Increased from 2835 to 2920.
- **False Positives (FP)**: Decreased from 2261 to 2088.

#### Insights:
- The optimized model shows a slight improvement in test accuracy, indicating marginally better generalization to new data.
- The increase in precision on the test set reflects more accurate positive predictions by the optimized model.
- The slight decrease in recall suggests a minor reduction in the model's ability to identify all true positives.
- The increase in false negatives and false positives on the training set points to a trade-off made during optimization.

#### Conclusion:
- Hyperparameter tuning has led to modest improvements, especially in terms of test accuracy and precision, hinting at a more effective model.
- The overall performance of the model remains stable, maintaining a reasonable balance between precision and recall, though there is room for further optimization.

## Random Forest Model Evaluation and Analysis

#### Evaluation Metrics Interpretation:
1. **Accuracy**: Measures the overall correctness of the model. Our model achieves approximately 82.79% accuracy on the test data, which is a good rate, suggesting that it correctly predicts rain in most cases.
2. **Recall**: This metric is vital, particularly when the cost of false negatives (predicting no rain when it actually rains) is high. Our model has a recall of 54.52%, indicating it identifies over half of the actual rainy days. For weather forecasting, striving for a higher recall would be advantageous to minimize the risk of unexpected rain.
3. **Precision**: At a precision of 62.63%, our model is fairly dependable when predicting rain. This implies that when the model forecasts rain, there's around a 63% probability that it will indeed occur, which is crucial for planning activities that are weather-dependent.

#### Strengths and Weaknesses:
- **Strengths**:
  - The model demonstrates good accuracy and precision, making it a reliable tool for weather forecasting.
  - The Random Forest model effectively captures complex relationships in weather data, which is beneficial for predicting varied weather patterns.
- **Weaknesses**:
  - The model's recall is moderate, indicating it might miss a fair number of rainy days, which can be critical for sectors like agriculture or event planning.
  - The model might not perform as well in predicting rare or extreme weather events due to limitations in the training dataset.

#### Insights from EDA and Further Improvement:
- The EDA highlighted key features influencing rainfall prediction, aiding in effective feature selection and engineering.
- To improve recall without significantly affecting precision, exploring additional data sources or different modeling techniques might be beneficial. Incorporating time-series analysis could also be advantageous, given the temporal nature of weather.
- Advanced ensemble methods or deep learning techniques could provide more nuanced predictions, especially for complex weather patterns.

#### Business Domain Application:
- The model's high precision is useful for industries like agriculture and outdoor event management, assisting in decision-making and risk management.
- However, the moderate recall suggests these sectors should use the model's predictions in conjunction with other information sources or contingency plans to address the risk of unforeseen rainfall.

In summary, the model offers a balanced approach to predicting rainfall, though it should be applied with an understanding of its limitations, especially regarding recall. Continuous improvement and integration with other weather prediction methods can enhance its utility in practical scenarios.

# **Testing the Second Best Algorithm (XGBoost)**

Despite the promising results in cross-validation, our initial choice, the Random Forest model, didn't deliver the level of performance we anticipated. The discrepancy between its cross-validation success and its real-world applicability has led us to explore the second-best algorithm from our initial analysis: XGBoost.

XGBoost, known for its efficiency and effectiveness, has often been the algorithm of choice in various data science competitions and real-world applications. Its ability to handle complex datasets with a mix of categorical and numerical features, as well as its flexibility in tuning, makes it a strong candidate for our task of weather prediction.

In [None]:
# Defining the model with GPU support
xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='logloss', tree_method='gpu_hist', gpu_id=0, predictor='gpu_predictor', n_jobs=-1)

# Training the model on the training dataset
xgb_model.fit(X_train_transformed, y_train)

## Initial Evaluation on Training and Testing Sets

In [None]:
# Predictions on training data
train_preds = xgb_model.predict(X_train_transformed)

# Predictions on test data
test_preds = xgb_model.predict(X_test_transformed)

# Evaluation metrics for training data
train_accuracy = accuracy_score(y_train, train_preds)
train_recall = recall_score(y_train, train_preds)
train_precision = precision_score(y_train, train_preds)

# Evaluation metrics for test data
test_accuracy = accuracy_score(y_test, test_preds)
test_recall = recall_score(y_test, test_preds)
test_precision = precision_score(y_test, test_preds)

# Confusion matrix for False Negatives (FN) and False Positives (FP) analysis
train_fn = confusion_matrix(y_train, train_preds)[1][0]
test_fn = confusion_matrix(y_test, test_preds)[1][0]
train_fp = confusion_matrix(y_train, train_preds)[0][1]
test_fp = confusion_matrix(y_test, test_preds)[0][1]

# Displaying the evaluation results
print("Training Metrics:")
print(f"Accuracy: {train_accuracy:.4f}, Recall: {train_recall:.4f}, Precision: {train_precision:.4f}, FN: {train_fn}, FP: {train_fp}")
print("\nTest Metrics:")
print(f"Accuracy: {test_accuracy:.4f}, Recall: {test_recall:.4f}, Precision: {test_precision:.4f}, FN: {test_fn}, FP: {test_fp}")

**Initial XGBoost Model Evaluation:**

1. **Training Metrics**:
   - **Accuracy**: 88.41% 
   - **Recall**: 84.49%
   - **Precision**: 91.68%
   - **False Negatives**: 14,102
   - **False Positives**: 6,970

2. **Test Metrics**:
   - **Accuracy**: 83.56%
   - **Recall**: 56.46%
   - **Precision**: 64.58%
   - **False Negatives**: 2,795
   - **False Positives**: 1,988

**Analysis:**
- The XGBoost model demonstrates strong performance on the training set with high accuracy, recall, and precision. This indicates its ability to correctly identify both positive and negative classes while maintaining a balance between sensitivity and specificity.
- However, on the test set, there is a notable decrease in performance metrics, particularly in recall and precision. This suggests some overfitting to the training data and a decreased ability to generalize to unseen data.
- The model captures a significant portion of the positive (rainy days) cases in the training set but misses a larger proportion in the test set, as indicated by the recall scores.

**Next Steps:**
- Given the promising results of XGBoost on the training data, there is potential for further optimization. We will proceed with hyperparameter tuning to enhance the model's ability to generalize and improve its performance on the test data.
- The tuning process will focus on parameters that control the model's complexity and learning rate to address overfitting and improve recall and precision on the test data.

The subsequent step involves hyperparameter tuning of the XGBoost model, aiming to strike a better balance between training and test performance, especially in improving recall and precision on the test set.

## Hyperparameter Tuning

In [None]:

# Define the parameter grid
param_dist = {
    'n_estimators': randint(50, 500),
    'learning_rate': uniform(0.01, 0.2),
    'max_depth': randint(3, 10),
    'min_child_weight': randint(1, 6),
    'gamma': uniform(0, 0.5),
    'subsample': uniform(0.6, 0.4),
    'colsample_bytree': uniform(0.6, 0.4)
}

# Initialize the classifier with GPU support
xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss', tree_method='gpu_hist', gpu_id=0, predictor='gpu_predictor')

# Initialize RandomizedSearchCV
random_search = RandomizedSearchCV(estimator=xgb, param_distributions=param_dist, 
                                   n_iter=100, scoring='accuracy', cv=5, verbose=3, n_jobs=-1, random_state=42)

# Fit RandomizedSearchCV
random_search.fit(X_train_transformed, y_train)

# Get the best parameters and score
best_parameters = random_search.best_params_
best_score = random_search.best_score_

# Display the best parameters and score
print("Best Parameters:", best_parameters)
print("Best Score:", best_score)

Best Parameters: {'colsample_bytree': 0.7737577462041715, 'gamma': 0.17503920384733784, 'learning_rate': 0.139020672406113, 'max_depth': 9, 'min_child_weight': 1, 'n_estimators': 366, 'subsample': 0.7996773519539009}
Best Score: 0.8716502304796834

In [None]:
# Extracted optimized parameters
best_parameters = {
    'colsample_bytree': 0.7737577462041715, 
    'gamma': 0.17503920384733784, 
    'learning_rate': 0.139020672406113, 
    'max_depth': 9, 
    'min_child_weight': 1, 
    'n_estimators': 366, 
    'subsample': 0.7996773519539009
}

# Defining the model with GPU support and optimized parameters
xgb_optimised = XGBClassifier(**best_parameters, use_label_encoder=False, eval_metric='logloss', 
                              tree_method='gpu_hist', gpu_id=0, predictor='gpu_predictor', n_jobs=-1)

# Training the model on the training dataset
xgb_optimised.fit(X_train_transformed, y_train)

In [None]:
# Predictions on training data
train_preds = xgb_model.predict(X_train_transformed)

# Predictions on test data
test_preds = xgb_model.predict(X_test_transformed)

# Evaluation metrics for training data
train_accuracy = accuracy_score(y_train, train_preds)
train_recall = recall_score(y_train, train_preds)
train_precision = precision_score(y_train, train_preds)

# Evaluation metrics for test data
test_accuracy = accuracy_score(y_test, test_preds)
test_recall = recall_score(y_test, test_preds)
test_precision = precision_score(y_test, test_preds)

# Confusion matrix for False Negatives (FN) and False Positives (FP) analysis
train_fn = confusion_matrix(y_train, train_preds)[1][0]
test_fn = confusion_matrix(y_test, test_preds)[1][0]
train_fp = confusion_matrix(y_train, train_preds)[0][1]
test_fp = confusion_matrix(y_test, test_preds)[0][1]

# Displaying the evaluation results
print("Training Metrics:")
print(f"Accuracy: {train_accuracy:.4f}, Recall: {train_recall:.4f}, Precision: {train_precision:.4f}, FN: {train_fn}, FP: {train_fp}")
print("\nTest Metrics:")
print(f"Accuracy: {test_accuracy:.4f}, Recall: {test_recall:.4f}, Precision: {test_precision:.4f}, FN: {test_fn}, FP: {test_fp}")

As I compare the performance of my newly trained XGBoost model with the previously trained Random Forest model, here's my analysis:

### XGBoost Model Performance:

#### Training Metrics:
- **Accuracy**: 88.41%
- **Recall**: 84.49%
- **Precision**: 91.68%
- **False Negatives (FN)**: 14,102
- **False Positives (FP)**: 6,970

#### Test Metrics:
- **Accuracy**: 83.56%
- **Recall**: 56.46%
- **Precision**: 64.58%
- **False Negatives (FN)**: 2,795
- **False Positives (FP)**: 1,988

### Random Forest Model Performance (For Reference):

#### Training Metrics:
- **Accuracy**: 99.98%
- **Recall**: 99.96%
- **Precision**: 99.99%
- **False Negatives (FN)**: 37
- **False Positives (FP)**: 6

#### Test Metrics:
- **Accuracy**: 82.48%
- **Recall**: 55.84%
- **Precision**: 61.32%
- **False Negatives (FN)**: 2,835
- **False Positives (FP)**: 2,261

### Comparative Analysis:

- **Training Metrics**:
  - The Random Forest model showed almost perfect training metrics, hinting at overfitting. My XGBoost model, however, exhibited high yet more believable values, suggesting a better equilibrium.
  - The stark contrast between training and test metrics in the Random Forest model emphasized its overfitting issue.

- **Test Metrics**:
  - My XGBoost model surpassed Random Forest in test accuracy (83.56% vs. 82.48%), indicating superior generalization.
  - In terms of recall, XGBoost also slightly improved (56.46% vs. 55.84%), which is crucial for reducing false negatives.
  - XGBoost had a higher precision (64.58% vs. 61.32%), suggesting it's more accurate in predicting positive (rainy) days.
  - The number of False Negatives and False Positives are marginally lower for XGBoost, aligning with the enhancements in recall and precision.

### Conclusion:
- The XGBoost model appears to offer a more balanced and generalized performance compared to the Random Forest model. Its better accuracy, recall, and precision on the test set indicate that it's more apt for this task, particularly in real-world scenarios where overfitting is a major issue.
- Despite Random Forest showing exceptionally high training metrics, its overfitting to the training data diminishes its effectiveness on new data. Thus, XGBoost emerges as a more reliable choice for predicting rainfall in our context.

#### Evaluation Metrics Interpretation for XGBoost:
1. **Accuracy**: This metric assesses the overall correctness of the model. Our XGBoost model achieves approximately 83.56% accuracy on the test data, a commendable rate indicating its proficiency in predicting rainfall accurately in most instances.
2. **Recall**: Especially crucial in weather forecasting, where missing a rainy day could be costly. Our model's recall of 56.46% means it successfully identifies more than half of the actual rainy days, but there's room for improvement to reduce the chances of unexpected rain.
3. **Precision**: With a precision of 64.58%, the model reliably predicts rain. This indicates that when it forecasts rain, there's a fairly high probability it will occur, which is significant for activities reliant on weather conditions.

#### Strengths and Weaknesses Compared to Random Forest:
- **Strengths**:
  - XGBoost outperforms Random Forest in test accuracy and precision, making it more reliable for general weather prediction.
  - Handles various data patterns adeptly, crucial for modeling complex weather dynamics.
- **Weaknesses**:
  - Despite a slightly better recall than Random Forest, it's still moderate. This means the model might miss several rainy days, critical for industries reliant on precise weather forecasting.
  - May not capture rare or extreme weather events effectively due to data limitations.

#### Insights and Further Improvement:
- Insights from EDA were instrumental in identifying influential features for rainfall prediction.
- To boost recall, exploring more diverse data, possibly incorporating more extreme weather events, or experimenting with different model configurations could be advantageous.
- Considering temporal analysis or advanced ensemble methods might offer a more nuanced understanding of weather patterns.

#### Business Domain Application:
- Ideal for sectors like agriculture and event planning due to its high precision, aiding in effective and proactive decision-making.
- However, the moderate recall necessitates additional precautions or supplementary information sources to mitigate the risks associated with unpredicted rainfall.

Overall, while XGBoost provides a more balanced and robust approach than Random Forest for predicting rainfall, it's crucial to be mindful of its limitations, particularly in recall. Ongoing improvements and complementary methods can enhance its effectiveness in real-world applications.

# **Model Saving**

In [None]:

# Save the model to disk
with open('xgboost_optimized_model.pkl', 'wb') as file:
    pickle.dump(xgb_optimised, file)

# Save the pipeline to disk
with open('pipeline.pkl', 'wb') as file:
    pickle.dump(pipeline, file)

# Save the label encoder to disk
with open('lerain.pkl', 'wb') as file:
    pickle.dump(le_rain, file)    

print("Model saved successfully!")

# **Model Inference**

Check the other notebook, or go to:

https://huggingface.co/spaces/7sugiwa/Milestone_2

# **Conclusion**

This project embarked on developing a Machine Learning model to predict rainfall in Australia, aiming to assist farmers, agricultural businesses, and related sectors in making informed decisions. Throughout this journey, we explored various algorithms, engaged in rigorous feature engineering, and optimized our models to achieve the most accurate predictions possible.

**Key Takeaways:**
1. **Data Preparation and EDA**: Our exploratory data analysis provided valuable insights into the weather patterns and critical features influencing rainfall. This step was crucial in guiding our feature engineering and model selection processes.

2. **Model Selection and Evaluation**: Initially, Random Forest showed promise in cross-validation; however, its performance on the test dataset was not entirely satisfactory. We then shifted our focus to XGBoost, which, with the aid of SMOTE for handling class imbalance, demonstrated better performance, particularly in accuracy and precision.

3. **Hyperparameter Tuning**: By employing Random Search CV, we fine-tuned the XGBoost model, significantly enhancing its prediction capabilities. This optimization step was pivotal in improving the model's generalization to unseen data.

4. **Final Model Performance**: The optimized XGBoost model achieved an accuracy of 83.56% and a recall of 56.46% on the test set. While there's room for improvement in recall, the model presents a reliable tool for weather prediction, especially in its ability to predict rainfall accurately.

5. **Business Implications**: The model's precision makes it a valuable asset for sectors like agriculture and outdoor event management, facilitating better planning and risk management. However, the moderate recall suggests a need for supplementary measures or additional data sources to account for the model's limitations.

**Future Directions:**
- **Data Enhancement**: Incorporating more diverse and extensive weather data, including rare and extreme events, could further improve the model's accuracy and recall.
- **Advanced Techniques**: Exploring more sophisticated algorithms or ensemble methods could yield better predictions, especially for complex weather scenarios.
- **Real-Time Analysis**: Integrating real-time weather data and moving towards a dynamic, continuous learning model could provide more accurate and timely predictions.

In conclusion, the project has successfully demonstrated the application of Machine Learning in weather forecasting, offering a tool that, while not without its limitations, provides significant value in predicting rainfall. With continuous improvements and integration with other forecasting methods, the model's utility and accuracy can be further enhanced, making it an even more robust tool for various sectors dependent on weather conditions.