## Description

Predicting whether it will rain on a given day in Australia using XGBoost on Rain in Australia dataset on Kaggle.
- Dataset: https://www.kaggle.com/jsphyg/weather-dataset-rattle-package

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
!ls

README.md      rainAUS.ipynb  weatherAUS.csv


In [3]:
df = pd.read_csv('weatherAUS.csv')

In [4]:
df.head()

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,W,...,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,No
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,...,44.0,25.0,1010.6,1007.8,,,17.2,24.3,No,No
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,W,...,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,No
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,SE,...,45.0,16.0,1017.6,1012.8,,,18.1,26.5,No,No
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,ENE,...,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,No


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145460 entries, 0 to 145459
Data columns (total 23 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Date           145460 non-null  object 
 1   Location       145460 non-null  object 
 2   MinTemp        143975 non-null  float64
 3   MaxTemp        144199 non-null  float64
 4   Rainfall       142199 non-null  float64
 5   Evaporation    82670 non-null   float64
 6   Sunshine       75625 non-null   float64
 7   WindGustDir    135134 non-null  object 
 8   WindGustSpeed  135197 non-null  float64
 9   WindDir9am     134894 non-null  object 
 10  WindDir3pm     141232 non-null  object 
 11  WindSpeed9am   143693 non-null  float64
 12  WindSpeed3pm   142398 non-null  float64
 13  Humidity9am    142806 non-null  float64
 14  Humidity3pm    140953 non-null  float64
 15  Pressure9am    130395 non-null  float64
 16  Pressure3pm    130432 non-null  float64
 17  Cloud9am       89572 non-null

In [6]:
#change date col to datetime data type
df.Date = pd.to_datetime(df.Date)

## Exploratory Analysis

In [7]:
df_tmp = df.copy()

In [8]:
# set Date col as the index 
df_tmp.index = df_tmp.Date

In [9]:
#drop Date col
df_tmp.drop(['Date'], axis=1, inplace=True)

In [10]:
#check
df_tmp.head(2)

Unnamed: 0_level_0,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,W,WNW,...,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,No
2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,WSW,...,44.0,25.0,1010.6,1007.8,,,17.2,24.3,No,No


In [None]:
# Rainfull time series
fig = plt.figure(figsize=(12,4)) 
sns.lineplot(data=df_tmp['Rainfall'], color='purple', linewidth=1);

In [None]:
df_tmp=df_tmp.resample('M')[['Rainfall']].mean() # aggregate by month


In [None]:
# Rainfull time series over months
plt.figure(figsize=(12,4)) 
plt.title("Rainfall in AUS")
sns.lineplot(data=df_tmp['Rainfall'], color='purple', linewidth=1);

Now that we did some exploratory analysis on the amount of rainfall over the years I want to go back to df to prep it for XGBoost

## Cleaning

### Define

drop unnecessary features/cols

### Code

In [None]:
df.columns

I want to predict `RainToday` i.e our target

In [None]:
cols = ['Date', 'Location', 'Rainfall', 'RainTomorrow']
df.drop(cols, axis=1, inplace=True)

### Test

In [None]:
df.columns

### Define

Remove cols with more than **40%** Nans in them 

### Code

In [None]:
cols = df.isna().mean(axis=0)
cols

In [None]:
cols_drop = cols[cols >= 0.4]
cols_drop

In [None]:
df.drop(cols_drop.index, axis=1, inplace=True)

### Test

In [None]:
df.columns

In [None]:
df.head()

In [None]:
df.WindGustDir.nunique(), df.WindDir9am.nunique(), df.WindDir3pm.nunique()

In [None]:
3*16

In [None]:
15+48

### Define

### Code

### Test

## Split The Data

In [None]:
X = df.drop("RainToday", axis=1)
y = df.RainToday

## Preprocessing

For the categorical features, we will impute the missing values with the mode of the column and encode them with One-Hot encoding:

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

In [None]:
categorical_pipeline = Pipeline(
    steps=[
        ("impute", SimpleImputer(strategy="most_frequent")),
        ("oh-encode", OneHotEncoder(handle_unknown="ignore", sparse=False)),
    ]
)

For the numeric features, I will choose the mean as an imputer and StandardScaler so that the features have 0 mean and a variance of 1:

In [None]:
from sklearn.preprocessing import StandardScaler


In [None]:
numeric_pipeline = Pipeline(
    steps=[("impute", SimpleImputer(strategy="mean")), 
           ("scale", StandardScaler())]
)

Finally, we will combine the two pipelines with a column transformer. To specify which columns the pipelines are designed for, we should first isolate the categorical and numeric feature names:

In [None]:
cat_cols = X.select_dtypes(exclude="number").columns
num_cols = X.select_dtypes(include="number").columns

Next, we will input these along with their corresponding pipelines into a ColumnTransFormer instance:

In [None]:
from sklearn.compose import ColumnTransformer


In [None]:
full_processor = ColumnTransformer(
    transformers=[
        ("numeric", numeric_pipeline, num_cols),
        ("categorical", categorical_pipeline, cat_cols),
    ]
)

In [None]:
y.shape

In [None]:
# Apply preprocessing
X_processed = full_processor.fit_transform(X)
y_processed = SimpleImputer(strategy="most_frequent").fit_transform(
    y.values.reshape(-1, 1)
)

In [None]:
X_processed.shape

In [None]:
type(X_processed)

In [None]:
X.shape

In [None]:
type(X)

## Splitting the data

In [None]:
from sklearn.model_selection import train_test_split


In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X_processed, y_processed, stratify=y_processed, random_state=1121218
)

## Applying XGBoost

In [None]:
import xgboost as xgb

In [None]:
xgb_cl = xgb.XGBClassifier()


In [None]:
xgb_cl.fit(X_train, y_train)


In [None]:
y_processed.shape