# Problem 2 Rain in Australia Data Set
### Created By: Ivor Zalud
***
## The Problem
I want to predict if it will rain tomorrow in Australia given a set of weather characteristics. These features are: Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RISK_MM,RainTomorrow. We will be dropping RISK_MM as this directly correlates to RainTomorrow.


## The Data
Data for this is provided by Joe Young on Kaggle [here](https://www.kaggle.com/jsphyg/weather-dataset-rattle-package/version/1). The set describes weather characteristics per day in Australia between 10/31/07 - 6/24/17

## Methods
* Gradient-Boosted Trees as our multi-class classifier
* Stochastic Gradient Descent as our linear regression


In [35]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.linear_model import SGDRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

# Gradient-Boosted Trees
***
### - **Regularization**: Scikit outlines it well [here](https://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_regularization.html#sphx-glr-auto-examples-ensemble-plot-gradient-boosting-regularization-py) the optimal regularization methods. We'll use shrinkage and subamsple < 1.0 to produce a more accurate model. This is a result of reducing the variance via bagging
### - **Loss function**: Deviance (default for scikit) which will use logistic regressions loss function. We want to minimize this loss function


## 1. Cleaning and encoding the columns with string data
- **String Data:** We need to transform each column that has string values to be numeric. We will use pandas .cat.codes method to convert target columns.
- **NaN Data:** We are just filling all nans with 0; likely not a good strategy


In [36]:
df = pd.read_csv('Data/weatherAUS.csv')
df = df.drop(columns='RISK_MM')
cols = ['Date','Location','WindGustDir','WindDir9am','WindDir3pm','RainToday','RainTomorrow']
df[cols] = df[cols].astype('category')
cat_columns = df.select_dtypes(['category']).columns
df[cat_columns] = df[cat_columns].apply(lambda x: x.cat.codes)
df = df.fillna(0)
df


Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,396,2,13.4,22.9,0.6,0.0,0.0,13,44.0,13,...,71.0,22.0,1007.7,1007.1,8.0,0.0,16.9,21.8,0,0
1,397,2,7.4,25.1,0.0,0.0,0.0,14,44.0,6,...,44.0,25.0,1010.6,1007.8,0.0,0.0,17.2,24.3,0,0
2,398,2,12.9,25.7,0.0,0.0,0.0,15,46.0,13,...,38.0,30.0,1007.6,1008.7,0.0,2.0,21.0,23.2,0,0
3,399,2,9.2,28.0,0.0,0.0,0.0,4,24.0,9,...,45.0,16.0,1017.6,1012.8,0.0,0.0,18.1,26.5,0,0
4,400,2,17.5,32.3,1.0,0.0,0.0,13,41.0,1,...,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
145455,3431,41,2.8,23.4,0.0,0.0,0.0,0,31.0,9,...,51.0,24.0,1024.6,1020.3,0.0,0.0,10.1,22.4,0,0
145456,3432,41,3.6,25.3,0.0,0.0,0.0,6,22.0,9,...,56.0,21.0,1023.5,1019.1,0.0,0.0,10.9,24.5,0,0
145457,3433,41,5.4,26.9,0.0,0.0,0.0,3,37.0,9,...,53.0,24.0,1021.0,1016.8,0.0,0.0,12.5,26.1,0,0
145458,3434,41,7.8,27.0,0.0,0.0,0.0,9,28.0,10,...,51.0,24.0,1019.4,1016.5,3.0,2.0,15.1,26.0,0,0


## 2. Split the data set into a training and test set
### Also split columns into the x and y variables
### We adopt a 80/20 split for training vs test

In [37]:
## Define our indepedent and dependant variables
data_column_names = [column for column in df.columns if column not in ['RainTomorrow']]
x = df.loc[:, data_column_names]
y = df.loc[:,'RainTomorrow']



x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2)
x_train

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday
135360,2725,17,-0.6,10.8,0.0,0.0,0.0,9,20.0,0,...,7.0,94.0,54.0,0.0,0.0,0.0,0.0,1.9,10.6,0
56017,1311,5,6.5,9.9,1.0,0.0,0.0,13,61.0,3,...,41.0,98.0,72.0,1007.4,1003.6,6.0,0.0,6.6,8.8,0
136111,436,3,21.0,37.5,0.0,14.8,12.9,10,41.0,0,...,15.0,13.0,9.0,1006.6,1002.6,0.0,0.0,30.3,36.1,0
69604,2647,18,0.0,0.0,0.0,1.6,9.0,10,20.0,6,...,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1
87662,858,8,23.1,33.5,0.0,6.8,10.7,4,28.0,2,...,15.0,58.0,57.0,1011.2,1009.0,4.0,3.0,29.9,32.5,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
114786,593,29,3.5,20.2,0.0,0.0,7.7,6,26.0,1,...,15.0,62.0,35.0,1023.9,1020.7,0.0,1.0,12.4,19.3,0
137902,2227,3,19.3,31.1,0.0,11.8,5.2,0,43.0,2,...,26.0,55.0,49.0,1015.1,1011.5,4.0,7.0,27.1,28.1,0
1347,1743,2,7.5,15.2,3.4,0.0,0.0,14,46.0,13,...,24.0,86.0,63.0,1017.9,1017.4,8.0,8.0,9.6,14.0,1
57391,2685,5,6.0,10.6,0.0,0.0,0.0,5,37.0,3,...,20.0,100.0,99.0,1029.7,1027.8,8.0,7.0,7.9,10.2,0


## 3. Create the Gradient-boosted Trees Model and fit with the training data

In [38]:
GBT = GradientBoostingClassifier(n_estimators=5000,
                                       learning_rate=0.01,
                                       max_depth=3,
                                       subsample=0.5,
                                       validation_fraction=0.1,
                                       n_iter_no_change=20,
                                       max_features='log2'
                                      ).fit(x_train,y_train)

GBT.score(x_test,y_test)

predictions = GBT.predict(x_test)
print(GBT.feature_names_in_)
print(classification_report(y_test, predictions))

