# Problem 2 Rain in Australia Data Set
### Created By: Ivor Zalud
***
## The Problem
I want to predict if it will rain tomorrow in Australia given a set of weather characteristics. These features are: Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RISK_MM,RainTomorrow. We will be dropping RISK_MM as this directly correlates to RainTomorrow.


## The Data
Data for this is provided by Joe Young on Kaggle [here](https://www.kaggle.com/jsphyg/weather-dataset-rattle-package/version/1). The set describes weather characteristics per day in Australia between 10/31/07 - 6/24/17

## Methods
* Gradient-Boosted Trees as our multi-class classifier
* Stochastic Gradient Descent as our linear regression


In [15]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.linear_model import SGDRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import make_pipeline

# Gradient-Boosted Trees
***
### - **Regularization**: Scikit outlines it well [here](https://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_regularization.html#sphx-glr-auto-examples-ensemble-plot-gradient-boosting-regularization-py) the optimal regularization methods. We'll use shrinkage and subamsple < 1.0 to produce a more accurate model. This is a result of reducing the variance via bagging
### - **Loss function**: Deviance (default for scikit) which will use logistic regressions loss function. We want to minimize this loss function


## 1. Cleaning and encoding the columns with string data
- **String Data:** We need to transform each column that has string values to be numeric. We will use pandas .cat.codes method to convert target columns.
- **NaN Data:** We are just filling all nans with the mean of each column. Furthermore, for columns that have a large count of NaN data we are dropping those columns completly. Our target for this is 30% of our total data count.


In [16]:
df = pd.read_csv('Data/weatherAUS.csv')
df = df.drop(columns='RISK_MM')
cols = ['Date','Location','WindGustDir','WindDir9am','WindDir3pm','RainToday','RainTomorrow']
df[cols] = df[cols].astype('category')
cat_columns = df.select_dtypes(['category']).columns
df[cat_columns] = df[cat_columns].apply(lambda x: x.cat.codes)
df['Evaporation'].fillna(df['Evaporation'].mean(), inplace = True)
df[['MinTemp', 'MaxTemp', 'Rainfall', 'Sunshine', 'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm','Pressure9am', 'Pressure3pm','Cloud9am','Cloud3pm','Temp9am','Temp3pm']] = df[['MinTemp', 'MaxTemp', 'Rainfall', 'Sunshine', 'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm','Pressure9am', 'Pressure3pm','Cloud9am','Cloud3pm','Temp9am','Temp3pm']].fillna(value=df[['MinTemp', 'MaxTemp', 'Rainfall', 'Sunshine', 'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm','Pressure9am', 'Pressure3pm','Cloud9am','Cloud3pm','Temp9am','Temp3pm']].mean())
df = df.drop(columns=['Evaporation','Sunshine','Cloud9am','Cloud3pm'])

df = df[df['RainTomorrow'] >= 0]



## 2. Split the data set into a training and test set
### Also split columns into the x and y variables
### We adopt a 80/20 split for training vs test

In [17]:
## Define our indepedent and dependant variables
data_column_names = [column for column in df.columns if column not in ['RainTomorrow']]
x = df.loc[:, data_column_names]
y = df.loc[:,'RainTomorrow']



x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2)
x_train

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Temp9am,Temp3pm,RainToday
15291,611,24,7.0,23.221348,0.0,-1,40.03523,7,12,15.0,6.0,62.0,41.0,1017.64994,1015.255889,10.5,16.4,0
111612,428,46,15.1,28.800000,0.0,10,50.00000,9,10,11.0,22.0,65.0,48.0,1012.40000,1011.400000,21.0,27.9,0
48616,3029,9,5.2,21.300000,0.0,14,54.00000,3,14,19.0,26.0,75.0,32.0,1018.60000,1015.700000,14.7,20.3,0
91486,1642,14,15.7,22.500000,0.8,10,72.00000,8,10,28.0,33.0,78.0,78.0,1029.40000,1026.700000,18.9,20.1,0
57666,2960,5,17.0,26.800000,0.0,9,57.00000,10,9,19.0,43.0,65.0,63.0,1018.70000,1019.600000,23.8,23.5,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39250,3148,42,3.9,15.200000,0.2,14,22.00000,14,14,7.0,15.0,93.0,65.0,1021.30000,1019.500000,8.8,14.0,0
73664,2120,25,9.1,31.600000,0.0,14,48.00000,1,7,20.0,17.0,49.0,15.0,1015.30000,1010.000000,18.4,30.4,0
120185,2983,32,12.8,30.600000,0.0,12,33.00000,4,0,13.0,11.0,50.0,27.0,1023.10000,1018.000000,22.8,30.2,0
90092,3288,8,25.7,33.300000,0.8,9,50.00000,10,2,19.0,31.0,72.0,60.0,1011.20000,1008.200000,30.0,32.4,0


## 3. Create the Gradient-boosted Trees Model and fit with the training data
### Hyperparameters
- These were tweaked based on scikit documentation. I tried a variety of values for each and found this set typically resulted in the best score. If we greatly increase the data set size, I would reduce the n_iter_nochange and the n_estimators. We could benefit from grid search here if we had more hyperparameters.

In [18]:
GBT = GradientBoostingClassifier(n_estimators=5000,
                                       learning_rate=0.1,
                                       max_depth=3,
                                       subsample=0.5,
                                       validation_fraction=0.1,
                                       n_iter_no_change=20,
                                       max_features='log2'
                                      ).fit(x_train,y_train)

GBT.score(x_test,y_test)

predictions = GBT.predict(x_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.87      0.95      0.91     22112
           1       0.75      0.52      0.62      6327

    accuracy                           0.86     28439
   macro avg       0.81      0.74      0.76     28439
weighted avg       0.85      0.86      0.85     28439



## Results
The model works sufficiently well. The main weakness is the recall for days with rain. Future studies will want to improve this to greatly incerase the overall f1 score.

# Logistic Regression

***

## 1. Regularize, run, and fit the model
### We already split the data into the training and test set which well use here.

### We use scikit-learn's pipeline to scale the data as SGD is sensitive to feature scaling.
### -  **Regularization:** L2 - Default of Logit Regression in Scikit learn. Well keep L2 since we have a low amount of features and do not want to shrink any coefficients to zero.
### -  **Loss Function:** Squared error - I want to be sensititve to any outliers in the data.
### - **Hyperparameters:** max_iter: I found ~100 typically yielded the best results for this data set. As we increase the size of the data set, I may reduce the max_iter to increase run time. However, this time cost analysis depends on the users wanted outcome.


In [19]:
reg = make_pipeline(StandardScaler(), LogisticRegression(max_iter=1000))
reg.fit(x_train, y_train)

reg_predictions = reg.predict(x_test)

print(classification_report(y_test,reg_predictions))


print("Score: " + str(reg.score(x_test, y_test)))


              precision    recall  f1-score   support

           0       0.86      0.95      0.90     22112
           1       0.72      0.46      0.56      6327

    accuracy                           0.84     28439
   macro avg       0.79      0.70      0.73     28439
weighted avg       0.83      0.84      0.83     28439

Score: 0.840254579978199


### Result
The model works adequetly. It struggles with recall when predicting if a day will have rain. Future studies would improve the recall of the rain label to improve the overall f1 score. Future studies will want to increase the feature set or use a better strategy to fill NaN data. Furthermore, a onehotencoding strategy may result in an improved model for the categorical data.