# Problem 2 Rain in Australia Data Set
### Created By: Ivor Zalud
***
## The Problem
I want to predict if it will rain tomorrow in Australia given a set of weather characteristics. These features are: Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RISK_MM,RainTomorrow. We will be dropping RISK_MM as this directly correlates to RainTomorrow.


## The Data
Data for this is provided by Joe Young on Kaggle [here](https://www.kaggle.com/jsphyg/weather-dataset-rattle-package/version/1). The set describes weather characteristics per day in Australia between 10/31/07 - 6/24/17

## Methods
* Gradient-Boosted Trees as our multi-class classifier
* Stochastic Gradient Descent as our linear regression


In [152]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.linear_model import SGDRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import make_pipeline

# Gradient-Boosted Trees
***
### - **Regularization**: Scikit outlines it well [here](https://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_regularization.html#sphx-glr-auto-examples-ensemble-plot-gradient-boosting-regularization-py) the optimal regularization methods. We'll use shrinkage and subamsple < 1.0 to produce a more accurate model. This is a result of reducing the variance via bagging
### - **Loss function**: Deviance (default for scikit) which will use logistic regressions loss function. We want to minimize this loss function


## 1. Cleaning and encoding the columns with string data
- **String Data:** We need to transform each column that has string values to be numeric. We will use pandas .cat.codes method to convert target columns.
- **NaN Data:** We are just filling all nans with 0; likely not a good strategy


In [162]:
df = pd.read_csv('Data/weatherAUS.csv')
df = df.drop(columns='RISK_MM')
cols = ['Date','Location','WindGustDir','WindDir9am','WindDir3pm','RainToday','RainTomorrow']
df[cols] = df[cols].astype('category')
cat_columns = df.select_dtypes(['category']).columns
df[cat_columns] = df[cat_columns].apply(lambda x: x.cat.codes)
df['Evaporation'].fillna(df['Evaporation'].mean(), inplace = True)
df[['MinTemp', 'MaxTemp', 'Rainfall', 'Sunshine', 'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm','Pressure9am', 'Pressure3pm','Cloud9am','Cloud3pm','Temp9am','Temp3pm']] = df[['MinTemp', 'MaxTemp', 'Rainfall', 'Sunshine', 'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm','Pressure9am', 'Pressure3pm','Cloud9am','Cloud3pm','Temp9am','Temp3pm']].fillna(value=df[['MinTemp', 'MaxTemp', 'Rainfall', 'Sunshine', 'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm','Pressure9am', 'Pressure3pm','Cloud9am','Cloud3pm','Temp9am','Temp3pm']].mean())
df = df.drop(columns=['Evaporation','Sunshine','Cloud9am','Cloud3pm'])

df = df[df['RainTomorrow'] >= 0]



## 2. Split the data set into a training and test set
### Also split columns into the x and y variables
### We adopt a 80/20 split for training vs test

In [158]:
## Define our indepedent and dependant variables
data_column_names = [column for column in df.columns if column not in ['RainTomorrow']]
x = df.loc[:, data_column_names]
y = df.loc[:,'RainTomorrow']



x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2)
x_train

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Temp9am,Temp3pm,RainToday
106449,1314,48,6.5,15.2,0.0,11,37.00000,12,8,17.0,20.0,88.0,47.0,1024.20000,1025.900000,9.3,14.1,0
11582,2951,11,22.9,28.9,0.4,8,39.00000,11,8,17.0,22.0,84.0,68.0,1016.00000,1015.300000,24.6,27.9,0
67337,380,18,12.0,22.7,0.0,15,57.00000,12,12,20.0,19.0,57.0,37.0,1016.60000,1014.900000,13.8,19.5,0
6769,1147,10,13.3,30.5,0.0,10,31.00000,0,2,15.0,9.0,33.0,18.0,1015.80000,1013.500000,21.8,29.1,0
134847,2212,17,14.3,24.7,0.0,7,37.00000,3,7,11.0,24.0,62.0,33.0,1010.70000,1006.700000,17.8,23.9,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
53610,1944,23,3.0,10.1,0.0,11,43.00000,11,12,15.0,17.0,35.0,45.0,1017.64994,1015.255889,6.2,8.2,0
98285,2208,0,20.6,24.5,75.2,4,26.00000,4,1,11.0,7.0,93.0,86.0,1003.20000,998.800000,22.0,23.2,1
132097,2502,15,12.0,15.8,4.2,9,35.00000,2,10,9.0,26.0,93.0,81.0,1011.40000,1010.200000,13.2,14.0,1
64148,3393,35,8.1,16.9,0.0,-1,40.03523,3,2,6.0,17.0,96.0,68.0,1024.60000,1022.200000,10.4,15.7,0


## 3. Create the Gradient-boosted Trees Model and fit with the training data

In [155]:
GBT = GradientBoostingClassifier(n_estimators=5000,
                                       learning_rate=0.1,
                                       max_depth=3,
                                       subsample=0.5,
                                       validation_fraction=0.1,
                                       n_iter_no_change=20,
                                       max_features='log2'
                                      ).fit(x_train,y_train)

GBT.score(x_test,y_test)

predictions = GBT.predict(x_test)
print(GBT.feature_names_in_)
print(classification_report(y_test, predictions))



['Date' 'Location' 'MinTemp' 'MaxTemp' 'Rainfall' 'WindGustDir'
 'WindGustSpeed' 'WindDir9am' 'WindDir3pm' 'WindSpeed9am' 'WindSpeed3pm'
 'Humidity9am' 'Humidity3pm' 'Pressure9am' 'Pressure3pm' 'Temp9am'
 'Temp3pm' 'RainToday']
              precision    recall  f1-score   support

           0       0.88      0.95      0.91     44149
           1       0.75      0.53      0.62     12729

    accuracy                           0.86     56878
   macro avg       0.81      0.74      0.77     56878
weighted avg       0.85      0.86      0.85     56878



# Logistic Regression

***

## 1. Regularize, run, and fit the model
### We already split the data into the training and test set which well use here.

### We use scikit-learn's pipeline to scale the data as SGD is sensitive to feature scaling.
### -  **Regularization:** L2 - Default of Logit Regression in Scikit learn. Well keep L2 since we have a low amount of features and do not want to shrink any coefficients to zero.
### -  **Loss Function:** Squared error - I want to be sensititve to any outliers in the data.


In [156]:
reg = make_pipeline(StandardScaler(), LogisticRegression(max_iter=1000))
reg.fit(x_train, y_train)

reg_predictions = reg.predict(x_test)

print(classification_report(y_test,reg_predictions))


print("Score: " + str(reg.score(x_test, y_test)))


              precision    recall  f1-score   support

           0       0.86      0.95      0.90     44149
           1       0.71      0.47      0.57     12729

    accuracy                           0.84     56878
   macro avg       0.79      0.71      0.73     56878
weighted avg       0.83      0.84      0.83     56878

Score: 0.83897113119308


### Result
The model works adequetly. It struggles with recall when predicting if a day will have rain. Future studies would improve the recall of the rain label to improve the overall f1 score.