# Rainfall Prediction - Training a Binary Classifier
## Exploratory Data Analysis
### Feature Engineering and Scaling
As usual we need to load in the libraries, and data, we will be using. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
pd.options.mode.chained_assignment = None

rainfall = pd.read_csv("weatherAUS_extracted_dates.csv")
rainfall = rainfall[rainfall['RainTomorrow'].notna()]

The next step in preparing our data for use in training and testing a model is to separate out the target variable. We can do this as shown below.

In [2]:
X = rainfall.drop(columns=['RainTomorrow'])
y = rainfall['RainTomorrow']

We can now split up the data into training and testing datasets as follows.

In [3]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In order to handle the missing values we can impute them into the datasets. What this means is we assign them a value by inferring one from our data. For the numerical variables we'll be using the median since the data contains many outliers. And for the categorical variables we'll be using the mode. 

In [4]:
numerical_vars = X.select_dtypes(include=['int64', 'float64']).columns
categorical_vars = X.select_dtypes(include=['object']).columns

for X in [X_train, X_test]:
    for cat_var in categorical_vars:
        var_mode = X_train[cat_var].mode()[0]
        X[cat_var].fillna(var_mode, inplace=True);
    for num_var in numerical_vars:
        var_median = X_train[num_var].median()
        X[num_var].fillna(var_median, inplace=True);

We can also engineer some of our outliers by capping them at the bounds of the interquartile range. 

In [5]:
for X in [X_train, X_test]:
    for num_var in numerical_vars:
        ub = X_train[num_var].quantile(0.75)
        X.loc[X[num_var] > ub, num_var] = ub

        lb = X_train[num_var].quantile(0.25)
        X.loc[X[num_var] < lb, num_var] = lb

Our second to last step is to use sklearn's `DictVectorizer` to encode our categorical variables so that they can be more easily used in the model.

In [6]:
from sklearn.feature_extraction import DictVectorizer

dv = DictVectorizer(sparse=False)
train_dict = X_train.to_dict(orient='records')
X_train = dv.fit_transform(train_dict)
X_test = dv.transform(X_test.to_dict(orient='records'))
np.save("features.npy", dv.get_feature_names())

Lastly, we need to scale all of the variables so that no variable biases the model. We do this as shown below. 

In [7]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Finally (this time for real), we can save the dataframes we made so we can use them for training and testing the model in the next notebook.

In [8]:
np.savetxt("X_train.csv", X_train, delimiter=',')
np.savetxt("X_test.csv", X_test, delimiter=',')
y_train.to_csv("y_train.csv")
y_test.to_csv("y_test.csv")