Hi,

In this project I perform an Exploratory Data Analysis (EDA) on the Rainfall Data and try to predict if it is going to rain in Australia the next day. Athough the data contains data about the locations of the measurements, my initial try is to make a general model to predict the rain without using the location.


***Result of the project: AUC score: around 85% This could still be enhanced with further analysis***

# Schematic Project Layout:

Import
- Import Data from kaggle

- Split Data into train and test (Timeseriessplit)

EDA
- EDA on the train set with Pandas profiling library

The Pipeline

- Pipeline to process train (and later the test set):

    - Process numerical columns (most columns)

    - Process the Rainfall and Sunshine column (make feature descrete (categorical), since it has a lot of zeros and outliers)

    - Power transform (log transform) features that are right skewed

    - Feature engineering pipe (for example take the difference of the min and max temperature on a day for each location

    - Process categorical columns (one hot encoding)


Specific alterations on the train data to make training easier:

  - Remove outliers? Outliers are not really an issue in these data (except from Rainfall (but this is already fixed with descretion)

  - Oversample the minority class using SMOTE

      NOTE: Undersampling the majority class is not necessary because the ratio minority/majority is equal to about 1/3.

- Drop Nan's

  - In the target variable

  - In columns that have more than ~40 % NaN's

  - In rows that have more than 50% NaN's


- Cross validation on different estimators

  - Used metric: ROC_AUC

  - Learning Curves

- Pick promising estimators and perform a hyperparameter optimization with HyperOpt.

- Test best estimator on test set



# Import packages and the data

In [None]:
# Installed packages

import pandas as pd

import numpy as np
import matplotlib.pyplot as plt

#EDA library
from pandas_profiling import ProfileReport

from sklearn.model_selection import TimeSeriesSplit
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import  MinMaxScaler, PowerTransformer, OneHotEncoder, OrdinalEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer

#MICE
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer, SimpleImputer

#PCA
# from sklearn.decomposition import PCA # PCA is not really nessecary here

import seaborn as sns

# for oversampling and undersampling
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as imblearn_pipeline

#see progress bar in pandas
from tqdm import tqdm # display handy progess bars
tqdm.pandas()


## Read the csv file and train/test

In [None]:
#read the csv
df = pd.read_csv("../input/weather-dataset-rattle-package/weatherAUS.csv", parse_dates=['Date'], index_col='Date')

In [None]:
#view the first rows
df.head()

In [None]:
#the last rows
df.tail()

Check the general info of the data

In [None]:
# check the dataset
df.info()

The first things I notice is that apparently the dataframe is ordered by city instead of being in a chronicle order. This would affect our data split to the train and test set, since we do a TimeSeriesSplit; We are tryin to split the data on different periods in time. With some periods representing the train set and others the test set.

Secondly, the data has a lot of missing data. Fortunately, since the amount of rows is quiet big, this should not be a major problem.

In [None]:
#sort the df by index instead of by location
df = df.sort_index()
df.head()

The dataframe has only data from Canberra in the first days of the measurements. Does this mean that Canberra might be overpresented in the data?In the EDA analysis we should check for the frequency of each location label.

Let's look at the tail of the data frame.

In [None]:
df.tail()

So now the data is chronically ordered let's split the data into train and test sets using the Timeseries split.

### Split the data into train and test

In [None]:
# Make a TimeseriesSplit on the copies of the variables
tscv = TimeSeriesSplit(n_splits=5) # We use Time Series split to easily split the time series data without shuffling
for train_index, test_index in tscv.split(df):
     df_train, df_test = df.iloc[train_index].copy(), df.iloc[test_index].copy() 

# EDA

Firstly, let's do a Explorative Data Analysis (EDA) on the training data. The Pandas profiling library is an exellent and easy choice to explore the data, so let's use this one.

In [None]:
profile = ProfileReport(df_train, html={"style": {"full_width": True}}, title="Rain Australia") # html attribute optional

In [None]:
profile

The first thing I notice is that some columns have a very high proportion of missing data (< 35% in one column). So the first question that arises is; "Do these columns/features matter in predicting the rain tomorrow?" If they do, dropping them would result in a massive data loss. But on the other side, a data imputation would be ineffective because these data do definitely not have data missing at random (MAR). The columns that have a a lot of missing data are;


    - Evaporation
    - Sunshine
    - Cloud9am
    - Cloud3pm

Theoretically, these features are important since you can imagine that clouds may be hang over the sky for a while. And a sunny day may be prolonged in with another sunny day. This is especially the case in a for example a land climate. The climate around a sea may be more volatile. Lastly, evaporation should (obviously) also be linked to rainfall.

A second thing to mention is that the target column also contains NaN's. Since we do a fully supervised classification, the rows with NaN's in the target feature/column should also be dropped.

Lastly, rows with more than 50% NaN's should also be dropped, since these could not be very valueable in our analysis.



Let's find out with an EDA on the data with ~40 % dropped rows, dropped NaN's in the target column and drop useless rows. 

Note: We drop the 40% rows by using just a few columns since the missing values are correlated to others. (according to the heatmap in the EDA). 

In [None]:
# drop "some" rows
df_train_edit = (df_train.copy()
                        .dropna(axis=0, subset=['RainTomorrow']) # drop the rows with no target values (useless in fully supervised training)
                        .dropna(subset=['Cloud9am', 'Evaporation']) # drop the rows with NaN's in columns that have a lot of NaN's
                        .dropna(axis=0, how='any', thresh=12)) # drop rows that have more than 50% NaN's in a row (~12 NaN's)   # drop the rows of some features
drop_profile = ProfileReport(df_train_edit, html={"style": {"full_width": True}}, title="Rain Australia") # html attribute optional

In [None]:
drop_profile

Since we started with a lot of data, the histogram distributions of the features do not change that much, although we lost some locatinos and went from 49 locations to 30 locations. Since our goal is to make a general predictor for the rain the next day, this is not a huge problem. A problem might be an unbalanced (not uniform) distribution of the locations, what could cause a biased view of the weather in Australia. Let's see the histogram of the location column.

In [None]:
plt.figure(figsize=(16,10))
sns.histplot(data=df_train_edit['Location'])
plt.title('Histogram of the location labels')
plt.xticks(rotation=45)
plt.show()

The histogram is quiet uniform distributed, with a few exceptions. I think this would be a descent distribution to avoid a bias for a certain city. (Although the fact that we lost 19 cities)

# A complete overview of the EDA

Point 1: Global exploration

- Columns with high % missing values (> 40% missing values):
      - Evaporation
      - Sunshine ( + a lot of zeros)
      - Cloud9am
      - Cloud3pm

**-> Action: drop a massive amount of rows, to include the features in the data analysis**


- Columns with (very) high correlation:
      - Min Temp - Temp 9am
      - Max Temp - Temp 3pm
      - Pressure9am - Pressure3pm

**-> Action: Include in Columntransformer for MICE imputation, 1 of each pair is excluded afterwards** *( - an other option would be to do a PCA, but feature selection would be more dificult afterwards)*

--------------------------------------------------------

Point 2: The numeric pipelines

The numeric pipeline is divided into three "subpipes".


    The first subpipe
- The first subpipe is the 'general' pipe with normally distributed features.
These features will be imputed using the "multiple imputation by chained equations" technique (MICE). This techinque will be most of the times better than just median or a mean imputation. Afterwards these features will be standardize with a MinMaxscaler to remain standardized positive values. (A Standard scaler would also create ngeative values.)

  More practical info on MICE: https://machinelearningmastery.com/iterative-imputation-for-missing-values-in-machine-learning/ .



    The second subpipe
- The second subpipe is for the right skewed distibutions:
      - WindSpeed9am
      - WindSpeed3pm
      - WindGustSpeed
      - Evaporation

  These features will be transformed with a Powertransformer and afterwards imputed with the median and standardized with the MinMaxScaler.


    The third subpipe

- The third subpipe is to convert the features Rainfall and sunshine to variables with ordinal numbers. Because these values have a lot of zeros. Moreover, Rainfall also has a lot of outliers. The pipe will have median imputation before being converted to bins.


    The fourth subpipe
- The fourth subpipe will be an imputation pipe for the ordinal features. The pipe will have a 'most frequent' (mode) imputation. 


---------------------------------------------------------------------
Point 3: The categorical pipe

The categorical features will be encoded using a one-hot encoder

------------------------------------------------------------------
Point 4: Feature Engineering pipe

To add additional features, the difference is calculated between the Windspeed in the morning and afternoon. More wind mostly means a change in the weather.
Futhermore, the difference is calculated in the min and max temp on a day and difference in humidity during the day.

The pipe will consist of imputation with the median, apply a custom transformer to create the new features and lastly, to standardize these features with a MinMaxscaler.

-----------------------------------------------------------------
Point 5: The target column

The target variable has a ratio of 1:3 in terms of the distribution of "Yes" and "No". So a convenient solution for this would be to do apply the "Synthetic Minority Oversampling Technique". The minority class will be oversampled to create a more balanced distribution.

NOTE: The SMOTE must be performed DURING the Cross Validation. If SMOTE is performed performed before Cross Validation (only once). There will be an overfit on the synthesized data.
Handy sources: 

    https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/


    https://www.kaggle.com/janiobachmann/credit-fraud-dealing-with-imbalanced-datasets#Test-Data-with-Logistic-Regression:
---------------------------------------------------------------------
A picture of the whole pipe will be displayed after the code for the pipe
_________________________________________________________________________


# Pipelines for the Columntransformer (Used for Train and test data)

## Custom transformer for feature engineering

Ideas:

- Difference between windspeed 9am and 3pm
- Difference between humidity 9am and 3pm
- Difference Min and Max temp

In [None]:
#Custom transformer to add columns
#This transformer calculates the difference in temp, windspeed and humidity on different time periods

from sklearn.base import BaseEstimator, TransformerMixin

minTemp_ix, maxTemp_ix, humidity_9_ix, humidity_3_ix, windspeed_9_ix, windspeed_3_ix = 0,1,2,3,4,5

class ColumnAdd(BaseEstimator, TransformerMixin):
    def __init__(self): # no *args or **kargs
        pass
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X):
        diff_temp = np.abs(X[:,minTemp_ix] - X[:,maxTemp_ix])
        diff_humidity = np.abs(X[:,humidity_9_ix] - X[:,humidity_3_ix])
        diff_windspeed = np.abs(X[:,windspeed_9_ix] - X[:,windspeed_3_ix])
        
        return np.c_[diff_temp, diff_humidity, diff_windspeed] # transformer returns a np array

## Pipeline to transform numeric columns

To make better imputations than just the mean, the Multiple Imputation by Chained Equations is used instead of just a mean/median imputation. In order to maximize the result of a MICE, the highly correlated columns are initially included in the numeric pipeline. These columns are dropped at the final step of the pipeline to avoid multicollinearity during training. 

Note:
An alternative to dropping could be a PCA at the end. However, the interpretabillity of the features becomes harder since the features are transformed to PCA's. Just dropping the correlated features is more convenient.

In [None]:
# Function to use in the FunctionTransformer to drop come columns afterwards
def drop_cols(X):
  """
  Drops the first three columns of X to remove multicollinearity (these are the highly correlated columns)

  The columns were initially used to assist with the MICE imputation
  """
  return X[:,3:] #drop the first three columns that are entered in the Columntransformer

In [None]:
# Pipeline for all normally distributed features:

numeric_pipeline = Pipeline([
                             ('Iter_imputer', IterativeImputer(initial_strategy='median')), # since there are quiete some features used in this pipe, MICE may be more effective than just the median
                             ('scaler', MinMaxScaler()),
                             ('drop_some_cols', FunctionTransformer(func=drop_cols)) # drops the first three columns to avoid multicollinearity
])

### Pipeline for the Rainfall and Sunshine feature

In [None]:
# Make the values of the rainfall feature discrete >> e.g. "no rain", "little rain", "a lot of rain"
from sklearn.preprocessing import KBinsDiscretizer
# another way to discretize is with a DecisionTreeDiscretiser

descrete_pipe = Pipeline([
                          ('median_imputer',SimpleImputer(strategy='median')),
                          ('discrete', KBinsDiscretizer(strategy='kmeans', encode='ordinal'))
])

### Pipeline for right skewed distributions

In [None]:
# Pipeline for all normally distributed features:

not_normal_pipe = Pipeline([
                            ('log_trans', PowerTransformer()), #also scales the data and the Nan's remain NaN's
                            ('median_imputer',SimpleImputer(strategy='median'))
                            
])


### Pipeline for ordinal variables

In [None]:
# Pipeline to process ordinal features

ordinal_pipe = Pipeline([
                         ('impute', SimpleImputer(strategy='most_frequent'))
])

### Pipeline for feature engineering

In [None]:
#Feature engineering pipe (with the use ofthe custom transformer "ColumnAdd")
feature_eng_pipe = Pipeline([
                             ('median_imputer',SimpleImputer(strategy='median')),
                             ('add_columns', ColumnAdd()),
                             ('scaler', MinMaxScaler())
])

## Pipeline to encode Categorical Columns (one-hot encode)

In [None]:
# pipe for all categorical columns
cat_pipeline = Pipeline([
                         ('one_hot', OneHotEncoder()) # all nan's get a seperate column
])

## Column Transformer

In [None]:
#Remember the column names that remain in the DataFrame after dropping (to use in the Columntransformer)
normal_num_columns = ['Temp9am', 'Temp3pm', 'Pressure9am','MinTemp', 'MaxTemp', 'Humidity9am', 'Humidity3pm', 'Pressure9am'] # 'Temp9am', 'Temp3pm', 'Pressure9am' are dropped at the end to avoid multicollinearity
cat_columns = ['WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday'] #Since the want to predict the general weather at first, the location column is not required

preprocess_pipe = ColumnTransformer([
                               ('numeric', numeric_pipeline, normal_num_columns), # Normally distributed columns
                               ('descrete_pipe', descrete_pipe, ['Rainfall', 'Sunshine']), # pipe to create bins of the rainfall and sunshine features 
                               ('not_normal_pipe', not_normal_pipe, ['WindSpeed9am', 'WindSpeed3pm', 'WindGustSpeed', 'Evaporation']), # not normally distributed columns
                               ('ordinal_pipe', ordinal_pipe, ['Cloud9am', 'Cloud3pm']),
                               ('feature_engineer', feature_eng_pipe, ['MinTemp', 'MaxTemp', 'Humidity9am', 'Humidity3pm', 'WindSpeed9am', 'WindSpeed3pm']), # just try to see the effect
                               ('cat_pipe', cat_pipeline, cat_columns) # categorical columns (one hot encode)

])

### Graphical display of the full preprocessing pipeline

In [None]:
from sklearn import set_config
# from sklearn.utils import estimator_html_repr to save the display of the pipe in html format


set_config(display='diagram') # set display to diagram instead of text

preprocess_pipe

In [None]:
# Process the training data before going into the pipeline
# this variable was also already defined in the beginning, but just for clearity
df_train_edit = (df_train.copy()
                        .dropna(axis=0, subset=['RainTomorrow']) # drop the rows with no target values (useless in fully supervised training)
                        .dropna(subset=['Cloud9am', 'Evaporation']) # drop the rows with NaN's in columns that have a lot of NaN's
                        .dropna(axis=0, how='any', thresh=12)) # drop rows that have more than 50% NaN's in a row (~12 NaN's)  

X_train = df_train_edit.drop('RainTomorrow', axis=1)
y_train = df_train_edit['RainTomorrow']

X_processed = preprocess_pipe.fit_transform(X_train)
cat_ordinal = LabelEncoder()
y_processed = cat_ordinal.fit_transform(y_train)

# Oversampling the Minority class of the target data (ONLY for df_train!)

An easy way to oversample the minority class in this example (Rain Tomorrow is True) is to pick random sample from the distribution and add them to the data. However, this would not add any extra information to the data. A better way is to use the Synthetic Minority Oversampling Technique (SMOTE).

"" *SMOTE first selects a minority class instance a at random and finds its k nearest minority class neighbors. The synthetic instance is then created by choosing one of the k nearest neighbors b at random and connecting a and b to form a line segment in the feature space. The synthetic instances are generated as a convex combination of the two chosen instances a and b*.""

  — Page 47, Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.

Basically it produces new samples that are just slightly different. 

A tip from the creaters of the technique: First undersample the majority class before you oversample the minority class. So that's what we are going to do. This tip is effective because it creates more samples that are more plausible. BUT, since we already have a descent ratio (ratio = 1/3) undersampling of the majority class is not necessary.

source:

    https://machinelearningmastery.com/iterative-imputation-for-missing-values-in-machine-learning/

In [None]:
oversample = SMOTE()
X_train_sampled, y_train_sampled = oversample.fit_resample(X_processed, y_processed)

In [None]:
plt.hist(y_train_sampled)
plt.show()

# import estimators

In [None]:
#importing estimators
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import SGDClassifier # Standard SGD is with SVCLinear
from xgboost import XGBClassifier
#from lightgbm import LGBMClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.neural_network import MLPClassifier

# import CV
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

# Training the models

Global inspection of the CV scores

In [None]:
# create dict of models
def get_models():
  models = dict()
  models['log'] = LogisticRegression()
  models['knn'] = KNeighborsClassifier()
  models['xgb'] = XGBClassifier()
  models['sgd'] = SGDClassifier()
  models['mlp'] = MLPClassifier()
  return models

In [None]:
def evaluate_model(model, X, y):
  cv = StratifiedKFold()
  scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1, error_score='raise')
  return scores

In [None]:
# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():

  # original pipe adjusted to train on the train data with SMOTE (only applicable on the train data)
  imb_pipe = imblearn_pipeline([
                                ('preprocess', preprocess_pipe),
                                ('SMOTE', SMOTE()),
                                (name, model)
  ])
  scores = evaluate_model(imb_pipe, X_train, y_processed)
  results.append(scores)
  names.append(name)
  print('>%s %.3f (%.3f)' % (name, np.mean(scores), np.std(scores)))
# plot model performance for comparison
plt.boxplot(results, labels=names, showmeans=True)
plt.show()

Several models are around the same auc scores. To further explore these models a learning curve is plotted. In the learning curve we could observe which estimator would cause an overfit. 

# Plot Learning Curves of promising models

In [None]:
# Adopted from: https://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html#sphx-glr-auto-examples-model-selection-plot-learning-curve-py

from sklearn.model_selection import learning_curve

def plot_learning_curve(estimator, title, X, y, axes=None, ylim=None, cv=None,
                        n_jobs=None, train_sizes=np.linspace(.1, 1.0, 5)):
    """
    Generate 3 plots: the test and training learning curve, the training
    samples vs fit times curve, the fit times vs score curve.

    Parameters
    ----------
    estimator : estimator instance
        An estimator instance implementing `fit` and `predict` methods which
        will be cloned for each validation.

    title : str
        Title for the chart.

    X : array-like of shape (n_samples, n_features)
        Training vector, where ``n_samples`` is the number of samples and
        ``n_features`` is the number of features.

    y : array-like of shape (n_samples) or (n_samples, n_features)
        Target relative to ``X`` for classification or regression;
        None for unsupervised learning.

    axes : array-like of shape (3,), default=None
        Axes to use for plotting the curves.

    ylim : tuple of shape (2,), default=None
        Defines minimum and maximum y-values plotted, e.g. (ymin, ymax).

    cv : int, cross-validation generator or an iterable, default=None
        Determines the cross-validation splitting strategy.
        Possible inputs for cv are:

          - None, to use the default 5-fold cross-validation,
          - integer, to specify the number of folds.
          - :term:`CV splitter`,
          - An iterable yielding (train, test) splits as arrays of indices.

        For integer/None inputs, if ``y`` is binary or multiclass,
        :class:`StratifiedKFold` used. If the estimator is not a classifier
        or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.

        Refer :ref:`User Guide <cross_validation>` for the various
        cross-validators that can be used here.

    n_jobs : int or None, default=None
        Number of jobs to run in parallel.
        ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
        ``-1`` means using all processors. See :term:`Glossary <n_jobs>`
        for more details.

    train_sizes : array-like of shape (n_ticks,)
        Relative or absolute numbers of training examples that will be used to
        generate the learning curve. If the ``dtype`` is float, it is regarded
        as a fraction of the maximum size of the training set (that is
        determined by the selected validation method), i.e. it has to be within
        (0, 1]. Otherwise it is interpreted as absolute sizes of the training
        sets. Note that for classification the number of samples usually have
        to be big enough to contain at least one sample from each class.
        (default: np.linspace(0.1, 1.0, 5))
    """
    if axes is None:
        _, axes = plt.subplots(1, 3, figsize=(20, 5))

    axes[0].set_title(title)
    if ylim is not None:
        axes[0].set_ylim(*ylim)
    axes[0].set_xlabel("Training examples")
    axes[0].set_ylabel("Score")

    train_sizes, train_scores, test_scores, fit_times, _ = \
        learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs,
                       train_sizes=train_sizes, scoring='roc_auc',
                       return_times=True)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    fit_times_mean = np.mean(fit_times, axis=1)
    fit_times_std = np.std(fit_times, axis=1)

    # Plot learning curve
    axes[0].grid()
    axes[0].fill_between(train_sizes, train_scores_mean - train_scores_std,
                         train_scores_mean + train_scores_std, alpha=0.1,
                         color="r")
    axes[0].fill_between(train_sizes, test_scores_mean - test_scores_std,
                         test_scores_mean + test_scores_std, alpha=0.1,
                         color="g")
    axes[0].plot(train_sizes, train_scores_mean, 'o-', color="r",
                 label="Training score")
    axes[0].plot(train_sizes, test_scores_mean, 'o-', color="g",
                 label="Cross-validation score")
    axes[0].legend(loc="best")

    # Plot n_samples vs fit_times
    axes[1].grid()
    axes[1].plot(train_sizes, fit_times_mean, 'o-')
    axes[1].fill_between(train_sizes, fit_times_mean - fit_times_std,
                         fit_times_mean + fit_times_std, alpha=0.1)
    axes[1].set_xlabel("Training examples")
    axes[1].set_ylabel("fit_times")
    axes[1].set_title("Scalability of the model")

    # Plot fit_time vs score
    axes[2].grid()
    axes[2].plot(fit_times_mean, test_scores_mean, 'o-')
    axes[2].fill_between(fit_times_mean, test_scores_mean - test_scores_std,
                         test_scores_mean + test_scores_std, alpha=0.1)
    axes[2].set_xlabel("fit_times")
    axes[2].set_ylabel("Score")
    axes[2].set_title("Performance of the model")

    return plt

In [None]:
fig, axes = plt.subplots(3, 3, figsize=(20, 20))
title = "Learning Curves XGB"

xgb_pipe = imblearn_pipeline([
                                ('preprocess', preprocess_pipe),
                                ('SMOTE', SMOTE()),
                                ('xgb', XGBClassifier())
])

plot_learning_curve(xgb_pipe, title, X_train, y_processed, axes=axes[:, 0], ylim=(0.8, 0.9),
                    cv=5, n_jobs=-1)


title = "Learning Curves Logistic Regression"
log_pipe = imblearn_pipeline([
                                ('preprocess', preprocess_pipe),
                                ('SMOTE', SMOTE()),
                                ('log', LogisticRegression())
])

plot_learning_curve(log_pipe, title, X_train, y_processed, axes=axes[:, 1], ylim=(0.8, 0.9),
                    cv=5, n_jobs=-1)

title = "Learning curves SGD"
sgd_pipe = imblearn_pipeline([
                                ('preprocess', preprocess_pipe),
                                ('SMOTE', SMOTE()),
                                ('sgd', SGDClassifier())
])

plot_learning_curve(sgd_pipe, title, X_train, y_processed, axes=axes[:, 2], ylim=(0.8, 0.9),
                    cv=5, n_jobs=-1)



plt.show()

So, all estimators are pretty good predictors.. But which is the best? To find out, we do a search using the Hyperopt library and search for the best estimator with the best parameters. Logistic regression is left out in this grid search since it is it could be implemented in the SGD as a hyperparameter.

A useful and easy library for Hyperopt in sklearn is the hpsklearn. However this library does not (yet) support a ROC_AUC metric. So that's why we use the original library. To solve this issue we could use SMOTE in the parameter tuning, but I find it more convenient to stick with the original values (without the synthesized ones)

# Hyperopt for hyperparameter optimization
(With XGBoost and SGD)

In [None]:
from hyperopt import fmin, hp, tpe, Trials, space_eval, STATUS_OK
from hyperopt.pyll import scope
from hyperopt.pyll.stochastic import sample
from functools import partial

## Search spaces

In [None]:
###################################################
##==== XGBoost hyperparameters search space ====##
###################################################
# parameters adopted from hpsklearn (github)
hp_space = {
    
    'clf_type': hp.choice('clf_type', [
                                       
    {'type': 'xgb',
     
      'clf':{
            'max_depth' :scope.int(hp.uniform('max_depth', 1, 11)),
            'learning_rate':hp.loguniform('learning_rate', np.log(0.0001), np.log(0.5)) - 0.0001,
            # 'n_estimators' : scope.int(hp.quniform('n_estimators', 100, 1000, 100)), # just do 100 since more is mostly better, but more cpu expensive
            'n_estimators':100,
            'scale_pos_weight':3, # target data distribution is approx. 1:3
            'gamma': hp.loguniform('gamma', np.log(0.0001), np.log(5)) - 0.0001,
            'min_chold_weight': scope.int(hp.loguniform('min_chold_weight', np.log(1), np.log(100))),
            'subsample': hp.uniform('subsample', 0.5, 1),
            'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 1),
            'colsample_bylevel':hp.uniform('colsample_bylevel', 0.5, 1),
            'reg_alpha':hp.loguniform('reg_alpha', np.log(0.0001), np.log(1)) - 0.0001,
            'reg_lambda':hp.loguniform('reg_lambda', np.log(1), np.log(4)),
            'n_jobs':-1,
            }},

            #adopted from hpsklearn
      {'type': 'sgd',
          'clf':{
             
            'penalty': hp.pchoice('penalty', [(0.40, 'l2'),(0.35, 'l1'), (0.25, 'elasticnet')]),
            'loss': hp.pchoice('loss',[
            (0.25, 'hinge'),
            (0.25, 'log'),
            (0.25, 'modified_huber'),
            (0.05, 'squared_hinge'),
            (0.05, 'perceptron'),
            (0.05, 'squared_loss'),
            (0.05, 'huber'),
            (0.03, 'epsilon_insensitive'),
            (0.02, 'squared_epsilon_insensitive')]),

            'alpha': hp.loguniform('alpha', np.log(1e-6), np.log(1e-1)),
            'l1_ratio': hp.uniform('l1_ratio', 0, 1),
            'epsilon': hp.loguniform('epsilon', np.log(1e-7), np.log(1)),
            'eta0':hp.loguniform('eta0', np.log(1e-5), np.log(1e-1)),
            'power_t':hp.uniform('power_t', 0, 1),
            'class_weight':'balanced',
            'n_jobs':-1}}

    ])}


In [None]:
sample(hp_space['clf_type']['type'])

In [None]:
def f_clf1(hps):
    """
    Constructs estimator
    
    Parameters:
    ----------------
    hps : sample point from search space
    
    Returns:
    ----------------
    model : sklearn.Pipeline.pipeline with hyperparameters set up as per hps
    """


    if hps['clf_type']['type'] == 'xgb':
      model = Pipeline([
                      ('preprocess', preprocess_pipe),
                      ('xgb', XGBClassifier(**hps['clf_type']['clf']))
      ])                                                    # ** unpacks the dictionary   (when then dictionary has subdictionarries, there must be a special function or action)

    elif hps['clf_type']['type'] == 'sgd':
      model = Pipeline([
                      ('preprocess', preprocess_pipe),
                      ('sgd', SGDClassifier(**hps['clf_type']['clf']))
            ])
    
    return model

In [None]:
def f_to_min1(hps, X, y, ncv=5):
    """
    Target function for optimization
    
    Parameters:
    ----------------
    hps : sample point from search space
    X : feature matrix
    y : target array
    ncv : number of folds for cross-validation
    
    Returns:
    ----------------
    : target function value (negative mean cross-val ROC-AUC score)
    """

    estimator = f_clf1(hps)
    cv_res = cross_val_score(estimator, X, y, cv=StratifiedKFold(ncv), 
                             scoring='roc_auc', n_jobs=-1)
    
    return {
        'loss': -cv_res.mean(), # return negative value because hyperopt wants to minimize the score (while you try to maximize the postive AUC)
        'cv_std': cv_res.std(),
        'status': STATUS_OK}

Hints from https://www.kaggle.com/fanvacoolt/tutorial-on-hyperopt 

- There is a rule of thumb that for proper search space exploration you would require around 25 runs per dimension to converge. It is entirely heuristical, but might help to estimate how many steps you would need;

- Always back up your results into Trials object, and possibly pickle it to hard drive. It is very frustrating to lose several hours of optimization due to a power cut or OS freezing;

- Check carefully sign of your target function. Shouldn't a minus be there?

- Don't use full capacity of your model for hyperparameter tuning. This means: your don't need 1000 trees in your XGBoost, nor do you need all 8 million samples of your training set. It is very unlikely that optimization on a reduced set/capacity would yield significantly different results, while computation time would be much lower;

- MOST IMPORTANT: Never substitute proper feature engineering with extensive hyperparameter optimization. Former is much more important - untuned model with great features easily outperforms heavily optimized model without intelligent design. Do some EDA, try things, find out what works, what might work and what is garbage, and then, before you go to sleep - launch optimization for the night. Not the other way around.


In [None]:
trials_clf = Trials()
best_clf1 = fmin(partial(f_to_min1, X=X_train, y=y_processed), hp_space, algo=tpe.suggest, max_evals=10, trials=trials_clf, verbose=1) # the loss does not improve that much after 10 evals

The loss does not improve that much after 10 evals

In [None]:
space_eval(hp_space, best_clf1)

So the XGB it iss

In [None]:
best_clf1

In [None]:
clf_optimized = f_clf1(space_eval(hp_space, best_clf1)).fit(X_train, y_processed)

In [None]:
# evaluate on the test set
from sklearn.metrics import roc_auc_score

df_test = (df_test.dropna(axis=0, subset=['RainTomorrow'])
                        .dropna(axis=0, how='any', thresh=12))

X_test = df_test.drop('RainTomorrow', axis=1)
y_test = df_test['RainTomorrow']

y_test_processed = cat_ordinal.transform(y_test)

y_pred = clf_optimized.predict_proba(X_test)[:,1]

roc_auc_score(y_test_processed, y_pred)

# print(test_score)

Suggestions:

To further improve the model the following things could be performed:

- Feature selection

- More feature engineering

- Enhance the amount of estimators

- Ensemble some estimators

- Create a Neural network in for example keras


Thanks for reading!