# 8. Observations

In this notebook, we will check some suggestions received after the official project presentation. We will continue using the dataset [Rain in Australia](https://www.kaggle.com/jsphyg/weather-dataset-rattle-package) available in Kaggle, that contains about 10 years of daily weather observations from many locations across Australia.

### Index:
1. [Packages required](#1.-Packages-required)
2. [Validation methods](#2.-Validation-methods)
3. [delta% interpretation](#3.-delta%-interpretation)
4. [Conclusion](#4.-Conclusion)

# 1. Packages required

In [None]:
!pip install lightgbm

In [5]:
import os
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split

# 2. Validation methods

Along the project, we have used an ``Out-Of-Time´´ validation given the characteristics of the data. However, according to the fact that we are working with daily measures, each record can be interpreted as an individual record. In this way, makes sense thinking that applying cross validation we will obtain equivalent results. We will check it building the CART, Random Forest and LightGBM models using this validation.

In [9]:
#Loading data:
weather = pd.read_parquet('../data/04_model_input/master.parquet')

#We fix the variables we are interested in:
model_columns = list(set(weather.select_dtypes(include='number').columns) - set(['RainTomorrow']))

#We split through cross validation the dataset:
X = weather[model_columns].fillna(-1)
y = weather.RainTomorrow
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

First, we generate different decision trees varying the maximum depth allowed:

In [10]:
#Building CART models:
metrics = {}
for max_depth in [1, 3, 5, 10, 15, 20, 30]:
    model = DecisionTreeClassifier(max_depth = max_depth)
    model.fit(X_train, y_train);
    
    train_pred = model.predict_proba(X_train)[:, 1]
    test_pred = model.predict_proba(X_test)[:, 1]

    metrics['DT_'+ str(max_depth)] = {
        'Train_Gini': 2*roc_auc_score(y_train, train_pred)-1,
        'Test_Gini': 2*roc_auc_score(y_test, test_pred)-1
    }

metrics_pd = pd.DataFrame.from_dict(metrics, orient='index',columns=['Train_Gini', 'Test_Gini'])
metrics_pd['delta%'] = 100*(metrics_pd.Test_Gini - metrics_pd.Train_Gini) / metrics_pd.Train_Gini
metrics_pd

Unnamed: 0,Train_Gini,Test_Gini,delta%
DT_1,0.346214,0.349199,0.862177
DT_3,0.573851,0.570802,-0.531304
DT_5,0.652942,0.644036,-1.363879
DT_10,0.760784,0.682462,-10.294895
DT_15,0.898604,0.51568,-42.613262
DT_20,0.979568,0.371941,-62.030033
DT_30,0.999924,0.398351,-60.161868


As we can see, the conclusion is practically the same that using an ``Out-Of-Time´´ validation: we must choose the models with maximum depths 3 and 5, and now, we can choose the first too. According to the Gini obtained in these models, we can check that it's quite similar to the obtained in CART notebook.

Then, we generate Random Forest and LightGBM models varying the number of trees built:

In [16]:
#Building Random Forest models:
metrics = {}
for n_estimators in [1, 3, 5, 10, 15, 20, 30, 50, 100, 200, 500, 1000]:
    model = RandomForestClassifier(n_estimators = n_estimators, max_depth = 5)
    model.fit(X_train, y_train);
    
    train_pred = model.predict_proba(X_train)[:, 1]
    test_pred = model.predict_proba(X_test)[:, 1]

    metrics['RF_'+ str(n_estimators)] = {
        'Train_Gini': 2*roc_auc_score(y_train, train_pred)-1,
        'Test_Gini': 2*roc_auc_score(y_test, test_pred)-1
    }

metrics_pd = pd.DataFrame.from_dict(metrics, orient='index',columns=['Train_Gini', 'Test_Gini'])
metrics_pd['delta%'] = 100*(metrics_pd.Test_Gini - metrics_pd.Train_Gini) / metrics_pd.Train_Gini
metrics_pd

Unnamed: 0,Train_Gini,Test_Gini,delta%
RF_1,0.583297,0.580785,-0.430639
RF_3,0.622282,0.609559,-2.04458
RF_5,0.661855,0.657672,-0.631953
RF_10,0.678603,0.672058,-0.964484
RF_15,0.682151,0.677459,-0.687704
RF_20,0.686987,0.679835,-1.041011
RF_30,0.682475,0.676544,-0.869134
RF_50,0.690508,0.683733,-0.98104
RF_100,0.692939,0.686644,-0.908442
RF_200,0.690666,0.684331,-0.917195


In [17]:
#Building LightGBM models:
metrics = {}
for n_estimators in [1, 3, 5, 10, 15, 20, 30, 50, 100, 200, 500, 1000]:
    model = LGBMClassifier(n_estimators = n_estimators, max_depth = 5)
    model.fit(X_train, y_train);
    
    train_pred = model.predict_proba(X_train)[:, 1]
    test_pred = model.predict_proba(X_test)[:, 1]

    metrics['LGBM_'+ str(n_estimators)] = {
        'Train_Gini': 2*roc_auc_score(y_train, train_pred)-1,
        'Test_Gini': 2*roc_auc_score(y_test, test_pred)-1
    }

metrics_pd = pd.DataFrame.from_dict(metrics, orient='index',columns=['Train_Gini', 'Test_Gini'])
metrics_pd['delta%'] = 100*(metrics_pd.Test_Gini - metrics_pd.Train_Gini) / metrics_pd.Train_Gini
metrics_pd

Unnamed: 0,Train_Gini,Test_Gini,delta%
LGBM_1,0.652921,0.644048,-1.358885
LGBM_3,0.692352,0.685808,-0.945136
LGBM_5,0.698796,0.691258,-1.078697
LGBM_10,0.716791,0.709094,-1.07384
LGBM_15,0.728525,0.72044,-1.109796
LGBM_20,0.737042,0.728487,-1.160768
LGBM_30,0.751131,0.741624,-1.265722
LGBM_50,0.771237,0.759669,-1.499942
LGBM_100,0.795702,0.776631,-2.396778
LGBM_200,0.824733,0.790275,-4.178044


As we can check, the Gini coefficient obtained applying cross validation doesn't differ from results of previous notebooks. However, we can appreciate a small general fall in the value of delta%. This decrease can be given by the presence of any relationship between the target and the time. However, that's only and hypothesis.

In summary, we can conclude that the models given and the precision obtained are quite similar but the deviation between both subsets is slightly different.

# 3. delta% interpretation

In the text of the project, we can find an interpretation of delta% where we affirm that the delta% is the deviation between both subsets and that we can suppose that this value will increase each two years (the interval of our test subset). In this section, we will analyze if this affirmation is exactly true and, in the negative case, why not.

First, we will split the data in three groups:
* Train: 2007 - 2012
* Test1: 2013 - 2014
* Test2: 2015 - 2017

Then, we will build the same models of the last section using the training data and we will check the delta of both testing data. If the afirmation given is true, delta_2 will be as twice as delta1. Let's check it.

In [24]:
#We split through the fixed rule:
test1_date = '2013-01-01'
test2_date = '2015-01-01'
train = weather[weather.Date < test1_date].fillna(-1)
test = weather[weather.Date >= test1_date].fillna(-1)
test1 = test[test.Date < test2_date]
test2 = test[test.Date >= test2_date]

#Building CART models:
metrics = {}
for max_depth in [1, 3, 5, 10, 15, 20, 30]:
    model = DecisionTreeClassifier(max_depth = max_depth)
    model.fit(train[model_columns], train.RainTomorrow);
    
    train_pred = model.predict_proba(train[model_columns])[:, 1]
    test_pred_1 = model.predict_proba(test1[model_columns])[:, 1]
    test_pred_2 = model.predict_proba(test2[model_columns])[:, 1]

    metrics['DT_'+ str(max_depth)] = {
        'Train_Gini': 2*roc_auc_score(train.RainTomorrow, train_pred)-1,
        'Test1_Gini': 2*roc_auc_score(test1.RainTomorrow, test_pred_1)-1,
        'Test2_Gini': 2*roc_auc_score(test2.RainTomorrow, test_pred_2)-1
    }

metrics_pd = pd.DataFrame.from_dict(metrics, orient='index',columns=['Train_Gini', 'Test1_Gini', 'Test2_Gini'])
metrics_pd['delta%_1'] = 100*(metrics_pd.Test1_Gini - metrics_pd.Train_Gini) / metrics_pd.Train_Gini
metrics_pd['delta%_2'] = 100*(metrics_pd.Test2_Gini - metrics_pd.Train_Gini) / metrics_pd.Train_Gini
metrics_pd

Unnamed: 0,Train_Gini,Test1_Gini,Test2_Gini,delta%_1,delta%_2
DT_1,0.379782,0.366532,0.341015,-3.488659,-10.207481
DT_3,0.592657,0.58256,0.560669,-1.703593,-5.397341
DT_5,0.665618,0.658775,0.627401,-1.02813,-5.741525
DT_10,0.784471,0.659522,0.624405,-15.927867,-20.404392
DT_15,0.931453,0.448049,0.366332,-51.897891,-60.670905
DT_20,0.990921,0.36849,0.345109,-62.813331,-65.172859
DT_30,0.999993,0.391008,0.381837,-60.898938,-61.816017


In [25]:
#Building Random Forest models:
metrics = {}
for n_estimators in [1, 3, 5, 10, 15, 20, 30, 50, 100, 200, 500, 1000]:
    model = RandomForestClassifier(n_estimators = n_estimators, max_depth = 5)
    model.fit(train[model_columns], train.RainTomorrow);
    
    train_pred = model.predict_proba(train[model_columns])[:, 1]
    test_pred_1 = model.predict_proba(test1[model_columns])[:, 1]
    test_pred_2 = model.predict_proba(test2[model_columns])[:, 1]

    metrics['RF_'+ str(n_estimators)] = {
        'Train_Gini': 2*roc_auc_score(train.RainTomorrow, train_pred)-1,
        'Test1_Gini': 2*roc_auc_score(test1.RainTomorrow, test_pred_1)-1,
        'Test2_Gini': 2*roc_auc_score(test2.RainTomorrow, test_pred_2)-1
    }

metrics_pd = pd.DataFrame.from_dict(metrics, orient='index',columns=['Train_Gini', 'Test1_Gini', 'Test2_Gini'])
metrics_pd['delta%_1'] = 100*(metrics_pd.Test1_Gini - metrics_pd.Train_Gini) / metrics_pd.Train_Gini
metrics_pd['delta%_2'] = 100*(metrics_pd.Test2_Gini - metrics_pd.Train_Gini) / metrics_pd.Train_Gini
metrics_pd

Unnamed: 0,Train_Gini,Test1_Gini,Test2_Gini,delta%_1,delta%_2
RF_1,0.520147,0.488581,0.45898,-6.068761,-11.759569
RF_3,0.619372,0.589994,0.570631,-4.743187,-7.869409
RF_5,0.665561,0.654647,0.642283,-1.639832,-3.497467
RF_10,0.696533,0.669743,0.654997,-3.846185,-5.963282
RF_15,0.684245,0.666,0.65502,-2.666407,-4.271048
RF_20,0.695906,0.675316,0.660895,-2.958731,-5.030989
RF_30,0.696993,0.680269,0.668491,-2.399516,-4.089281
RF_50,0.69623,0.680602,0.669001,-2.244714,-3.910912
RF_100,0.700795,0.684006,0.672896,-2.395647,-3.981108
RF_200,0.700758,0.687444,0.672635,-1.900046,-4.01322


In [26]:
#Building LightGBM models:
metrics = {}
for n_estimators in [1, 3, 5, 10, 15, 20, 30, 50, 100, 200, 500, 1000]:
    model = LGBMClassifier(n_estimators = n_estimators, max_depth = 5)
    model.fit(train[model_columns], train.RainTomorrow);
    
    train_pred = model.predict_proba(train[model_columns])[:, 1]
    test_pred_1 = model.predict_proba(test1[model_columns])[:, 1]
    test_pred_2 = model.predict_proba(test2[model_columns])[:, 1]

    metrics['LGBM_'+ str(n_estimators)] = {
        'Train_Gini': 2*roc_auc_score(train.RainTomorrow, train_pred)-1,
        'Test1_Gini': 2*roc_auc_score(test1.RainTomorrow, test_pred_1)-1,
        'Test2_Gini': 2*roc_auc_score(test2.RainTomorrow, test_pred_2)-1
    }

metrics_pd = pd.DataFrame.from_dict(metrics, orient='index',columns=['Train_Gini', 'Test1_Gini', 'Test2_Gini'])
metrics_pd['delta%_1'] = 100*(metrics_pd.Test1_Gini - metrics_pd.Train_Gini) / metrics_pd.Train_Gini
metrics_pd['delta%_2'] = 100*(metrics_pd.Test2_Gini - metrics_pd.Train_Gini) / metrics_pd.Train_Gini
metrics_pd

Unnamed: 0,Train_Gini,Test1_Gini,Test2_Gini,delta%_1,delta%_2
LGBM_1,0.665526,0.658645,0.631214,-1.033882,-5.155604
LGBM_3,0.698234,0.684882,0.670203,-1.912279,-4.014483
LGBM_5,0.704639,0.689854,0.675035,-2.098209,-4.201335
LGBM_10,0.727133,0.708405,0.700353,-2.575655,-3.683001
LGBM_15,0.737453,0.717584,0.708335,-2.694297,-3.94844
LGBM_20,0.746249,0.724707,0.714658,-2.886718,-4.233345
LGBM_30,0.763406,0.736676,0.727937,-3.501456,-4.646072
LGBM_50,0.784869,0.751252,0.741935,-4.283109,-5.470279
LGBM_100,0.813026,0.765601,0.755906,-5.833176,-7.02559
LGBM_200,0.846914,0.774619,0.7648,-8.536274,-9.695592


Once we have check, we can affirm that considering that $delta_2 \simeq 2\times delta_1$ is a result a little bit pretentious but it's true that we obtain that $delta_2 > delta_1$ in all models, with some huge increasings. Thus, we recognize that the initial result it's not exactly true but delta increases if the data tested is later.

# 4. Conclusion

According to the results obtained, we have observed that there's no significant difference between using an ``Out-Of-Time´´ validation and Cross Validation. We only find differences in the delta% values, which are probably related with any relationship between the target and the temporal structure of the data.

On the other hand, we have checked that the more far (in terms of time) is the prediction, the worst value of delta% is obtained. Nevertheless, the increasement porportion established in the text is no exact.