# 7  Modeling - selection of the best deep learning models with Keras

<b> Purpose of the action </b> - checking accuracy of prediction on test set using 2 different types of Neural Networks:
- ANN only with Dense layers
- RNN with LSTM layers

<b> </b>
<b> Action plan </b>:
- Test 10 diffrent network for each type
- Use  Keras wraper for the Scikit-Learn API and ParameterSampler to generate models with random hyper parameters
- Use training set for fitting model and use validation set for model evaluation
- Select the best one model of each type, retrain on all data and make prediction on test set
- Select the best 5 models of each type and create AveragingNetworkClassifiers from them
- Train him on all data and make prediction on test set
- Create LargeAveragingNetworkClassifier from the previously created AveragingNetworkClassifier, then make prediction on test set (all network are already trained)
- Compare all prediction accuracy and other metrics on test set and save results for future purpose

## 7.1 Import nessesary libraries and modules

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf 
from sklearn.pipeline import Pipeline
from classifiers import AveragingNetworkClassifier, LargeAveragingClassifier
from preprocessing_pipelines import basic_preprocess_pipeline, ImportantFeaturesSelector
from modeling import show_best_models, build_ann, build_rnn, select_best_networks, Metrics, AnnBuilder, RnnBuilder

## 7.2 Import base data sets

In [2]:
# data sets for selecting best models of each type
train_set = pd.read_csv("./preprocessed_data/processed_base_train_set.csv", index_col=0)
validation_set = pd.read_csv("./preprocessed_data/processed_base_validation_set.csv", index_col=0)

# data sets for final fiting and prediction
train_set_all = pd.read_csv('./preprocessed_data/train_set_stage2.csv', index_col=0)
test_set = pd.read_csv('./preprocessed_data/test_set_stage2.csv', index_col=0)

## 7.3 Split datasets to feature and label sets

In [3]:
# feature and label sets for selecting models
X_train, y_train = np.array(train_set.drop(columns='FTR')), np.array(train_set['FTR'])
X_val, y_val = np.array(validation_set.drop(columns='FTR')), np.array(validation_set['FTR'])

# feature and label sets for final training and prediction
X_train_all, y_train_all = train_set_all.drop(columns='FTR'), np.array(train_set_all['FTR'])
X_test, y_test = test_set.drop(columns='FTR'), np.array(test_set['FTR'])

## 7.4 Create placeholders to hold prediction results

In [4]:
# placeholder to hold prediction results
prediction_metrics = Metrics()

# lists to hold model objects
averaging_models = []

## 7.5 Artificial Neural Network(ANN)

### 7.5.1  Select best models

Select the best model using Keras wraper for the Scikit-Learn API and ParameterSampler to generate models with diffrent parameters

In [5]:
# define params for random grid search
params_grid = {
    'n_hiden_layers': [1, 1, 2, 2, 3],
    'hidden_layer_size' : [64, 32, 16],
    'batch_size' : [8, 16]
}

# add early stopping callback
early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor='val_accuracy', 
    verbose=0,
    patience=10,
    mode='max',
    restore_best_weights=True)

# funtion select 5 best neural network
best_models, best_scoring, = select_best_networks(build_func=build_ann, 
                                                  params_grid=params_grid,
                                                  n_iter=10,
                                                  random_state=23,
                                                  X_train=X_train,
                                                  y_train=y_train, 
                                                  X_val=X_val, 
                                                  y_val=y_val,
                                                  early_stopping=early_stopping,
                                                  kind = 'ann',
                                                  epochs=50,
                                                  shuffle=True, 
                                                  verbose=1,
                                                  n_best_models=5)
# show best selected models
show_best_models(best_models, best_scoring)

KerasClassifier{'n_hiden_layers': 1, 'hidden_layer_size': 64, 'batch_size': 16}
Accuracy score on training set: 0.6842 | Accuracy score on validation set: 0.6697
-------------------------------------------------------------------------------------------------------------------------------
KerasClassifier{'n_hiden_layers': 1, 'hidden_layer_size': 32, 'batch_size': 16}
Accuracy score on training set: 0.7131 | Accuracy score on validation set: 0.703
-------------------------------------------------------------------------------------------------------------------------------
KerasClassifier{'n_hiden_layers': 1, 'hidden_layer_size': 16, 'batch_size': 8}
Accuracy score on training set: 0.7121 | Accuracy score on validation set: 0.703
-------------------------------------------------------------------------------------------------------------------------------
KerasClassifier{'n_hiden_layers': 1, 'hidden_layer_size': 64, 'batch_size': 16}
Accuracy score on training set: 0.7086 | Accuracy sco

### 7.5.2 Extract single models from list

In [6]:
clf_1, clf_2, clf_3, clf_4, clf_5 = best_models[:,1][0].steps[1][1].model, best_models[:,1][1].steps[1][1].model, \
                                    best_models[:,1][2].steps[1][1].model, best_models[:,1][3].steps[1][1].model, \
                                    best_models[:,1][4].steps[1][1].model

### 7.5.3 Create compleated pipelines (with scaling, encoding and futures selection) for each individual classifiers

In [7]:
# all base preprocess pipeline and transformers come from module preprocessing_pipelines.py
pipe_clf_1 = Pipeline([ ('preprocess_pipeline', basic_preprocess_pipeline),
                        ('feature_seletion', ImportantFeaturesSelector(clf_1, 'ann') ),
                        ('classification', AnnBuilder(clf_1))
                      ])

pipe_clf_2 = Pipeline([ ('preprocess_pipeline', basic_preprocess_pipeline),
                        ('feature_seletion', ImportantFeaturesSelector(clf_2, 'ann') ),
                        ('classification', AnnBuilder(clf_2))
                      ])

pipe_clf_3 = Pipeline([ ('preprocess_pipeline', basic_preprocess_pipeline),
                        ('feature_seletion', ImportantFeaturesSelector(clf_3, 'ann') ),
                        ('classification', AnnBuilder(clf_3))
                      ])

pipe_clf_4 = Pipeline([ ('preprocess_pipeline', basic_preprocess_pipeline),
                        ('feature_seletion', ImportantFeaturesSelector(clf_4, 'ann') ),
                        ('classification', AnnBuilder(clf_4))
                      ])

pipe_clf_5 = Pipeline([ ('preprocess_pipeline', basic_preprocess_pipeline),
                        ('feature_seletion', ImportantFeaturesSelector(clf_5, 'ann') ),
                        ('classification', AnnBuilder(clf_5))
                      ])

### 7.5.4  Make AveragingClassifier from the best 5 selected models (pipelines)

In [8]:
avg_clf = AveragingNetworkClassifier(base_estimators=[pipe_clf_1,
                                                      pipe_clf_2,
                                                      pipe_clf_3,
                                                      pipe_clf_4,
                                                      pipe_clf_5],
                                      voting='soft')
# print(avg_clf.base_estimators[0])

### 7.5.5 Fit single and averaging models on the entire data set 

In [9]:
# train models on all data
pipe_clf_1.fit(X_train_all, y_train_all)
avg_clf.fit(X_train_all, y_train_all)

# give models a name
clf_1_name = 'ANN'
avg_clf_name = f'Averaging{clf_1_name}'
print(clf_1_name, avg_clf_name)

ANN AveragingANN


### 7.5.6 Calculate metrics of prediction and add results to the lists

In [10]:
# add prediction metrics for single classifier to placeholder
prediction_metrics.add_metrics(pipe_clf_1, clf_1_name, X_test, y_test)

# add prediction metrics for voting classifier to placeholder
prediction_metrics.add_metrics(avg_clf, avg_clf_name, X_test, y_test)

# add both classifiers to the lists (to create largest average classifiers)
averaging_models.append( (avg_clf) )

## 7.6 Recurrent Neural Network(RNN)

### 7.6.1  Select best models

Select the best model using Keras wraper for the Scikit-Learn API and ParameterSampler to generate models with diffrent parameters

In [11]:
# define params for random grid search
params_grid = {
    'n_lstm_layers':[1, 2, 2], 
    'lstm_layer_size': [64, 32],
    'n_hiden_layers': [0, 0, 1],
    'hidden_layer_size' : [32, 16],
    'batch_size' : [16, 16, 8]
}

# add early stopping callback
early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor='val_accuracy', 
    verbose=0,
    patience=10,
    mode='max',
    restore_best_weights=True)

# funtion select 5 best neural network
best_models, best_scoring, = select_best_networks(build_func=build_rnn, 
                                                  params_grid=params_grid,
                                                  n_iter=10,
                                                  random_state=23,
                                                  X_train=X_train,
                                                  y_train=y_train, 
                                                  X_val=X_val, 
                                                  y_val=y_val,
                                                  early_stopping=early_stopping,
                                                  kind = 'rnn',
                                                  epochs=50,
                                                  shuffle=True, 
                                                  verbose=1,
                                                  n_best_models=5)
# show best selected models
show_best_models(best_models, best_scoring)

KerasClassifier{'n_lstm_layers': 2, 'n_hiden_layers': 1, 'lstm_layer_size': 32, 'hidden_layer_size': 16, 'batch_size': 16}
Accuracy score on training set: 0.7327 | Accuracy score on validation set: 0.697
-------------------------------------------------------------------------------------------------------------------------------
KerasClassifier{'n_lstm_layers': 2, 'n_hiden_layers': 1, 'lstm_layer_size': 32, 'hidden_layer_size': 32, 'batch_size': 16}
Accuracy score on training set: 0.7131 | Accuracy score on validation set: 0.6879
-------------------------------------------------------------------------------------------------------------------------------
KerasClassifier{'n_lstm_layers': 2, 'n_hiden_layers': 0, 'lstm_layer_size': 32, 'hidden_layer_size': 32, 'batch_size': 16}
Accuracy score on training set: 0.746 | Accuracy score on validation set: 0.6909
-------------------------------------------------------------------------------------------------------------------------------
Ker

### 7.6.2 Extract single models from list

In [12]:
clf_1, clf_2, clf_3, clf_4, clf_5 = best_models[:,1][0].steps[1][1].model, best_models[:,1][1].steps[1][1].model, \
                                    best_models[:,1][2].steps[1][1].model, best_models[:,1][3].steps[1][1].model, \
                                    best_models[:,1][4].steps[1][1].model

### 7.6.3 Create compleated pipelines (with scaling, encoding and futures selection) for each individual classifiers

In [13]:
# all base preprocess pipeline and transformers come from module preprocessing_pipelines.py
pipe_clf_1 = Pipeline([ ('preprocess_pipeline', basic_preprocess_pipeline),
                        ('feature_seletion', ImportantFeaturesSelector(clf_1, 'rnn') ),
                        ('classification', RnnBuilder(clf_1))
                      ])

pipe_clf_2 = Pipeline([ ('preprocess_pipeline', basic_preprocess_pipeline),
                        ('feature_seletion', ImportantFeaturesSelector(clf_2, 'rnn') ),
                        ('classification', RnnBuilder(clf_2))
                      ])

pipe_clf_3 = Pipeline([ ('preprocess_pipeline', basic_preprocess_pipeline),
                        ('feature_seletion', ImportantFeaturesSelector(clf_3, 'rnn') ),
                        ('classification', RnnBuilder(clf_3))
                      ])

pipe_clf_4 = Pipeline([ ('preprocess_pipeline', basic_preprocess_pipeline),
                        ('feature_seletion', ImportantFeaturesSelector(clf_4, 'rnn') ),
                        ('classification', RnnBuilder(clf_4))
                      ])

pipe_clf_5 = Pipeline([ ('preprocess_pipeline', basic_preprocess_pipeline),
                        ('feature_seletion', ImportantFeaturesSelector(clf_5, 'rnn') ),
                        ('classification', RnnBuilder(clf_5))
                      ])

### 7.6.4  Make AveragingClassifier from the best 5 selected models (pipelines)

In [14]:
avg_clf = AveragingNetworkClassifier(base_estimators=[pipe_clf_1,
                                                      pipe_clf_2,
                                                      pipe_clf_3,
                                                      pipe_clf_4,
                                                      pipe_clf_5],
                                      voting='soft')
# print(avg_clf.base_estimators[0])

### 7.6.5 Fit single and averaging models on the entire data set 

In [21]:
# train models on all data
pipe_clf_1.fit(X_train_all, y_train_all)
avg_clf.fit(X_train_all, y_train_all)

# give models a name
clf_1_name = 'RNN'
avg_clf_name = f'Averaging{clf_1_name}'
print(clf_1_name, avg_clf_name)

RNN AveragingRNN


### 7.6.6 Calculate metrics of prediction and add results to the lists

In [16]:
# add prediction metrics for single classifier to placeholder
prediction_metrics.add_metrics(pipe_clf_1, clf_1_name, X_test, y_test)

# add prediction metrics for averaging classifier to placeholder
prediction_metrics.add_metrics(avg_clf, avg_clf_name, X_test, y_test)

# add both classifiers to the lists (to create largest average classifiers)
averaging_models.append( (avg_clf) )

## 7.7 Merge averaging models in largest averaging model

### 7.7.1 Create new largest averaging model

In [17]:
# create model (all base models is already fitted)

# as base models using averaging classifier
large_average_network_clf = LargeAveragingClassifier(base_estimators=averaging_models, voting='soft')

# give model a name
large_average_network_clf_name = 'LargeNeuralNetworkAveragingClassifier'
print(large_average_network_clf_name)

LargeNeuralNetworkAveragingClassifier


### 7.7.2 Calculate metrics of prediction and add results to the lists

In [18]:
# add prediction metrics for large averaging classifier to placeholder
prediction_metrics.add_metrics(large_average_network_clf, large_average_network_clf_name, X_test, y_test)

## 7.8 Show all result in one table and save it for future purpose

In [19]:
# get prediction metric result lists from placeholder
precision_score, recall_score, f1_score, roc_auc_score, accuracy_score = prediction_metrics.get_metrics()

# get model names list from placeholder
models_name = prediction_metrics.get_names()

# create dictionary of results 
results_dict = {'precision_score': precision_score, 
               'recall_score': recall_score, 
               'f1_score': f1_score,
               'roc_auc_score' : roc_auc_score,
               'accuracy_score' : accuracy_score}

results_df = pd.DataFrame(data=results_dict)
results_df.insert(loc=0, column='Model', value=models_name)
results_df

Unnamed: 0,Model,precision_score,recall_score,f1_score,roc_auc_score,accuracy_score
0,ANN,0.657534,0.571429,0.611465,0.746069,0.678947
1,AveragingANN,0.693333,0.619048,0.654088,0.750225,0.710526
2,RNN,0.630137,0.547619,0.585987,0.702493,0.657895
3,AveragingRNN,0.666667,0.595238,0.628931,0.735737,0.689474
4,LargeNeuralNetworkAveragingClassifier,0.666667,0.571429,0.615385,0.744609,0.684211


In [20]:
results_df.to_csv("./results/neural_network_results.csv")