# 6  Modeling - selection of the best linear and other models

<b> Purpose of the action </b> - checking accuracy of prediction on test set using different types of models:
- LogisticRegression
- LinearSVC
- SVC with RBF
- KNeighborsClassifier

<b> </b>
<b> Action plan </b>:
- Test 20 diffrent models for each type
- Use ParameterSampler to generate different models with random hyperparameters
- Use training set for fitting model and use validation set for model evaluation 
- Select the best 5 models of each type and create one averaging model (Votingclassifier)
- Retrain averaging model on all data (training and validation sets)
- Create one large averaging model using the best one model of each type
- Create another large averaging model from the previously created VotingClassifiers (each model contains the top 5 models of the same type)
- Save this models for use in future
- Compare prediction accuracy and other metrics on test set and save results for future purpose

## 6.1 Import nessesary libraries and modules

In [1]:
import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import VotingClassifier
import pickle
from modeling import make_voting_classifier

## 6.2 Create empty lists for future results

In [2]:
accuracy_score = []
precision_score = []
recall_score = []
f1_score = []
roc_auc_score = []
models_name = []
single_models = []
voting_models = []

## 6.3 Perform ensembling for LogisticRegression

### 6.3.1 Import data dedicated for this model

In [3]:
train_set = pd.read_csv("./preprocessed_data/processed_train_set_lr.csv", index_col=0)
validation_set = pd.read_csv("./preprocessed_data/processed_validation_set_lr.csv", index_col=0)
test_set = pd.read_csv("./preprocessed_data/processed_test_set_lr.csv", index_col=0)

### 6.3.2 Split datasets to feature set and labels set

In [4]:
X_train, y_train = np.array(train_set.drop(columns='FTR')), np.array(train_set['FTR'])
X_val, y_val = np.array(validation_set.drop(columns='FTR')), np.array(validation_set['FTR'])
X_test, y_test = np.array(test_set.drop(columns='FTR')), np.array(test_set['FTR'])

### 6.3.3 Perform averaging ensembling for this model type

Choose the top 5 models from 20 and create one model from them (Voting Classifier) using <b> ParameterSampler </b> for generating random parameters

In [5]:
# define params for random grid search
params_grid={
   'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
   'random_state': [0, 1, 2, 3, 4, 5, 6, 7, 8 ,9, 10],
   'max_iter': [1000]
}

# function build a voting classifier using n best models
voting_clf = make_voting_classifier(estimator=LogisticRegression, 
                                    params_grid=params_grid,
                                    n_iter=20, 
                                    random_state=42,
                                    X_train=X_train, 
                                    y_train=y_train, 
                                    X_val=X_val, 
                                    y_val=y_val, 
                                    verbose=1,
                                    n_best_models=5, 
                                    voting='soft')

voting_clf.fit(X_train, y_train)
# look on the estimators of voting claffier
print(voting_clf.estimators[:,0])

LogisticRegression{'random_state': 4, 'max_iter': 1000, 'C': 0.001}
Accuracy score on training set: 0.6593 | Accuracy score on validation set: 0.6515
--------------------------------------------------------------------------------
LogisticRegression{'random_state': 2, 'max_iter': 1000, 'C': 1}
Accuracy score on training set: 0.6779 | Accuracy score on validation set: 0.6576
--------------------------------------------------------------------------------
LogisticRegression{'random_state': 10, 'max_iter': 1000, 'C': 0.001}
Accuracy score on training set: 0.6593 | Accuracy score on validation set: 0.6515
--------------------------------------------------------------------------------
LogisticRegression{'random_state': 0, 'max_iter': 1000, 'C': 0.001}
Accuracy score on training set: 0.6593 | Accuracy score on validation set: 0.6515
--------------------------------------------------------------------------------
LogisticRegression{'random_state': 1, 'max_iter': 1000, 'C': 10}
Accuracy score

### 6.3.4 Exctract the best single model from voting classifier

In [6]:
# extract single classifier
clf = voting_clf.estimators_[0]
clf

LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=0, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

### 6.3.5 Refit single and averaging models on the entire data set (training set+validation set)

In [7]:
# merge training and validation sets
X_all = np.concatenate([X_train, X_val], axis=0)
y_all = np.concatenate([y_train, y_val], axis=0)

# retrain models on all data
clf.fit(X_all, y_all)
voting_clf.fit(X_all, y_all)

# give models a name
clf_name = f'{clf.__class__.__name__}'
voting_clf_name = f'Voting_{voting_clf.estimators_[0].__class__.__name__}'
print(clf_name, voting_clf_name)

LogisticRegression Voting_LogisticRegression


### 6.3.6 Calculate metrics of prediction and add results to the lists

In [8]:
# append metrics for single classifier to the list 
accuracy_score.append(metrics.accuracy_score(y_test , clf.predict(X_test)))  
precision_score.append(metrics.precision_score(y_test , clf.predict(X_test)))
recall_score.append(metrics.recall_score(y_test , clf.predict(X_test)))
f1_score.append( metrics.f1_score(y_test , clf.predict(X_test)))
roc_auc_score.append(metrics.roc_auc_score(y_test , clf.predict_proba(X_test)[:,1]))

# append metrics for voting classifier to the list  
accuracy_score.append(metrics.accuracy_score(y_test , voting_clf.predict(X_test)))  
precision_score.append(metrics.precision_score(y_test , voting_clf.predict(X_test)))
recall_score.append(metrics.recall_score(y_test , voting_clf.predict(X_test)))
f1_score.append( metrics.f1_score(y_test , voting_clf.predict(X_test)))
roc_auc_score.append(metrics.roc_auc_score(y_test , voting_clf.predict_proba(X_test)[:,1]))

# add claffiers name to the list (needed for created table with results)
models_name.append(clf_name)
models_name.append(voting_clf_name)

# add classifiers to the lists (to create largest average classifiers)
single_models.append( (clf_name, clf) )
voting_models.append( (voting_clf_name, voting_clf) )

## 6.4 Perform ensembling for LinearSVC

### 6.4.1 Import data dedicated for this model

In [9]:
train_set = pd.read_csv("./preprocessed_data/processed_train_set_linearsvc.csv", index_col=0)
validation_set = pd.read_csv("./preprocessed_data/processed_validation_set_linearsvc.csv", index_col=0)
test_set = pd.read_csv("./preprocessed_data/processed_validation_set_linearsvc.csv", index_col=0)

### 6.4.2 Split datasets to feature set and labels set

In [10]:
X_train, y_train = np.array(train_set.drop(columns='FTR')), np.array(train_set['FTR'])
X_val, y_val = np.array(validation_set.drop(columns='FTR')), np.array(validation_set['FTR'])
X_test, y_test = np.array(test_set.drop(columns='FTR')), np.array(test_set['FTR'])

### 6.4.3 Perform averaging ensembling for this model type

Choose the top 5 models from 20 and create one model from them (Voting Classifier) using <b> ParameterSampler </b> for generating random parameters

In [11]:
# define params for random grid search
params_grid={
   'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
   'random_state': [0, 1, 2, 3, 4, 5, 6, 7, 8 ,9, 10],
   'kernel': ['linear'],
   'probability' : [True],
}

# function build a voting classifier using n best models
voting_clf = make_voting_classifier(estimator=SVC, 
                                    params_grid=params_grid,
                                    n_iter=20, 
                                    random_state=42,
                                    X_train=X_train, 
                                    y_train=y_train, 
                                    X_val=X_val, 
                                    y_val=y_val, 
                                    verbose=1,
                                    n_best_models=5, 
                                    voting='soft')

voting_clf.fit(X_train, y_train)
# look on the estimators of voting claffier
print(voting_clf.estimators[:,0])

SVC{'random_state': 4, 'probability': True, 'kernel': 'linear', 'C': 0.001}
Accuracy score on training set: 0.6525 | Accuracy score on validation set: 0.6455
--------------------------------------------------------------------------------
SVC{'random_state': 2, 'probability': True, 'kernel': 'linear', 'C': 1}
Accuracy score on training set: 0.6832 | Accuracy score on validation set: 0.6788
--------------------------------------------------------------------------------
SVC{'random_state': 10, 'probability': True, 'kernel': 'linear', 'C': 0.001}
Accuracy score on training set: 0.6525 | Accuracy score on validation set: 0.6455
--------------------------------------------------------------------------------
SVC{'random_state': 0, 'probability': True, 'kernel': 'linear', 'C': 0.001}
Accuracy score on training set: 0.6525 | Accuracy score on validation set: 0.6455
--------------------------------------------------------------------------------
SVC{'random_state': 1, 'probability': True, 'ke

### 6.4.4 Exctract the best single model from voting classifier

In [12]:
# extract single classifier
clf = voting_clf.estimators_[0]
clf

SVC(C=1, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=True, random_state=1, shrinking=True, tol=0.001,
    verbose=False)

### 6.4.5 Refit single and averaging models on the entire data set (training set+validation set)

In [13]:
# merge training and validation sets
X_all = np.concatenate([X_train, X_val], axis=0)
y_all = np.concatenate([y_train, y_val], axis=0)

# retrain models on all data
clf.fit(X_all, y_all)
voting_clf.fit(X_all, y_all)

# give models a name
clf_name = f'Linear{clf.__class__.__name__}'
voting_clf_name = f'Voting_Linear{voting_clf.estimators_[0].__class__.__name__}'
print(clf_name, voting_clf_name)

LinearSVC Voting_LinearSVC


### 6.4.6 Calculate metrics of prediction and add results to the lists

In [14]:
# append metrics for single classifier to the list 
accuracy_score.append(metrics.accuracy_score(y_test , clf.predict(X_test)))  
precision_score.append(metrics.precision_score(y_test , clf.predict(X_test)))
recall_score.append(metrics.recall_score(y_test , clf.predict(X_test)))
f1_score.append( metrics.f1_score(y_test , clf.predict(X_test)))
roc_auc_score.append(metrics.roc_auc_score(y_test , clf.predict_proba(X_test)[:,1]))

# append metrics for voting classifier to the list  
accuracy_score.append(metrics.accuracy_score(y_test , voting_clf.predict(X_test)))  
precision_score.append(metrics.precision_score(y_test , voting_clf.predict(X_test)))
recall_score.append(metrics.recall_score(y_test , voting_clf.predict(X_test)))
f1_score.append( metrics.f1_score(y_test , voting_clf.predict(X_test)))
roc_auc_score.append(metrics.roc_auc_score(y_test , voting_clf.predict_proba(X_test)[:,1]))

# add claffiers name to the list (needed for created table with results)
models_name.append(clf_name)
models_name.append(voting_clf_name)

# add classifiers to the lists (to create largest average classifiers)
single_models.append( (clf_name, clf) )
voting_models.append( (voting_clf_name, voting_clf) )

## 6.5 Perform ensembling for SVC Classifier with RBF kernel

### 6.5.1 Import data dedicated for this model

In [15]:
train_set = pd.read_csv("./preprocessed_data/processed_train_set_svc_rbf.csv", index_col=0)
validation_set = pd.read_csv("./preprocessed_data/processed_validation_set_svc_rbf.csv", index_col=0)
test_set = pd.read_csv("./preprocessed_data/processed_test_set_svc_rbf.csv", index_col=0)

### 6.5.2 Split datasets to feature set and labels set

In [16]:
X_train, y_train = np.array(train_set.drop(columns='FTR')), np.array(train_set['FTR'])
X_val, y_val = np.array(validation_set.drop(columns='FTR')), np.array(validation_set['FTR'])
X_test, y_test = np.array(test_set.drop(columns='FTR')), np.array(test_set['FTR'])

### 6.5.3 Perform averaging ensembling for this model type

Choose the top 5 models from 20 and create one model from them (Voting Classifier) using <b> ParameterSampler </b> for generating random parameters

In [17]:
# define params for random grid search
params_grid={
   'C':  [10, 100, 1000, 10000, 100000, 1000000],
   'gamma': [0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1],
   'random_state': [0, 1, 2, 3, 4, 5, 6, 7, 8 ,9, 10],
   'probability' : [True],
   'max_iter': [1000000]
}

# function build a voting classifier using n best models
voting_clf = make_voting_classifier(estimator=SVC, 
                                    params_grid=params_grid,
                                    n_iter=20, 
                                    random_state=42,
                                    X_train=X_train, 
                                    y_train=y_train, 
                                    X_val=X_val, 
                                    y_val=y_val, 
                                    verbose=1,
                                    n_best_models=5, 
                                    voting='soft')

voting_clf.fit(X_train, y_train)
# look on the estimators of voting claffier
print(voting_clf.estimators[:,0])

SVC{'random_state': 1, 'probability': True, 'max_iter': 1000000, 'gamma': 1e-05, 'C': 100}
Accuracy score on training set: 0.6582 | Accuracy score on validation set: 0.6364
--------------------------------------------------------------------------------
SVC{'random_state': 2, 'probability': True, 'max_iter': 1000000, 'gamma': 1e-05, 'C': 100000}
Accuracy score on training set: 0.681 | Accuracy score on validation set: 0.6697
--------------------------------------------------------------------------------




SVC{'random_state': 8, 'probability': True, 'max_iter': 1000000, 'gamma': 0.01, 'C': 10000}
Accuracy score on training set: 0.7576 | Accuracy score on validation set: 0.6152
--------------------------------------------------------------------------------
SVC{'random_state': 0, 'probability': True, 'max_iter': 1000000, 'gamma': 0.1, 'C': 10}
Accuracy score on training set: 0.7375 | Accuracy score on validation set: 0.6424
--------------------------------------------------------------------------------




SVC{'random_state': 6, 'probability': True, 'max_iter': 1000000, 'gamma': 0.1, 'C': 1000000}
Accuracy score on training set: 1.0 | Accuracy score on validation set: 0.5909
--------------------------------------------------------------------------------
SVC{'random_state': 8, 'probability': True, 'max_iter': 1000000, 'gamma': 1e-05, 'C': 100000}
Accuracy score on training set: 0.681 | Accuracy score on validation set: 0.6697
--------------------------------------------------------------------------------
SVC{'random_state': 9, 'probability': True, 'max_iter': 1000000, 'gamma': 0.001, 'C': 10}
Accuracy score on training set: 0.6685 | Accuracy score on validation set: 0.6485
--------------------------------------------------------------------------------
SVC{'random_state': 0, 'probability': True, 'max_iter': 1000000, 'gamma': 1e-05, 'C': 1000000}
Accuracy score on training set: 0.6793 | Accuracy score on validation set: 0.6879
-------------------------------------------------------------



SVC{'random_state': 3, 'probability': True, 'max_iter': 1000000, 'gamma': 0.1, 'C': 100000}
Accuracy score on training set: 1.0 | Accuracy score on validation set: 0.5909
--------------------------------------------------------------------------------
SVC{'random_state': 3, 'probability': True, 'max_iter': 1000000, 'gamma': 0.001, 'C': 1000}
Accuracy score on training set: 0.682 | Accuracy score on validation set: 0.6667
--------------------------------------------------------------------------------
SVC{'random_state': 1, 'probability': True, 'max_iter': 1000000, 'gamma': 1e-06, 'C': 1000000}
Accuracy score on training set: 0.6781 | Accuracy score on validation set: 0.6606
--------------------------------------------------------------------------------




SVC{'random_state': 3, 'probability': True, 'max_iter': 1000000, 'gamma': 0.01, 'C': 1000000}
Accuracy score on training set: 0.5561 | Accuracy score on validation set: 0.5061
--------------------------------------------------------------------------------
SVC{'random_state': 3, 'probability': True, 'max_iter': 1000000, 'gamma': 0.1, 'C': 100}
Accuracy score on training set: 0.8194 | Accuracy score on validation set: 0.597
--------------------------------------------------------------------------------




SVC{'random_state': 9, 'probability': True, 'max_iter': 1000000, 'gamma': 0.1, 'C': 1000000}
Accuracy score on training set: 1.0 | Accuracy score on validation set: 0.5909
--------------------------------------------------------------------------------
SVC{'random_state': 5, 'probability': True, 'max_iter': 1000000, 'gamma': 1e-06, 'C': 10000}
Accuracy score on training set: 0.668 | Accuracy score on validation set: 0.6455
--------------------------------------------------------------------------------
SVC{'random_state': 5, 'probability': True, 'max_iter': 1000000, 'gamma': 1e-06, 'C': 1000000}
Accuracy score on training set: 0.6781 | Accuracy score on validation set: 0.6606
--------------------------------------------------------------------------------
["SVC{'random_state': 0, 'probability': True, 'max_iter': 1000000, 'gamma': 1e-05, 'C': 1000000}"
 "SVC{'random_state': 6, 'probability': True, 'max_iter': 1000000, 'gamma': 0.01, 'C': 100}"
 "SVC{'random_state': 2, 'probability': Tru

### 6.5.4 Exctract the best single model from voting classifier

In [18]:
# extract single classifier
clf = voting_clf.estimators_[0]
clf

SVC(C=1000000, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=1e-05, kernel='rbf',
    max_iter=1000000, probability=True, random_state=0, shrinking=True,
    tol=0.001, verbose=False)

### 6.5.5 Refit single and averaging models on the entire data set (training set+validation set)

In [19]:
# merge training and validation sets
X_all = np.concatenate([X_train, X_val], axis=0)
y_all = np.concatenate([y_train, y_val], axis=0)

# retrain models on all data
clf.fit(X_all, y_all)
voting_clf.fit(X_all, y_all)

# give models a name
clf_name = f'{clf.__class__.__name__}_RBF'
voting_clf_name = f'Voting_{voting_clf.estimators_[0].__class__.__name__}_RBF'
print(clf_name, voting_clf_name)

SVC_RBF Voting_SVC_RBF


### 6.5.6 Calculate metrics of prediction and add results to the lists

In [20]:
# append metrics for single classifier to the list 
accuracy_score.append(metrics.accuracy_score(y_test , clf.predict(X_test)))  
precision_score.append(metrics.precision_score(y_test , clf.predict(X_test)))
recall_score.append(metrics.recall_score(y_test , clf.predict(X_test)))
f1_score.append( metrics.f1_score(y_test , clf.predict(X_test)))
roc_auc_score.append(metrics.roc_auc_score(y_test , clf.predict_proba(X_test)[:,1]))

# append metrics for voting classifier to the list  
accuracy_score.append(metrics.accuracy_score(y_test , voting_clf.predict(X_test)))  
precision_score.append(metrics.precision_score(y_test , voting_clf.predict(X_test)))
recall_score.append(metrics.recall_score(y_test , voting_clf.predict(X_test)))
f1_score.append( metrics.f1_score(y_test , voting_clf.predict(X_test)))
roc_auc_score.append(metrics.roc_auc_score(y_test , voting_clf.predict_proba(X_test)[:,1]))

# add claffiers name to the list (needed for created table with results)
models_name.append(clf_name)
models_name.append(voting_clf_name)

# add classifiers to the lists (to create largest average classifiers)
single_models.append( (clf_name, clf) )
voting_models.append( (voting_clf_name, voting_clf) )

## 6.6 Perform ensembling for KNeighborsClassifier

### 6.6.1 Import data dedicated for this model

In [22]:
train_set = pd.read_csv("./preprocessed_data/processed_train_set_knn.csv", index_col=0)
validation_set = pd.read_csv("./preprocessed_data/processed_validation_set_knn.csv", index_col=0)
test_set = pd.read_csv("./preprocessed_data/processed_test_set_knn.csv", index_col=0)

### 6.6.2 Split datasets to feature set and labels set

In [23]:
X_train, y_train = np.array(train_set.drop(columns='FTR')), np.array(train_set['FTR'])
X_val, y_val = np.array(validation_set.drop(columns='FTR')), np.array(validation_set['FTR'])
X_test, y_test = np.array(test_set.drop(columns='FTR')), np.array(test_set['FTR'])

### 6.6.3 Perform averaging ensembling for this model type

Choose the top 5 models from 20 and create one model from them (Voting Classifier) using <b> ParameterSampler </b> for generating random parameters

In [24]:
# define params for random grid search
params_grid = {
                  'n_neighbors' : [7, 9, 11, 13, 15, 17, 19, 21],
                  'metric': ['manhattan', 'cosine'],
                  'leaf_size' : [15, 20, 25, 30, 35, 40, 45]
              }

# function build a voting classifier using n best models
voting_clf = make_voting_classifier(estimator=KNeighborsClassifier, 
                                    params_grid=params_grid,
                                    n_iter=20, 
                                    random_state=42,
                                    X_train=X_train, 
                                    y_train=y_train, 
                                    X_val=X_val, 
                                    y_val=y_val, 
                                    verbose=1,
                                    n_best_models=5, 
                                    voting='soft')

voting_clf.fit(X_train, y_train)
# look on the estimators of voting claffier
print(voting_clf.estimators[:,0])

KNeighborsClassifier{'n_neighbors': 7, 'metric': 'cosine', 'leaf_size': 25}
Accuracy score on training set: 0.7274 | Accuracy score on validation set: 0.6606
--------------------------------------------------------------------------------
KNeighborsClassifier{'n_neighbors': 9, 'metric': 'manhattan', 'leaf_size': 35}
Accuracy score on training set: 0.7177 | Accuracy score on validation set: 0.6636
--------------------------------------------------------------------------------
KNeighborsClassifier{'n_neighbors': 15, 'metric': 'manhattan', 'leaf_size': 15}
Accuracy score on training set: 0.7007 | Accuracy score on validation set: 0.6727
--------------------------------------------------------------------------------
KNeighborsClassifier{'n_neighbors': 21, 'metric': 'cosine', 'leaf_size': 25}
Accuracy score on training set: 0.6916 | Accuracy score on validation set: 0.6606
--------------------------------------------------------------------------------
KNeighborsClassifier{'n_neighbors': 

### 6.6.4 Exctract the best single model from voting classifier

In [25]:
# extract single classifier
clf = voting_clf.estimators_[0]
clf

KNeighborsClassifier(algorithm='auto', leaf_size=15, metric='manhattan',
                     metric_params=None, n_jobs=None, n_neighbors=15, p=2,
                     weights='uniform')

### 6.6.5 Refit single and averaging models on the entire data set (training set+validation set)

In [26]:
# merge training and validation sets
X_all = np.concatenate([X_train, X_val], axis=0)
y_all = np.concatenate([y_train, y_val], axis=0)

# retrain models on all data
clf.fit(X_all, y_all)
voting_clf.fit(X_all, y_all)

# give models a name
clf_name = f'{clf.__class__.__name__}'
voting_clf_name = f'Voting_{voting_clf.estimators_[0].__class__.__name__}'
print(clf_name, voting_clf_name)

KNeighborsClassifier Voting_KNeighborsClassifier


### 6.6.6 Calculate metrics of prediction and add results to the lists

In [27]:
# append metrics for single classifier to the list 
accuracy_score.append(metrics.accuracy_score(y_test , clf.predict(X_test)))  
precision_score.append(metrics.precision_score(y_test , clf.predict(X_test)))
recall_score.append(metrics.recall_score(y_test , clf.predict(X_test)))
f1_score.append( metrics.f1_score(y_test , clf.predict(X_test)))
roc_auc_score.append(metrics.roc_auc_score(y_test , clf.predict_proba(X_test)[:,1]))

# append metrics for voting classifier to the list  
accuracy_score.append(metrics.accuracy_score(y_test , voting_clf.predict(X_test)))  
precision_score.append(metrics.precision_score(y_test , voting_clf.predict(X_test)))
recall_score.append(metrics.recall_score(y_test , voting_clf.predict(X_test)))
f1_score.append( metrics.f1_score(y_test , voting_clf.predict(X_test)))
roc_auc_score.append(metrics.roc_auc_score(y_test , voting_clf.predict_proba(X_test)[:,1]))

# add claffiers name to the list (needed for created table with results)
models_name.append(clf_name)
models_name.append(voting_clf_name)

# add classifiers to the lists (to create largest average classifiers)
single_models.append( (clf_name, clf) )
voting_models.append( (voting_clf_name, voting_clf) )

## 6.7 Merge single and voting classifiers in largest models

### 6.7.1 Import data dedicated for this models

In [28]:
train_set = pd.read_csv("./preprocessed_data/processed_categorical_train_set.csv", index_col=0)
validation_set = pd.read_csv("./preprocessed_data/processed_categorical_validation_set.csv", index_col=0)
test_set = pd.read_csv("./preprocessed_data/processed_categorical_test_set.csv", index_col=0)

### 6.7.2 Split datasets to feature set and labels set

In [29]:
X_train, y_train = np.array(train_set.drop(columns='FTR')), np.array(train_set['FTR'])
X_val, y_val = np.array(validation_set.drop(columns='FTR')), np.array(validation_set['FTR'])
X_test, y_test = np.array(test_set.drop(columns='FTR')), np.array(test_set['FTR'])

# merge training and validation sets
X_all = np.concatenate([X_train, X_val], axis=0)
y_all = np.concatenate([y_train, y_val], axis=0)

### 6.7.3 Create new largest voting models and fit them on all data

In [30]:
# create models

# as base models using single classifier
voting_linear_clf = VotingClassifier(estimators=single_models, voting='soft')
# as base models using voting classifier
average_voting_linear_clf = VotingClassifier(estimators=voting_models, voting='soft')

# train models on all data
voting_linear_clf.fit(X_all, y_all)
average_voting_linear_clf.fit(X_all, y_all)

# give models a name
voting_linear_clf_name = 'LinearModelsVotingClassifier'
average_voting_linear_clf_name = 'LinearModelsAveragingVotingClassifier'
print(voting_linear_clf_name, average_voting_linear_clf_name)



LinearModelsVotingClassifier LinearModelsAveragingVotingClassifier


### 6.7.4 Calculate metrics of predictions and add results to the lists

In [31]:
# append metrics for single voting classifier to the lists
accuracy_score.append(metrics.accuracy_score(y_test , voting_linear_clf.predict(X_test)))  
precision_score.append(metrics.precision_score(y_test , voting_linear_clf.predict(X_test)))
recall_score.append(metrics.recall_score(y_test , voting_linear_clf.predict(X_test)))
f1_score.append( metrics.f1_score(y_test , voting_linear_clf.predict(X_test)))
roc_auc_score.append(metrics.roc_auc_score(y_test , voting_linear_clf.predict_proba(X_test)[:,1]))

# append metrics for averaging voting classifier to the lists 
accuracy_score.append(metrics.accuracy_score(y_test , average_voting_linear_clf.predict(X_test)))  
precision_score.append(metrics.precision_score(y_test , average_voting_linear_clf.predict(X_test)))
recall_score.append(metrics.recall_score(y_test , average_voting_linear_clf.predict(X_test)))
f1_score.append( metrics.f1_score(y_test , average_voting_linear_clf.predict(X_test)))
roc_auc_score.append(metrics.roc_auc_score(y_test , average_voting_linear_clf.predict_proba(X_test)[:,1]))


# add claffiers name to the list (needed for created table with results)
models_name.append(voting_linear_clf_name)
models_name.append(average_voting_linear_clf_name)

### 6.7.5 Save models for future purpose

In [32]:
# save single voting model using pickle library
with open(f'./models/{voting_linear_clf_name}.pickle', 'wb') as f:
    # pickle the 'models'using the highest protocol available.
    pickle.dump(voting_linear_clf, f, pickle.HIGHEST_PROTOCOL)
    
# save averaging voting model using pickle library
with open(f'./models/{average_voting_linear_clf_name}.pickle', 'wb') as f:
    # pickle the 'models'using the highest protocol available.
    pickle.dump(average_voting_linear_clf, f, pickle.HIGHEST_PROTOCOL)

## 6.8 Show all result in one table and save it for future purpose

In [33]:
# create dictionary of results 
results_dict = {'precision_score': precision_score, 
               'recall_score': recall_score, 
               'f1_score': f1_score,
               'roc_auc_score' : roc_auc_score,
               'accuracy_score' : accuracy_score}

results_df = pd.DataFrame(data=results_dict)
results_df.insert(loc=0, column='Model', value=models_name)
results_df

Unnamed: 0,Model,precision_score,recall_score,f1_score,roc_auc_score,accuracy_score
0,LogisticRegression,0.608696,0.666667,0.636364,0.713612,0.663158
1,Voting_LogisticRegression,0.6,0.642857,0.62069,0.705863,0.652632
2,LinearSVC,0.707483,0.65,0.677524,0.749449,0.7
3,Voting_LinearSVC,0.697987,0.65,0.673139,0.749779,0.693939
4,SVC_RBF,0.592593,0.761905,0.666667,0.732143,0.663158
5,Voting_SVC_RBF,0.598131,0.761905,0.670157,0.723832,0.668421
6,KNeighborsClassifier,0.568182,0.595238,0.581395,0.676662,0.621053
7,Voting_KNeighborsClassifier,0.568182,0.595238,0.581395,0.676831,0.621053
8,LinearModelsVotingClassifier,0.6,0.642857,0.62069,0.712601,0.652632
9,LinearModelsAveragingVotingClassifier,0.597826,0.654762,0.625,0.711815,0.652632


In [34]:
# save results
results_df.to_csv("./results/linear_models_results.csv")