<h1>Sound Classification - part 2</h1>

In this part we will add new ML models and compare their perfomance. We will select 5 the most promising and check how hyperparameters tuning can improve their performance.</br>
Let's start from importing required libraries and reading metadata file and other data saved in part 1.

In [1]:
import pandas as pd
pd.options.display.max_columns = 500
import numpy as np
import pickle
import time

import warnings
warnings.filterwarnings("ignore")

In [2]:
# Loading saved X (features) and Y (labels) from files
with open('part1_X.pickle', 'rb') as f:
            X = pickle.load(f)
with open('part1_Y.pickle', 'rb') as f:
            Y = pickle.load(f)

# read metadata writen to file in part 1
with open('part1_df.pickle', 'rb') as f:
            df = pickle.load(f)

df.head()

Unnamed: 0,slice_file_name,fsID,start,end,salience,fold,classID,class,channel_count,sampling_rate
0,100032-3-0-0.wav,100032,0.0,0.317551,1,5,3,dog_bark,2,44100
1,100263-2-0-117.wav,100263,58.5,62.5,1,5,2,children_playing,2,44100
2,100263-2-0-121.wav,100263,60.5,64.5,1,5,2,children_playing,2,44100
3,100263-2-0-126.wav,100263,63.0,67.0,1,5,2,children_playing,2,44100
4,100263-2-0-137.wav,100263,68.5,72.5,1,5,2,children_playing,2,44100


From this part we will split out dataset on the training set and testing set. As test set we will hold fold10. Other folds will be used for traning and cross validation. As we want to keep original 10 folds data separation we need to write custom function for 9 folds cross validation.

In [3]:
# split X on train and test sets
train_X = X.loc[df['fold']!=10]
train_Y = Y.loc[df['fold']!=10]
test_X = X.loc[df['fold']==10]
test_Y = Y.loc[df['fold']==10]

### create scaled features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

# fit_transform() returns numpy.ndarray, if we want to keep indexes and column names, we do it like this
train_X_scaled = pd.DataFrame(scaler.fit_transform(train_X), index=train_X.index, columns=train_X.columns)
# we fit scaler only with train data and transform test data using fitted scaler
test_X_scaled = pd.DataFrame(scaler.transform(test_X), index=test_X.index, columns=test_X.columns)


In [4]:
# create our cross_val function similarly to my_cross_val_using_10_folds from part 1
def my_cross_val_using_9_folds(clf, X, data):
    test_score = []
    train_score = []
    if len(X.index) != len(data.index):
        print("Indexes of X and data are not the same length!!!")
        return test_score, train_score
    if (X.index != data.index).any():
        print("Indexes of X and data are not equal!!!")
        return test_score, train_score
    for i in range(1,10):
        X_train = X.loc[data['fold']!=i]
        Y_train = data.loc[data['fold']!=i]['class']
        X_test = X.loc[data['fold']==i]
        Y_test = data.loc[data['fold']==i]['class']
        clf.fit(X_train, Y_train)
        Y_pred = clf.predict(X_test)
        temp = np.sum(Y_pred==Y_test)/len(Y_test)
        test_score.append(temp)
        Y_pred = clf.predict(X_train)
        temp = np.sum(Y_pred==Y_train)/len(Y_train)
        train_score.append(temp)
    return test_score, train_score


We will evaluate our models with our new function

In [5]:
### create dicts for scores
score_table_unscaled = {}
score_table_scaled = {}

Decision Tree Classifier

In [6]:
### create decision tree classifier and evaluate it
from sklearn.tree import DecisionTreeClassifier
dtc_clf = DecisionTreeClassifier(random_state=44)

# unscaled features
dtc_clf.fit(train_X, train_Y)
Y_pred = dtc_clf.predict(test_X)
score = np.sum(Y_pred == test_Y)/len(test_Y)
score_table_unscaled['Decision Tree'] = score
print('Decision Tree average score - X unscaled: ', score)

# scaled features
dtc_clf.fit(train_X_scaled, train_Y)
Y_pred = dtc_clf.predict(test_X_scaled)
score = np.sum(Y_pred == test_Y)/len(test_Y)
score_table_scaled['Decision Tree'] = score
print('Decision Tree average score - X scaled: ', score)

Decision Tree average score - X unscaled:  0.5005973715651135
Decision Tree average score - X scaled:  0.5005973715651135


K-Nearest Neighbors Classifier

In [7]:
### create KNN classifier and evaluate it
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier(n_neighbors=5, n_jobs=-1)

# unscaled features
knn_clf.fit(train_X, train_Y)
Y_pred = knn_clf.predict(test_X)
score = np.sum(Y_pred == test_Y)/len(test_Y)
score_table_unscaled['KNN'] = score
print('KNN average score - X unscaled: ', score)

# scaled features
knn_clf.fit(train_X_scaled, train_Y)
Y_pred = knn_clf.predict(test_X_scaled)
score = np.sum(Y_pred == test_Y)/len(test_Y)
score_table_scaled['KNN'] = score
print('KNN average score - X scaled: ', score)

KNN average score - X unscaled:  0.5244922341696535
KNN average score - X scaled:  0.5770609318996416


Random Forest Classifier

In [8]:
### create Random Forest classifier and evaluate it
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier(n_jobs=-1, random_state=44, n_estimators=500)

# unscaled features
rf_clf.fit(train_X, train_Y)
Y_pred = rf_clf.predict(test_X)
score = np.sum(Y_pred == test_Y)/len(test_Y)
score_table_unscaled['Random Forest'] = score
print('Random Forest average score - X unscaled: ', score)

# scaled features
rf_clf.fit(train_X_scaled, train_Y)
Y_pred = rf_clf.predict(test_X_scaled)
score = np.sum(Y_pred == test_Y)/len(test_Y)
score_table_scaled['Random Forest'] = score
print('Random Forest average score - X scaled: ', score)

Random Forest average score - X unscaled:  0.7574671445639187
Random Forest average score - X scaled:  0.7562724014336918


Support Vector Machines

In [9]:
### create Support Vector Machines classifier and evaluate it
from sklearn.svm import SVC
svc_clf = SVC(random_state=44)

# unscaled features
svc_clf.fit(train_X, train_Y)
Y_pred = svc_clf.predict(test_X)
score = np.sum(Y_pred == test_Y)/len(test_Y)
score_table_unscaled['SVM'] = score
print('SVM average score - X unscaled: ', score)

# scaled features
svc_clf.fit(train_X_scaled, train_Y)
Y_pred = svc_clf.predict(test_X_scaled)
score = np.sum(Y_pred == test_Y)/len(test_Y)
score_table_scaled['SVM'] = score
print('SVM average score - X scaled: ', score)

SVM average score - X unscaled:  0.3727598566308244
SVM average score - X scaled:  0.7706093189964157


Zero Rate Classifier

In [10]:
### create Zero R classifier and evaluate it
from sklearn.dummy import DummyClassifier
dummy_clf = DummyClassifier(strategy="most_frequent")

# unscaled features
dummy_clf.fit(train_X, train_Y)
Y_pred = dummy_clf.predict(test_X)
score = np.sum(Y_pred == test_Y)/len(test_Y)
score_table_unscaled['Zero R'] = score
print('Zero R average score - X unscaled: ', score)

# scaled features
dummy_clf.fit(train_X_scaled, train_Y)
Y_pred = dummy_clf.predict(test_X_scaled)
score = np.sum(Y_pred == test_Y)/len(test_Y)
score_table_scaled['Zero R'] = score
print('Zero R average score - X scaled: ', score)

Zero R average score - X unscaled:  0.1111111111111111
Zero R average score - X scaled:  0.1111111111111111


Ridge Classifier

In [11]:
### create Ridge classifier and evaluate it
from sklearn.linear_model import RidgeClassifier
ridge_clf = RidgeClassifier(random_state=44)

# unscaled features
ridge_clf.fit(train_X, train_Y)
Y_pred = ridge_clf.predict(test_X)
score = np.sum(Y_pred == test_Y)/len(test_Y)
score_table_unscaled['Ridge'] = score
print('Ridge average score - X unscaled: ', score)

# scaled features
ridge_clf.fit(train_X_scaled, train_Y)
Y_pred = ridge_clf.predict(test_X_scaled)
score = np.sum(Y_pred == test_Y)/len(test_Y)
score_table_scaled['Ridge'] = score
print('Ridge average score - X scaled: ', score)

Ridge average score - X unscaled:  0.6881720430107527
Ridge average score - X scaled:  0.6821983273596177


Logistic Regression Classifier

In [12]:
### create Logistic Regression classifier and evaluate it
from sklearn.linear_model import LogisticRegression
lr_clf = LogisticRegression(random_state=44)

# unscaled features
lr_clf.fit(train_X, train_Y)
Y_pred = lr_clf.predict(test_X)
score = np.sum(Y_pred == test_Y)/len(test_Y)
score_table_unscaled['Logistic Regression'] = score
print('Logistic Regression average score - X unscaled: ', score)

# scaled features
lr_clf.fit(train_X_scaled, train_Y)
Y_pred = lr_clf.predict(test_X_scaled)
score = np.sum(Y_pred == test_Y)/len(test_Y)
score_table_scaled['Logistic Regression'] = score
print('Logistic Regression average score - X scaled: ', score)

Logistic Regression average score - X unscaled:  0.5651135005973715
Logistic Regression average score - X scaled:  0.6953405017921147


Stochastic Gradient Descent

In [13]:
### create SGD classifier and evaluate it
from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(n_jobs=-1, random_state=44)

# unscaled features
sgd_clf.fit(train_X, train_Y)
Y_pred = sgd_clf.predict(test_X)
score = np.sum(Y_pred == test_Y)/len(test_Y)
score_table_unscaled['SGD'] = score
print('SGD average score - X unscaled: ', score)

# scaled features
sgd_clf.fit(train_X_scaled, train_Y)
Y_pred = sgd_clf.predict(test_X_scaled)
score = np.sum(Y_pred == test_Y)/len(test_Y)
score_table_scaled['SGD'] = score
print('SGD average score - X scaled: ', score)

SGD average score - X unscaled:  0.4862604540023895
SGD average score - X scaled:  0.5985663082437276


Passive Agressive Algorithm

In [14]:
### create Passive Agressive classifier and evaluate it
from sklearn.linear_model import PassiveAggressiveClassifier
pac_clf = PassiveAggressiveClassifier(n_jobs=-1, random_state=44)

# unscaled features
pac_clf.fit(train_X, train_Y)
Y_pred = pac_clf.predict(test_X)
score = np.sum(Y_pred == test_Y)/len(test_Y)
score_table_unscaled['Passive Agressive'] = score
print('Passive Agressive average score - X unscaled: ', score)

# scaled features
pac_clf.fit(train_X_scaled, train_Y)
Y_pred = pac_clf.predict(test_X_scaled)
score = np.sum(Y_pred == test_Y)/len(test_Y)
score_table_scaled['Passive Agressive'] = score
print('Passive Agressive average score - X scaled: ', score)

Passive Agressive average score - X unscaled:  0.5232974910394266
Passive Agressive average score - X scaled:  0.6666666666666666


Nearest Centroid Classifier

In [15]:
### create Nearest Centroid classifier and evaluate it
from sklearn.neighbors import NearestCentroid
nc_clf = NearestCentroid()

# unscaled features
nc_clf.fit(train_X, train_Y)
Y_pred = nc_clf.predict(test_X)
score = np.sum(Y_pred == test_Y)/len(test_Y)
score_table_unscaled['Nearest Centroid'] = score
print('Nearest Centroid average score - X unscaled: ', score)

# scaled features
nc_clf.fit(train_X_scaled, train_Y)
Y_pred = nc_clf.predict(test_X_scaled)
score = np.sum(Y_pred == test_Y)/len(test_Y)
score_table_scaled['Nearest Centroid'] = score
print('Nearest Centroid average score - X scaled: ', score)

Nearest Centroid average score - X unscaled:  0.2855436081242533
Nearest Centroid average score - X scaled:  0.5053763440860215


Multi-Layer Perceptron

In [16]:
### create Multi-Layer Perceptron classifier and evaluate it
from sklearn.neural_network import MLPClassifier
mlp_clf = MLPClassifier(random_state=44)

# unscaled features
mlp_clf.fit(train_X, train_Y)
Y_pred = mlp_clf.predict(test_X)
score = np.sum(Y_pred == test_Y)/len(test_Y)
score_table_unscaled['MLP'] = score
print('MLP average score - X unscaled: ', score)

# scaled features
mlp_clf.fit(train_X_scaled, train_Y)
Y_pred = mlp_clf.predict(test_X_scaled)
score = np.sum(Y_pred == test_Y)/len(test_Y)
score_table_scaled['MLP'] = score
print('MLP average score - X scaled: ', score)

MLP average score - X unscaled:  0.6547192353643967
MLP average score - X scaled:  0.7359617682198327


In [17]:
score_table = pd.concat([pd.Series(score_table_unscaled, name='base_X_unscaled'),
                         pd.Series(score_table_scaled, name='base_X_scaled')], axis=1)
score_table['MAX'] = score_table.max(axis=1)
score_table.to_csv('part2_score_table.csv')
score_table.sort_values(by='MAX', ascending=False)

Unnamed: 0,base_X_unscaled,base_X_scaled,MAX
SVM,0.37276,0.770609,0.770609
Random Forest,0.757467,0.756272,0.757467
MLP,0.654719,0.735962,0.735962
Logistic Regression,0.565114,0.695341,0.695341
Ridge,0.688172,0.682198,0.688172
Passive Agressive,0.523297,0.666667,0.666667
SGD,0.48626,0.598566,0.598566
KNN,0.524492,0.577061,0.577061
Nearest Centroid,0.285544,0.505376,0.505376
Decision Tree,0.500597,0.500597,0.500597


Our TOP 5 best performing ML classifiers are: Support Vector Machine, Random Forest, Multi-Layer Perceptron, Logostic Regression, Ridge.</br>

We will make fine tuning of hyperparameters using predefined 9 folds split. We can't use sklearn grid_serch_CV() but we will write our own function. We will need 2 auxiliary functions to generate all possible combinations of given parameters.

In [18]:
# this function rewrite given list of dictionaries generating new list which elements are all possibe combinations of
# element in given list and elements in arg 'value' which is also a list
def add_param(name, values, my_list=[]):
    new_list = []
    if len(my_list)==0:
        my_list.append({})
    for l in my_list:
        for value in values:
            temp_my_dict = l.copy()
            temp_my_dict[name]=value
            new_list.append(temp_my_dict)
    return new_list

# this function generate all possible combinations of parameters given in arg params, arg params is a list of dictionaries
# accorging to the approach taken in sklearn library class GridSearchCV
def make_list_of_parameters_dictionary(params):
    list_of_parameters = []
    for d in params:
        item_list = []
        for name, values in d.items():
            item_list = add_param(name, values, item_list)
        list_of_parameters.extend(item_list)
    return list_of_parameters
    
from os.path import exists

# our grid search function generates all possible combinations of parameters given in 'params' and performs cross validation
# for given classifier setted with each parameters combination, returning best set of parameters accorging to 'accuracy' metrics
# return set of parameters is a full set of model's parameters
def my_grid_search_cross_validation(clf, X, df, params):
    best_score = 0
    best_params = {}
    start = time.time()
    i = 1
    list_of_parameters = make_list_of_parameters_dictionary(params)
    l = len(list_of_parameters)
    df2 = df.loc[df['fold']!=10]
    for p in list_of_parameters:
        clf.set_params(**p)
        test_score, train_score = my_cross_val_using_9_folds(clf, X, df2)
        mean_test_score = np.mean(test_score)
        mean_train_score = np.mean(train_score)
        if mean_test_score > best_score:
            best_params = clf.get_params()
            best_score = mean_test_score
        print('\n',i,'/',l,' duration: ',time.time()-start, ' best score: ',best_score)
        print('Test score:    ', mean_test_score)
        print('Train score:   ', mean_train_score)
        print(p)
        i += 1
        if exists('stop.txt'):
            print('Proces przerwany.')
            break
    print('Best score: ', best_score)
    print('Best params: ', best_params)
    return best_params

<h3>Tuning Logistic Regression model</h3>
This model performs better with scaled feaures that's why we will use X_scaled.</br>
Searching of optimal hyperparameters values is iterative process. We will try same initial range of parameters values. According to results of first run we will change the range or step of the values in next runs.

In [19]:
param_grid = [
    {
        'solver': ['liblinear'],
        'penalty': ['l1'],
#         'tol': [0.01],
#     },
#     {
#             'solver':['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
#         'solver': ['liblinear', 'saga'],
#         'penalty': ['l2', 'l1'],
        'tol': [1e-2],
        'C': [0.5],
#         'fit_intercept': [True, False],
        'class_weight': [None, 'balanced'],
#         'max_iter': [100, 80, 120],
#         'l1_ratio': [0.1, 0.5, 0.9] # if penalty = 'elasticnet'
#     },
#     {
#      'penalty': ['elasticnet', 'l1', 'l2', 'none'],
#      'tol': [1e-7, 1e-6, 1e-5],
#      'C': [0.04, 0.05, 0.6],
#      'fit_intercept': [True, False],
#      'class_weight': [None, 'balanced'],
#      'solver': ['saga'],
#      'max_iter': [80, 90, 100]
    }
    ]

# parameters that we are sure will be defined here and set model with them before starting the grid search process
constant_params = {'random_state': 44, 'dual': False, 'multi_class':'auto', 'n_jobs': -1}

In [20]:
lr_clf = LogisticRegression()
lr_clf.set_params(**constant_params)
# logostic_regression_best_params = my_grid_search_cross_validation(lr_clf, train_X_scaled, df, param_grid)

LogisticRegression(n_jobs=-1, random_state=44)

In [21]:
# saved best params
logostic_regression_best_params = {'C': 0.5, 'class_weight': None, 'dual': False, 'fit_intercept': True,
                                   'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 100, 'multi_class': 'auto',
                                   'n_jobs': -1, 'penalty': 'l1', 'random_state': 44, 'solver': 'liblinear',
                                   'tol': 0.01, 'verbose': 0, 'warm_start': False}

In [22]:
score_table_tuned = {}

lr_clf.set_params(**logostic_regression_best_params)
lr_clf.fit(train_X_scaled, train_Y)
Y_pred = lr_clf.predict(test_X_scaled)
score = np.sum(Y_pred == test_Y)/len(test_Y)
score_table_tuned['Logistic Regression'] = score
print('Logistic Regression average score - X scaled: ', score)

Logistic Regression average score - X scaled:  0.7168458781362007


<h3>Tuning Ridge model</h3>

In [23]:
param_grid = [
    {
        'solver': ['auto'],
        'alpha': [x for x in range(80, 100)],
        'fit_intercept': [True],#, False],
        'class_weight': [None],#, 'balanced'],
        'positive': [False]
#     },
#     {
#         'solver': ['lbfgs'],
#         'alpha': [0.01, 0.1, 1, 10, 100],
#         'fit_intercept': [True, False],
#         'class_weight': [None, 'balanced'],
#         'positive': [True]
    }
    ]
constant_params = {'random_state': 44}

In [24]:
ridge_clf.set_params(**constant_params)
# ridge_best_params = my_grid_search_cross_validation(ridge_clf, train_X, df, param_grid)

RidgeClassifier(random_state=44)

In [25]:
ridge_best_params = {'alpha': 92, 'class_weight': None, 'copy_X': True, 'fit_intercept': True, 'max_iter': None,
                     'normalize': 'deprecated', 'positive': False, 'random_state': 44, 'solver': 'auto', 'tol': 0.001}

In [26]:
ridge_clf.set_params(**ridge_best_params)
ridge_clf.fit(train_X, train_Y)
Y_pred = ridge_clf.predict(test_X)
score = np.sum(Y_pred == test_Y)/len(test_Y)
score_table_tuned['Ridge'] = score
print('Ridge average score - X scaled: ', score)

Ridge average score - X scaled:  0.6953405017921147


<h3>Tuning MLP model</h3>

In [27]:
param_grid = [
    {
        'hidden_layer_sizes': [(1000,)],
        'activation': ['relu'], #, 'identity', 'logistic', 'tanh'],
        'alpha': [0.0001],
        'batch_size': [200],
        'learning_rate': ['constant'], #, 'invscaling', 'adaptive'],
        'early_stopping': [False, True]
    }
    ]
constant_params = {'random_state': 44, 'solver': 'adam'}

In [28]:
mlp_clf.set_params(**constant_params)
# mlp_best_params = my_grid_search_cross_validation(mlp_clf, train_X_scaled, df, param_grid)

MLPClassifier(random_state=44)

In [29]:
mlp_best_params = {'activation': 'relu', 'alpha': 0.0001, 'batch_size': 200, 'beta_1': 0.9, 'beta_2': 0.999,
                   'early_stopping': False, 'epsilon': 1e-08, 'hidden_layer_sizes': (1000,), 'learning_rate': 'constant',
                   'learning_rate_init': 0.001, 'max_fun': 15000, 'max_iter': 200, 'momentum': 0.9, 'n_iter_no_change': 10,
                   'nesterovs_momentum': True, 'power_t': 0.5, 'random_state': 44, 'shuffle': True, 'solver': 'adam',
                   'tol': 0.0001, 'validation_fraction': 0.1, 'verbose': False, 'warm_start': False}

In [30]:
mlp_clf.set_params(**mlp_best_params)
mlp_clf.fit(train_X_scaled, train_Y)
Y_pred = mlp_clf.predict(test_X_scaled)
score = np.sum(Y_pred == test_Y)/len(test_Y)
score_table_tuned['MLP'] = score
print('MLP average score - X scaled: ', score)

MLP average score - X scaled:  0.7538829151732378


<h3>Tuning Random Forest classifier model</h3>

In [31]:
param_grid = [
    {
        'n_estimators': [550],
        'criterion': ["entropy"],
        'max_depth': [None],
        'min_samples_split': [2],
        'min_samples_leaf': [2],
        'max_features': ["auto"],
        'max_leaf_nodes': [None],
        'class_weight': ['balanced_subsample']
    },
    {
        'n_estimators': [575, 600], #[x for x in range(50, 1000, 50)],
        'criterion': ['gini', "entropy"],
        'max_depth': [None], #[None, 5, 10, 15, 20, 25, 30]
        'min_samples_split': [2, 3, 4, 5], #, 7, 8, 9, 10]
        'min_samples_leaf': [1, 2, 3], #, 2, 3, 4, 5, 6]
        'max_features': ["auto"], #, None, "log2"]
        'max_leaf_nodes': [None], #, 5, 10, 20, 50, 100]
        'class_weight': ['balanced_subsample']
    }
    ]
constant_params = {'random_state': 44, 'n_jobs': -1}

In [32]:
rf_clf.set_params(**constant_params)
# rf_best_params = my_grid_search_cross_validation(rf_clf, train_X, df, param_grid)

RandomForestClassifier(n_estimators=500, n_jobs=-1, random_state=44)

In [33]:
rf_best_params = {'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': 'balanced_subsample', 'criterion': 'entropy',
                  'max_depth': None, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None,
                  'min_impurity_decrease': 0.0, 'min_samples_leaf': 2, 'min_samples_split': 2,
                  'min_weight_fraction_leaf': 0.0, 'n_estimators': 550, 'n_jobs': -1, 'oob_score': False,
                  'random_state': 44, 'verbose': 0, 'warm_start': False}

In [34]:
rf_clf.set_params(**rf_best_params)
rf_clf.fit(train_X, train_Y)
Y_pred = rf_clf.predict(test_X)
score = np.sum(Y_pred == test_Y)/len(test_Y)
score_table_tuned['Random Forest'] = score
print('Random Forest average score - X scaled: ', score)

Random Forest average score - X scaled:  0.7634408602150538


<h3> Tuning SVM Classifier </h3>

In [35]:
param_grid = [
    {
        'C': [5], #[x/10 for x in range(40, 60)] #[3, 4, 5, 6, 7, 8, 9]
        'kernel': ['rbf'], #['linear', 'poly', 'rbf', 'sigmoid']
        'gamma': ['scale', 0.0001, 0.0003, 0.001, 0.002, 0.003, 0.004, 0.005], #, 'auto'], 
        'tol': [1e-2],
        'class_weight': [None], #, 'balanced'],
        'max_iter': [500], #, 1000, 2000, 5000],
        'decision_function_shape': ['ovr'],
    }]

# parameters that we are sure will be defined here and set model with them before starting the grid search process
constant_params = {'random_state': 44}

In [36]:
svc_clf.set_params(**constant_params)
# svc_best_params_all_features = my_grid_search_cross_validation(svc_clf, train_X_scaled, df, param_grid)

SVC(random_state=44)

In [37]:
svc_best_params = {'C': 5, 'break_ties': False, 'cache_size': 200, 'class_weight': None, 'coef0': 0.0,
                   'decision_function_shape': 'ovo', 'degree': 3, 'gamma': 'scale', 'kernel': 'rbf', 'max_iter': 500,
                   'probability': False, 'random_state': 44, 'shrinking': True, 'tol': 0.01, 'verbose': False}

In [38]:
svc_clf.set_params(**svc_best_params)
svc_clf.fit(train_X_scaled, train_Y)
Y_pred = svc_clf.predict(test_X_scaled)
score = np.sum(Y_pred == test_Y)/len(test_Y)
score_table_tuned['SVM'] = score
print('SVM average score - X scaled: ', score)

SVM average score - X scaled:  0.7622461170848268


Let's see the results.

In [39]:
# read score table
score_table = pd.read_csv('part2_score_table.csv', index_col=0)

# drop 'MAX' column
score_table.drop(['MAX'], axis=1, inplace=True)

# make new DataFrame using new  results
score_table = pd.concat([score_table,
                         pd.Series(score_table_tuned, name='tuned')],
                        axis=1)

# add new column with best result for each model
score_table['MAX'] = score_table.max(axis=1)

# save to file
score_table.to_csv('part2_score_table_tuned.csv')

# show sorted score table
score_table.sort_values(by='MAX', ascending=False)

Unnamed: 0,base_X_unscaled,base_X_scaled,tuned,MAX
SVM,0.37276,0.770609,0.762246,0.770609
Random Forest,0.757467,0.756272,0.763441,0.763441
MLP,0.654719,0.735962,0.753883,0.753883
Logistic Regression,0.565114,0.695341,0.716846,0.716846
Ridge,0.688172,0.682198,0.695341,0.695341
Passive Agressive,0.523297,0.666667,,0.666667
SGD,0.48626,0.598566,,0.598566
KNN,0.524492,0.577061,,0.577061
Nearest Centroid,0.285544,0.505376,,0.505376
Decision Tree,0.500597,0.500597,,0.500597


In general adjusting hyperparameters improve models performance unless we overfit the model. In case of SVM model by changing the hyperparameters we have fittet the model better to train data but as it was performing better on train data it gets worse results on new data. It is called overfiting.