### MODEL TRAINING AND HYPER PARAMETER TUNING TO SELECT PARAMETERS TO FIT THE FINAL MODEL

BASELINE MODEL- SUPPORT VECTOR MACHINE

In [None]:
# Using the Support vector classifier for model training
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

sv_classifier=SVC()
y_pred_svc = sv_classifier.fit(X_train, y_train)
y_predict_svc=sv_classifier.predict(X_test)
final_accuracy_svc=accuracy_score(y_predict_svc,y_test)
print("The test-accuracy of SVM model is ", final_accuracy_svc*100,"%")

The test-accuracy of SVM model is  72.8 %


HYPERPARAMETER TUNNING FOR BASELINE SVM MODEL

In [None]:
from sklearn.model_selection import GridSearchCV
param_grid = {"kernel": ['rbf','sigmoid'],
             "C":[0.1,0.5,1.0],
             "random_state":[0,100,200,300]}

In [None]:
grid = GridSearchCV(estimator=sv_classifier, param_grid=param_grid, cv=5,  verbose=3)

In [None]:
grid.fit(X_train, y_train)

Fitting 5 folds for each of 24 candidates, totalling 120 fits
[CV] C=0.1, kernel=rbf, random_state=0 ...............................
[CV] ... C=0.1, kernel=rbf, random_state=0, score=0.760, total=   0.0s
[CV] C=0.1, kernel=rbf, random_state=0 ...............................
[CV] ... C=0.1, kernel=rbf, random_state=0, score=0.760, total=   0.0s
[CV] C=0.1, kernel=rbf, random_state=0 ...............................
[CV] ... C=0.1, kernel=rbf, random_state=0, score=0.760, total=   0.0s
[CV] C=0.1, kernel=rbf, random_state=0 ...............................
[CV] ... C=0.1, kernel=rbf, random_state=0, score=0.760, total=   0.0s
[CV] C=0.1, kernel=rbf, random_state=0 ...............................
[CV] ... C=0.1, kernel=rbf, random_state=0, score=0.767, total=   0.0s
[CV] C=0.1, kernel=rbf, random_state=100 .............................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.1s remaining:    0.0s


[CV] . C=0.1, kernel=rbf, random_state=100, score=0.760, total=   0.0s
[CV] C=0.1, kernel=rbf, random_state=100 .............................
[CV] . C=0.1, kernel=rbf, random_state=100, score=0.760, total=   0.0s
[CV] C=0.1, kernel=rbf, random_state=100 .............................
[CV] . C=0.1, kernel=rbf, random_state=100, score=0.760, total=   0.0s
[CV] C=0.1, kernel=rbf, random_state=100 .............................
[CV] . C=0.1, kernel=rbf, random_state=100, score=0.760, total=   0.0s
[CV] C=0.1, kernel=rbf, random_state=100 .............................
[CV] . C=0.1, kernel=rbf, random_state=100, score=0.767, total=   0.0s
[CV] C=0.1, kernel=rbf, random_state=200 .............................
[CV] . C=0.1, kernel=rbf, random_state=200, score=0.760, total=   0.0s
[CV] C=0.1, kernel=rbf, random_state=200 .............................
[CV] . C=0.1, kernel=rbf, random_state=200, score=0.760, total=   0.0s
[CV] C=0.1, kernel=rbf, random_state=200 .............................
[CV] .

[Parallel(n_jobs=1)]: Done 120 out of 120 | elapsed:    4.5s finished


GridSearchCV(cv=5, error_score=nan,
             estimator=SVC(C=1.0, break_ties=False, cache_size=200,
                           class_weight=None, coef0=0.0,
                           decision_function_shape='ovr', degree=3,
                           gamma='scale', kernel='rbf', max_iter=-1,
                           probability=False, random_state=None, shrinking=True,
                           tol=0.001, verbose=False),
             iid='deprecated', n_jobs=None,
             param_grid={'C': [0.1, 0.5, 1.0], 'kernel': ['rbf', 'sigmoid'],
                         'random_state': [0, 100, 200, 300]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=3)

In [None]:
grid.best_estimator_

SVC(C=0.1, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=0, shrinking=True, tol=0.001,
    verbose=False)

For Support vector machine classifier applying Grid search CV we discovered that selecting parameters c=0.1 and kernel will provide better output accuracy.

In [None]:
#Fit the baseline model
from sklearn.metrics import accuracy_score

# Testing the accuracy of SVM for best value of C
sv_classifier = SVC(kernel='rbf',C=0.1, random_state=0)
sv_classifier.fit(X_train, y_train)
y_predict_svc=sv_classifier.predict(X_test)
final_accuracy_svc=accuracy_score(y_predict,y_test)
print("The test-accuracy of tunned-SVM model for best C value is ", final_accuracy_svc*100,"%")

The test-accuracy of tunned-SVM model for best C value is  72.8 %


The classification by SVM model has no major improvement using the tunned parameters so let us try XG-Boost model.

XG BOOST MODEL

In [None]:
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

xgb=XGBClassifier()
y_pred_xgb = xgb.fit(X_train,y_train)
y_predict_xgb=xgb.predict(X_test)
final_accuracy_xgb=accuracy_score(y_predict_xgb,y_test)
print("The test-accuracy of XG Boost model is ", final_accuracy_xgb*100,"%")


The test-accuracy of XG Boost model is  77.2 %


HYPERPARAMETER TUNNING FOR XG-BOOST MODEL

In [None]:
param_grid = {"n_estimators": [10, 50, 100, 130], "criterion": ['gini', 'entropy'],
                               "max_depth": range(2, 10, 1)}

#Creating an object of the Grid Search class
grid = GridSearchCV(estimator=xgb, param_grid=param_grid, cv=5,  verbose=3,n_jobs=-1)

In [None]:
#finding the best parameters
grid.fit(X_train, y_train)

Fitting 5 folds for each of 64 candidates, totalling 320 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  28 tasks      | elapsed:    2.4s
[Parallel(n_jobs=-1)]: Done 220 tasks      | elapsed:   21.2s
[Parallel(n_jobs=-1)]: Done 320 out of 320 | elapsed:   33.9s finished


GridSearchCV(cv=5, error_score=nan,
             estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                     colsample_bylevel=1, colsample_bynode=1,
                                     colsample_bytree=1, gamma=0,
                                     learning_rate=0.1, max_delta_step=0,
                                     max_depth=3, min_child_weight=1,
                                     missing=None, n_estimators=100, n_jobs=1,
                                     nthread=None, objective='binary:logistic',
                                     random_state=0, reg_alpha=0, reg_lambda=1,
                                     scale_pos_weight=1, seed=None, silent=None,
                                     subsample=1, verbosity=1),
             iid='deprecated', n_jobs=-1,
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': range(2, 10),
                         'n_estimators': [10, 50, 100, 130]},
      

In [None]:
grid.best_estimator_

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, criterion='gini', gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=2,
              min_child_weight=1, missing=None, n_estimators=10, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

For XG-Boost classifier applying Grid search CV we get the tunned parameters such as the max_depth=2 and n_estimators=10.

In [None]:
from xgboost import XGBClassifier
xgb=XGBClassifier(n_estimators=10, max_depth= 2)
y_pred_xgb = xgb.fit(X_train, y_train)
y_predict_xgb=xgb.predict(X_test)
final_accuracy=accuracy_score(y_predict_xgb,y_test)
print("The test-accuracy of tunned XG-Boost model is ", final_accuracy*100,"%")

The test-accuracy of tunned XG-Boost model is  80.0 %


Using the tunned parameters we are elevating the accuracy of the classifier to 80%. Which was not the case in SVM. Hence, XG-Boost classifier performs far better than the baseline model.