# Support Vector Machine (SVM) and Model Ensemble {-}

This assignment aims at familiarizing with training and testing Suppor Vector Machine classification model, along with exploiting the power of model ensemble technics.
- Load the data.
- Analyze the data.
- Remove outliers and clean the data.
- Use GridSearchCV to find the best set of SVM hyperparameters.
- Build, train and evaluate the SVM model.
- Separately build, train and evaluate the other four classifiers (Logistic regression, Naive Bayes, Decision Tree, Random Forest) on the same dataset, then compare their performance with the SVM model's.
- Apply three model ensemble technics, i.e., Bagging, Boosting and Stacking, to solve the problem, then compare their performance with each other and with the use of individual models. Draw conclusion from what has been observed.

The dataset you will be working on is 'data-breast-cancer.csv'. It is composed of attributes to build a prediction model.

**1. Load the data**

In [None]:
# Load the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle

In [None]:
# Load the dataset
df = pd.read_csv("data-breast-cancer.csv")

In [None]:
# Show some data samples
df.head()

Unnamed: 0.1,Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean
0,0,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871
1,1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667
2,2,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999
3,3,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744
4,4,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883


This is a dataset used to detect whether a patient has breast cancer depending on the following features:

- diagnosis: (label) the diagnosis of breast (label) tissues (M = malignant, B = benign).
- radius: distances from center to points on the perimeter.
- texture: standard deviation of gray-scale values.
- perimeter: perimeter of the tumor.
- area: area of the tumor.
- smoothness: local variation in radius lengths.
- compactness: is equal to (perimeter^2 / area - 1.0).
- concavity: severity of concave portions of the contour.
- concave points: number of concave portions of the contour.
- symmetry: symmetry of the tumor shape.
- fractal dimension: "coastline approximation" - 1.



**2. Analyze the data**

In [None]:
# Drop "Unnamed:0" column as it does not contain useful information
df.drop(columns=["Unnamed: 0"], inplace=True)
df.head()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean
0,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871
1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667
2,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999
3,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744
4,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883


In [None]:
# Shape of data frame
print("Data shape: " + str(df.shape) + "\n")

Data shape: (569, 11)



In [None]:
# Show data information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 11 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   diagnosis               569 non-null    object 
 1   radius_mean             569 non-null    float64
 2   texture_mean            569 non-null    float64
 3   perimeter_mean          569 non-null    float64
 4   area_mean               569 non-null    float64
 5   smoothness_mean         569 non-null    float64
 6   compactness_mean        569 non-null    float64
 7   concavity_mean          569 non-null    float64
 8   concave points_mean     569 non-null    float64
 9   symmetry_mean           569 non-null    float64
 10  fractal_dimension_mean  569 non-null    float64
dtypes: float64(10), object(1)
memory usage: 49.0+ KB


In [None]:
# Print out different types of diagnosis
df['diagnosis'].unique()

array(['M', 'B'], dtype=object)

In [None]:
# Transform diagnosis to dummy variables using mapping
# Define mapping
mapping = {'M': 1, 'B': 0}

# Apply mapping to the DataFrame column
df['diagnosis'] = df['diagnosis'].map(mapping)

In [None]:
# Drop duplicate samples
df = df.drop_duplicates(ignore_index=True)
df.shape


(569, 11)

In [None]:
# Descibe the dataset
df.describe()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,0.372583,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,0.062798
std,0.483918,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,0.00706
min,0.0,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996
25%,0.0,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,0.0577
50%,0.0,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,0.06154
75%,1.0,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,0.06612
max,1.0,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744


**3. Remove outliers and Clean the data**

In [None]:
# Define function to remove outliers
def remove_outliers(df, columns):
    df_clean = df.copy()
    for col in columns:
        q = df[col].quantile(0.98)
        df_clean = df_clean[df_clean[col] < q]
    return df_clean

# Specify columns to remove outliers from
columns_to_clean = ['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
                    'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean']

# Call the function
df_clean = remove_outliers(df, columns_to_clean)

df_clean.shape

(512, 11)

In [None]:
# Separate data features by removing the data label.
X = df_clean.drop(columns=["diagnosis"], axis=1)

# Assign data label to variable y
y = df_clean.diagnosis

# Split train/test with a random state
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=10, train_size=0.8)

In [None]:
# Show some training samples
X_train.head()

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean
199,14.45,20.22,94.49,642.7,0.09872,0.1206,0.118,0.0598,0.195,0.06466
284,12.89,15.7,84.08,516.6,0.07818,0.0958,0.1115,0.0339,0.1432,0.05935
198,19.18,22.49,127.5,1148.0,0.08523,0.1428,0.1114,0.06772,0.1767,0.05529
57,14.71,21.59,95.55,656.9,0.1137,0.1365,0.1293,0.08123,0.2027,0.06758
546,10.32,16.35,65.31,324.9,0.09434,0.04994,0.01012,0.005495,0.1885,0.06201


In [None]:
# Check NAN values in the data
if df_clean.isna().any().any():
    print("There are missing values in the DataFrame.")
else:
    print("No missing values found in the DataFrame.")

No missing values found in the DataFrame.


**4. Use GridSearchCV to find the best set of SVM hyperparameters. Build, train and evaluate the SVM model.**

In [None]:
# Initialize and use StandardScaler to normalize the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_normalized_train = scaler.fit_transform(X_train)     # Fit and transform the training data
X_normalized_test = scaler.transform(X_test)           # Only transform the test data.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
param_grid = {"C": [0.01, 0.1, 1, 10, 100, 1000],
              "gamma": ["scale", 0.001, 0.005, 0.1]}
gridsearch = GridSearchCV(SVC(), param_grid, cv=10, scoring="f1", verbose=1)     # cv: number of folds in cross validation.

# Run grid search to find the best set of hyper-parameters
gridsearch.fit(X_normalized_train, y_train)


Fitting 10 folds for each of 24 candidates, totalling 240 fits


In [None]:
# Best set of hyper-parameters
gridsearch.best_params_

{'C': 10, 'gamma': 'scale'}

In [None]:
# Run SVM with the best set of hyper-parameters.
model = SVC(C=gridsearch.best_params_['C'], gamma=gridsearch.best_params_['gamma'])
model.fit(X_normalized_train, y_train)

In [None]:
# Evaluating the accuracy of the model
from sklearn.metrics import accuracy_score
accuracy_svm = accuracy_score(y_test, model.predict(X_normalized_test))
print("SVM Model Accuracy:", accuracy_svm)

# Show evaluation metrics on the test set
from sklearn.metrics import classification_report
print(classification_report(y_test, model.predict(X_normalized_test)))

SVM Model Accuracy: 0.941747572815534
              precision    recall  f1-score   support

           0       0.94      0.97      0.96        67
           1       0.94      0.89      0.91        36

    accuracy                           0.94       103
   macro avg       0.94      0.93      0.94       103
weighted avg       0.94      0.94      0.94       103



**5. Separately build, train and evaluate the other four classifiers (Logistic regression, Naive Bayes, Decision Tree, Random Forest) on the same dataset, then compare their performance with the SVM model's.**

**5.1. Logistic regression**

In [None]:
# Load the libraries
from sklearn.linear_model import LogisticRegression

grid_search={"C":[0.01, 0.1, 1]} # Define the values of hyperparameter C we want to try
logmodel=LogisticRegression() # Initialize the logistic regression model
logmodel_cv=GridSearchCV(logmodel, grid_search, cv=5) # Set up GridSearchCV to find the best value of hyperparameter C, with 5-fold cross validation, i.e., cv=5.
logmodel_cv.fit(X_normalized_train, y_train) # Train the model using GridSearchCV

logmodel = LogisticRegression(C=logmodel_cv.best_params_['C'])  # Initialize Logistic Regression model with the best value of hyper parameter C
logmodel.fit(X_normalized_train, y_train)       # Train the model

# Make prediction on the test data
pred_y = logmodel.predict(X_normalized_test)

# Evaluating the accuracy of the model
accuracy_lr = accuracy_score(y_test, pred_y)
print("Logistic Regression Model Accuracy:", accuracy_lr)

# Show evaluation metrics on the test set
print(classification_report(y_test, pred_y))

Logistic Regression Model Accuracy: 0.8932038834951457
              precision    recall  f1-score   support

           0       0.88      0.97      0.92        67
           1       0.93      0.75      0.83        36

    accuracy                           0.89       103
   macro avg       0.90      0.86      0.88       103
weighted avg       0.90      0.89      0.89       103



**5.2. Naive Bayes**

In [None]:
# Load the libraries
from sklearn.naive_bayes import GaussianNB    # Initialize Gaussian Naive Bayes model
naive_model = GaussianNB()

# Define the values of hyperparameter var_smoothing we want to try
grid_search={"var_smoothing":[1e-2, 1e-3, 1e-4, 1e-5,]}

# Set up GridSearchCV to find the best value of hyperparameter var_smoothing, with 5-fold cross validation
naive_cv=GridSearchCV(naive_model, grid_search, cv=5)

# Train the model using GridSearchCV
naive_cv.fit(X_normalized_train, y_train)

In [None]:
# Initialize Gaussian Naive Bayes model with the best value of hyperparameter var_smoothing
naive_normal = GaussianNB(var_smoothing=naive_cv.best_params_['var_smoothing'])

# Train the model
naive_normal.fit(X_normalized_train, y_train)

# Make prediction on the test data
pred_y = naive_normal.predict(X_normalized_test)

# Evaluating the accuracy of the model
accuracy_nb = accuracy_score(y_test, pred_y)
print("Naive Bayes Model Accuracy:", accuracy_nb)

# Show evaluation metrics on the test set
print(classification_report(y_test, pred_y))


Naive Bayes Model Accuracy: 0.883495145631068
              precision    recall  f1-score   support

           0       0.91      0.91      0.91        67
           1       0.83      0.83      0.83        36

    accuracy                           0.88       103
   macro avg       0.87      0.87      0.87       103
weighted avg       0.88      0.88      0.88       103



**5.3. Decision Tree**

In [None]:
# Check if the data is imbalanced then apply SMOTE if neccessary

# Before oversampling
print("Before oversampling: " + str(X_train.shape))
print(np.unique(y_train, return_counts=True))                  # Print number of labels, label '1' dominates '0'

Before oversampling: (409, 10)
(array([0, 1]), array([271, 138]))


In [None]:
# Apply oversampling method for label '1'
from imblearn.over_sampling import SMOTE     # Load the SMOTE library
smote = SMOTE(random_state=5)                # Initialize SMOTE
X_train_oversampling, y_train_oversampling = smote.fit_resample(X_train, y_train)     # Oversample label '1' (minority class) in the training set

In [None]:
# After oversampling
print("After oversampling: " + str(X_train_oversampling.shape))
print(np.unique(y_train_oversampling, return_counts=True))     # Print number of labels, now label '0' and '1' have the same number of labels.

After oversampling: (542, 10)
(array([0, 1]), array([271, 271]))


In [None]:
# Import GridSearchCV for finding the best hyper-parameter set.
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor

params = {"criterion": ["gini", "entropy"],             # Criterion to evaluate the purity.
         "max_depth": [3, 5],                           # Maximum depth of the tree
         "min_samples_split": [4, 8]}                   # Stop splitting condition.

grid_search = GridSearchCV(estimator=DecisionTreeClassifier(), param_grid=params, cv=5)

In [None]:
# Run the search on oversampled training data samples.
grid_search.fit(X_train_oversampling, y_train_oversampling)

In [None]:
# Build a decision tree model pipeline from the best set of hyper-parameters found
model_dt = DecisionTreeClassifier(criterion=grid_search.best_params_['criterion'], max_depth=grid_search.best_params_['max_depth'], min_samples_split=grid_search.best_params_['min_samples_split'])

In [None]:
# Train the decision tree model
model_dt.fit(X_train_oversampling, y_train_oversampling)

In [None]:
# Make prediction on the original test set (after training on the over-sampled training set).
pred_y = model_dt.predict(X_test)

# Evaluating the accuracy of the model
accuracy_dt = accuracy_score(y_test, pred_y)
print("Decision Tree Model Accuracy:", accuracy_dt)

# Show evaluation metrics on the test set
print(classification_report(y_test, pred_y))

Decision Tree Model Accuracy: 0.9223300970873787
              precision    recall  f1-score   support

           0       0.95      0.93      0.94        67
           1       0.87      0.92      0.89        36

    accuracy                           0.92       103
   macro avg       0.91      0.92      0.92       103
weighted avg       0.92      0.92      0.92       103



**5.4. Random forest**

In [None]:
from sklearn.ensemble import RandomForestClassifier

params = {"criterion": ["gini", "entropy"],             # Criterion to evaluate the purity.
         "max_depth": [7, 9, 11],                           # Maximum depth of the tree
         "min_samples_split": [8, 12, 16]}                   # Stop splitting condition.

grid_search_rf = GridSearchCV(estimator=RandomForestClassifier(n_estimators=10, n_jobs=10), param_grid=params, cv= 5) # Number of trees in the forest is 10

# Run the search on oversampled training data samples.
grid_search_rf.fit(X_train_oversampling, y_train_oversampling)     # Train the RandomForest

In [None]:
# Build a Random Forest model pipeline from the best set of hyper-parameters found
model_rf = RandomForestClassifier(n_estimators=10, random_state=1, criterion=grid_search_rf.best_params_['criterion'], max_depth=grid_search_rf.best_params_['max_depth'], min_samples_split=grid_search_rf.best_params_['min_samples_split'])     # Initialize the RandomForest

In [None]:
# Train the Random Forest model
model_rf.fit(X_train_oversampling, y_train_oversampling)

In [None]:
# Make prediction on the original test set (after training on the over-sampled training set).
pred_y = model_rf.predict(X_test)

# Evaluating the accuracy of the model
accuracy_rf = accuracy_score(y_test, pred_y)
print("Random Forest Model Accuracy:", accuracy_rf)

# Show evaluation metrics on the test set
print(classification_report(y_test, pred_y))

Random Forest Model Accuracy: 0.9223300970873787
              precision    recall  f1-score   support

           0       0.97      0.91      0.94        67
           1       0.85      0.94      0.89        36

    accuracy                           0.92       103
   macro avg       0.91      0.93      0.92       103
weighted avg       0.93      0.92      0.92       103



**5.5. Comparision**


*   In term of accuracy, SVM performs the best with very high accuracy value of 0.94. Second performance standing are Decision Tree and Random Forest (0.92). The two last models are Logistic Regression (0.89) and Naive Baynes (0.88).



**6. Apply three model ensemble technics, i.e., Bagging, Boosting and Stacking, to solve the problem, then compare their performance with each other and with the use of individual models. Draw conclusion from what has been observed.**

In [None]:
# Load the libraries
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier

**6.1. Bagging with Support Vector Machine (SVM)**

In [None]:
# Creating a Support Vector Machine Classifier as the base estimator
base_svm = SVC(kernel='linear', C=1.0)

# Create a Bagging Classifier with SVM as the base model
bagging_clf = BaggingClassifier(estimator=base_svm, n_estimators=10, max_samples=0.5)

# Training the Bagging Classifier
bagging_clf.fit(X_train, y_train)

# Making predictions on the test set
pred_y = bagging_clf.predict(X_test)

# Evaluating the accuracy
accuracy_bagging = accuracy_score(y_test, pred_y)
print("Bagging Accuracy:", accuracy_bagging)

# Show evaluation metrics on the test set
print(classification_report(y_test, pred_y))

Bagging Accuracy: 0.883495145631068
              precision    recall  f1-score   support

           0       0.90      0.93      0.91        67
           1       0.85      0.81      0.83        36

    accuracy                           0.88       103
   macro avg       0.88      0.87      0.87       103
weighted avg       0.88      0.88      0.88       103



**6.2. Boosting with AdaBoost, Gradient Boosting and XGBoost**

In [None]:
# AdaBoost

# Create an AdaBoost Classifier with Decision Tree as the base model
ada_clf = AdaBoostClassifier(estimator = DecisionTreeClassifier(), n_estimators=10)

# Train the AdaBoost Classifier
ada_clf.fit(X_train, y_train)

# Making predictions on the test set
pred_y_ada = ada_clf.predict(X_test)

# Evaluating the accuracy of the model
accuracy_ada = accuracy_score(y_test, pred_y_ada)
print("AdaBoost Accuracy:", accuracy_ada)

# Show evaluation metrics on the test set
print(classification_report(y_test, pred_y_ada))

AdaBoost Accuracy: 0.883495145631068
              precision    recall  f1-score   support

           0       0.97      0.85      0.90        67
           1       0.77      0.94      0.85        36

    accuracy                           0.88       103
   macro avg       0.87      0.90      0.88       103
weighted avg       0.90      0.88      0.89       103



In [None]:
# Gradient Boosting

# Create a Gradient Boosting Classifier which uses Decision Tree as boosting model by default
gb_clf = GradientBoostingClassifier(n_estimators=10, learning_rate=0.1)

# Train the Gradient Boosting Classifier
gb_clf.fit(X_train, y_train)

# Making predictions on the test set
pred_y_gb = gb_clf.predict(X_test)

# Evaluating the accuracy of the model
accuracy_gb = accuracy_score(y_test, pred_y_gb)
print("Gradient Boosting Accuracy:", accuracy_gb)

# Show evaluation metrics on the test set
print(classification_report(y_test, pred_y_gb))

Gradient Boosting Accuracy: 0.912621359223301
              precision    recall  f1-score   support

           0       0.95      0.91      0.93        67
           1       0.85      0.92      0.88        36

    accuracy                           0.91       103
   macro avg       0.90      0.91      0.91       103
weighted avg       0.92      0.91      0.91       103



In [None]:
# XGBoost

# Create an XGBoost Classifier
xgb_clf = XGBClassifier(n_estimators=100, learning_rate=0.1)

# Train the XGBoost Classifier
xgb_clf.fit(X_train, y_train)

# Making predictions on the test set
pred_y_xgb = xgb_clf.predict(X_test)

# Evaluating the accuracy of the model
accuracy_xgb = accuracy_score(y_test, pred_y_xgb)
print("XGBoost Accuracy:", accuracy_xgb)

# Show evaluation metrics on the test set
print(classification_report(y_test, pred_y_xgb))

XGBoost Accuracy: 0.9320388349514563
              precision    recall  f1-score   support

           0       0.97      0.93      0.95        67
           1       0.87      0.94      0.91        36

    accuracy                           0.93       103
   macro avg       0.92      0.93      0.93       103
weighted avg       0.93      0.93      0.93       103



XGBoost performs the best in all evaluation metrics of accuracy, precision, recall and f1-score.

**6.3. Stacking**

In [None]:
# K-Nearest Neighbor (KNN)
# Train a K-Nearest Neighbor (KNN) model
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()                           # Initialize KNN model.
params_knn = {'n_neighbors': np.arange(1, 25)}         # n_neighbors in KNeighborsClassifier() indicates the number of neighbors K.
knn_gs = GridSearchCV(knn, params_knn, cv=5)           # Initialize GridSearchCV to find an optimal value of K.
knn_gs.fit(X_train, y_train)                           # Fit GridSearch in training set to find the optimal K.

# Best number of neighbors K
knn_best = knn_gs.best_estimator_
print(knn_gs.best_params_)

{'n_neighbors': 4}


In [None]:
# Support Vector Machine (SVM)
# Train a Support Vector Machine (SVM) model
svm = SVC()
params_svm = {"C": [0.1, 1, 10, 100]}

svm_gs = GridSearchCV(svm, params_svm, cv=5)    # Initialize GridSearchCV to find an optimal value if the hyperparameter C.
svm_gs.fit(X_train, y_train)                    # Fit GridSearch in training set to find the optimal C.

# Best value of the hyperparameter C.
svm_best = svm_gs.best_estimator_
print(svm_gs.best_params_)

{'C': 10}


In [None]:
# Random Forest
# Train a Random Forest classifier
rf = RandomForestClassifier()                        # Initialize a Random Forest Classifier.
params_rf = {'n_estimators': [50, 100, 200]}         # n_estimator in RandomForestClassifier(...) indicates the number of Trees in the Forest.
rf_gs = GridSearchCV(rf, params_rf, cv=5)            # Initialize GridSearchCV to find an optimal number of Trees.
rf_gs.fit(X_train, y_train)                          # Fit GridSearch in training set to find the optimal number of Trees.

# Best number of Trees.
rf_best = rf_gs.best_estimator_
print(rf_gs.best_params_)

{'n_estimators': 100}


In [None]:
# Logistic Regression
# Train a Logistic Regression model
log_reg = LogisticRegression(solver='lbfgs', max_iter=1000)   # Initialize Logistic Regression model.
log_reg.fit(X_train, y_train)                                 # Fit the model to training set.

In [None]:
# Model Testing
# Print accuracy of single models on the test set
print('KNN: {}'.format(knn_best.score(X_test, y_test)))                     # KNN accuracy
print('SVM: {}'.format(svm_best.score(X_test, y_test)))                     # SVM accuracy
print('Random Forest: {}'.format(rf_best.score(X_test, y_test)))            # Random Forest accuracy
print('Logistic Regression: {}'.format(log_reg.score(X_test, y_test)))      # Logistic Regression accuracy

KNN: 0.8932038834951457
SVM: 0.8640776699029126
Random Forest: 0.9029126213592233
Logistic Regression: 0.883495145631068


In [None]:
# Model ensembling

from sklearn.ensemble import VotingClassifier
# Ensemble the four models using hard (majority) voting
estimators=[('knn', knn_best), ('svm', svm_best), ('rf', rf_best), ('log_reg', log_reg)]    # Initialize base models in the ensemble
ensemble = VotingClassifier(estimators, voting='hard')                                      # Define how to ensemble them, i.e., hard voting

# Train the model ensemble on the training set
ensemble.fit(X_train, y_train)          # Train the ensemble on the training set
ensemble.score(X_test, y_test)          # Test the ensemble on the test set

0.883495145631068

Boosting with XGBoost performs the best compared to Bagging and Stacking with accuracy of 0.93. However, this still performs worsen than SVM model individually with accuracy of 0.94. This contradicts with theoritical conclusion of model ensemble can achieve higher accuracy than single model. This might be due to some of factors such as model diversity, overfitting and data quality which can lead to lower model accuracy.