Example from: https://www.datacamp.com/tutorial/svm-classification-scikit-learn-python

Let's first load the required dataset you will use.

In [1]:
#Import scikit-learn dataset library
from sklearn import datasets

#Load dataset
cancer = datasets.load_breast_cancer()

Exploring Data

After you have loaded the dataset, you might want to know a little bit more about it. You can check feature and target names.

In [2]:
# print the names of the 13 features
print("Features: ", cancer.feature_names)

# print the label type of cancer('malignant' 'benign')
print("Labels: ", cancer.target_names)

Features:  ['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']
Labels:  ['malignant' 'benign']


Splitting Data

To understand model performance, dividing the dataset into a training set and a test set is a good strategy.

Split the dataset by using the function train_test_split(). you need to pass 3 parameters features, target, and test_set size. Additionally, you can use random_state to select records randomly.

In [3]:
# Import train_test_split function
from sklearn.model_selection import train_test_split

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, test_size=0.3,random_state=109) # 70% training and 30% test

Generating Model

Let's build support vector machine model. First, import the SVM module and create support vector classifier object by passing argument kernel as the linear kernel in SVC() function.

Then, fit your model on train set using fit() and perform prediction on the test set using predict().

In [4]:
#Import svm model
from sklearn import svm

#Create a svm Classifier
clf = svm.SVC(kernel='linear') # Linear Kernel

#Train the model using the training sets
clf.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

Evaluating the Model

Let's estimate how accurately the classifier or model can predict the breast cancer of patients.

Accuracy can be computed by comparing actual test set values and predicted values.

In [5]:
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics

# Model Accuracy: how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.9649122807017544


Well, you got a classification rate of 96.49%, considered as very good accuracy.

For further evaluation, you can also check precision and recall of model.

In [6]:
# Model Precision: what percentage of positive tuples are labeled as such?
print("Precision:",metrics.precision_score(y_test, y_pred))

# Model Recall: what percentage of positive tuples are labelled as such?
print("Recall:",metrics.recall_score(y_test, y_pred))


Precision: 0.9811320754716981
Recall: 0.9629629629629629


In [7]:
#Import the Data as dataframe
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
cancer_data = load_breast_cancer()
df = pd.DataFrame(data = cancer_data.data, columns=cancer_data.feature_names)
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [8]:
#Visualize the data
import plotly.express as px
counts = np.unique(cancer_data.target, return_counts=True)
values = [counts[0][0], counts[1][1]]
fig = px.pie(values=counts[1], names=cancer_data.target_names ,title="Malignant vs Benign",color_discrete_sequence=['skyblue', 'black'])
fig.show()

In [9]:
#Train Test Split the Data
from sklearn.model_selection import train_test_split

X = cancer_data.data
y = cancer_data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state=42)


In [10]:
#Build Model
from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestClassifier

kf = KFold(n_splits=3, shuffle=True)
model = RandomForestClassifier(n_estimators = 5, random_state=42)
model

In [11]:
#Train the model and get predictions
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred

array([0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1,
       1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0,
       0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1,
       1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0,
       0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1])

In [12]:
from sklearn.metrics import *

cm = confusion_matrix(y_test, y_pred)
tn, fp, fn, tp = cm.ravel()
specificity = tn / (tn+fp)
sensitivity = tp / (tp + fn)
precision = precision_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
print(cm)

[[ 58   6]
 [  5 102]]


In [13]:
from statistics import stdev
from sklearn.model_selection import cross_val_score
score = cross_val_score(model, X_train, y_train, cv=kf, scoring='recall')
model_cv_score = score.mean()
model_cv_stdev = stdev(score)
print('Cross Validation Recall scores are: {}'.format(score))
print('Average Cross Validation Recall score: ', model_cv_score)
print('Cross Validation Recall standard deviation: ', model_cv_stdev)

Cross Validation Recall scores are: [0.95180723 0.91764706 0.96341463]
Average Cross Validation Recall score:  0.9442896406285112
Cross Validation Recall standard deviation:  0.023791875461460717


In [14]:
x = [(accuracy, sensitivity, specificity,precision, f1)]

scores = pd.DataFrame(data = x, columns=
                        ['Accuracy',  'Sensitivity', 'Specificity', 'Precision','F1 Score'])
scores.insert(0, 'Random Forest', 'Before tuning hyperparameters')
scores

Unnamed: 0,Random Forest,Accuracy,Sensitivity,Specificity,Precision,F1 Score
0,Before tuning hyperparameters,0.935673,0.953271,0.90625,0.944444,0.948837


#Hyperparameter Tuning Using GridSearchCV

In [15]:
#Use Gridsearch to find the best hyper parameters
from sklearn.model_selection import GridSearchCV
params = {
    'n_estimators' : [40,50,100,150,200],
    'max_features': [ 'sqrt', 'log2'],
    'max_depth': [4,5,6],
    'min_samples_split': [2,3,4,6],
    'min_samples_leaf': [1,2,3,5],

}

grid_search = GridSearchCV(model, param_grid=params, cv = kf, scoring='recall').fit(X_train, y_train)
grid_search

In [16]:
print('Best parameters:', grid_search.best_params_)
print('Best score:', grid_search.best_score_)

Best parameters: {'max_depth': 6, 'max_features': 'log2', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50}
Best score: 0.9839816933638444


In [17]:
y_pred = grid_search.predict(X_test)

In [18]:
grid_cm = confusion_matrix(y_test, y_pred)
tn, fp, fn, tp = grid_cm.ravel()
grid_specificity = tn / (tn+fp)
grid_sensitivity = tp / (tp + fn)
grid_precision = precision_score(y_test, y_pred)
grid_f1 = f1_score(y_test, y_pred)
grid_accuracy = accuracy_score(y_test, y_pred)

print(grid_cm)

[[ 58   6]
 [  4 103]]


In [19]:
score2 = cross_val_score(grid_search, X_train, y_train, cv=kf, scoring='recall')
grid_cv_score = score2.mean()
grid_cv_stdev = stdev(score2)

print('Cross Validation Recall scores are: {}'.format(score2))
print('Average Cross Validation Recall score: ', grid_cv_score)
print('Cross Validation Recall standard deviation: ', grid_cv_stdev)

Cross Validation Recall scores are: [0.95180723 0.9875     0.96551724]
Average Cross Validation Recall score:  0.9682748234316577
Cross Validation Recall standard deviation:  0.0180054622545854


In [20]:
x2 = [(grid_accuracy, grid_sensitivity, grid_specificity,grid_precision, grid_f1)]

grid_scores =  pd.DataFrame(data = x2, columns=
                        ['Accuracy', 'Sensitivity', 'Specificity', 'Precision','F1 Score'])
grid_scores.insert(0, 'Random Forest', 'After tuning hyperparameters')
grid_scores

Unnamed: 0,Random Forest,Accuracy,Sensitivity,Specificity,Precision,F1 Score
0,After tuning hyperparameters,0.94152,0.962617,0.90625,0.944954,0.953704


In [23]:
predictions = pd.concat([scores, grid_scores], ignore_index=True, sort=False)
predictions.sort_values(by=['Accuracy'], ascending=False)
predictions

Unnamed: 0,Random Forest,Accuracy,Sensitivity,Specificity,Precision,F1 Score
0,Before tuning hyperparameters,0.935673,0.953271,0.90625,0.944444,0.948837
1,After tuning hyperparameters,0.94152,0.962617,0.90625,0.944954,0.953704
