<a href="https://colab.research.google.com/github/aleksanderprofic/Machine-Learning/blob/master/Classification/ModelSelection/tumors_model_selection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classification model selection for Tumors dataset

### Selecting the best model for particular problem out of all learned classification models:
* Logistic Regression, 
* K-Nearest Neighbors, 
* Support Vector Machines,
* Naive Bayes
* Decision Trees,
* Random Forests
* XGBoost


## Data preprocessing

In [1]:
import pandas as pd
import numpy as np

dataset = pd.read_csv('Tumors.csv')
dataset.head()

Unnamed: 0,Sample code number,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


### Extracting dependent and independent variables

In [2]:
X = dataset.iloc[:, 1:-1].values
y = dataset.iloc[:, -1].values

### Feature Scaling

In [3]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X = sc.fit_transform(X)

### Splitting dataset into the Training Set and the Test Set 

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

## Training and predictions

### Logistic Regression

In [5]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, recall_score, confusion_matrix

log_regressor = LogisticRegression()
log_regressor.fit(X_train, y_train)
log_y_pred = log_regressor.predict(X_test)

cm = confusion_matrix(y_test, log_y_pred)

print('Accuracy score: {:.2f}%'.format(accuracy_score(y_test, log_y_pred) * 100))
print('Recall score: {:.2f}%'.format(recall_score(y_test, log_y_pred, pos_label=4) * 100))
print(f'Confusion matrix: \n{cm}')

Accuracy score: 97.81%
Recall score: 98.11%
Confusion matrix: 
[[82  2]
 [ 1 52]]


#### Applying k-fold Cross Validation

In [6]:
from sklearn.model_selection import cross_val_score

accuracies = cross_val_score(estimator=log_regressor, X=X_train, y=y_train, cv=10)
print('Mean accuracy: {:.2f}%'.format(accuracies.mean() * 100))
print('Standard deviation: {:.2f}%'.format(accuracies.std() * 100))

Mean accuracy: 96.53%
Standard deviation: 2.86%


### K-Nearest Neighbors

#### Performing Grid Search to find the best hyper parameter

In [7]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, recall_score, confusion_matrix
from sklearn.model_selection import GridSearchCV

parameters = [{'n_neighbors': [3,4,5,6,7,10]}]
grid_search = GridSearchCV(estimator=KNeighborsClassifier(), param_grid=parameters, scoring='accuracy', n_jobs=-1, cv=10)
grid_search.fit(X_train, y_train)

print('Best mean accuracy: {:.2f}%'.format(grid_search.best_score_ * 100))
print('Standard deviation: {:.2f}%'.format(grid_search.cv_results_['std_test_score'][grid_search.best_index_] * 100))
print(f'Best parameter: {grid_search.best_params_}')

Best mean accuracy: 96.90%
Standard deviation: 2.44%
Best parameter: {'n_neighbors': 7}


In [8]:
knn = KNeighborsClassifier(n_neighbors=7)
knn.fit(X_train, y_train)
knn_y_pred = knn.predict(X_test)

cm = confusion_matrix(y_test, knn_y_pred)

print('Accuracy score: {:.2f}%'.format(accuracy_score(y_test, knn_y_pred) * 100))
print('Recall score: {:.2f}%'.format(recall_score(y_test, knn_y_pred, pos_label=4) * 100))
print(f'Confusion matrix: \n{cm}')

Accuracy score: 96.35%
Recall score: 96.23%
Confusion matrix: 
[[81  3]
 [ 2 51]]


### Support Vector Machines

#### Performing Grid Search to find the best hyper parameters

In [9]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, recall_score, confusion_matrix
from sklearn.model_selection import GridSearchCV

parameters = [{'C': [0, 0.1, 0.25, 0.5, 0.75, 1], 'kernel': ['rbf'], 'gamma': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 'scale']}]
grid_search = GridSearchCV(estimator=SVC(), param_grid=parameters, scoring='accuracy', n_jobs=-1, cv=10)
grid_search.fit(X_train, y_train)

print('Best mean accuracy: {:.2f}%'.format(grid_search.best_score_ * 100))
print('Standard deviation: {:.2f}%'.format(grid_search.cv_results_['std_test_score'][grid_search.best_index_] * 100))
print(f'Best parameter: {grid_search.best_params_}')

Best mean accuracy: 96.90%
Standard deviation: 2.94%
Best parameter: {'C': 1, 'gamma': 'scale', 'kernel': 'rbf'}


In [10]:
svc = SVC(C=1, kernel='rbf', gamma='scale')
svc.fit(X_train, y_train)
svc_y_pred = svc.predict(X_test)

cm = confusion_matrix(y_test, svc_y_pred)

print('Accuracy score: {:.2f}%'.format(accuracy_score(y_test, svc_y_pred) * 100))
print('Recall score: {:.2f}%'.format(recall_score(y_test, svc_y_pred, pos_label=4) * 100))
print(f'Confusion matrix: \n{cm}')

Accuracy score: 97.81%
Recall score: 98.11%
Confusion matrix: 
[[82  2]
 [ 1 52]]


### Naive Bayes

In [11]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, recall_score, confusion_matrix

nb = GaussianNB()
nb.fit(X_train, y_train)
nb_y_pred = nb.predict(X_test)

cm = confusion_matrix(y_test, nb_y_pred)

print('Accuracy score: {:.2f}%'.format(accuracy_score(y_test, nb_y_pred) * 100))
print('Recall score: {:.2f}%'.format(recall_score(y_test, nb_y_pred, pos_label=4) * 100))
print(f'Confusion matrix: \n{cm}')

Accuracy score: 94.89%
Recall score: 98.11%
Confusion matrix: 
[[78  6]
 [ 1 52]]


#### Applying k-fold Cross Validation

In [12]:
from sklearn.model_selection import cross_val_score

accuracies = cross_val_score(estimator=nb, X=X_train, y=y_train, cv=10)
print('Mean accuracy: {:.2f}%'.format(accuracies.mean() * 100))
print('Standard deviation: {:.2f}%'.format(accuracies.std() * 100))

Mean accuracy: 96.71%
Standard deviation: 2.91%


### Decision Tree

#### Performing Grid Search to find the best hyper parameter

In [13]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, recall_score, confusion_matrix
from sklearn.model_selection import GridSearchCV

parameters = [{'criterion': ['gini', 'entropy']}]
grid_search = GridSearchCV(estimator=DecisionTreeClassifier(), param_grid=parameters, scoring='accuracy', n_jobs=-1, cv=10)
grid_search.fit(X_train, y_train)

print('Best mean accuracy: {:.2f}%'.format(grid_search.best_score_ * 100))
print('Standard deviation: {:.2f}%'.format(grid_search.cv_results_['std_test_score'][grid_search.best_index_] * 100))
print(f'Best parameter: {grid_search.best_params_}')

Best mean accuracy: 94.70%
Standard deviation: 2.88%
Best parameter: {'criterion': 'entropy'}


In [14]:
tree = DecisionTreeClassifier(criterion='entropy')
tree.fit(X_train, y_train)
tree_y_pred = tree.predict(X_test)

cm = confusion_matrix(y_test, tree_y_pred)

print('Accuracy score: {:.2f}%'.format(accuracy_score(y_test, tree_y_pred) * 100))
print('Recall score: {:.2f}%'.format(recall_score(y_test, tree_y_pred, pos_label=4) * 100))
print(f'Confusion matrix: \n{cm}')

Accuracy score: 94.89%
Recall score: 92.45%
Confusion matrix: 
[[81  3]
 [ 4 49]]


### Random Forest

#### Performing Grid Search to find the best hyper parameters

In [15]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, recall_score, confusion_matrix
from sklearn.model_selection import GridSearchCV

parameters = [{'n_estimators': [5, 10, 20, 25, 50, 75, 100, 125, 150, 175, 200], 'criterion': ['gini', 'entropy']}]
grid_search = GridSearchCV(estimator=RandomForestClassifier(), param_grid=parameters, scoring='accuracy', n_jobs=-1, cv=10)
grid_search.fit(X_train, y_train)

print('Best mean accuracy: {:.2f}%'.format(grid_search.best_score_ * 100))
print('Standard deviation: {:.2f}%'.format(grid_search.cv_results_['std_test_score'][grid_search.best_index_] * 100))
print(f'Best parameter: {grid_search.best_params_}')

Best mean accuracy: 97.08%
Standard deviation: 2.19%
Best parameter: {'criterion': 'gini', 'n_estimators': 10}


In [16]:
forest = RandomForestClassifier(criterion='gini', n_estimators=10)
forest.fit(X_train, y_train)
forest_y_pred = forest.predict(X_test)

cm = confusion_matrix(y_test, forest_y_pred)

print('Accuracy score: {:.2f}%'.format(accuracy_score(y_test, forest_y_pred) * 100))
print('Recall score: {:.2f}%'.format(recall_score(y_test, forest_y_pred, pos_label=4) * 100))
print(f'Confusion matrix: \n{cm}')

Accuracy score: 95.62%
Recall score: 94.34%
Confusion matrix: 
[[81  3]
 [ 3 50]]


### XGBoost

#### Performing Grid Search to find the best hyper parameters

In [17]:
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV

parameters = [{'n_estimators': [5, 10, 20, 25, 50, 75, 100, 125, 150, 175, 200], 
               'learning_rate': [0.01, 0.03, 0.05, 0.1, 0.3, 0.5],
               'booster': ['gbtree', 'gblinear', 'dart']}]
grid_search = GridSearchCV(estimator=XGBClassifier(), param_grid=parameters, scoring='accuracy', n_jobs=-1, cv=10)
grid_search.fit(X_train, y_train)

print('Best mean accuracy: {:.2f}%'.format(grid_search.best_score_ * 100))
print('Standard deviation: {:.2f}%'.format(grid_search.cv_results_['std_test_score'][grid_search.best_index_] * 100))
print(f'Best parameter: {grid_search.best_params_}')

Best mean accuracy: 96.35%
Standard deviation: 2.93%
Best parameter: {'booster': 'gbtree', 'learning_rate': 0.3, 'n_estimators': 20}


In [18]:
from sklearn.metrics import accuracy_score, recall_score, confusion_matrix

xgb_classifier = XGBClassifier(booster='gbtree', learning_rate=0.3, n_estimators=20)
xgb_classifier.fit(X_train, y_train)
xgb_y_pred = xgb_classifier.predict(X_test)

cm = confusion_matrix(y_test, xgb_y_pred)

print('Accuracy score: {:.2f}%'.format(accuracy_score(y_test, xgb_y_pred) * 100))
print('Recall score: {:.2f}%'.format(recall_score(y_test, xgb_y_pred, pos_label=4) * 100))
print(f'Confusion matrix: \n{cm}')

Accuracy score: 97.08%
Recall score: 98.11%
Confusion matrix: 
[[81  3]
 [ 1 52]]


Two best model for this problem seem to be:

- **Logistic Regression**
    - Accuracy: 97.81%
    - Recall: 98.11%
    - Accuracy on training set: 96.53%
    - Standard deviation: 2.86%

- **Support Vector Machine**
    - Accuracy: 97.81%
    - Recall: 98.11%
    - Accuracy on training set: 96.9%
    - Standard deviation: 2.94%

XGBoost is worth mentioning as well, as it got:
- Accuracy: 97.08%
- Recall: 98.11%
- Accuracy on training set: 96.35%
- Standard deviation: 2.93%

