# **Improving Kyphosis Diagnosis with ML/DL: Classifying Patients as Having Kyphosis or Not**

## **Problem Statement**

Kyphosis is a spinal condition that can have significant impacts on patient health.In his notebook We aim to develop a machine learning model that can accurately classify patients as having kyphosis or not based on various features.
<center>

<img src="images/Kyphosis.png" width="500"/>

</center>

## Dataset Overview

*   kyphosis dataset has 81 rows and 4 columns :

    1.   Kyphosis : Target present/absent
    2.   Age : the number of months
    3.   Number : the number of vertebrae involved
    4.   Start: the number of the first vertebra operated on.

## **Importing Libraries and Loading the dataset**

In [152]:
# import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

%matplotlib inline

In [153]:
df = pd.read_csv('kyphosis.csv')
df.head()

Unnamed: 0,Kyphosis,Age,Number,Start
0,absent,71,3,5
1,absent,158,3,14
2,present,128,4,5
3,absent,2,5,1
4,absent,1,4,15


### Dataset description

In [154]:
df.describe()

Unnamed: 0,Age,Number,Start
count,81.0,81.0,81.0
mean,83.65,4.05,11.49
std,58.1,1.62,4.88
min,1.0,2.0,1.0
25%,26.0,3.0,9.0
50%,87.0,4.0,13.0
75%,130.0,5.0,16.0
max,206.0,10.0,18.0


In [155]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 81 entries, 0 to 80
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Kyphosis  81 non-null     object
 1   Age       81 non-null     int64 
 2   Number    81 non-null     int64 
 3   Start     81 non-null     int64 
dtypes: int64(3), object(1)
memory usage: 2.7+ KB


In [156]:
# replace the target variable with 0 and 1
df['Kyphosis'] = df['Kyphosis'].map({'absent':0, 'present':1})
df.head()


Unnamed: 0,Kyphosis,Age,Number,Start
0,0,71,3,5
1,0,158,3,14
2,1,128,4,5
3,0,2,5,1
4,0,1,4,15


In [157]:
# check for missing values
df.isnull().sum()

Kyphosis    0
Age         0
Number      0
Start       0
dtype: int64

### Data preprocessing

In [158]:
# make kyphosis as the last column
df = df[['Age', 'Number', 'Start', 'Kyphosis']]
df.head()

Unnamed: 0,Age,Number,Start,Kyphosis
0,71,3,5,0
1,158,3,14,0
2,128,4,5,1
3,2,5,1,0
4,1,4,15,0


#### Visualizing Key Features in the dataset

In [159]:
# visualize the correlation between the features and the target with plotly
import plotly.express as px
fig = px.scatter_matrix(df, dimensions=['Age', 'Number', 'Start'], color='Kyphosis')
fig.show()

In [160]:
#calculate the correlation between the features
correlation = df.corr()
fig = px.imshow(correlation, text_auto=True)
fig.show()


In [161]:
fig = px.scatter_3d(df, x='Age', y='Number', z='Start', color='Kyphosis', color_continuous_scale='Viridis')
fig.show()


In [162]:
# boxplots of numerical features for outlier detection using plotly
fig = px.box(df, x='Age', color='Kyphosis')
fig.show()

fig = px.box(df, x='Number', color='Kyphosis')
fig.show()

fig = px.box(df, x='Start', color='Kyphosis')
fig.show()

In [163]:
from sklearn.model_selection import train_test_split

X = df.drop('Kyphosis', axis=1)
y = df['Kyphosis']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)

In [164]:
# use Smote to balance the dataset
from imblearn.over_sampling import SMOTE

def balance_dataset(X, y):
    sm = SMOTE(random_state=42)
    X_res, y_res = sm.fit_resample(X, y)
    return X_res, y_res



X_res, y_res = balance_dataset(X_train, y_train)

# check the balance of the dataset
y_res.value_counts()

0    44
1    44
Name: Kyphosis, dtype: int64

In [165]:
# implement lazy predict
from lazypredict.Supervised import LazyClassifier

clf = LazyClassifier(verbose=1, ignore_warnings=False, custom_metric=None)
models, predictions = clf.fit(X_res, X_test, y_res, y_test)

models


 31%|███       | 9/29 [00:00<00:00, 35.09it/s]

{'Model': 'AdaBoostClassifier', 'Accuracy': 0.88, 'Balanced Accuracy': 0.8500000000000001, 'ROC AUC': 0.8500000000000001, 'F1 Score': 0.883916083916084, 'Time taken': 0.07935428619384766}
{'Model': 'BaggingClassifier', 'Accuracy': 0.8, 'Balanced Accuracy': 0.725, 'ROC AUC': 0.7250000000000001, 'F1 Score': 0.8065268065268065, 'Time taken': 0.021566152572631836}
{'Model': 'BernoulliNB', 'Accuracy': 0.76, 'Balanced Accuracy': 0.775, 'ROC AUC': 0.775, 'F1 Score': 0.7809523809523811, 'Time taken': 0.010007619857788086}
{'Model': 'CalibratedClassifierCV', 'Accuracy': 0.76, 'Balanced Accuracy': 0.775, 'ROC AUC': 0.775, 'F1 Score': 0.7809523809523811, 'Time taken': 0.025006771087646484}
CategoricalNB model failed to execute
Negative values in data passed to CategoricalNB (input X)
{'Model': 'DecisionTreeClassifier', 'Accuracy': 0.84, 'Balanced Accuracy': 0.75, 'ROC AUC': 0.7500000000000001, 'F1 Score': 0.84, 'Time taken': 0.00801539421081543}
{'Model': 'DummyClassifier', 'Accuracy': 0.8, 'Bala

 69%|██████▉   | 20/29 [00:00<00:00, 62.83it/s]

{'Model': 'GaussianNB', 'Accuracy': 0.84, 'Balanced Accuracy': 0.825, 'ROC AUC': 0.825, 'F1 Score': 0.8491228070175438, 'Time taken': 0.009002685546875}
{'Model': 'KNeighborsClassifier', 'Accuracy': 0.8, 'Balanced Accuracy': 0.8, 'ROC AUC': 0.8, 'F1 Score': 0.8149688149688149, 'Time taken': 0.01002645492553711}
{'Model': 'LabelPropagation', 'Accuracy': 0.8, 'Balanced Accuracy': 0.8, 'ROC AUC': 0.8, 'F1 Score': 0.8149688149688149, 'Time taken': 0.007001638412475586}
{'Model': 'LabelSpreading', 'Accuracy': 0.8, 'Balanced Accuracy': 0.8, 'ROC AUC': 0.8, 'F1 Score': 0.8149688149688149, 'Time taken': 0.00802755355834961}
{'Model': 'LinearDiscriminantAnalysis', 'Accuracy': 0.76, 'Balanced Accuracy': 0.775, 'ROC AUC': 0.775, 'F1 Score': 0.7809523809523811, 'Time taken': 0.010056734085083008}
{'Model': 'LinearSVC', 'Accuracy': 0.76, 'Balanced Accuracy': 0.775, 'ROC AUC': 0.775, 'F1 Score': 0.7809523809523811, 'Time taken': 0.009316682815551758}
{'Model': 'LogisticRegression', 'Accuracy': 0.76,

100%|██████████| 29/29 [00:00<00:00, 49.10it/s]

{'Model': 'RandomForestClassifier', 'Accuracy': 0.84, 'Balanced Accuracy': 0.825, 'ROC AUC': 0.825, 'F1 Score': 0.8491228070175438, 'Time taken': 0.12188243865966797}
{'Model': 'RidgeClassifier', 'Accuracy': 0.76, 'Balanced Accuracy': 0.775, 'ROC AUC': 0.775, 'F1 Score': 0.7809523809523811, 'Time taken': 0.009002923965454102}
{'Model': 'RidgeClassifierCV', 'Accuracy': 0.76, 'Balanced Accuracy': 0.775, 'ROC AUC': 0.775, 'F1 Score': 0.7809523809523811, 'Time taken': 0.010001420974731445}
{'Model': 'SGDClassifier', 'Accuracy': 0.52, 'Balanced Accuracy': 0.625, 'ROC AUC': 0.625, 'F1 Score': 0.56, 'Time taken': 0.008003711700439453}
{'Model': 'SVC', 'Accuracy': 0.84, 'Balanced Accuracy': 0.825, 'ROC AUC': 0.825, 'F1 Score': 0.8491228070175438, 'Time taken': 0.00800180435180664}
StackingClassifier model failed to execute
__init__() missing 1 required positional argument: 'estimators'
{'Model': 'XGBClassifier', 'Accuracy': 0.88, 'Balanced Accuracy': 0.8500000000000001, 'ROC AUC': 0.8500000000




Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AdaBoostClassifier,0.88,0.85,0.85,0.88,0.08
XGBClassifier,0.88,0.85,0.85,0.88,0.03
Perceptron,0.88,0.85,0.85,0.88,0.01
SVC,0.84,0.82,0.82,0.85,0.01
RandomForestClassifier,0.84,0.82,0.82,0.85,0.12
GaussianNB,0.84,0.82,0.82,0.85,0.01
LabelPropagation,0.8,0.8,0.8,0.81,0.01
QuadraticDiscriminantAnalysis,0.8,0.8,0.8,0.81,0.01
NearestCentroid,0.8,0.8,0.8,0.81,0.01
LabelSpreading,0.8,0.8,0.8,0.81,0.01


In [166]:
predictions

Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AdaBoostClassifier,0.88,0.85,0.85,0.88,0.08
XGBClassifier,0.88,0.85,0.85,0.88,0.03
Perceptron,0.88,0.85,0.85,0.88,0.01
SVC,0.84,0.82,0.82,0.85,0.01
RandomForestClassifier,0.84,0.82,0.82,0.85,0.12
GaussianNB,0.84,0.82,0.82,0.85,0.01
LabelPropagation,0.8,0.8,0.8,0.81,0.01
QuadraticDiscriminantAnalysis,0.8,0.8,0.8,0.81,0.01
NearestCentroid,0.8,0.8,0.8,0.81,0.01
LabelSpreading,0.8,0.8,0.8,0.81,0.01


In [167]:

# implement the best model
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix


rfc = RandomForestClassifier()
rfc.fit(X_res, y_res)

rfc_pred = rfc.predict(X_test)

print(classification_report(y_test, rfc_pred))
print(confusion_matrix(y_test, rfc_pred))

              precision    recall  f1-score   support

           0       0.94      0.85      0.89        20
           1       0.57      0.80      0.67         5

    accuracy                           0.84        25
   macro avg       0.76      0.82      0.78        25
weighted avg       0.87      0.84      0.85        25

[[17  3]
 [ 1  4]]


In [168]:
# implement the hyperparameter optimization
from sklearn.model_selection import RandomizedSearchCV

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
                'max_features': max_features,
                'max_depth': max_depth,
                'min_samples_split': min_samples_split,
                'min_samples_leaf': min_samples_leaf,
                'bootstrap': bootstrap}

rfc_random = RandomizedSearchCV(estimator = rfc, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)

rfc_random.fit(X_res, y_res)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


In [169]:
# print the best parameters, the best score and the best estimator of the model after HPO
print("Best parameters : ",rfc_random.best_params_)
print("Best score : ",rfc_random.best_score_)
print("Best estimator",rfc_random.best_estimator_)

Best parameters :  {'n_estimators': 400, 'min_samples_split': 5, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 30, 'bootstrap': True}
Best score :  0.8869731800766284
Best estimator RandomForestClassifier(max_depth=30, min_samples_split=5, n_estimators=400)


In [170]:
rfc_random_pred = rfc_random.predict(X_test)
print(classification_report(y_test, rfc_random_pred))
print(confusion_matrix(y_test, rfc_random_pred))

              precision    recall  f1-score   support

           0       0.95      0.90      0.92        20
           1       0.67      0.80      0.73         5

    accuracy                           0.88        25
   macro avg       0.81      0.85      0.83        25
weighted avg       0.89      0.88      0.88        25

[[18  2]
 [ 1  4]]


In [208]:
import plotly.figure_factory as ff
x_labels = ['Predicted Negative', 'Predicted Positive']
y_labels = ['Actual Negative', 'Actual Positive']
confusion_matrix  = [[18, 2], [1, 4]]
colorscale = [[0, '#FFFFFF'], [1, '#4B0082']]

fig = ff.create_annotated_heatmap(
    z=confusion_matrix,
    x=x_labels,
    y=y_labels,
    showscale=True,
    colorscale=colorscale,
    reversescale=False,
    font_colors=['#000000', '#FFFFFF'],
)
# Set the title and axis labels
fig.update_layout(
    title='Confusion Matrix : Random Forest Classifier',
    xaxis_title='Predicted Label',
    yaxis_title='True Label',
)
fig.show()


In [207]:
# implement xgboost classifier 
from xgboost import XGBClassifier

xgb = XGBClassifier()
xgb.fit(X_res, y_res)

xgb_pred = xgb.predict(X_test)

print(classification_report(y_test, xgb_pred))
#print(confusion_matrix(y_test, xgb_pred))

              precision    recall  f1-score   support

           0       0.95      0.90      0.92        20
           1       0.67      0.80      0.73         5

    accuracy                           0.88        25
   macro avg       0.81      0.85      0.83        25
weighted avg       0.89      0.88      0.88        25



In [209]:
import plotly.figure_factory as ff
x_labels = ['Predicted Negative', 'Predicted Positive']
y_labels = ['Actual Negative', 'Actual Positive']
confusion_matrix  = [[18, 2], [1, 4]] # XGboost Confusion Matrix
colorscale = [[0, '#FFFFFF'], [1, '#4B0082']]

fig = ff.create_annotated_heatmap(
    z=confusion_matrix,
    x=x_labels,
    y=y_labels,
    showscale=True,
    colorscale=colorscale,
    reversescale=False,
    font_colors=['#000000', '#FFFFFF'],
)
# Set the title and axis labels
fig.update_layout(
    title='Confusion Matrix : XGBoost Classifier',
    xaxis_title='Predicted Label',
    yaxis_title='True Label',
)
fig.show()
