<a href="https://colab.research.google.com/github/DattaIn/idatta.github.io/blob/master/Dementia_prediction_SVM_Classifier_SS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Certification Program in Computational Data Science
## A program by IISc and TalentSprint
### Mini-Project: Dementia prediction using SVM

## Problem Statement

Prediction of Dementia using an SVM model on brain MRI features

## Learning Objectives

At the end of the mini-project, you will be able to :

* perform data exploration, preprocessing and visualization
* implement SVM Classifier on the data
* explore various parameters of SVM classifier and implement OneVsOne classifier
* calculate the metrics and plot the roc_curve

## Information

**About Dementia**

Dementia is a general term for loss of memory and other mental abilities severe enough to interfere with daily life. It is caused by physical changes in the brain. Alzheimer's is the most common type of dementia, but there are many kinds.

**Brain Imaging via magnetic resonance imaging (MRI) and Machine Learning**

* MRI is used for the evaluation of patients with suspected Alzheimer's disease
* MRIs detect both, local and generalized shrinkage of brain tissue.
* MRI features predict the rate of decline of AD and may guide therapy in the future
* Using machine learning on MRI features could help in automatedly and accurately predicting the progress of a patient from mild cognitive impairment to dementia

To understand the basics of MRI technique, you could refer [here](https://case.edu/med/neurology/NR/MRI%20Basics.htm)

## Dataset

The dataset chosen for this mini-project is [OASIS - Longitudinal brain MRI Dataset](https://www.oasis-brains.org/). This dataset consists of a longitudinal MRI collection of 150 subjects aged 60 to 96. Each subject was scanned on two or more visits, separated by at least one year for a total of 373 imaging sessions. For each subject, 3 or 4 individual T1-weighted MRI scans obtained in single scan sessions are included. The subjects are all right-handed and include both men and women. 72 of the subjects were characterized as nondemented throughout the study. 64 of the included subjects were characterized as demented at the time of their initial visits and remained so for subsequent scans, including 51 individuals with mild to moderate Alzheimer’s disease. Another 14 subjects were characterized as nondemented at the time of their initial visit and were subsequently characterized as demented at a later visit.

**Dataset fields:**

* Subject ID - Subject Identification
* MRI ID - MRI Exam Identification
* Group - Target variable with 3 labels ('NonDemented', 'Demented', 'Converted')
* Visit - Visit order
* MR Delay - MR Delay Time (Contrast)
* M/F - Male or Female
* Hand - Unique value 'R'
* MMSE - Mini-Mental State Examination score (range is from 0 = worst to 30 = best)
* CDR - Clinical Dementia Rating (0 = no dementia, 0.5 = very mild AD, 1 = mild AD, 2 = moderate AD)
* Derived anatomic volumes
* eTIV - Estimated total intracranial volume, mm3
* nWBV - Normalized whole-brain volume, expressed as a percent of all voxels in the atlas-masked image that are labeled as gray or white matter by the automated tissue segmentation process
* ASF - Atlas scaling factor (unitless). A computed scaling factor that transforms native-space brain and skull to the atlas target (i.e., the determinant of the transform matrix)

For learning more on building a machine learning model to predict dementia using SVM, refer [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7408873/).

## Grading = 10 Points

In [None]:
#@title Download the dataset
!wget -qq https://cdn.iisc.talentsprint.com/CDS/MiniProjects/oasis_longitudinal.csv
print("Data downloaded successfully!")

Data downloaded successfully!


### Import required packages

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import math
# sklearn imports
from sklearn.preprocessing import LabelEncoder
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.multiclass import OneVsOneClassifier

### Load the dataset

In [None]:
# YOUR CODE HERE
df = pd.read_csv('https://cdn.iisc.talentsprint.com/CDS/MiniProjects/oasis_longitudinal.csv')

In [None]:
df.head()

Unnamed: 0,Subject ID,MRI ID,Group,Visit,MR Delay,M/F,Hand,Age,EDUC,SES,MMSE,CDR,eTIV,nWBV,ASF
0,OAS2_0001,OAS2_0001_MR1,Nondemented,1,0,M,R,87,14,2.0,27.0,0.0,1987,0.696,0.883
1,OAS2_0001,OAS2_0001_MR2,Nondemented,2,457,M,R,88,14,2.0,30.0,0.0,2004,0.681,0.876
2,OAS2_0002,OAS2_0002_MR1,Demented,1,0,M,R,75,12,,23.0,0.5,1678,0.736,1.046
3,OAS2_0002,OAS2_0002_MR2,Demented,2,560,M,R,76,12,,28.0,0.5,1738,0.713,1.01
4,OAS2_0002,OAS2_0002_MR3,Demented,3,1895,M,R,80,12,,22.0,0.5,1698,0.701,1.034


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 373 entries, 0 to 372
Data columns (total 15 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Subject ID  373 non-null    object 
 1   MRI ID      373 non-null    object 
 2   Group       373 non-null    object 
 3   Visit       373 non-null    int64  
 4   MR Delay    373 non-null    int64  
 5   M/F         373 non-null    object 
 6   Hand        373 non-null    object 
 7   Age         373 non-null    int64  
 8   EDUC        373 non-null    int64  
 9   SES         354 non-null    float64
 10  MMSE        371 non-null    float64
 11  CDR         373 non-null    float64
 12  eTIV        373 non-null    int64  
 13  nWBV        373 non-null    float64
 14  ASF         373 non-null    float64
dtypes: float64(5), int64(5), object(5)
memory usage: 43.8+ KB


In [None]:
df.describe()

Unnamed: 0,Visit,MR Delay,Age,EDUC,SES,MMSE,CDR,eTIV,nWBV,ASF
count,373.0,373.0,373.0,373.0,354.0,371.0,373.0,373.0,373.0,373.0
mean,1.882038,595.104558,77.013405,14.597855,2.460452,27.342318,0.290885,1488.128686,0.729568,1.195461
std,0.922843,635.485118,7.640957,2.876339,1.134005,3.683244,0.374557,176.139286,0.037135,0.138092
min,1.0,0.0,60.0,6.0,1.0,4.0,0.0,1106.0,0.644,0.876
25%,1.0,0.0,71.0,12.0,2.0,27.0,0.0,1357.0,0.7,1.099
50%,2.0,552.0,77.0,15.0,2.0,29.0,0.0,1470.0,0.729,1.194
75%,2.0,873.0,82.0,16.0,3.0,30.0,0.5,1597.0,0.756,1.293
max,5.0,2639.0,98.0,23.0,5.0,30.0,2.0,2004.0,0.837,1.587


### Pre-processing and Data Engineering

#### Remove unwanted columns

In [None]:
# YOUR CODE HERE
df_clean = df.drop(['MRI ID', 'Hand'], axis = 1)

#### Encode categorical features into numeric

In [None]:
# YOUR CODE HERE
df_clean['M/F'] = df_clean['M/F'].replace(['M','F'], [1,0])

In [None]:
df_clean['Group'].unique()

array(['Nondemented', 'Demented', 'Converted'], dtype=object)

In [None]:
#df_clean['Group'] = df_clean['Group'].replace(['Converted'], ['Demented'])
df_clean['Group'] = df_clean['Group'].replace(['Converted', 'Demented', 'Nondemented'], [2,1,0])
df_clean.describe()

Unnamed: 0,Group,Visit,MR Delay,M/F,Age,EDUC,SES,MMSE,CDR,eTIV,nWBV,ASF
count,373.0,373.0,373.0,373.0,373.0,373.0,354.0,371.0,373.0,373.0,373.0,373.0
mean,0.589812,1.882038,595.104558,0.428954,77.013405,14.597855,2.460452,27.342318,0.290885,1488.128686,0.729568,1.195461
std,0.664461,0.922843,635.485118,0.495592,7.640957,2.876339,1.134005,3.683244,0.374557,176.139286,0.037135,0.138092
min,0.0,1.0,0.0,0.0,60.0,6.0,1.0,4.0,0.0,1106.0,0.644,0.876
25%,0.0,1.0,0.0,0.0,71.0,12.0,2.0,27.0,0.0,1357.0,0.7,1.099
50%,0.0,2.0,552.0,0.0,77.0,15.0,2.0,29.0,0.0,1470.0,0.729,1.194
75%,1.0,2.0,873.0,1.0,82.0,16.0,3.0,30.0,0.5,1597.0,0.756,1.293
max,2.0,5.0,2639.0,1.0,98.0,23.0,5.0,30.0,2.0,2004.0,0.837,1.587


#### Handle the null values by removing or replacing

In [None]:
# YOUR CODE HERE
df_clean = df_clean.dropna()
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 354 entries, 0 to 372
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Subject ID  354 non-null    object 
 1   Group       354 non-null    int64  
 2   Visit       354 non-null    int64  
 3   MR Delay    354 non-null    int64  
 4   M/F         354 non-null    int64  
 5   Age         354 non-null    int64  
 6   EDUC        354 non-null    int64  
 7   SES         354 non-null    float64
 8   MMSE        354 non-null    float64
 9   CDR         354 non-null    float64
 10  eTIV        354 non-null    int64  
 11  nWBV        354 non-null    float64
 12  ASF         354 non-null    float64
dtypes: float64(5), int64(7), object(1)
memory usage: 38.7+ KB


#### Identify feature and target and split it into train test

In [None]:
%pip install mlxtend



In [None]:
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression



In [None]:
sfs = SFS(LinearRegression(),
           k_features=10,
           forward=True,
           floating=False,
           scoring = 'r2',
           cv = 0)

#Use SFS to select the top 10 features
sfs.fit(x_train, y_train)
print(sfs)

#Create a dataframe for the SFS results
df_SFS_results = pd.DataFrame(sfs.subsets_).transpose()
print("Features", df_SFS_results)

NameError: ignored

In [None]:
# YOUR CODE HERE
x = df_clean.drop(columns=['Subject ID', 'Group'])
y = df_clean[['Group']]

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 100)

### EDA &  Visualization

#### Plot the distribution of all the variables using a histogram

In [None]:
x.head()

In [None]:
# YOUR CODE HERE
sex = df['M/F'].value_counts()
sex.plot(kind='bar')

In [None]:
df_temp = pd.DataFrame()
df_temp['MR Delay'] = x['MR Delay']
df_temp[df_temp['MR Delay'] != 0] = 1
MRD = df_temp['MR Delay'].value_counts()
MRD.plot(kind='bar')

#### Visualize the frequency of Age

In [None]:
# YOUR CODE HERE
age = x['Age'].value_counts()
age.plot(kind='bar')

#### How many people have Alzheimer? Visualize with an appropriate plot

the same person visits two or more times; extract the single visit data and plot

**Hint**: Visit = 1

In [None]:
# YOUR CODE HERE
x_1_visit = df_clean[df_clean['Visit'] == 1]
AD = x_1_visit['Group'].value_counts()
AD.plot(kind='bar')

In [None]:
non_demented = df_clean[df_clean['Group'] == 0]
demented = df_clean[df_clean['Group'] == 1]
converted = df_clean[df_clean['Group'] == 2]

In [None]:
non_demented.describe()

In [None]:
demented.describe()

In [None]:
converted.describe()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="ticks", color_codes=True)
sns.catplot(x="Visit", y="eTIV", hue="Group", kind="swarm", data=df_clean)

In [None]:
sns.catplot(x="Visit", y="MR Delay", hue="Group", kind="swarm", data=df_clean)

In [None]:
sns.catplot(x="Visit", y="CDR", hue="Group", kind="swarm", data=df_clean)

In [None]:
sns.catplot(x="Visit", y="nWBV", hue="Group", kind="swarm", data=df_clean)

#### Calculate the correlation of features and plot the heatmap

In [None]:
# YOUR CODE HERE
corrMatrix = df_clean.corr()
print(corrMatrix)

In [None]:
plt.figure(figsize=(16, 9))
heatmap = sns.heatmap(corrMatrix, vmin=-1, vmax=1, annot=True, cmap='coolwarm_r')

### Model training and evaluation

**Hint:** SVM model from sklearn

In [None]:
# YOUR CODE HERE
classifier = SVC(kernel='linear', random_state = 100)
classifier.fit(x_train, y_train)

In [None]:
y_pred = classifier.predict(x_test)
y_pred

#### Support vectors of the model

* Find the samples of the dataset which are the support vectors of the model

In [None]:
# YOUR CODE HERE
print(classifier.support_vectors_[:5, :])

In [None]:
print(classifier.support_vectors_.shape)

#### Confusion matrix for multi-class classification

* Predict the test and plot the confusion matrix

In [None]:
# YOUR CODE HERE
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
cm

In [None]:
labels = ['True non-dememnted 58','False demented classification 1','False not demented 5','True demented 43']
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(cm, annot=labels, fmt='')

#### One VS Rest Classifier

OneVsRestClassifier can also be used for multilabel classification. For each classifier, the class is fitted against all the other classes. In addition to its computational efficiency (only n_classes classifiers are needed)

* Fit `OneVsRestClassifier` on the data and find the accuracy

Hint: [OneVsRestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html)

In [None]:
# YOUR CODE HERE
clf = OneVsRestClassifier(SVC(kernel='linear')).fit(x_train, y_train)
y_pred_OvR = clf.predict(x_test)
y_pred_OvR

#### One VS One Classifier

This strategy consists of fitting one classifier per class pair. At prediction time, the class which received the most votes is selected.

* Fit `OneVsOneClassifier` on the data and find the accuracy

Hint: [OneVsOneClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsOneClassifier.html)

In [None]:
# YOUR CODE HERE

#### Make it binary classification

As stated in the dataset description, 14 subjects were characterized as nondemented at the time of their initial visit and were subsequently characterized as demented at a later visit. Change `Converted` label into `Demented`.

**Note:** In two-class classification, encode the labels into numerical to plot the roc_curve with predictions.

In [None]:
# YOUR CODE HERE

#### Compare the performance and predictions of both multi-class and binary classifications

In [None]:
# YOUR CODE HERE

### Classification report and metrics

#### Confusion matrix

Describe the performance of a classification model (or "classifier") on a set of test data for which the true values are known.

In [None]:
# YOUR CODE HERE

#### Plot the ROC Curve

ROC curves typically feature true positive rate on the Y axis, and false positive rate on the X axis. This means that the top left corner of the plot is the “ideal” point - a false positive rate of zero, and a true positive rate of one. This is not very realistic, but it does mean that a larger area under the curve (AUC) is usually better.

In [None]:
# YOUR CODE HERE

### Choice of C for SVM

experiment with different C values given and plot the ROC curve for each

In [None]:
c_val = [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000]
# YOUR CODE HERE

### Report Analysis

* Compare the performance of the model with various Kernel parameters.
* Discuss the impact of parameter C and gamma on performance.
* Comment on the computational cost of implementing one vs one and one vs all to solve multi-class classification with binary classifier.
* When do you call a sample/record in the data as a support vector?