# <font face='Comic Sans MF' color='black' size=6><b><center>INTRODUCTION</center></b></font>
<br>

**PROBLEM END GOAL:**
- The goal of the problem is to predict the target variable, called 'Dementia'.
- It has three associated values, so let's treat this a multi-class classification problem.

<br>

**CONTENTS OF THE DATASET:**

This set consists of a longitudinal collection of 150 subjects aged 60 to 96. Each subject was scanned on two or more visits, separated by at least one year for a total of 373 imaging sessions. For each subject, 3 or 4 individual T1-weighted MRI scans obtained in single scan sessions are included. The subjects are all right-handed and include both men and women. 72 of the subjects were characterized as nondemented throughout the study. 64 of the included subjects were characterized as demented at the time of their initial visits and remained so for subsequent scans, including 51 individuals with mild to moderate Alzheimer’s disease. Another 14 subjects were characterized as nondemented at the time of their initial visit and were subsequently characterized as demented at a later visit

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

# <font face='Comic Sans MF' color='orchid' size = 6 ><b><center>1. DATA MANIPULATION AND DATA CLEASING</center></font>

In [None]:
df = pd.read_csv ('/kaggle/input/dementia-prediction-dataset/dementia_dataset.csv')
df.rename ( columns = { 'Group': 'Dementia'}, inplace=True ) # rename
df.rename ( columns = { 'M/F': 'Sex'}, inplace=True ) # rename
df.drop(columns=['Subject ID', 'MRI ID'], inplace=True) # drop
df.head(10).style.set_properties(**{'background-color':'black',
                                     'color': 'orchid'})

In [None]:
print ( "Let's see the values ​​of the `Hand` column:", df.Hand.unique(), '\n' )
print ( 'Unique value in this column is R. We can drop it.' )
df.drop(columns=['Hand'], inplace=True)

In [None]:
# rename columns
col = df.columns
new_col = []
for columns in col:
  columns_low = columns.title()
  new_col.append (columns_low)
df.columns = new_col

In [None]:
df.describe().T.style.background_gradient( cmap='tab10')

<font face='hacker' size=3>
Let's focus on some information that provides the description of the dataset:

- `Age` $\to$ the subjects that make up the dataset have a minimum age of 60 and a maximum of 98.

- `Educ` $\to$ the average level of education is 14.6 years

In [None]:
df.info()

<font face='Comic Sans MF' color='orchid' size = 4 ><b>Missing value imputation </font>

In [None]:
df.isna().sum()

<font face='hacker' size=3>

There are missing values ​​in the following variables: `Ses` and `Mmse`. Let's solve it now.

In [None]:
df.Ses.fillna ( df.Ses.mode() [0], inplace=True ) # impute mode
df.Mmse.fillna ( df.Mmse.mean() , inplace=True ) # impute mean
df.isna().sum()

# <font face='Comic Sans MF' color='lightseagreen' size = 6 ><b><center>2. EXPLORATORY DATA ANALYSIS</center></font>

<font face='Comic Sans MF' color='lightseagreen' size = 4 ><b>Analysis of the target variable </font>

In [None]:
custom_params = {"axes.spines.right": False, "axes.spines.top": False}
sns.set_theme(style="ticks", rc=custom_params, palette='pastel')
fig = plt.figure ( figsize= (8,6) )
ax=sns.countplot(data=df, x='Dementia')
for i in ax.patches:
    ax.text(x=i.get_x()+i.get_width()/2, y=i.get_height()/7, s=f"{np.round(i.get_height()/len(df)*100,0)}%", ha='center', size=20, weight='bold', rotation=360, color='white')
plt.title("Dementia Feature", size=20, weight='bold')
plt.ylabel ( 'Count' )
plt.show()

<font face='hacker' size=3>

The classes in the target variable are **unbalanced**, this is a problem to be solved later for a good classification.

In [None]:
plt.figure ( figsize= (8,6) )
sns.countplot ( df.Dementia, data=df, hue='Sex' )
plt.title("Dementia Feature by Sex", size=20, weight='bold')
plt.ylabel ( 'Count' )
plt.xlabel ('', size = 16 )
plt.show()

<font face='Comic Sans MF' color='lightseagreen' size = 4 ><b>Analysis of the other variables </font>

In [None]:
plt.figure ( figsize= (8,6) )
sns.histplot( data=df, x="Age", binwidth=5, kde=True, hue="Dementia" )
plt.xlabel ('Age_bins' )
plt.ylim(0,50)
plt.show()

<font face='hacker' size=3>

The age distribution of converted subjects has a higher average than that of non-demented and demented subjects.

In [None]:
plt.figure ( figsize= (8,6) )
sns.boxplot( data=df,y="Age", x='Sex', showfliers = False )
plt.show ()

In [None]:
plt.figure ( figsize= (8,6) )
sns.countplot ( df.Educ, data=df )
plt.ylabel ('Count')
plt.show ()

<font face='hacker' size=3>


Most of the subjects have an education level of 12, 16 and 18 years.

In [None]:
plt.figure ( figsize= (8,6) )
sns.boxplot(x="Dementia", y="Mmse", data=df, showfliers = False ) # without outliers
plt.ylim(0,35)
plt.show ()

<font face='hacker' size=3>

From the boxplot it is clear that:

- subjects without dementia have `Mmse` values ​​of 30 or nearly 30.

- subjects classified as 'Converted' take values ​​that tend to be slightly lower than 30.

- subjects suffering from dementia have generally lower `Mmse` values, on average equal to 26.

In [None]:
plt.figure ( figsize= (8,6) )
sns.scatterplot(data=df, x="Cdr", y="Mmse", hue="Dementia", alpha=0.7)
plt.show()

<font face='hacker' size=3>

Looking at the scatter plot, it can be seen that for values $<25$ ​​of Mmse, the subjects are almost all affected by dementia.

<font face='hacker' size=3>

In addition, those without dementia all have a `Cdr` value of 0.

<font face='Comic Sans MF' color='lightseagreen' size = 4 ><b>Correlation </font>

In [None]:
corr = df.corr ()
plt.figure ( figsize= (8,6) )
sns.heatmap(corr, annot=True, fmt=".2f", linewidths=0.7, cbar = True, cmap='RdBu' )
plt.title ( 'Matrix of correlation', size = 16)
plt.show ()

<font face='hacker' size=3>

- `Mr Delay` and `Visit` have a strong positive correlation. You can think of eliminating one of the two variables to reduce noise in the data if you are using a parametric model.

- `Asf` and `Etiv`have a strong negative correlation.

# <font face='Comic Sans MF' color='gold' size = 6 ><b><center>3. PRE-PROCESSING</center></font>

<font face='Comic Sans MF' color='gold' size = 4 ><b>Management of categorical data</font>

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder ()
df.Sex = le.fit_transform ( df.Sex.values )
print ( 'Sex:\n0 : %s \n1 : %s\n\n' %(le.classes_[0], le.classes_[1]) )
df.Dementia = le.fit_transform ( df.Dementia.values )
print ( 'Dementia:\n0 : %s \n1 : %s \n2 : %s' %(le.classes_[0], le.classes_[1], le.classes_[2]) )

df.Dementia = df.Dementia.astype('category')
df.Sex = df.Sex.astype('category')

In [None]:
df.info()

<font face='Comic Sans MF' color='gold' size = 4 ><b>Split in Train and Test set</font>

In [None]:
from sklearn.model_selection import train_test_split

X, y = df.drop ('Dementia', axis=1).values , df.Dementia.values
X_train, X_test, y_train, y_test = train_test_split ( X, y,
                                                     test_size = 0.2,
                                                     random_state = 1,
                                                     stratify = y)

<font face='Comic Sans MF' color='gold' size = 4 ><b>Re-sampling train set</font>

<font face='hacker' size=3>

Balance the distribution of classes in the target variable with an oversampling of minority classes.

In [None]:
from imblearn.over_sampling import SMOTE

print ('Number of observations in the target variable before oversampling of the minority class:', np.bincount (y_train) )

smt = SMOTE ()
X_train, y_train = smt.fit_resample (X_train, y_train)

print ('\nNumber of observations in the target variable after oversampling of the minority class:', np.bincount (y_train) )

<font face='Comic Sans MF' color='gold' size = 4 ><b>Standardization of features</font>

In [None]:
from sklearn.preprocessing import StandardScaler

std_scaler = StandardScaler()
X_train_std = std_scaler.fit_transform ( X_train )
X_test_std = std_scaler.transform ( X_test )

# <font face='Comic Sans MF' color='indianred' size = 6 ><b><center>4. MODELS</center></font>

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier

clf = [ LogisticRegression(random_state=42), DecisionTreeClassifier(random_state=42), SVC (random_state=42),
       RandomForestClassifier(random_state=42), GradientBoostingClassifier(random_state=42) ]
models = [ 'Logistic Regression', 'Tree', 'Support vector machine', 'RFC', 'Gradient boost' ]

for clf, model in zip(clf,models):
  clf.fit ( X_train_std, y_train )
  y_pred = clf.predict ( X_test_std )
  print ( f'Cross validation score of {model}: %.3f \n' %cross_val_score (clf, X_train_std, y_train, cv=5).mean() )

<font face='Comic Sans MF' color='indianred' size = 4 ><b>Tuning RFC with cross-validation</font>

In [None]:
from sklearn.model_selection import GridSearchCV

rfc = RandomForestClassifier(n_jobs=-1, random_state=42) 

param_grid = { 
    'n_estimators': [500, 700, 900],
    'min_samples_split': [2,4,6,8,10]
}

gs = GridSearchCV ( estimator = rfc,
                   param_grid = param_grid,
                   scoring = 'accuracy',
                   cv = 5,
                   refit = True,
                   n_jobs = -1
                   )

gs = gs.fit ( X_train_std, y_train )

print ( 'Parameter setting that gave the best results on the hold out data:', gs.best_params_ )

print ( 'Mean cross-validated score of the best_estimator: %.3f' %gs.best_score_ )

gs = gs.best_estimator_

<font face='hacker' size=3>

After hyperparameter tuning, the mean cross-validation score stands at 0,948, up from 0,945. 👍

In [None]:
gs.fit ( X_train_std, y_train )
y_pred = gs.predict ( X_test_std )
print ( f'Accuracy train score: %.4f' %gs.score (X_train_std, y_train) )
print ( f'Accuracy test score: %.4f' %accuracy_score ( y_test, y_pred ) )

<font face='hacker' size=3>

Accuracy of the trained model on the test data is 0.933. A good result, but now let's see what kind of mistakes he makes when he fails to classify subjects.
<br>Look at Confusion Matrix. 👀

<font face='Comic Sans MF' color='indianred' size = 4 ><b>Confusion matrix</font>

In [None]:
from sklearn.metrics import confusion_matrix
conf_matrix = confusion_matrix (  y_test, y_pred )

print ('Number of records in the test dataset: %d\n' %y_test.shape[0])

fig, axes = plt.subplots(1, 2, figsize=(15, 5), sharey=False)
#plot 1
sns.heatmap(conf_matrix,ax=axes[0],annot=True, cmap='Blues', cbar=False, fmt='d')
axes[0].set_xlabel('\nPredicted label', size = 14)
axes[0].set_ylabel('True label\n', size = 14)

# plot 2
sns.heatmap(conf_matrix/np.sum(conf_matrix),ax=axes[1], annot=True, 
            fmt='.2%', cmap='Blues', cbar=False)
axes[1].set_xlabel('\nPredicted label', size = 14)
axes[1].set_ylabel('True label\n', size = 14)
axes[1].yaxis.tick_left()
plt.show()

<font face='hacker' size=3>

The test dataset contains 75 records. From the confusion matrix it is concluded that:

- All subjects who do not suffer from dementia are correctly classified.

- All bellies suffering from dementia are also correctly classified.

- The model makes more mistakes only in classifying those borderline subjects, that is, those subjects who were not initially classified as demented but who became so during the data collection. In particular, 2.67% of them are labeled as having dementia from the beginning and 4% are classified as having no dementia from the beginning of the survey to the end.

<br>
<br>
<br>

<font face='hacker' size=4><U>I await suggestions, advice and criticisms. Comment below.</U></font>

<font face='hacker' size = 4 ><mark>If you liked my work, please rate the code and follow my Kaggle profile.<br>Thanks 😊 </mark></font>
