# Alzheimer's disease classification

## import libraries

In [7]:
import torch
import numpy as np
import pandas as pd
from torch import nn
import random
import os

import warnings
warnings.filterwarnings('ignore')

## Load data

data for this project can be found at this github repo : https://github.com/14thibea/OASIS-1_dataset.git

In [2]:
data = pd.read_csv('C:\\Users\\DELL\\Alzheimer\'s-disease-classification\\OASIS-1_dataset\\tsv_files\\lab_1\\OASIS_BIDS.tsv', sep='\t')

In [3]:
data.head()

Unnamed: 0,participant_id,session_id,alternative_id_1,sex,education_level,age_bl,cdr,diagnosis_bl,laterality,MMS,cdr_global,diagnosis
0,sub-OASIS10001,ses-M00,OAS1_0001_MR1,F,2.0,74,0.0,CN,R,29.0,0.0,CN
1,sub-OASIS10002,ses-M00,OAS1_0002_MR1,F,4.0,55,0.0,CN,R,29.0,0.0,CN
2,sub-OASIS10003,ses-M00,OAS1_0003_MR1,F,4.0,73,0.5,AD,R,27.0,0.5,AD
3,sub-OASIS10004,ses-M00,OAS1_0004_MR1,M,,28,30.0,CN,R,30.0,30.0,CN
4,sub-OASIS10005,ses-M00,OAS1_0005_MR1,M,,18,30.0,CN,R,30.0,30.0,CN


Two labels exist in this dataset:

* CN (Cognitively Normal) for healthy participants.
* AD (Alzheimer's Disease) for patients affected by Alzheimer's disease.


One crucial step before training a neural network is to check the dataset. Are the classes balanced ? Are there biases in the dataset that may differentiate the labels ?

Here we will focus on the demographics (age, sex and level of education) and two cognitive scores:

* The **MMS** (Mini Mental State), rated between 0 (no correct answer) to 30 (healthy subject)

* The **CDR** (Clinical Dementia Rating), that is null if the participant is non-demented and of 0.5, 1, 2 and 3 for very mild, mild, moderate and severe dementia, respectively

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 377 entries, 0 to 376
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   participant_id    377 non-null    object 
 1   session_id        377 non-null    object 
 2   alternative_id_1  377 non-null    object 
 3   sex               377 non-null    object 
 4   education_level   197 non-null    float64
 5   age_bl            377 non-null    int64  
 6   cdr               377 non-null    float64
 7   diagnosis_bl      377 non-null    object 
 8   laterality        377 non-null    object 
 9   MMS               377 non-null    float64
 10  cdr_global        377 non-null    float64
 11  diagnosis         377 non-null    object 
dtypes: float64(4), int64(1), object(7)
memory usage: 35.5+ KB


### Missing values

In [5]:
data.isna().sum()

participant_id        0
session_id            0
alternative_id_1      0
sex                   0
education_level     180
age_bl                0
cdr                   0
diagnosis_bl          0
laterality            0
MMS                   0
cdr_global            0
diagnosis             0
dtype: int64

Let's check the number of instance in each class

In [6]:
data.diagnosis.value_counts()

CN    304
AD     73
Name: diagnosis, dtype: int64

Our dataset contains 304 Cognitively Normal persons and 73 of patients that are affected by Alzheimer's disease. Our dataset is imbalanced.

In the next section we will study the characteristics of control participants and patients that are affected by Alzheimer's disease

In [7]:
data.cdr.value_counts()

30.0    180
0.0     124
0.5      45
1.0      26
2.0       2
Name: cdr, dtype: int64

In [8]:
def characteristics_table(df, merged_df):
    """Creates a DataFrame that summarizes the characteristics of the DataFrame df"""
    diagnoses = np.unique(df.diagnosis.values)
    population_df = pd.DataFrame(index=diagnoses,
                                columns=['Number', 'age', 'perc_sexF', 'education',
                                         'MMS', 'CDR=0', 'CDR=0.5', 'CDR=1', 'CDR=2'])
    merged_df = merged_df.set_index(['participant_id', 'session_id'], drop=True)
    df = df.set_index(['participant_id', 'session_id'], drop=True)
    sub_merged_df = merged_df.loc[df.index]
    
    for diagnosis in population_df.index.values:
        diagnosis_df = sub_merged_df[df.diagnosis == diagnosis]
        population_df.loc[diagnosis, 'N'] = len(diagnosis_df)
        # Age
        mean_age = np.mean(diagnosis_df.age_bl)
        std_age = np.std(diagnosis_df.age_bl)
        population_df.loc[diagnosis, 'age'] = '%.1f ± %.1f' % (mean_age, std_age)
        # Sex
        population_df.loc[diagnosis, '%sexF'] = round((len(diagnosis_df[diagnosis_df.sex == 'F']) / len(diagnosis_df)) * 100, 1)
        # Education level
        mean_education_level = np.nanmean(diagnosis_df.education_level)
        std_education_level = np.nanstd(diagnosis_df.education_level)
        population_df.loc[diagnosis, 'education'] = '%.1f ± %.1f' % (mean_education_level, std_education_level)
        # MMS
        mean_MMS = np.mean(diagnosis_df.MMS)
        std_MMS = np.std(diagnosis_df.MMS)
        population_df.loc[diagnosis, 'MMS'] = '%.1f ± %.1f' % (mean_MMS, std_MMS)
        # CDR
        for value in ['0', '0.5', '1', '2']:
            population_df.loc[diagnosis, 'CDR=%s' % value] = len(diagnosis_df[diagnosis_df.cdr_global == float(value)])

    return population_df

population_df = characteristics_table(data, data)
population_df

Unnamed: 0,Number,age,perc_sexF,education,MMS,CDR=0,CDR=0.5,CDR=1,CDR=2,N,%sexF
AD,,77.5 ± 7.4,,2.7 ± 1.3,22.7 ± 3.6,0,45,26,2,73.0,63.0
CN,,44.0 ± 23.3,,3.5 ± 1.2,29.7 ± 0.6,124,0,0,0,304.0,62.2


## Data Preprocessing

we have only a few images to train the network in this lab session, the preprocessing here is very extensive. More specifically, the images encountered:
* Non-linear registration
* Segmentation of grey matter
* Conversion to tensor format (.pt)

The preprocessed images all have the same size (121x145x121)