# FCUL ALS Data Exploration
---

Exploring the ALS dataset from Faculdade de Ciências da Universidade de Lisboa (FCUL) with the data from over 1000 patients collected in Portugal.

Amyotrophic lateral sclerosis, or ALS (also known in the US as Lou Gehrig’s Disease and as Motor Neuron Disease in the UK) is a disease that involves the degeneration and death of the nerve cells in the brain and spinal cord that control voluntary muscle movement. Death typically occurs within 3 - 5 years of diagnosis. Only about 25% of patients survive for more than 5 years after diagnosis.

## Importing the necessary packages

In [1]:
import pandas as pd                        # Pandas to handle the data in dataframes
import re                                  # re to do regex searches in string data
import plotly.graph_objs as go             # Plotly for interactive and pretty plots
from datetime import datetime              # datetime to use proper date and time formats
import os                                  # os handles directory/workspace changes
import numpy as np                         # NumPy to handle numeric and NaN operations
from tqdm import tqdm_notebook             # tqdm allows to track code execution progress
import numbers                             # numbers allows to check if data is numeric
import torch                               # PyTorch to create and apply deep learning models
from torch.utils.data.sampler import SubsetRandomSampler
import data_utils as du                    # Data science and machine learning relevant methods

In [2]:
import plotly.io as pio
pio.templates

Templates configuration
-----------------------
    Default template: 'plotly'
    Available templates:
        ['ggplot2', 'seaborn', 'simple_white', 'plotly',
         'plotly_white', 'plotly_dark', 'presentation', 'xgridoff',
         'ygridoff', 'gridon', 'none']

Use Plotly in dark mode:

In [3]:
pio.templates.default = 'plotly_dark'

In [4]:
# Change to parent directory (presumably "Documents")
os.chdir("../../..")
# Path to the CSV dataset files
data_path = 'Datasets/Thesis/FCUL_ALS/'
# ## Exploring the preprocessed dataset

### Basic stats

In [5]:
ALS_proc_df = pd.read_csv(f'{data_path}dataWithoutDunnoNIV.csv')
ALS_proc_df.head()

Unnamed: 0,REF,Gender,BMI,MND familiar history,Age at onset,Disease duration,El Escorial reviewed criteria,UMN vs LMN,Onset form,C9orf72,...,SNIP,PhrenMeanLat,PhrenMeanAmpl,CervicalFlex,CervicalExt,NIV,NIV_DATE,firstDate,lastDate,medianDate
0,2,1,17.901235,2.0,55.0,5.3,,,1,Unknown,...,,,,5.0,5.0,0,04/06/2007,07/11/2006,15/11/2006,07/11/2006
1,2,1,17.901235,2.0,55.0,5.3,,,1,Unknown,...,,,,5.0,5.0,0,04/06/2007,04/12/2006,04/12/2006,04/12/2006
2,2,1,17.901235,2.0,55.0,5.3,,,1,Unknown,...,,,,5.0,5.0,0,04/06/2007,09/01/2007,24/01/2007,09/01/2007
3,2,1,17.901235,2.0,55.0,5.3,,,1,Unknown,...,,,,4.0,5.0,0,04/06/2007,11/05/2007,17/05/2007,11/05/2007
4,2,1,17.901235,2.0,55.0,5.3,,,1,Unknown,...,,,,2.0,4.0,1,04/06/2007,03/09/2007,03/09/2007,03/09/2007


In [6]:
ALS_proc_df.dtypes

REF                                int64
Gender                             int64
BMI                              float64
MND familiar history             float64
Age at onset                     float64
Disease duration                 float64
El Escorial reviewed criteria     object
UMN vs LMN                        object
Onset form                        object
C9orf72                           object
ALS-FRS                          float64
ALS-FRS-R                        float64
ALS-FRSb                         float64
ALS-FRSsUL                       float64
ALS-FRSsLL                       float64
ALS-FRSr                         float64
R                                float64
P1                               float64
P2                               float64
P3                               float64
P4                               float64
P5                               float64
P6                               float64
P7                               float64
P8              

In [7]:
ALS_proc_df.nunique()

REF                              1110
Gender                              2
BMI                               612
MND familiar history                3
Age at onset                       71
Disease duration                  598
El Escorial reviewed criteria       7
UMN vs LMN                          3
Onset form                          6
C9orf72                             5
ALS-FRS                            41
ALS-FRS-R                          46
ALS-FRSb                           13
ALS-FRSsUL                         13
ALS-FRSsLL                         13
ALS-FRSr                            6
R                                  13
P1                                  5
P2                                  5
P3                                  5
P4                                  5
P5                                  5
P6                                  5
P7                                  5
P8                                  5
P9                                  5
P10         

In [8]:
du.search_explore.dataframe_missing_values(ALS_proc_df)

Unnamed: 0,column_name,percent_missing
REF,REF,0.0
firstDate,firstDate,0.0
NIV_DATE,NIV_DATE,0.0
NIV,NIV,0.0
lastDate,lastDate,0.0
C9orf72,C9orf72,0.0
medianDate,medianDate,0.0
Gender,Gender,0.0
Age at onset,Age at onset,0.032321
Disease duration,Disease duration,0.129282


**Comment:** Many relevant features (timestamps, NIV, age, ALSFRS, etc) have zero or low missing values percentage (bellow 10%), much better than in the PRO-ACT dataset. However, there are other interesting ones with more than half missing values (FVC, VC, etc).

In [None]:
ALS_proc_df.describe().transpose()

In [None]:
ALS_proc_df['El Escorial reviewed criteria'].value_counts()

In [None]:
ALS_proc_df['Onset form'].value_counts()

In [None]:
ALS_proc_df['UMN vs LMN'].value_counts()

In [None]:
ALS_proc_df['C9orf72'].value_counts()

In [None]:
ALS_proc_df['SNIP'].value_counts()

In [None]:
ALS_proc_df['1R'].value_counts()

### Plots

In [None]:
ALS_proc_gender_count = ALS_proc_df.groupby('REF').first().Gender.value_counts().to_frame()
data = [go.Pie(labels=ALS_proc_gender_count.index, values=ALS_proc_gender_count.Gender)]
layout = go.Layout(title='Patients Gender Demographics')
fig = go.Figure(data, layout)
fig.show()

In [None]:
ALS_proc_niv_count = ALS_proc_df.NIV.value_counts().to_frame()
data = [go.Pie(labels=ALS_proc_niv_count.index, values=ALS_proc_niv_count.NIV)]
layout = go.Layout(title='Visits where the patient is using NIV')
fig = go.Figure(data, layout)
fig.show()

In [None]:
data = [go.Histogram(x = ALS_proc_df.NIV)]
layout = go.Layout(title='Number of visits where the patient is using NIV.')
fig = go.Figure(data, layout)
fig.show()

In [None]:
ALS_proc_patient_niv_count = ALS_proc_df.groupby('subject_id').niv.max().value_counts().to_frame()
data = [go.Pie(labels=ALS_proc_patient_niv_count.index, values=ALS_proc_patient_niv_count.niv)]
layout = go.Layout(title='Patients which eventually use NIV')
fig = go.Figure(data, layout)
fig.show()

In [None]:
data = [go.Scatter(
                    x = ALS_proc_df.FVC,
                    y = ALS_proc_df.NIV,
                    mode = 'markers'
                  )]
layout = go.Layout(
                    title='Relation between NIV use and FVC values',
                    xaxis=dict(title='FVC'),
                    yaxis=dict(title='NIV')
                  )
fig = go.Figure(data, layout)
fig.show()

In [None]:
# Average FVC value when NIV is used:
ALS_proc_df[ALS_proc_df.NIV == 1].FVC.mean()

**Comments:** The average FVC when NIV is 1 is lower than average, but the scatter plot doesn't show a very clear dependence between the variables.

In [None]:
data = [go.Scatter(
                    x = ALS_proc_df['Disease duration'],
                    y = ALS_proc_df.NIV,
                    mode = 'markers'
                  )]
layout = go.Layout(
                    title='Relation between NIV use and disease duration',
                    xaxis=dict(title='Disease duration'),
                    yaxis=dict(title='NIV')
                  )
fig = go.Figure(data, layout)
fig.show()

In [None]:
# Average disease duration when NIV is used:
ALS_proc_df[ALS_proc_df.NIV == 1]['Disease duration'].mean()

In [None]:
data = [go.Scatter(
                    x = ALS_proc_df['Age at onset'],
                    y = ALS_proc_df.NIV,
                    mode = 'markers'
                  )]
layout = go.Layout(
                    title='Relation between NIV use and age',
                    xaxis=dict(title='Age at onset'),
                    yaxis=dict(title='NIV')
                  )
fig = go.Figure(data, layout)
fig.show()

In [None]:
# Average age at onset when NIV is used:
ALS_proc_df[ALS_proc_df.NIV == 1]['Age at onset'].mean()

In [None]:
ALS_proc_NIV_3R = ALS_proc_df.groupby(['3R', 'NIV']).REF.count().to_frame().reset_index()
data = [go.Bar(
                    x=ALS_proc_NIV_3R[ALS_proc_NIV_3R.NIV == 0]['3R'],
                    y=ALS_proc_NIV_3R[ALS_proc_NIV_3R.NIV == 0]['REF'],
                    name='Not used'
              ),
        go.Bar(
                    x=ALS_proc_NIV_3R[ALS_proc_NIV_3R.NIV == 1]['3R'],
                    y=ALS_proc_NIV_3R[ALS_proc_NIV_3R.NIV == 1]['REF'],
                    name='Using NIV'
        )]
layout = go.Layout(barmode='group')
fig = go.Figure(data=data, layout=layout)
fig.show()

In [None]:
# Average 3R value when NIV is used:
ALS_proc_df[ALS_proc_df.NIV == 1]['3R'].mean()

**Comments:** Clearly, there's a big dependence of the use of NIV with the respiratory symptoms indicated by 3R, as expected.

## Exploring the raw dataset

In [None]:
ALS_raw_df = pd.read_excel(f'{data_path}TabelaGeralnew_21012019_sem.xlsx')
ALS_raw_df.head()