# FCUL ALS Pre Training Exploration
---

Exploring the ALS dataset from Faculdade de Ciências da Universidade de Lisboa (FCUL) with the data from over 1000 patients collected in Portugal.

Just playing around with the cleaned dataframe before inputing it to the machine learning pipeline.

## Importing the necessary packages

In [None]:
import pandas as pd              # Pandas to handle the data in dataframes
import re                        # re to do regex searches in string data
import plotly                    # Plotly for interactive and pretty plots
import plotly.graph_objs as go
from datetime import datetime    # datetime to use proper date and time formats
import os                        # os handles directory/workspace changes
import numpy as np               # NumPy to handle numeric and NaN operations
from tqdm import tqdm_notebook   # tqdm allows to track code execution progress
import numbers                   # numbers allows to check if data is numeric
import torch                     # PyTorch to create and apply deep learning models
from torch.utils.data.sampler import SubsetRandomSampler
import data_utils as du          # Data science and machine learning relevant methods

In [None]:
# Change to parent directory (presumably "Documents")
os.chdir("../../..")
# Path to the CSV dataset files
data_path = 'Datasets/Thesis/FCUL_ALS/'

In [None]:
du.set_pandas_library(lib='pandas')

Allow pandas to show more columns:

In [None]:
pd.set_option('display.max_columns', 3000)
pd.set_option('display.max_rows', 3000)

Set the random seed for reproducibility:

In [None]:
du.set_random_seed(42)

## Exploring the cleaned dataset

### Loading the data

In [None]:
ALS_proc_df = pd.read_csv(f'{data_path}cleaned/FCUL_ALS_cleaned.csv')
ALS_proc_df.head()

### Basic stuff

In [None]:
ALS_proc_df.dtypes

In [None]:
ALS_proc_df.nunique()

In [None]:
du.search_explore.dataframe_missing_values(ALS_proc_df)

In [None]:
ALS_proc_df.describe().transpose()

### Label analysis

In [None]:
ALS_proc_df.niv_label.value_counts()

How many subjects always have the same label in their time series:

In [None]:
const_label_subj = list()
for subject in ALS_proc_df.subject_id.unique():
    subject_data = ALS_proc_df[ALS_proc_df.subject_id == subject]
    if subject_data.niv_label.min() == subject_data.niv_label.max():
        const_label_subj.append(subject)
const_label_subj

In [None]:
len(const_label_subj)

In [None]:
percent_const_label_subj = (len(const_label_subj) / ALS_proc_df.subject_id.nunique()) * 100
print(f'{percent_const_label_subj}%')

**Comment:** It's a real bummer that over 60% of the subjects have a static / constant label value. But it is what it is. I think that I would have more to lose if I would remove all of these subjects' data, which still give the model an idea of what a patient needing NIV looks like, and vice versa.

### Time / sampling variation

In [None]:
ALS_proc_df['delta_ts'] = ALS_proc_df.groupby('subject_id').ts.diff()
ALS_proc_df.head()

In [None]:
ALS_proc_df.delta_ts.describe()

## Random exploratory stuff

In [None]:
labels = torch.Tensor([0, 0, 0, 1, 1, 1])
pred = torch.Tensor([1, 0, 0, 0, 1, 1])
correct_pred = pred == labels
correct_pred

In [None]:
torch.masked_select(pred, labels.byte())

In [None]:
true_pos = int(sum(torch.masked_select(pred, labels.byte())))
true_pos

In [None]:
false_neg = int(sum(torch.masked_select(pred == 0, labels.byte())))
false_neg

In [None]:
true_neg = int(sum(torch.masked_select(pred == 0, (labels == 0).byte())))
true_neg

In [None]:
false_pos = int(sum(torch.masked_select(pred, (labels == 0).byte())))
false_pos

In [None]:
any(metric in ['a', 'b', 'c'] for metric in ['precision', 'recall', 'F1'])

In [None]:
x = 1

In [None]:
'x' in locals()