# Input Module

The objective of this module is to prepare input data for the Machine Learning module. In this module, we can manipulate CSV files with basic operations:
- Merge
- Handling NaN values
- Drop rows / columns
- Basic operation on data

This module takes CSV files as input, some of which may come from the output of the extraction module. It allows us to create a certain type of table (called *master table*) for further processing with the **MEDprofiles** package. For more information, see [MEDprofiles documentation](https://github.com/MEDomics-UdeS/MEDprofiles).

At the end of the Input module treatment, CSV static files (two CSV files per time point : one for the training set and one for the holdout set) are created using the MEDprofiles package.

In order to run this notebook, you must have executed the *Extraction* notebook first.

The following CSV files are required for this notebook:

- In the *csv/original_data* folder:
    - *admissions.csv*
    - *patients.csv*
- In the *csv/extracted_features* folder (generated by the Extraction notebook):
    - *chart_events.csv*
    - *lab_events.csv*
    - *procedure_events.csv*
    - *rad_notes.csv*

In [1]:
# Imports
import numpy as np
import os
import pandas as pd
import pickle
import random

os.chdir('../data/MEDprofiles')
from MEDprofiles.src.back import create_classes_from_master_table, instantiate_data_from_master_table

os.chdir('../../src')
import input
import extraction
from patient_list import PATIENT_LIST

%matplotlib qt

## Get data (from original and extracted_features CSV files)

In [2]:
# Set the working directory
os.chdir('../data/csv')

In [3]:
# Read CSV data from original data
df_admissions = pd.read_csv('original_data/admissions.csv')
df_patients = pd.read_csv('original_data/patients.csv')

In [4]:
# Read CSV data from extracted features
df_chart_events = pd.read_csv('extracted_features/chart_events.csv')
df_lab_events = pd.read_csv('extracted_features/lab_events.csv')
df_procedure_events = pd.read_csv('extracted_features/procedure_events.csv')
df_rad_notes = pd.read_csv('extracted_features/rad_notes.csv')

In [5]:
# This cell is here in order to reduce computation time and doesn't illustrate the Extraction module functionalities

# Filter data according to patient list
df_admissions = extraction.filter_dataframe_by_patient(df_admissions, 'subject_id', PATIENT_LIST)
df_patients = extraction.filter_dataframe_by_patient(df_patients, 'subject_id', PATIENT_LIST)

## Proceed to basic operation on data

In [6]:
# Convert categorical columns to numerical
input.convert_categorical_column_to_int(df_admissions, 'insurance')
input.convert_categorical_column_to_int(df_admissions, 'language')
input.convert_categorical_column_to_int(df_admissions, 'marital_status')
input.convert_categorical_column_to_int(df_admissions, 'race')
input.convert_categorical_column_to_int(df_patients, 'gender')

In [7]:
# Get prediction data and convert values to make binary classification
sr_pred = pd.Series(df_patients['dod'], df_patients.index)
sr_pred.index = df_patients['subject_id']
sr_pred.loc[~sr_pred.isnull()] = 1
sr_pred.loc[sr_pred.isnull()] = 0

In [8]:
# Filter data to get only useful information
df_admissions = df_admissions[['subject_id', 'admittime', 'insurance', 'language', 'marital_status', 'race']]
df_patients = df_patients[['subject_id', 'gender']]

In [9]:
# Rename columns
df_admissions.rename({'subject_id': 'PatientID', 'admittime': 'Date', 'insurance': 'demographic_insurance', 'language': 'demographic_language', 'marital_status': 'demographic_marital_status', 'race': 'demographic_race'}, axis='columns', inplace=True)
df_patients.rename({'subject_id': 'PatientID', 'gender': 'demographic_gender'}, axis='columns', inplace=True)

## Create the master table

In [10]:
# Merge all the dataframes
df_master = pd.merge(df_admissions, df_patients, how='inner', on=['PatientID'])
df_master = pd.concat([df_master, df_chart_events, df_lab_events, df_procedure_events, df_rad_notes], ignore_index=True)

In [11]:
df_master['Date'] = pd.to_datetime(df_master['Date'])
df_master['PatientID'] = df_master['PatientID'].astype('int')
df_master.sort_values(by=['PatientID', 'Date'], inplace=True)
df_master.insert(2, 'Time_point', np.NaN)

# Create a type list to add in the master table
# All the attributes are numbers except the date
type_list = ['num' for _ in df_master.columns]
type_list[1] = 'datetime.date'

# Add types row to the dataframe
df_types = pd.DataFrame([type_list], columns=df_master.columns)
df_master = pd.concat([df_types, df_master]).reset_index(drop=True)

# Save the master table as csv file
df_master.to_csv('master_table.csv', index=False)

## Proceed to MEDprofiles treatment

In [12]:
# Set the working directory
os.chdir('../MEDprofiles')

In [13]:
# Create MEDclasses from the master table columns
create_classes_from_master_table.main('../csv/master_table.csv', 'MEDclasses')

In [14]:
# Instantiate the master table data as MEDprofiles
instantiate_data_from_master_table.main('../csv/master_table.csv', 'MEDprofiles_bin')

 99%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 99/100 [02:28<00:01,  1.50s/it]


In [15]:
from MEDclasses import *

# Get cohort
data_file = open('MEDprofiles_bin', 'rb')
MEDprofile_list = pickle.load(data_file)
cohort = MEDcohort(list_MEDprofile=MEDprofile_list[:30])
df_cohort = cohort.profile_list_to_df()

In [16]:
from MEDprofiles.src.semi_front.MEDcohortFigure import MEDcohortFigure

# Set the figure
classes_attributes_dict = {'demographic': ([], 'compact'), 'labevent': (['sodium_max', 'sodium_min', 'sodium_trend'], 'complete'), 'chartevent': (['heart_rate_max', 'heart_rate_min'], 'compact'), 'nrad': (['attr_0', 'attr_1', 'attr_2'], 'compact')}
figure = MEDcohortFigure(classes_attributes_dict, df_cohort)

<div class="alert alert-block alert-info">
<b>Execute the following cells after manipulating the figure.</div>

## Split data into training and holdout set

In [17]:
# Create holdout set
holdout_patients = random.choices(list(set(figure.cohort_df.index)), k = int(0.2 * len(list(set(figure.cohort_df.index)))))

In [18]:
# Split the cohort into training and holdout sets
cohort_df_train = figure.cohort_df[~figure.cohort_df.index.isin(holdout_patients)].copy()
cohort_df_train.insert(0, 'PatientID', cohort_df_train.index)
cohort_df_holdout = figure.cohort_df[figure.cohort_df.index.isin(holdout_patients)].copy()
cohort_df_holdout.insert(0, 'PatientID', cohort_df_holdout.index)

In [19]:
# Split sr_pred into training and holdout sets
sr_pred_train = sr_pred[~sr_pred.index.isin(list(map(int, holdout_patients)))].copy()
sr_pred_holdout = sr_pred[sr_pred.index.isin(list(map(int, holdout_patients)))].copy()

## Save data as CSV static files

In [20]:
# Set the working directory
os.chdir('../csv')

In [21]:
input.generate_static_csv_from_df(cohort_df_train, 'static/train_time_point_')
input.generate_static_csv_from_df(cohort_df_holdout, 'static/holdout_time_point_')

## Save prediction data

In [22]:
sr_pred_train.to_csv('static/pred_train.csv')
sr_pred_holdout.to_csv('static/pred_holdout.csv')