# DATA PREPROCESSING FOR MORTALITY DETECTION

## Objectives

Vamos a entrenar dos modelos de aprendizaje automático para:
1. Detectar si una persona morirá por COVID-19, tomando en cuenta las afeccione sque padece además de otra información médica y
2. En caso de resultar en un pronóstico positivo, pronosticar cuánto tiempo de vida le queda

Estos dos modelos predictivos ayudarán en el diagnóstico médico de nuevos pacientes que ingresan a hospitales todos los días, reduciendo la carga de trbajo del personal médico y permitíendoles detectar con mayor facilidad a los pacientes que requiere cuidado prioritario.

Los atributos que se usarán para entrenar los modelos son:
- Edad
- Si es paciente hospitalizado o ambulatorio
- Afecciones previas con las que cuenta
- Días que han pasado desde que presentó los síntomas hasta que fue hospitalizado
   
Obviamente, debido a que entrenaremos dos modelos, uno de clasificación y otro de regresión, necesitamos dos etiquetas diferentes. Estas etiquetas son:

- Si la personas falleció o no
- Días que pasaron desde que la persona fue hospitalizado hasta que falleció

Los dos subsets de datos compartirán las mismas características, pero un subset tendrá la primera etiqueta mientras que el otro subset se que da con la etiqueta restante

In [155]:
import numpy as np
import pandas as pd
import os
from datetime import datetime as dt
from dateutil.relativedelta import relativedelta as reldelta
from sklearn.model_selection import train_test_split

In [20]:
SEED = 42

# 1 - Loading Data

In [21]:
EXEC_MONTH = 'dec22'

In [22]:
FILE_PATH = os.path.join('..', 'datasets', f'clean_covid_dataset_{EXEC_MONTH}.csv')

In [23]:
types = {
    'SECTOR': np.int8,
    'ENTIDAD_UM': np.int8,
    'SEXO': np.int8,
    'PAC_HOSPITALIZADO': np.int8,
    'FECHA_INGRESO': 'string',
    'FECHA_SINTOMAS': 'string',
    'FECHA_DEF': 'string',
    'INTUBADO': np.int8,
    'NEUMONIA': np.int8,
    'EDAD': np.int8,
    'EMBARAZO': np.int8,
    'DIABETES': np.int8,
    'EPOC': np.int8,
    'ASMA': np.int8,
    'INMUSUPR': np.int8,
    'HIPERTENSION':np.int8,
    'CARDIOVASCULAR': np.int8,
    'OBESIDAD': np.int8,
    'RENAL_CRONICA': np.int8,
    'TABAQUISMO': np.int8,
    'UCI': np.int8
}

In [24]:
date_cols = ['FECHA_INGRESO', 'FECHA_SINTOMAS']

In [25]:
# 'latin' porque contiene acentos
df = pd.read_csv(FILE_PATH, encoding='latin', dtype=types, index_col=None)

In [26]:
# Datestrings are parsed as datetime objects
for col in date_cols:
    df[col] = pd.to_datetime(df[col], format='%Y-%m-%d')

## General Info

#### How does the data look like?

In [27]:
df.head(3)

Unnamed: 0,SECTOR,ENTIDAD_UM,SEXO,PAC_HOSPITALIZADO,FECHA_INGRESO,FECHA_SINTOMAS,FECHA_DEF,INTUBADO,NEUMONIA,EDAD,...,DIABETES,EPOC,ASMA,INMUSUPR,HIPERTENSION,CARDIOVASCULAR,OBESIDAD,RENAL_CRONICA,TABAQUISMO,UCI
0,12,18,1,0,2022-02-08,2022-02-05,9999-99-99,0,0,24,...,0,0,1,0,0,0,0,0,0,0
1,12,14,1,0,2022-08-09,2022-08-06,9999-99-99,0,0,57,...,0,0,0,0,0,0,0,0,0,0
2,12,9,0,0,2022-01-13,2022-01-10,9999-99-99,0,0,39,...,0,0,0,0,0,0,0,0,0,0


In [28]:
df.tail(3)

Unnamed: 0,SECTOR,ENTIDAD_UM,SEXO,PAC_HOSPITALIZADO,FECHA_INGRESO,FECHA_SINTOMAS,FECHA_DEF,INTUBADO,NEUMONIA,EDAD,...,DIABETES,EPOC,ASMA,INMUSUPR,HIPERTENSION,CARDIOVASCULAR,OBESIDAD,RENAL_CRONICA,TABAQUISMO,UCI
6686125,9,9,1,0,2020-12-06,2020-12-01,9999-99-99,0,0,23,...,0,0,0,0,0,0,0,0,0,0
6686126,9,9,1,0,2020-12-06,2020-11-30,9999-99-99,0,0,42,...,0,0,0,0,0,0,0,0,0,0
6686127,9,9,1,1,2020-12-29,2020-12-25,2021-01-02,0,0,86,...,0,0,0,0,1,0,0,0,0,0


#### Datatypes and memory used

In [29]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6686128 entries, 0 to 6686127
Data columns (total 21 columns):
 #   Column             Dtype         
---  ------             -----         
 0   SECTOR             int8          
 1   ENTIDAD_UM         int8          
 2   SEXO               int8          
 3   PAC_HOSPITALIZADO  int8          
 4   FECHA_INGRESO      datetime64[ns]
 5   FECHA_SINTOMAS     datetime64[ns]
 6   FECHA_DEF          string        
 7   INTUBADO           int8          
 8   NEUMONIA           int8          
 9   EDAD               int8          
 10  EMBARAZO           int8          
 11  DIABETES           int8          
 12  EPOC               int8          
 13  ASMA               int8          
 14  INMUSUPR           int8          
 15  HIPERTENSION       int8          
 16  CARDIOVASCULAR     int8          
 17  OBESIDAD           int8          
 18  RENAL_CRONICA      int8          
 19  TABAQUISMO         int8          
 20  UCI                int8 

### Datetime Operations

* Stratified subsampling using datetime (maybe a split per month)
* ~~Grab rows generated before or after a certain date (DONE)~~

# 2 - Data Selection

### Removing useless instances

#### Removing old instances

Machine learning techniques will be only use recent data, to properly model the recent changes in the phenomenom of COVID-19 mortality prediction. Specifically, the data in the most recent month will be used as testing data, while the previous 3 months will serve as training data. At this point of time, we'll select the data from the 4 most recent months, apply the preprocessing tasks over them, and finally split them and save them. 

In [30]:
# Filters a DataFrame if the dates of a certain datetime field belong to a specific date range
def filter_by_dates_in_range(df, col, start_date=None, end_date=None):
    if start_date is None: start_date = '2010-01-01' # oldest date
    if end_date is None: end_date = '2030-01-01' # earliest date
    # start_date is inclusive, end_date is exclusive
    idxs = (df[col] >= start_date) & (df[col] < end_date)
    return df[idxs]

In [31]:
# Functions that calculate dates. Used when filtering by dates.
def get_first_date_current_month():
    return dt.now().replace(day=1).strftime('%Y-%m-%d')

def get_first_date_next_month():
    return (dt.now() + reldelta(months = 1)).replace(day=1).strftime('%Y-%m-%d')

def get_first_date_x_months_ago(x):
    return (dt.now() - reldelta(months = x)).replace(day=1).strftime('%Y-%m-%d')

In [32]:
start_date = get_first_date_x_months_ago(4)
end_date = get_first_date_current_month()
print(start_date, end_date)

2022-08-01 2022-12-01


In [33]:
df = filter_by_dates_in_range(df, 'FECHA_INGRESO', start_date, end_date)

In [34]:
df.shape

(254079, 21)

#### Removing instances that are not from the YUCATAN peninsula
4 - camp, 23 - qroo , 31 - yuc

In [45]:
def filter_df_by_state(df, col, states):
    idxs = df[col].isin(states)
    return df[idxs]

### Removing useless fields

Cuatro columnas no son de nuestro interés en ninguno de los modelos: SECTOR, ENTIDAD_UM, INTUBADO y UCI por lo que se eliminarán

In [46]:
df.drop(['SECTOR', 'ENTIDAD_UM', 'INTUBADO', 'UCI'], axis='columns', inplace=True)

# 3 - Feature Extraction

Some information can be extracted from previously existing fields so the ML model to be trained can perfome better. Specifically, we need to calculate two features:
* Days passed from the first date symptoms appeared on the pacient to the date him/her was admitted to the hospital
* Wheter the pacient died of COVID or not

### Days passed since first symptoms

In [47]:
def get_days_since_admission(row):
    return (row['FECHA_INGRESO'] - row['FECHA_SINTOMAS']).days

In [48]:
df['DIAS_SINTOMAS'] = df.apply(get_days_since_admission, axis='columns').astype('int16')

In [49]:
# Checking min and max values of created feature.
# Negative values should not exist
print(df['DIAS_SINTOMAS'].min(), df['DIAS_SINTOMAS'].max())

0 243


In [50]:
# Date fields used in calculations can be safely deleted
df.drop(['FECHA_SINTOMAS'], axis='columns', inplace=True)

### Generating Mortality Prediction Label

In [51]:
def is_deceased(x):
    year = x[:4] # first 4 letters of data string represent year
    if year == '9999': return 0 # The year 9999 means the person did not die
    else: return 1

In [52]:
df['FALLECIDO'] = df['FECHA_DEF'].apply(is_deceased).astype('int8')

In [53]:
# "FECHA_DEF" can be safely deleted
df.drop(['FECHA_DEF'], axis='columns', inplace=True)

# 4 - Splitting Data

### Splitting Train and Test Data

In [157]:
split_date = get_first_date_x_months_ago(1)
split_date

'2022-11-01'

In [158]:
train_df = filter_by_dates_in_range(df, 'FECHA_INGRESO', end_date = split_date).copy()
test_df = filter_by_dates_in_range(df, 'FECHA_INGRESO', start_date = split_date).copy()

In [159]:
train_df.shape, test_df.shape

((229705, 17), (24374, 17))

### Subsample instances on each split to prevent class imbalance

In [160]:
def get_age_groups(age):
    if age < 1: return 'BEBE'
    if age < 5: return 'INFANTE_PEQ'
    if age < 12: return 'INFANTE'
    if age < 19: return 'ADOLESCENTE'
    if age < 35: return 'ADULTO JOVEN'
    if age < 65: return 'ADULTO'
    else: return 'ADULTO MAYOR'

#### Subsample Train set

In [161]:
train_df['FALLECIDO'].value_counts()

0    228371
1      1334
Name: FALLECIDO, dtype: int64

In [163]:
1334/228371

0.005841372153206843

In [164]:
train_df_only_neg = train_df[train_df['FALLECIDO'] == 0]

In [165]:
age_groups_train = train_df_only_neg['EDAD'].apply(get_age_groups)

In [166]:
train_df_only_neg_reduced, _ = train_test_split(train_df_only_neg, train_size=0.005, stratify=age_groups_train,
                                                shuffle=True, random_state=SEED)

In [167]:
idxs_neg_red  = train_df_only_neg_reduced.index.values
idxs_neg_red.shape

(1141,)

In [168]:
idxs_pos = train_df[train_df['FALLECIDO'] == 1].index.values
idxs_pos.shape

(1334,)

In [169]:
filter_idxs = np.concatenate([idxs_neg_red, idxs_pos], axis=0)
filter_idxs.shape

(2475,)

In [170]:
train_df_bal = train_df.loc[filter_idxs]

In [171]:
train_df_bal['FALLECIDO'].value_counts()

1    1334
0    1141
Name: FALLECIDO, dtype: int64

#### Subsample Test set

In [173]:
test_df['FALLECIDO'].value_counts()

0    24288
1       86
Name: FALLECIDO, dtype: int64

In [174]:
100/24288

0.004117259552042161

In [175]:
test_df_only_neg = test_df[test_df['FALLECIDO'] == 0]

In [176]:
age_groups_test = test_df_only_neg['EDAD'].apply(get_age_groups)

In [177]:
test_df_only_neg_reduced, _ = train_test_split(test_df_only_neg, train_size=0.005, stratify=age_groups_test,
                                               shuffle=True, random_state=SEED)

In [178]:
idxs_neg_red  = test_df_only_neg_reduced.index.values
idxs_neg_red.shape

(121,)

In [179]:
idxs_pos = test_df[test_df['FALLECIDO'] == 1].index.values
idxs_pos.shape

(86,)

In [180]:
filter_idxs = np.concatenate([idxs_neg_red, idxs_pos], axis=0)
filter_idxs.shape

(207,)

In [181]:
test_df_bal = test_df.loc[filter_idxs]

In [182]:
test_df_bal['FALLECIDO'].value_counts()

0    121
1     86
Name: FALLECIDO, dtype: int64

#### Deletion of "FECHA_INGRESO" field

In [184]:
train_df_bal.drop(['FECHA_INGRESO'], axis='columns', inplace=True)
test_df_bal.drop(['FECHA_INGRESO'], axis='columns', inplace=True)

# 5 - Saving Data

Una vez generados los subsets de datos, solo habrá que salvarlos en dos archivos diferentes

In [186]:
TRAIN_DF_PATH = os.path.join('..', 'datasets', 'mort_pred_train.csv')
TEST_DF_PATH = os.path.join('..', 'datasets', 'mort_pred_test.csv')

In [187]:
train_df_bal.to_csv(TRAIN_DF_PATH, index=False)

In [188]:
test_df_bal.to_csv(TEST_DF_PATH, index=False)