# Index
<ol>
    <li><a href="#setup_and_data_download">Setup and data download</a>
    <li><a href="#training_data_visualization">Training data visualization</a>
    <li><a href="#data_cleaning_and_feature_engineering">Feature engineering</a>
    <li><a href="#model_validation">Data preprocessing</a>
    <li><a href="#model_training">Model training</a>
    <li><a href="#model_predictions">Model predictions</a>
       
           
            

<br>
<br>
<a id="setup_and_data_download"> </a>
# 1. Setup and data download

<br>
## 1.1 Libraries setup

### Import functionality libraries

In [1]:
from fastai.imports import *
from fastai.transforms import *
import fastai.conv_learner
from fastai.model import *
from fastai.dataset import *
from fastai.sgdr import *
from fastai.plots import *
from fastai.structured import *
from fastai.column_data import *

In [145]:
import os # Create directories, list files
import zipfile # Extract compressed files
import numpy as np # Linear algebra, sorting and selecting
import pandas as pd # Dataframes and csv I/O
import matplotlib.pyplot as plt # Plotting histograms
from collections import Counter # Class for counting purposes
from sklearn.model_selection import KFold
from sklearn.metrics import classification_report,precision_recall_curve,  precision_recall_fscore_support, fbeta_score ,roc_auc_score, roc_curve, auc # Useful metrics for single label classification
import time # Measuring elapsed time
import itertools as it
from lightgbm import LGBMClassifier
from sklearn.ensemble import RandomForestClassifier

<br>
## 1.2 Enviroment setup

### Environment

The environment overview

```
./input
│      
│
└─── train
│   │   train.csv
│
└─── test
│   │   test.csv
│   
└─── submission
    │   submission.csv
```

### Windows or Unix
Set the separators depending on the OS

In [3]:
OS = "Linux"

In [4]:
if OS == "Windows":
    s = "\\"
elif OS == "Linux":
    s = "/"
else:
    print("Not a valid OS")

### Initialize the environment variables

In [5]:
TRAIN_DIR = "train"
TEST_DIR = "test"
SUBMISSION_DIR = "submission"
INPUT_PATH = f'.{s}input'
TRAIN_PATH = f'{INPUT_PATH}{s}{TRAIN_DIR}'
TEST_PATH = f'{INPUT_PATH}{s}{TEST_DIR}'
SUBMISSION_PATH = f'{INPUT_PATH}{s}{SUBMISSION_DIR}'

<br>
## 1.2 Data setup

### Create the input directory

In [6]:
if not os.path.exists(f'{INPUT_PATH}'):
    os.mkdir(f'{INPUT_PATH}')

### Create the train, test and submission directory

In [7]:
if not os.path.exists(TRAIN_PATH): 
    os.mkdir(TRAIN_PATH)
print("Train directory ready")

if not os.path.exists(TEST_PATH): 
    os.mkdir(TEST_PATH)
print("Test directory ready")

if not os.path.exists(SUBMISSION_PATH): 
    os.mkdir(SUBMISSION_PATH)
print("Submmission directory ready")

Train directory ready
Test directory ready
Submmission directory ready


### Extract the data

Extract the data in the train directory and submission directory.<br>
Set the flag to false when already extracted.

In [8]:
extract_data = False

In [9]:
train_csv_files = ["train.csv"]
test_csv_files = ["test.csv"]

### Check environment is correctly initialized

In [10]:
expected_dir = [TRAIN_DIR,  TEST_DIR, SUBMISSION_DIR]
current_dir = os.listdir(INPUT_PATH)
print(current_dir)
if set(expected_dir).issubset(set(current_dir)): print("Everything is correct")

['test_processed_1', 'y', 'train_general_df', 'ID_df', 'y_df', 'train_processed_1', 'submission', 'test_david1', 'test', 'processed_test_df', 'train_david1', 'processed_train_df', 'train', 'test_general_df']
Everything is correct


<br>
<br>
<a id="data_preprocessing"> </a>
# 2. Training data visualization
Create a flag to iterate faster if we have a working model.

In [11]:
visualization = True

<br>
## 2.1 Generate the dataframes

Extract the names of the csv files in the train folder.<br>


In [12]:
train_table_names = [train_table_name[:-4] for train_table_name in os.listdir(TRAIN_PATH) if train_table_name[-4:] == ".csv"]
test_table_names = [test_table_name[:-4] for test_table_name in os.listdir(TEST_PATH) if test_table_name[-4:] == ".csv"]
print(train_table_names)
print(test_table_names)

['train']
['test_challenge']


To iterate faster we define a maximum sample_size (to train on the full dataset set it to None).

In [13]:
sample_size = None

Create the dictionary that links each name to the corresponging dataframe.<br>

In [14]:
generate_datasets = True

In [15]:
if generate_datasets:
    train_tables_dict = {train_table_name : pd.read_csv(f'{TRAIN_PATH}{s}{train_table_name}.csv', nrows=sample_size, low_memory=False, encoding= "ISO-8859-1") for train_table_name in train_table_names}
    test_tables_dict = {test_table_name : pd.read_csv(f'{TEST_PATH}{s}{test_table_name}.csv', nrows=sample_size, low_memory=False, encoding= "ISO-8859-1") for test_table_name in test_table_names}

<br><br>
## 2.2 Preliminary dataframes exploration

### Data description

In [229]:
if visualization:
    train_tables_dict["train"].info()
    display(train_tables_dict["train"].columns.values)
    display(train_tables_dict["train"].head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1617 entries, 0 to 1616
Data columns (total 87 columns):
MDR                                         1617 non-null int64
ID                                          1617 non-null object
NHC                                         1617 non-null int64
start_neutropenico                          1617 non-null datetime64[ns]
start_FN                                    1617 non-null datetime64[ns]
days_between                                1617 non-null float64
days_in_hospital                            1617 non-null int64
hospital_stay_w_FN                          1617 non-null int64
prev_hospital_stay                          1617 non-null int64
birth_year                                  1617 non-null datetime64[ns]
Gender                                      1617 non-null object
emergency                                   1617 non-null int64
num_movements                               1617 non-null int64
num_consult                   

array(['MDR', 'ID', 'NHC', 'start_neutropenico', 'start_FN', 'days_between', 'days_in_hospital',
       'hospital_stay_w_FN', 'prev_hospital_stay', 'birth_year', 'Gender', 'emergency', 'num_movements',
       'num_consult', 'share_room_MDR', 'dummy_LAM', 'dummy_others.LL', 'dummy_Cancer.linfoproliferativo',
       'dummy_SMD', 'dummy_LAL', 'dummy_EICH', 'dummy_Leucemia.cronica', 'dummy_SMPC', 'dummy_Cancer.solido',
       'dummy_LMC', 'dummy_TLPT', 'dummy_others.LM', 'dummy_Mieloma.like', 'dummy_LLC', 'antibiotic_count',
       'days_after_anti', 'AMIKACINA_.MG.', 'AMOXICILINA_.MG.', 'AMPICILINA_.MG.', 'AZITROMICINA_VIAL_.MG.',
       'AZTREONAM_.MG.', 'CEFAZOLINA_.MG.', 'CEFIXIMA_.MG.', 'CEFOTAXIMA_.MG.', 'CEFOXITINA_.MG.',
       'CEFTAROLINA_FOSAMIL_.MG.', 'CEFTAZIDIMA_.MG.', 'CEFTIBUTENO_.MG.', 'CEFTOLOZANO_.UND.',
       'CEFTRIAXONA_.MG.', 'CEFUROXIMA.AXETILO_.MG.', 'CIPROFLOXACINO_.MG.', 'CLARITROMICINA_.MG.',
       'CLINDAMICINA_.MG.', 'CLOXACILINA_.MG.', 'COTRIMOXAZOL_FORTE_.

Unnamed: 0,MDR,ID,NHC,start_neutropenico,start_FN,days_between,days_in_hospital,hospital_stay_w_FN,prev_hospital_stay,birth_year,...,Auto_TP,Alo_TP,room_list,mucositis,cito_group_3,cito_group_1,cito_group_2,Past_positive_result_from,had_FN_before,has_FN_since_no_anti
0,0,374-1,404,2007-12-11,2008-01-01,0.0,28,1,3,1941-01-01,...,1,0,E02403,0,1,0,0,Culture,no,yes
1,0,398-1,1897,2007-12-28,2008-01-01,0.0,8,1,6,1935-01-01,...,0,0,G06512,0,0,0,0,NEGATIVE,no,yes
2,0,403-1,556,2008-01-01,2008-01-01,0.0,2,1,1,1980-01-01,...,1,0,E02407,1,0,0,0,NEGATIVE,no,no
3,0,407-1,454,2008-01-05,2008-01-05,0.0,1,1,9,1986-01-01,...,0,0,"G06508, U10102, UHE211",0,0,0,0,NEGATIVE,no,no
4,0,394-1,1615,2007-12-22,2008-01-06,0.0,17,1,5,1943-01-01,...,0,0,"G06502, G08501",0,0,0,0,Culture,no,yes


In [17]:
if visualization:
    display(train_tables_dict["train"]['start_neutropenico'].unique())

array(['2007-12-11', '2007-12-28', '2008-01-01', '2008-01-05', '2007-12-22', '2008-01-08', '2008-01-11',
       '2008-01-12', '2007-12-25', '2008-01-15', '2008-01-17', '2008-01-18', '2007-11-01', '2008-01-19',
       '2008-01-16', '2008-01-10', '2008-01-21', '2008-01-22', '2008-01-24', '2008-01-25', '2008-01-31',
       '2008-01-29', '2008-02-03', '2008-02-05', '2008-02-02', '2008-02-07', '2008-02-10', '2008-02-14',
       '2008-02-17', '2008-02-19', '2008-02-12', '2008-02-24', '2008-02-28', '2008-03-03', '2008-03-05',
       '2008-03-07', '2008-03-11', '2008-03-12', '2008-03-16', '2008-03-19', '2008-03-21', '2008-03-28',
       '2008-03-31', '2008-04-04', '2008-04-07', '2008-04-06', '2008-04-10', '2008-04-11', '2008-03-26',
       '2008-04-15', '2008-04-03', '2008-04-24', '2008-04-17', '2008-04-25', '2008-04-26', '2008-04-22',
       '2008-04-20', '2008-04-27', '2008-05-03', '2008-05-05', '2008-04-14', '2008-05-07', '2008-04-29',
       '2008-05-10', '2008-05-11', '2008-05-13', '2008-

<br>
### Application train dataframe

Let's analyze the elements of the train dataframe:<br>
<ol>
    <li><b>MDR</b>: The TARGET varaible, 1 if the patient is multi-durg resistant, 0 if not.<br>
        This is the main index column
    </li><br>
</ol>

<br><br>
# 3. Feature engineering
Create a flag to iterate faster if we have a working model.

In [18]:
feature_engineering = True

<br>
### Train dataframe
The train dataframe contains the main identifier and all the major information needed for the project.

In [227]:
train_tables_dict["train"]['start_neutropenico'].unique()

array(['2007-12-11', '2007-12-28', '2008-01-01', '2008-01-05', '2007-12-22', '2008-01-08', '2008-01-11',
       '2008-01-12', '2007-12-25', '2008-01-15', '2008-01-17', '2008-01-18', '2007-11-01', '2008-01-19',
       '2008-01-16', '2008-01-10', '2008-01-21', '2008-01-22', '2008-01-24', '2008-01-25', '2008-01-31',
       '2008-01-29', '2008-02-03', '2008-02-05', '2008-02-02', '2008-02-07', '2008-02-10', '2008-02-14',
       '2008-02-17', '2008-02-19', '2008-02-12', '2008-02-24', '2008-02-28', '2008-03-03', '2008-03-05',
       '2008-03-07', '2008-03-11', '2008-03-12', '2008-03-16', '2008-03-19', '2008-03-21', '2008-03-28',
       '2008-03-31', '2008-04-04', '2008-04-07', '2008-04-06', '2008-04-10', '2008-04-11', '2008-03-26',
       '2008-04-15', '2008-04-03', '2008-04-24', '2008-04-17', '2008-04-25', '2008-04-26', '2008-04-22',
       '2008-04-20', '2008-04-27', '2008-05-03', '2008-05-05', '2008-04-14', '2008-05-07', '2008-04-29',
       '2008-05-10', '2008-05-11', '2008-05-13', '2008-

### Timestamp normalization
Convert the timestamp strings of all the tables to the same format and data type dtime64.<br>
Use explicit formatting (faster and safer).

In [228]:
train_tables_dict["train"]["start_neutropenico"] =  pd.to_datetime(train_tables_dict["train"]["start_neutropenico"], format='%Y-%m-%d')
train_tables_dict["train"]["start_FN"] =  pd.to_datetime(train_tables_dict["train"]["start_FN"], format='%Y-%m-%d')
train_tables_dict["train"]["birth_year"] =  pd.to_datetime(train_tables_dict["train"]["birth_year"], format='%Y')

In [253]:
test_tables_dict["test_challenge"]["start_neutropenico"] =  pd.to_datetime(test_tables_dict["test_challenge"]["start_neutropenico"], format='%Y-%m-%d')
test_tables_dict["test_challenge"]["start_FN"] =  pd.to_datetime(test_tables_dict["test_challenge"]["start_FN"], format='%Y-%m-%d')
test_tables_dict["test_challenge"]["birth_year"] =  pd.to_datetime(test_tables_dict["test_challenge"]["birth_year"], format='%Y')

In [246]:
train_tables_dict["train"]["neutro_FN_diff"] = (train_tables_dict["train"]["start_FN"] - train_tables_dict["train"]["start_neutropenico"]).astype('timedelta64[h]')

In [252]:
test_tables_dict["test_challenge"]["neutro_FN_diff"] = (test_tables_dict["test_challenge"]["start_FN"] - test_tables_dict["test_challenge"]["start_neutropenico"]).astype('timedelta64[h]')

In [249]:
train_tables_dict["train"]["age"] = (pd.to_datetime('today')-train_tables_dict["train"]["birth_year"]).astype('timedelta64[h]')

In [254]:
test_tables_dict["test_challenge"]["age"] = (pd.to_datetime('today')-test_tables_dict["test_challenge"]["birth_year"]).astype('timedelta64[h]')

In [233]:
train_tables_dict["train"]["neutro_FN_diff"]

0        504.0
1         96.0
2          0.0
3          0.0
4        360.0
5          0.0
6        360.0
7         24.0
8          0.0
9        456.0
10       120.0
11         0.0
12        24.0
13        24.0
14         0.0
15      1896.0
16         0.0
17        96.0
18       240.0
19         0.0
20         0.0
21       648.0
22         0.0
23       216.0
24        24.0
25       840.0
26         0.0
27         0.0
28        72.0
29         0.0
         ...  
1587      24.0
1588     456.0
1589      24.0
1590      72.0
1591       0.0
1592     144.0
1593     192.0
1594       0.0
1595       0.0
1596     288.0
1597     816.0
1598     360.0
1599       0.0
1600     624.0
1601       0.0
1602       0.0
1603       0.0
1604      48.0
1605      96.0
1606       0.0
1607       0.0
1608       0.0
1609     144.0
1610       0.0
1611       0.0
1612       0.0
1613       0.0
1614       0.0
1615      24.0
1616       0.0
Name: neutro_FN_diff, Length: 1617, dtype: float64

### Past positive result from
Generate two binary columns corresponding to blood cultive and cultive

In [19]:
train_tables_dict["train"].columns

Index(['MDR', 'ID', 'NHC', 'start_neutropenico', 'start_FN', 'days_between',
       'days_in_hospital', 'hospital_stay_w_FN', 'prev_hospital_stay',
       'birth_year', 'Gender', 'emergency', 'num_movements', 'num_consult',
       'share_room_MDR', 'dummy_LAM', 'dummy_others.LL',
       'dummy_Cancer.linfoproliferativo', 'dummy_SMD', 'dummy_LAL',
       'dummy_EICH', 'dummy_Leucemia.cronica', 'dummy_SMPC',
       'dummy_Cancer.solido', 'dummy_LMC', 'dummy_TLPT', 'dummy_others.LM',
       'dummy_Mieloma.like', 'dummy_LLC', 'antibiotic_count',
       'days_after_anti', 'AMIKACINA_.MG.', 'AMOXICILINA_.MG.',
       'AMPICILINA_.MG.', 'AZITROMICINA_VIAL_.MG.', 'AZTREONAM_.MG.',
       'CEFAZOLINA_.MG.', 'CEFIXIMA_.MG.', 'CEFOTAXIMA_.MG.',
       'CEFOXITINA_.MG.', 'CEFTAROLINA_FOSAMIL_.MG.', 'CEFTAZIDIMA_.MG.',
       'CEFTIBUTENO_.MG.', 'CEFTOLOZANO_.UND.', 'CEFTRIAXONA_.MG.',
       'CEFUROXIMA.AXETILO_.MG.', 'CIPROFLOXACINO_.MG.', 'CLARITROMICINA_.MG.',
       'CLINDAMICINA_.MG.', 'CLOXA

### Share room MDR
Substitute missing with nan

### Train

In [20]:
train_tables_dict["train"]["share_room_MDR"].replace(2, np.NaN, inplace=True)

### Test

In [21]:
test_tables_dict["test_challenge"]["share_room_MDR"].replace(2, np.NaN, inplace=True)

### Days between
Days since the las febral neutropenia.
<ul>
    <li> Set Nan to zero
    <li> Create an additional column that represents if the patient had neturopenia before 
</ul>

### Train

In [22]:
train_tables_dict["train"]["days_between"].replace(np.NaN, 0, inplace=True)

In [23]:
train_tables_dict["train"]["had_FN_before"] = np.where(train_tables_dict["train"]["days_between"]==0, 'no', 'yes')

### Test

In [24]:
test_tables_dict["test_challenge"]["days_between"].replace(np.NaN, 0, inplace=True)

In [25]:
test_tables_dict["test_challenge"]["had_FN_before"] = np.where(test_tables_dict["test_challenge"]["days_between"]==0, 'no', 'yes')

### Days after anti
Days since the patient stopped taking antibiotics and got FN.
<ul>
    <li> Set Nan to 150
    <li> Create an additional column that represents if the patient had neturopenia before 
</ul>

### Train

In [26]:
train_tables_dict["train"]["days_after_anti"].replace(np.NaN, 150, inplace=True)

In [27]:
train_tables_dict["train"]["has_FN_since_no_anti"] = np.where(train_tables_dict["train"]["days_after_anti"]==150, 'no', 'yes')

### Test

In [28]:
test_tables_dict["test_challenge"]["days_after_anti"].replace(np.NaN, 150, inplace=True)

In [29]:
test_tables_dict["test_challenge"]["has_FN_since_no_anti"] = np.where(test_tables_dict["test_challenge"]["days_after_anti"]==150, 'no', 'yes')

# Espacio

In [30]:
train_tables_dict["train"].columns

Index(['MDR', 'ID', 'NHC', 'start_neutropenico', 'start_FN', 'days_between',
       'days_in_hospital', 'hospital_stay_w_FN', 'prev_hospital_stay',
       'birth_year', 'Gender', 'emergency', 'num_movements', 'num_consult',
       'share_room_MDR', 'dummy_LAM', 'dummy_others.LL',
       'dummy_Cancer.linfoproliferativo', 'dummy_SMD', 'dummy_LAL',
       'dummy_EICH', 'dummy_Leucemia.cronica', 'dummy_SMPC',
       'dummy_Cancer.solido', 'dummy_LMC', 'dummy_TLPT', 'dummy_others.LM',
       'dummy_Mieloma.like', 'dummy_LLC', 'antibiotic_count',
       'days_after_anti', 'AMIKACINA_.MG.', 'AMOXICILINA_.MG.',
       'AMPICILINA_.MG.', 'AZITROMICINA_VIAL_.MG.', 'AZTREONAM_.MG.',
       'CEFAZOLINA_.MG.', 'CEFIXIMA_.MG.', 'CEFOTAXIMA_.MG.',
       'CEFOXITINA_.MG.', 'CEFTAROLINA_FOSAMIL_.MG.', 'CEFTAZIDIMA_.MG.',
       'CEFTIBUTENO_.MG.', 'CEFTOLOZANO_.UND.', 'CEFTRIAXONA_.MG.',
       'CEFUROXIMA.AXETILO_.MG.', 'CIPROFLOXACINO_.MG.', 'CLARITROMICINA_.MG.',
       'CLINDAMICINA_.MG.', 'CLOXA

In [223]:
train_tables_dict["train"]['birth_year'].unique()

array([1941, 1935, 1980, 1986, 1943, 1951, 1950, 1973, 1947, 1936, 1934, 1970, 1959, 1978, 1933, 1969, 1975,
       1958, 1960, 1952, 1991, 1949, 1968, 1983, 1948, 1953, 1937, 1940, 1971, 1974, 1965, 1961, 1954, 1944,
       1945, 1942, 1966, 1955, 1922, 1938, 1928, 1985, 1979, 1977, 1927, 1946, 1972, 1988, 1976, 1984, 1931,
       1956, 1957, 1963, 1926, 1992, 1982, 1920, 1964, 1967, 1990, 1923, 1939, 1929, 1925, 1932, 1962, 1989,
       1987, 1981, 1993, 1919, 1930, 1921])

In [32]:
train_tables_dict["train"].groupby('NHC')['ID'].count()

NHC
1       2
8       1
10      2
14      1
15      1
16      1
19      2
22      1
23      2
25      1
26      1
28      1
32      1
33      1
34      1
36      1
38      9
40      1
41      4
44      1
45      1
48      1
49      1
50      2
53      1
55      1
58      2
61      4
63      1
65      6
       ..
1971    3
1972    3
1973    4
1974    2
1975    1
1978    1
1979    1
1980    1
1981    4
1983    1
1984    1
1987    3
1988    1
1992    1
1994    2
1998    3
1999    4
2001    2
2002    6
2003    3
2004    4
2005    3
2006    2
2007    3
2008    1
2009    2
2011    2
2014    2
2015    7
2016    3
Name: ID, Length: 785, dtype: int64

<br><br>
## 2.2 Dataframes merge
Generate the general dataframe that will contain all the relevant information.<br>
To do so we need to find the relations (dates, identifiers) to link all the tables together.<br>
We will start with the most important tables and left merge all the additional information.

### Train dataframe vanilla
Main table without added data

<b>Train dataframe

In [33]:
if feature_engineering:
    train_general_df = train_tables_dict["train"]

In [34]:
train_general_df

Unnamed: 0,MDR,ID,NHC,start_neutropenico,start_FN,days_between,days_in_hospital,hospital_stay_w_FN,prev_hospital_stay,birth_year,...,Auto_TP,Alo_TP,room_list,mucositis,cito_group_3,cito_group_1,cito_group_2,Past_positive_result_from,had_FN_before,has_FN_since_no_anti
0,0,374-1,404,2007-12-11,2008-01-01,0.0,28,1,3,1941,...,1,0,E02403,0,1,0,0,Culture,no,yes
1,0,398-1,1897,2007-12-28,2008-01-01,0.0,8,1,6,1935,...,0,0,G06512,0,0,0,0,NEGATIVE,no,yes
2,0,403-1,556,2008-01-01,2008-01-01,0.0,2,1,1,1980,...,1,0,E02407,1,0,0,0,NEGATIVE,no,no
3,0,407-1,454,2008-01-05,2008-01-05,0.0,1,1,9,1986,...,0,0,"G06508, U10102, UHE211",0,0,0,0,NEGATIVE,no,no
4,0,394-1,1615,2007-12-22,2008-01-06,0.0,17,1,5,1943,...,0,0,"G06502, G08501",0,0,0,0,Culture,no,yes
5,0,392-1,1723,2008-01-08,2008-01-08,0.0,20,1,1,1941,...,0,0,,0,0,0,0,NEGATIVE,no,yes
6,0,396-1,1521,2007-12-28,2008-01-12,0.0,22,1,3,1951,...,0,0,G02409,0,1,0,0,Blood culture,no,yes
7,0,406-1,232,2008-01-11,2008-01-12,0.0,9,1,3,1950,...,1,0,E02406,1,1,1,1,NEGATIVE,no,yes
8,0,415-1,38,2008-01-12,2008-01-12,0.0,2,1,4,1973,...,0,0,G02206,0,0,0,0,NEGATIVE,no,no
9,0,388-1,544,2007-12-25,2008-01-13,0.0,27,1,1,1947,...,1,0,E02401,0,1,0,0,Culture,no,yes


<b>Test dataframe

In [35]:
if feature_engineering:
    test_general_df = test_tables_dict["test_challenge"]

<br>
### Dataframe checkpoint
When working with huge datasets some operations are quite time consuming.<br>
Pandas offers some shortcuts to optimize performance as much as possible.<br>
To feather basically dumps the dataframe representation in RAM as a file. This allows us to rapidly load it later.

<b>Save

In [36]:
save = True

In [37]:
if save:
    train_general_df.to_feather(f'{INPUT_PATH}{s}train_general_df')
    test_general_df.to_feather(f'{INPUT_PATH}{s}test_general_df')

<b>Load

In [38]:
load = True

In [39]:
if load:
    train_general_df = pd.read_feather(f'{INPUT_PATH}{s}train_general_df')
    test_general_df = pd.read_feather(f'{INPUT_PATH}{s}test_general_df')

<br>
# 4. Data preprocessing

Check the size of our training set.<br>

###  Continuous and categorical features

There are some fundamental differences between categorical and continuous features.
<ul>
    <li><b>Continuous</b> features present information in the magnitude of a one dimensional function.</li>
    <li><b>Categorical</b> features present information by mapping the sample to the particular subset it belongs.<br>
        Then the neural net can extract the multi-dimensional set of sub-features that define that subset.</li>


In our case, due to the insane amount of categories, we will treat the categories for each dataframe separately (we don't include the dependant feature).

### Train dataframe

In [40]:
train_tables_dict["train"].columns

Index(['MDR', 'ID', 'NHC', 'start_neutropenico', 'start_FN', 'days_between',
       'days_in_hospital', 'hospital_stay_w_FN', 'prev_hospital_stay',
       'birth_year', 'Gender', 'emergency', 'num_movements', 'num_consult',
       'share_room_MDR', 'dummy_LAM', 'dummy_others.LL',
       'dummy_Cancer.linfoproliferativo', 'dummy_SMD', 'dummy_LAL',
       'dummy_EICH', 'dummy_Leucemia.cronica', 'dummy_SMPC',
       'dummy_Cancer.solido', 'dummy_LMC', 'dummy_TLPT', 'dummy_others.LM',
       'dummy_Mieloma.like', 'dummy_LLC', 'antibiotic_count',
       'days_after_anti', 'AMIKACINA_.MG.', 'AMOXICILINA_.MG.',
       'AMPICILINA_.MG.', 'AZITROMICINA_VIAL_.MG.', 'AZTREONAM_.MG.',
       'CEFAZOLINA_.MG.', 'CEFIXIMA_.MG.', 'CEFOTAXIMA_.MG.',
       'CEFOXITINA_.MG.', 'CEFTAROLINA_FOSAMIL_.MG.', 'CEFTAZIDIMA_.MG.',
       'CEFTIBUTENO_.MG.', 'CEFTOLOZANO_.UND.', 'CEFTRIAXONA_.MG.',
       'CEFUROXIMA.AXETILO_.MG.', 'CIPROFLOXACINO_.MG.', 'CLARITROMICINA_.MG.',
       'CLINDAMICINA_.MG.', 'CLOXA

## Train features

## EXCLUDED

room_list: List identifier, only 3 cases per dataset, noisy

# AL LORO
A panda no le gustan integros como categorias, la columna NHC no funciona bien

### Patient info

IMPORTANT: num_consult considered continuous, num_movements considered continuous, prev_hospital_stay considered continuous <br>
hospital_stay_w_FN considered continuous

In [41]:
train_patient_info_categorical_f = [ 'Gender', 'share_room_MDR', 'emergency', 'had_FN_before', 'Past_positive_result_from']

In [42]:
train_patient_info_continuous_f = ['num_consult', 'num_movements', 'prev_hospital_stay', 'hospital_stay_w_FN', 'days_in_hospital', 'days_between']

### Time

In [250]:
train_time_continuous_f = ["neutro_FN_diff", "age"]

### Diagnoses

In [43]:
train_diagnoses_categorical_f = ['dummy_LAM', 'dummy_others.LL',
       'dummy_Cancer.linfoproliferativo', 'dummy_SMD', 'dummy_LAL',
       'dummy_EICH', 'dummy_Leucemia.cronica', 'dummy_SMPC',
       'dummy_Cancer.solido', 'dummy_LMC', 'dummy_TLPT', 'dummy_others.LM',
       'dummy_Mieloma.like', 'dummy_LLC']

### Antibiotics

IMPORTANT: Antibiotic count considered categorical

In [44]:
train_antibiotics_categorical_f = ['antibiotic_count', 'has_FN_since_no_anti']

In [45]:
train_antibiotics_continuous_f = ['AMIKACINA_.MG.', 'AMOXICILINA_.MG.',
       'AMPICILINA_.MG.', 'AZITROMICINA_VIAL_.MG.', 'AZTREONAM_.MG.',
       'CEFAZOLINA_.MG.', 'CEFIXIMA_.MG.', 'CEFOTAXIMA_.MG.',
       'CEFOXITINA_.MG.', 'CEFTAROLINA_FOSAMIL_.MG.', 'CEFTAZIDIMA_.MG.',
       'CEFTIBUTENO_.MG.', 'CEFTOLOZANO_.UND.', 'CEFTRIAXONA_.MG.',
       'CEFUROXIMA.AXETILO_.MG.', 'CIPROFLOXACINO_.MG.', 'CLARITROMICINA_.MG.',
       'CLINDAMICINA_.MG.', 'CLOXACILINA_.MG.','COTRIMOXAZOL.SULFAMETOXAZOL_.MG.', 'COTRIMOXAZOL.SULFAMETOXAZOL_.UND.',
       'DAPTOMICINA_.MG.', 'DORIPENEM_.MG.', 'DOXICICLINA_.MG.',
       'ERITROMICINA_.MG.', 'ERTAPENEM_.MG.', 'FOSFOMICINA_.MG.',
       'GENTAMICINA_.MG.', 'IMIPENEM_.MG.', 'LEVOFLOXACINO_.MG.',
       'LINEZOLID_.MG.', 'MEROPENEM_.MG.', 'METRONIDAZOL_.MG.',
       'METRONIDAZOL_COMP_.MG.', 'MOXIFLOXACINO_.MG.', 'NORFLOXACINO_.MG.',
       'PIPERACILINA_.MG.', 'RIFABUTINA_.MG.', 'RIFAMPICINA_.MG.',
       'SULFADIAZINA_.MG.', 'TEICOPLANINA_.MG.', 'TIGECICLINA_.MG.',
       'TOBRAMICINA_.MG.', 'TOBRAMICINA_NEB_.MG.', 'VANCOMICINA_.MG.', 'days_after_anti']

### Transplants

In [46]:
train_transplant_categorical_f = ['Auto_TP', 'Alo_TP']

### Mucositits

In [47]:
train_mucositis_categorical_f = ['mucositis']

### Chemoteraphy

In [48]:
train_chemo_categorical_f = ['cito_group_1', 'cito_group_2', 'cito_group_3']

### Train categorical merge

In [49]:
train_categorical_f = train_diagnoses_categorical_f + train_patient_info_categorical_f + train_antibiotics_categorical_f + train_transplant_categorical_f + train_mucositis_categorical_f + train_chemo_categorical_f

### Train continuous merge

In [50]:
train_continuous_f = train_antibiotics_continuous_f + train_patient_info_continuous_f

### Final merge

In [51]:
categorical_f = train_categorical_f

continuous_f = train_continuous_f

In [52]:
print(categorical_f)

['dummy_LAM', 'dummy_others.LL', 'dummy_Cancer.linfoproliferativo', 'dummy_SMD', 'dummy_LAL', 'dummy_EICH', 'dummy_Leucemia.cronica', 'dummy_SMPC', 'dummy_Cancer.solido', 'dummy_LMC', 'dummy_TLPT', 'dummy_others.LM', 'dummy_Mieloma.like', 'dummy_LLC', 'Gender', 'share_room_MDR', 'emergency', 'had_FN_before', 'Past_positive_result_from', 'antibiotic_count', 'has_FN_since_no_anti', 'Auto_TP', 'Alo_TP', 'mucositis', 'cito_group_1', 'cito_group_2', 'cito_group_3']


<br>
###  Identifier

The unique identifier is the NHC.

In [53]:
identifier_f = 'NHC'

<br>
###  Dependant feature

The dependant feature is 1 for multi-drug resistant patients.

In [54]:
dependant_f = 'MDR'

###  Training set

Create our training set from the general set.

<b>Train dataframe

In [55]:
train_general_df.head()

Unnamed: 0,MDR,ID,NHC,start_neutropenico,start_FN,days_between,days_in_hospital,hospital_stay_w_FN,prev_hospital_stay,birth_year,...,Auto_TP,Alo_TP,room_list,mucositis,cito_group_3,cito_group_1,cito_group_2,Past_positive_result_from,had_FN_before,has_FN_since_no_anti
0,0,374-1,404,2007-12-11,2008-01-01,0.0,28,1,3,1941,...,1,0,E02403,0,1,0,0,Culture,no,yes
1,0,398-1,1897,2007-12-28,2008-01-01,0.0,8,1,6,1935,...,0,0,G06512,0,0,0,0,NEGATIVE,no,yes
2,0,403-1,556,2008-01-01,2008-01-01,0.0,2,1,1,1980,...,1,0,E02407,1,0,0,0,NEGATIVE,no,no
3,0,407-1,454,2008-01-05,2008-01-05,0.0,1,1,9,1986,...,0,0,"G06508, U10102, UHE211",0,0,0,0,NEGATIVE,no,no
4,0,394-1,1615,2007-12-22,2008-01-06,0.0,17,1,5,1943,...,0,0,"G06502, G08501",0,0,0,0,Culture,no,yes


In [56]:
train_df = train_general_df[categorical_f + continuous_f + [dependant_f, identifier_f]].copy() 

In [57]:
test_general_df.head()

Unnamed: 0,ID,NHC,start_neutropenico,start_FN,days_between,days_in_hospital,hospital_stay_w_FN,prev_hospital_stay,birth_year,Gender,...,Auto_TP,Alo_TP,room_list,mucositis,cito_group_3,cito_group_1,cito_group_2,Past_positive_result_from,had_FN_before,has_FN_since_no_anti
0,2106-1,681,2013-04-30,2013-04-30,0.0,12,1,6,1989,female,...,1,0,E02408,1,1,0,0,NEGATIVE,no,yes
1,2104-2,1038,2013-04-18,2013-05-02,5.0,16,2,4,1992,male,...,0,0,"G02405, G06504",1,1,0,0,NEGATIVE,yes,yes
2,2115-1,1761,2013-05-02,2013-05-02,434.0,0,2,4,1941,female,...,0,0,E01506,0,0,0,0,NEGATIVE,yes,no
3,2108-1,1576,2013-05-03,2013-05-03,0.0,9,1,3,1945,male,...,1,0,,0,1,1,1,Blood culture,no,yes
4,2109-1,1090,2013-05-04,2013-05-04,0.0,10,1,0,1946,male,...,0,0,G02404,0,1,0,0,NEGATIVE,no,yes


In [58]:
test_df = test_general_df[categorical_f + continuous_f + [identifier_f]].copy() 

###  Pandas dataype cast

Define the datatypes of our features to either:
<ul>
    <li>"categorical" for categorical features (duh)</li>
    <li>"float32" for continuous features</li>
</ul>
This changes the internal representation of the values in the pandas dataframe.<br>
The fastai preprocessing function needs the values to be in either two of this categories.

<b>Train dataframe

In [59]:
for cat in categorical_f: train_df[cat] = train_df[cat].astype('category').cat.as_ordered()
for cont in continuous_f: train_df[cont] = train_df[cont].astype('float32')

<b>Test dataframe

In [60]:
for cat in categorical_f: test_df[cat] = test_df[cat].astype('category').cat.as_ordered()
for cont in continuous_f: test_df[cont] = test_df[cont].astype('float32')

###  Train sample

Use a percentage of the whole dataset to iterate faster.<br>

In [61]:
train_sample_pct = 1

We use the get_cv_idxs function to get a random set of indexes for our sample.<br>
We set the index as the identifier feature of the patient.

In [62]:
idxs = get_cv_idxs(train_df.shape[0], val_pct=train_sample_pct)
train_sample_df = train_df.iloc[idxs].set_index(identifier_f) #changed from activation date to item_id
n_train_sample = len(train_sample_df)
print(f'The training sample has {n_train_sample} samples')


The training sample has 1617 samples


###  Test index setting and identifier extraction

In [63]:
test_item_id = test_df[identifier_f]

We set the index to identifier feature for the test_df too.

In [64]:
test_df = test_df.set_index(identifier_f) #changed from activation date to item_id

With the new sorted dataframe we extract the identifier feature column that will be used in the submission file.<br>
Then we drop it from the test_df (it's not used as a categorical variable because it doesn't contain any information).

###  Preprocess sample training set

Use the fastai function process dataframe to:
<ul>
    <li> Create a dataframe without the dependant feature.</li>
    <li> Create an array with the dependant feature y.</li>
    <li> Process Nan values and generate a vector of where the original Nans where</li>
    <li> Normalize the dataframe features (do_scale)</li>
    <li> Return the mapper used to normalize the features so it can be used later for the test set</li>
    </ul>
   

<b>Train dataframe

In [65]:
train_sample_df.head()

Unnamed: 0_level_0,dummy_LAM,dummy_others.LL,dummy_Cancer.linfoproliferativo,dummy_SMD,dummy_LAL,dummy_EICH,dummy_Leucemia.cronica,dummy_SMPC,dummy_Cancer.solido,dummy_LMC,...,TOBRAMICINA_NEB_.MG.,VANCOMICINA_.MG.,days_after_anti,num_consult,num_movements,prev_hospital_stay,hospital_stay_w_FN,days_in_hospital,days_between,MDR
NHC,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
419,0,0,1,0,0,0,0,0,0,0,...,0.0,0.0,1.0,1.0,0.0,3.0,1.0,8.0,0.0,0
2015,0,0,1,0,0,0,0,0,0,0,...,0.0,0.0,0.0,2.0,1.0,11.0,7.0,7.0,73.0,1
119,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,150.0,0.0,0.0,1.0,1.0,0.0,0.0,0
343,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,150.0,0.0,0.0,4.0,2.0,0.0,192.0,0
122,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,150.0,0.0,0.0,1.0,1.0,0.0,0.0,0


In [66]:
processed_train_df, y, nas, mapper = proc_df(train_sample_df, dependant_f, do_scale=True)

<b>Test dataframe

In [67]:
processed_test_df , _, test_nas, test_mapper = proc_df(test_df, y_fld=None, do_scale=True,
                                                       mapper=mapper, na_dict=nas)

As we can see it has normalized the data and has enconded categories using contiguous integers starting from 0 (Nan).

### Preprocessed dataframes

In [68]:
processed_train_df.columns

Index(['dummy_LAM', 'dummy_others.LL', 'dummy_Cancer.linfoproliferativo',
       'dummy_SMD', 'dummy_LAL', 'dummy_EICH', 'dummy_Leucemia.cronica',
       'dummy_SMPC', 'dummy_Cancer.solido', 'dummy_LMC', 'dummy_TLPT',
       'dummy_others.LM', 'dummy_Mieloma.like', 'dummy_LLC', 'Gender',
       'share_room_MDR', 'emergency', 'had_FN_before',
       'Past_positive_result_from', 'antibiotic_count', 'has_FN_since_no_anti',
       'Auto_TP', 'Alo_TP', 'mucositis', 'cito_group_1', 'cito_group_2',
       'cito_group_3', 'AMIKACINA_.MG.', 'AMOXICILINA_.MG.', 'AMPICILINA_.MG.',
       'AZITROMICINA_VIAL_.MG.', 'AZTREONAM_.MG.', 'CEFAZOLINA_.MG.',
       'CEFIXIMA_.MG.', 'CEFOTAXIMA_.MG.', 'CEFOXITINA_.MG.',
       'CEFTAROLINA_FOSAMIL_.MG.', 'CEFTAZIDIMA_.MG.', 'CEFTIBUTENO_.MG.',
       'CEFTOLOZANO_.UND.', 'CEFTRIAXONA_.MG.', 'CEFUROXIMA.AXETILO_.MG.',
       'CIPROFLOXACINO_.MG.', 'CLARITROMICINA_.MG.', 'CLINDAMICINA_.MG.',
       'CLOXACILINA_.MG.', 'COTRIMOXAZOL.SULFAMETOXAZOL_.MG.',
    

In [69]:
processed_train_df.info()
display(processed_train_df.head())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1617 entries, 419 to 613
Data columns (total 79 columns):
dummy_LAM                            1617 non-null int8
dummy_others.LL                      1617 non-null int8
dummy_Cancer.linfoproliferativo      1617 non-null int8
dummy_SMD                            1617 non-null int8
dummy_LAL                            1617 non-null int8
dummy_EICH                           1617 non-null int8
dummy_Leucemia.cronica               1617 non-null int8
dummy_SMPC                           1617 non-null int8
dummy_Cancer.solido                  1617 non-null int8
dummy_LMC                            1617 non-null int8
dummy_TLPT                           1617 non-null int8
dummy_others.LM                      1617 non-null int8
dummy_Mieloma.like                   1617 non-null int8
dummy_LLC                            1617 non-null int8
Gender                               1617 non-null int8
share_room_MDR                       1617 non-null i

Unnamed: 0_level_0,dummy_LAM,dummy_others.LL,dummy_Cancer.linfoproliferativo,dummy_SMD,dummy_LAL,dummy_EICH,dummy_Leucemia.cronica,dummy_SMPC,dummy_Cancer.solido,dummy_LMC,...,TOBRAMICINA_.MG.,TOBRAMICINA_NEB_.MG.,VANCOMICINA_.MG.,days_after_anti,num_consult,num_movements,prev_hospital_stay,hospital_stay_w_FN,days_in_hospital,days_between
NHC,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
419,1,1,2,1,1,1,1,1,1,1,...,-0.044414,0.0,-0.337225,-0.79419,-0.42416,-0.501205,0.053928,-0.614833,-0.273222,-0.394399
2015,1,1,2,1,1,1,1,1,1,1,...,-0.044414,0.0,-0.337225,-0.808586,-0.314614,0.292574,2.75784,4.986206,-0.33143,0.339622
119,1,1,1,1,1,1,1,1,1,1,...,-0.044414,0.0,-0.337225,1.350735,-0.533706,-0.501205,-0.62205,-0.614833,-0.738886,-0.394399
343,1,1,1,1,1,1,1,1,1,1,...,-0.044414,0.0,-0.337225,1.350735,-0.533706,-0.501205,0.391917,0.318674,-0.738886,1.536176
122,1,1,1,1,1,1,1,1,1,1,...,-0.044414,0.0,-0.337225,1.350735,-0.533706,-0.501205,-0.62205,-0.614833,-0.738886,-0.394399


In [70]:
processed_test_df.info()
display(processed_test_df.head())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1618 entries, 681 to 1491
Data columns (total 79 columns):
dummy_LAM                            1618 non-null int8
dummy_others.LL                      1618 non-null int8
dummy_Cancer.linfoproliferativo      1618 non-null int8
dummy_SMD                            1618 non-null int8
dummy_LAL                            1618 non-null int8
dummy_EICH                           1618 non-null int8
dummy_Leucemia.cronica               1618 non-null int8
dummy_SMPC                           1618 non-null int8
dummy_Cancer.solido                  1618 non-null int8
dummy_LMC                            1618 non-null int8
dummy_TLPT                           1618 non-null int8
dummy_others.LM                      1618 non-null int8
dummy_Mieloma.like                   1618 non-null int8
dummy_LLC                            1618 non-null int8
Gender                               1618 non-null int8
share_room_MDR                       1618 non-null 

Unnamed: 0_level_0,dummy_LAM,dummy_others.LL,dummy_Cancer.linfoproliferativo,dummy_SMD,dummy_LAL,dummy_EICH,dummy_Leucemia.cronica,dummy_SMPC,dummy_Cancer.solido,dummy_LMC,...,TOBRAMICINA_.MG.,TOBRAMICINA_NEB_.MG.,VANCOMICINA_.MG.,days_after_anti,num_consult,num_movements,prev_hospital_stay,hospital_stay_w_FN,days_in_hospital,days_between
NHC,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
681,1,1,1,1,1,1,1,1,1,1,...,-0.044414,0.0,-0.337225,-0.79419,-0.314614,-0.501205,1.067895,-0.614833,-0.040389,-0.394399
1038,2,1,1,1,1,1,1,1,1,1,...,-0.044414,0.0,-0.337225,-0.808586,-0.095522,0.292574,0.391917,0.318674,0.192443,-0.344124
1761,1,1,1,1,1,1,1,1,1,1,...,-0.044414,0.0,-0.337225,1.350735,-0.533706,-0.501205,0.391917,0.318674,-0.738886,3.969506
1576,1,1,2,1,1,1,1,1,1,1,...,-0.044414,0.0,0.002171,-0.808586,-0.314614,-0.501205,0.053928,-0.614833,-0.215013,-0.394399
1090,2,1,1,1,1,1,1,1,1,1,...,-0.044414,0.0,-0.337225,-0.736609,0.123569,-0.501205,-0.960039,-0.614833,-0.156805,-0.394399


# Preprocessed dataframes to csv

### Train

In [71]:
processed_train_complete = processed_train_df

In [72]:
processed_train_complete['MDR'] = y

In [73]:
processed_train_complete.reset_index(inplace=True)

In [74]:
processed_train_complete.head()

Unnamed: 0,NHC,dummy_LAM,dummy_others.LL,dummy_Cancer.linfoproliferativo,dummy_SMD,dummy_LAL,dummy_EICH,dummy_Leucemia.cronica,dummy_SMPC,dummy_Cancer.solido,...,TOBRAMICINA_NEB_.MG.,VANCOMICINA_.MG.,days_after_anti,num_consult,num_movements,prev_hospital_stay,hospital_stay_w_FN,days_in_hospital,days_between,MDR
0,419,1,1,2,1,1,1,1,1,1,...,0.0,-0.337225,-0.79419,-0.42416,-0.501205,0.053928,-0.614833,-0.273222,-0.394399,0
1,2015,1,1,2,1,1,1,1,1,1,...,0.0,-0.337225,-0.808586,-0.314614,0.292574,2.75784,4.986206,-0.33143,0.339622,1
2,119,1,1,1,1,1,1,1,1,1,...,0.0,-0.337225,1.350735,-0.533706,-0.501205,-0.62205,-0.614833,-0.738886,-0.394399,0
3,343,1,1,1,1,1,1,1,1,1,...,0.0,-0.337225,1.350735,-0.533706,-0.501205,0.391917,0.318674,-0.738886,1.536176,0
4,122,1,1,1,1,1,1,1,1,1,...,0.0,-0.337225,1.350735,-0.533706,-0.501205,-0.62205,-0.614833,-0.738886,-0.394399,0


In [75]:
train_df.to_csv(f'{INPUT_PATH}{s}train_david1')

In [76]:
processed_train_complete.to_csv(f'{INPUT_PATH}{s}train_processed_1')

### Test

In [77]:
processed_test_complete = processed_test_df

In [78]:
processed_test_complete.reset_index(inplace=True)

In [79]:
processed_test_complete.head()

Unnamed: 0,NHC,dummy_LAM,dummy_others.LL,dummy_Cancer.linfoproliferativo,dummy_SMD,dummy_LAL,dummy_EICH,dummy_Leucemia.cronica,dummy_SMPC,dummy_Cancer.solido,...,TOBRAMICINA_.MG.,TOBRAMICINA_NEB_.MG.,VANCOMICINA_.MG.,days_after_anti,num_consult,num_movements,prev_hospital_stay,hospital_stay_w_FN,days_in_hospital,days_between
0,681,1,1,1,1,1,1,1,1,1,...,-0.044414,0.0,-0.337225,-0.79419,-0.314614,-0.501205,1.067895,-0.614833,-0.040389,-0.394399
1,1038,2,1,1,1,1,1,1,1,1,...,-0.044414,0.0,-0.337225,-0.808586,-0.095522,0.292574,0.391917,0.318674,0.192443,-0.344124
2,1761,1,1,1,1,1,1,1,1,1,...,-0.044414,0.0,-0.337225,1.350735,-0.533706,-0.501205,0.391917,0.318674,-0.738886,3.969506
3,1576,1,1,2,1,1,1,1,1,1,...,-0.044414,0.0,0.002171,-0.808586,-0.314614,-0.501205,0.053928,-0.614833,-0.215013,-0.394399
4,1090,2,1,1,1,1,1,1,1,1,...,-0.044414,0.0,-0.337225,-0.736609,0.123569,-0.501205,-0.960039,-0.614833,-0.156805,-0.394399


In [80]:
processed_test_complete.to_csv(f'{INPUT_PATH}{s}test_processed_1')

In [81]:
test_df

Unnamed: 0_level_0,dummy_LAM,dummy_others.LL,dummy_Cancer.linfoproliferativo,dummy_SMD,dummy_LAL,dummy_EICH,dummy_Leucemia.cronica,dummy_SMPC,dummy_Cancer.solido,dummy_LMC,...,TOBRAMICINA_.MG.,TOBRAMICINA_NEB_.MG.,VANCOMICINA_.MG.,days_after_anti,num_consult,num_movements,prev_hospital_stay,hospital_stay_w_FN,days_in_hospital,days_between
NHC,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
681,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,1.0,2.0,0.0,6.0,1.0,12.0,0.0
1038,1,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,4.0,1.0,4.0,2.0,16.0,5.0
1761,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,150.0,0.0,0.0,4.0,2.0,0.0,434.0
1576,0,0,1,0,0,0,0,0,0,0,...,0.0,0.0,2000.0,0.0,2.0,0.0,3.0,1.0,9.0,0.0
1090,1,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,5.0,6.0,0.0,0.0,1.0,10.0,0.0
262,0,0,0,0,0,1,0,0,0,0,...,0.0,0.0,6000.0,2.0,10.0,0.0,3.0,2.0,19.0,6.0
1085,0,0,0,1,0,0,0,0,0,0,...,0.0,0.0,2000.0,0.0,3.0,0.0,0.0,1.0,7.0,0.0
1092,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,1.0,2.0,0.0,0.0,1.0,5.0,0.0
864,0,0,1,0,0,0,0,0,0,0,...,0.0,0.0,0.0,8.0,5.0,0.0,5.0,2.0,23.0,305.0
1689,0,0,1,0,0,0,0,0,0,0,...,0.0,0.0,0.0,7.0,0.0,0.0,6.0,1.0,10.0,0.0


In [82]:
test_df.to_csv(f'{INPUT_PATH}{s}test_david1')

In [83]:
y

array([0, 1, 0, ..., 0, 0, 0])

<br>
### Dataframe checkpoint
When working with huge datasets some operations are quite time consuming.<br>
Pandas offers some shortcuts to optimize performance as much as possible.<br>
To feather basically dumps the dataframe representation in RAM as a file. This allows us to rapidly load it later.

In [84]:
ID_df = pd.DataFrame(test_general_df["ID"])

In [85]:
y_df = pd.DataFrame(y, columns=["MDR"])

In [86]:
y_df

Unnamed: 0,MDR
0,0
1,1
2,0
3,0
4,0
5,1
6,0
7,0
8,0
9,0


<b>Save

In [87]:
save = True

In [88]:
if save:
    processed_train_df.to_feather(f'{INPUT_PATH}{s}processed_train_df')
    processed_test_df.to_feather(f'{INPUT_PATH}{s}processed_test_df')
    ID_df.to_feather(f'{INPUT_PATH}{s}ID_df')
    y_df.to_feather(f'{INPUT_PATH}{s}y_df')

<b>Load

In [89]:
load = True

In [94]:
if load:
    processed_train_df = pd.read_feather(f'{INPUT_PATH}{s}processed_train_df')
    processed_test_df = pd.read_feather(f'{INPUT_PATH}{s}processed_test_df')
    ID_df = pd.read_feather(f'{INPUT_PATH}{s}ID_df')
    y_df = pd.read_feather(f'{INPUT_PATH}{s}y_df')

<br>
<br>
<a id="model_training"> </a>
# 5. Model training

In [146]:
forest = RandomForestClassifier(n_estimators=500)

Building a successful neural network is an iterative process. We shouldn't expect to come up with a magical idea that will make a great network from the start. Also, we shouldn't make decisions based on "gut feelings" or "divine visions".

In [95]:
data = processed_train_df
y = processed_train_df["MDR"]
data.drop(['MDR'], inplace=True, axis=1)

In [92]:
test = processed_test_df

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                    y,test_size=0.3
,random_state=0)

In [93]:
folds = KFold(n_splits=5, shuffle=True, random_state=546790)
oof_preds = np.zeros(data.shape[0])
sub_preds = np.zeros(test.shape[0])



In [221]:
folds = KFold(n_splits=8, shuffle=True, random_state=random.randint(0,10000))
oof_preds = np.zeros(data.shape[0])
sub_preds = np.zeros(test.shape[0])

feature_importance_df = pd.DataFrame()
feats = [f for f in data.columns if f not in [identifier_f]]
for n_fold, (trn_idx, val_idx) in enumerate(folds.split(data)):
    
    trn_x, trn_y = data[feats].iloc[trn_idx], y.iloc[trn_idx]
    val_x, val_y = data[feats].iloc[val_idx], y.iloc[val_idx]
    clf = RandomForestClassifier(n_estimators=1000, criterion='entropy',
                                 max_depth=20,
                                 min_samples_split=4,
                                 min_samples_leaf=6,
                                 max_features=40,
                                 max_leaf_nodes = 2,
                                 bootstrap=True
                             
                                )
    clf.fit(trn_x, trn_y)
    oof_preds[val_idx] = clf.predict_proba(val_x)[:, 1]
    sub_preds += clf.predict_proba(test[feats])[:, 1] / folds.n_splits
    
    
    print(forest.score(val_x,val_y))
    print('Fold %2d AUC : %.6f' % (n_fold + 1, roc_auc_score(val_y, oof_preds[val_idx])))
print('Full AUC score %.6f' % roc_auc_score(y, oof_preds))

0.9950738916256158
Fold  1 AUC : 0.638258
0.995049504950495
Fold  2 AUC : 0.553828
0.9752475247524752
Fold  3 AUC : 0.776216
0.9752475247524752
Fold  4 AUC : 0.644298
0.995049504950495
Fold  5 AUC : 0.549962
0.9752475247524752
Fold  6 AUC : 0.584407
0.9851485148514851
Fold  7 AUC : 0.532812
0.9900990099009901
Fold  8 AUC : 0.545425
Full AUC score 0.585846


In [221]:
folds = KFold(n_splits=8, shuffle=True, random_state=random.randint(0,10000))
oof_preds = np.zeros(data.shape[0])
sub_preds = np.zeros(test.shape[0])

feature_importance_df = pd.DataFrame()
feats = [f for f in data.columns if f not in [identifier_f]]
for n_fold, (trn_idx, val_idx) in enumerate(folds.split(data)):
    
    trn_x, trn_y = data[feats].iloc[trn_idx], y.iloc[trn_idx]
    val_x, val_y = data[feats].iloc[val_idx], y.iloc[val_idx]
    clf = RandomForestClassifier(n_estimators=1000, criterion='entropy',
                                 max_depth=20,
                                 min_samples_split=4,
                                 min_samples_leaf=6,
                                 max_features=40,
                                 max_leaf_nodes = 2,
                                 bootstrap=True
                             
                                )
    clf.fit(trn_x, trn_y)
    oof_preds[val_idx] = clf.predict_proba(val_x)[:, 1]
    sub_preds += clf.predict_proba(test[feats])[:, 1] / folds.n_splits
    
    
    print(forest.score(val_x,val_y))
    print('Fold %2d AUC : %.6f' % (n_fold + 1, roc_auc_score(val_y, oof_preds[val_idx])))
print('Full AUC score %.6f' % roc_auc_score(y, oof_preds))

0.9950738916256158
Fold  1 AUC : 0.638258
0.995049504950495
Fold  2 AUC : 0.553828
0.9752475247524752
Fold  3 AUC : 0.776216
0.9752475247524752
Fold  4 AUC : 0.644298
0.995049504950495
Fold  5 AUC : 0.549962
0.9752475247524752
Fold  6 AUC : 0.584407
0.9851485148514851
Fold  7 AUC : 0.532812
0.9900990099009901
Fold  8 AUC : 0.545425
Full AUC score 0.585846


In [212]:
feature_importance_df = pd.DataFrame()

feats = [f for f in data.columns if f not in [identifier_f]]
for n_fold, (trn_idx, val_idx) in enumerate(folds.split(data)):
    
    trn_x, trn_y = data[feats].iloc[trn_idx], y.iloc[trn_idx]
    val_x, val_y = data[feats].iloc[val_idx], y.iloc[val_idx]
    
    clf = LGBMClassifier(
        # n_estimators=1000,
        # num_leaves=20,
        # colsample_bytree=.8,
        # subsample=.8,
        # max_depth=7,
        # reg_alpha=.1,
        # reg_lambda=.1,
        # min_split_gain=.01
        n_estimators=10000,
        learning_rate=0.03,
        num_leaves=4,
        colsample_bytree=.8,
        subsample=.8,
        max_depth=5,
        reg_alpha=.1,
        reg_lambda=.1,
        min_split_gain=.01,
        min_child_weight=2,
        silent=-1,
        verbose=-1,
    )
    clf.fit(trn_x, trn_y, 
            eval_set= [(trn_x, trn_y), (val_x, val_y)], 
            eval_metric='auc', verbose=100, early_stopping_rounds=50  #30
           )
    oof_preds[val_idx] = clf.predict_proba(val_x, num_iteration=clf.best_iteration_)[:, 1]
    sub_preds += clf.predict_proba(test[feats], num_iteration=clf.best_iteration_)[:, 1] / folds.n_splits
    fold_importance_df = pd.DataFrame()
    fold_importance_df["feature"] = feats
    fold_importance_df["importance"] = clf.feature_importances_
    fold_importance_df["fold"] = n_fold + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    
    print('Fold %2d AUC : %.6f' % (n_fold + 1, roc_auc_score(val_y, oof_preds[val_idx])))
    
print('Full AUC score %.6f' % roc_auc_score(y, oof_preds))
    

Training until validation scores don't improve for 50 rounds.
[100]	training's auc: 0.803077	valid_1's auc: 0.866004
Early stopping, best iteration is:
[53]	training's auc: 0.760831	valid_1's auc: 0.897727
Fold  1 AUC : 0.897727
Training until validation scores don't improve for 50 rounds.
[100]	training's auc: 0.796986	valid_1's auc: 0.730566
Early stopping, best iteration is:
[76]	training's auc: 0.771737	valid_1's auc: 0.750509
Fold  2 AUC : 0.750509
Training until validation scores don't improve for 50 rounds.
[100]	training's auc: 0.828418	valid_1's auc: 0.576471
[200]	training's auc: 0.877285	valid_1's auc: 0.614308
[300]	training's auc: 0.902093	valid_1's auc: 0.637043
[400]	training's auc: 0.921	valid_1's auc: 0.643084
Early stopping, best iteration is:
[359]	training's auc: 0.914185	valid_1's auc: 0.645628
Fold  3 AUC : 0.645628
Training until validation scores don't improve for 50 rounds.
[100]	training's auc: 0.800993	valid_1's auc: 0.67033
Early stopping, best iteration is:

In [144]:
folds = KFold(n_splits=4, shuffle=True, random_state=random.randint(0,10000))
oof_preds = np.zeros(data.shape[0])
sub_preds = np.zeros(test.shape[0])

feature_importance_df = pd.DataFrame()

feats = [f for f in data.columns if f not in [identifier_f]]
for n_fold, (trn_idx, val_idx) in enumerate(folds.split(data)):
    
    trn_x, trn_y = data[feats].iloc[trn_idx], y.iloc[trn_idx]
    val_x, val_y = data[feats].iloc[val_idx], y.iloc[val_idx]
    
    clf = LGBMClassifier(
        # n_estimators=1000,
        # num_leaves=20,
        # colsample_bytree=.8,
        # subsample=.8,
        # max_depth=7,
        # reg_alpha=.1,
        # reg_lambda=.1,
        # min_split_gain=.01
        # overfitting estimators
        min_data_in_bin=1,
        max_bin = 40,
        num_leaves = 25,
        max_depth = 5,
        min_data_in_leaf = 20,
        min_sum_hessian_in_leaf = 1,
        bagging_fraction = 0.8,
        feature_fraction=0.7,
        lambda_l1 = 0.01,
        lambda_l2 = 0.03,
        min_gain_to_split = 0.01,
        n_estimators=10000,
        learning_rate=0.0001,
        silent=-1,
        verbose=-1,
    )
    clf.fit(trn_x, trn_y, 
            eval_set= [(trn_x, trn_y), (val_x, val_y)], 
            eval_metric='auc', verbose=100, early_stopping_rounds=200  #30
           )
    oof_preds[val_idx] = clf.predict_proba(val_x, num_iteration=clf.best_iteration_)[:, 1]
    sub_preds += clf.predict_proba(test[feats], num_iteration=clf.best_iteration_)[:, 1] / folds.n_splits
    fold_importance_df = pd.DataFrame()
    fold_importance_df["feature"] = feats
    fold_importance_df["importance"] = clf.feature_importances_
    fold_importance_df["fold"] = n_fold + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    
    print('Fold %2d AUC : %.6f' % (n_fold + 1, roc_auc_score(val_y, oof_preds[val_idx])))
    
print('Full AUC score %.6f' % roc_auc_score(y, oof_preds))
    

Training until validation scores don't improve for 200 rounds.
[100]	training's auc: 0.859307	valid_1's auc: 0.653599
[200]	training's auc: 0.860402	valid_1's auc: 0.664364
Early stopping, best iteration is:
[8]	training's auc: 0.864523	valid_1's auc: 0.715376
Fold  1 AUC : 0.715376
Training until validation scores don't improve for 200 rounds.
[100]	training's auc: 0.904683	valid_1's auc: 0.597782
[200]	training's auc: 0.905912	valid_1's auc: 0.618539
[300]	training's auc: 0.903079	valid_1's auc: 0.614265
Early stopping, best iteration is:
[107]	training's auc: 0.906534	valid_1's auc: 0.597171
Fold  2 AUC : 0.597171
Training until validation scores don't improve for 200 rounds.
[100]	training's auc: 0.851325	valid_1's auc: 0.673171
[200]	training's auc: 0.85804	valid_1's auc: 0.678421
Early stopping, best iteration is:
[8]	training's auc: 0.843651	valid_1's auc: 0.691544
Fold  3 AUC : 0.691544
Training until validation scores don't improve for 200 rounds.
[100]	training's auc: 0.86802

In [None]:
folds = KFold(n_splits=4, shuffle=True, random_state=random.randint(0,10000))
oof_preds = np.zeros(data.shape[0])
sub_preds = np.zeros(test.shape[0])
feature_importance_df = pd.DataFrame()

feats = [f for f in data.columns if f not in [identifier_f]]
for n_fold, (trn_idx, val_idx) in enumerate(folds.split(data)):
    
    trn_x, trn_y = data[feats].iloc[trn_idx], y.iloc[trn_idx]
    val_x, val_y = data[feats].iloc[val_idx], y.iloc[val_idx]
    
    clf = LGBMClassifier(
        # n_estimators=1000,
        # num_leaves=20,
        # colsample_bytree=.8,
        # subsample=.8,
        # max_depth=7,
        # reg_alpha=.1,
        # reg_lambda=.1,
        # min_split_gain=.01
        # overfitting estimators
        max_bin = 150,
        num_leaves = 36,
        max_depth = 9,
        min_data_in_leaf = 20,
        min_sum_hessian_in_leaf = 3,
        bagging_fraction = 0.8,
        feature_fraction=0.8,
        lambda_l2 = 0.01,
        min_gain_to_split = 0.01,
        n_estimators=10000,
        learning_rate=0.02,
        colsample_bytree=.8,
        subsample=.8,
        min_child_weight=2,
        silent=-1,
        verbose=-1, 
    )
    clf.fit(trn_x, trn_y, 
            eval_set= [(trn_x, trn_y), (val_x, val_y)], 
            eval_metric='auc', verbose=100, early_stopping_rounds=50  #30
           )
    oof_preds[val_idx] = clf.predict_proba(val_x, num_iteration=clf.best_iteration_)[:, 1]
    sub_preds += clf.predict_proba(test[feats], num_iteration=clf.best_iteration_)[:, 1] / folds.n_splits
    fold_importance_df = pd.DataFrame()
    fold_importance_df["feature"] = feats
    fold_importance_df["importance"] = clf.feature_importances_
    fold_importance_df["fold"] = n_fold + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    
    print('Fold %2d AUC : %.6f' % (n_fold + 1, roc_auc_score(val_y, oof_preds[val_idx])))
    
print('Full AUC score %.6f' % roc_auc_score(y, oof_preds))
    

In [None]:
clf = LGBMClassifier(
        # n_estimators=1000,
        # num_leaves=20,
        # colsample_bytree=.8,
        # subsample=.8,
        # max_depth=7,
        # reg_alpha=.1,
        # reg_lambda=.1,
        # min_split_gain=.01
        # overfitting estimators
        max_bin = 150,
        num_leaves = 36,
        max_depth = 9,
        min_data_in_leaf = 20,
        min_sum_hessian_in_leaf = 3,
        bagging_fraction = 0.8,
        feature_fraction=0.8,
        lambda_l2 = 0.01,
        min_gain_to_split = 0.01,
        n_estimators=10000,
        learning_rate=0.02,
        colsample_bytree=.8,
        subsample=.8,
        min_child_weight=2,
        silent=-1,
        verbose=-1, 
)

In [None]:
clf = LGBMClassifier(
        # n_estimators=1000,
        # num_leaves=20,
        # colsample_bytree=.8,
        # subsample=.8,
        # max_depth=7,
        # reg_alpha=.1,
        # reg_lambda=.1,
        # min_split_gain=.01
        n_estimators=10000,
        learning_rate=0.03,
        num_leaves=36,
        colsample_bytree=.8,
        subsample=.8,
        max_depth=9,
        reg_alpha=.1,
        reg_lambda=.1,
        min_split_gain=.01,
        min_child_weight=2,
        silent=-1,
        verbose=-1,
    )

    clf = LGBMClassifier(
        # n_estimators=1000,
        # num_leaves=20,
        # colsample_bytree=.8,
        # subsample=.8,
        # max_depth=7,
        # reg_alpha=.1,
        # reg_lambda=.1,
        # min_split_gain=.01
        n_estimators=10000,
        learning_rate=0.03,
        num_leaves=36,
        colsample_bytree=.8,
        subsample=.8,
        max_depth=9,
        reg_alpha=.1,
        reg_lambda=.1,
        min_split_gain=.01,
        min_child_weight=2,
        silent=-1,
        verbose=-1,
    )

In [None]:
clf.fit(trn_x, trn_y, 
            eval_set= [(trn_x, trn_y), (val_x, val_y)], 
            eval_metric='auc', verbose=100, early_stopping_rounds=50  #30
           )
    

In [None]:
oof_preds[val_idx] = clf.predict_proba(val_x, num_iteration=clf.best_iteration_)[:, 1]
sub_preds += clf.predict_proba(test[feats], num_iteration=clf.best_iteration_)[:, 1] / folds.n_splits
    

In [None]:
    fold_importance_df = pd.DataFrame()
    fold_importance_df["feature"] = feats
    fold_importance_df["importance"] = clf.feature_importances_
    fold_importance_df["fold"] = n_fold + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    
    print('Fold %2d AUC : %.6f' % (n_fold + 1, roc_auc_score(val_y, oof_preds[val_idx])))
    
print('Full AUC score %.6f' % roc_auc_score(y, oof_preds))

In [None]:
ID_df = pd.DataFrame(test_general_df["ID"])

In [None]:
ID_df.head()

In [None]:
sub_preds

In [None]:
predictions = pd.DataFrame(sub_preds, columns=["MDR"])
predictions.head()


In [None]:
result = pd.concat([ID_df, predictions], axis=1)
result.to_csv("intento5.csv", index=False)
result.head()
# test[['ID', 'MDR']].to_csv('first_submission.csv', index=False)

<br>
<br>
<a id="model_training"> </a>
# 6. Validation

### Cross-validation set probabilities

### Cross-validation set predictions

### Classification report

In [None]:
predictions = oof_preds[val_idx]

In [None]:
real_values = y_df["MDR"].iloc[val_idx]

In [None]:
real_values = y.iloc[val_idx]

In [None]:
real_values.values

In [None]:
print(classification_report(real_values.values, predictions))

### ROC AUC score

In [None]:
roc_auc_score(real_values.values, predictions)

### Plot ROC AUC

In [None]:
def plot_auc_roc(real_values, predicted_probabilities):
    fpr = dict()
    tpr = dict()
    roc_auc = dict()
    fpr["micro"], tpr["micro"], _ = roc_curve(real_values, predicted_probabilities)
    roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
    #Plot of a ROC curve for a specific class
    plt.figure()
    lw = 2
    plt.plot(fpr["micro"], tpr["micro"], color='darkorange',
             lw=lw, label='ROC curve (area = %0.2f)' % roc_auc["micro"])
    plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    plt.show()

In [None]:
plot_auc_roc(real_values.values, predictions)

In [None]:
roc_auc_score(real_values.values, predictions)

In [None]:
learner[-1]

<br>
<br>
<a id="model_predictions"> </a>
# 6. Model predictions

<br>
## 6.1 Making the test set predictions

Set the is_test flag to True to tell the model to predict on the test set.<br>

In [None]:
test_predictions_array = learner.predict(is_test=True)

In [None]:
test_item_id

It will predict the probabilities for all the rows in our dataset

In [None]:
np.shape(test_predictions)

Check the output dimension is correct. The number of items_id of the test set should be equal to the predictions.

In [None]:
if np.shape(test_predictions)[0] == test_df["item_id"].count():
    print("Output dimension correct")
else:
    print("FATAL ERROR, output dimension doesn't match the number items")

In [None]:
data.test_ds.cats[0:13]

In [None]:
processed_test_df.head()

<br>
## 6.2 Submission

###  Format

The format of the csv submission file for the challenge: <br><br>
    ID, MDR<br>
    2,0<br>
    5,0<br>
    6,0<br>

###  Item id dataframe
We use the test_item_id dataframe from the sorted test df.

In [None]:
item_id_df = test_item_id
#item_id_df = test_item_id.reset_index().drop(['activation_date'], axis=1)

In [None]:
item_id_df.head()

###  Test predictions dataframe
We use the predicitions array to build the dataframe

In [None]:
test_predictions_df  = pd.DataFrame(test_predictions_array, columns=['TARGET'])

In [None]:
#test_predictions_df = test_predictions_df.clip(0,1)

In [None]:
test_predictions_df.head()

###  Submission dataframe
Concat both dataframes to generate the submission dataframe

In [None]:
submission_df = pd.concat([item_id_df, test_predictions_df], axis=1)

In [None]:
submission_df

<br>
### Submission csv file

Define the submission name and path.

In [None]:
submission_filename = "submission3"
submission_path = f"{INPUT_PATH}{s}submission{s}{submission_filename}.csv"

Create the submission file without the index column.

In [None]:
submission_df.to_csv(submission_path,index=False)

Generate a link to a direct download of the submission file.

In [None]:
FileLink(submission_path)

Visualize sample submission.

In [None]:
sample_submission_df = pd.read_csv(f"{INPUT_PATH}{s}submission{s}sample_submission.csv")

In [None]:
sample_submission_df