## Import Libraries

In [1]:
import pandas as pd
import numpy as np

## Authorization for Hospital Admission Data

The AIH datasets contains data on hospital production and services in Brazil. The data that will be used here is the Authorization for Hospital Admission. This dataset is part of `Brazil’s SIHSUS Hospital Information System`. This system manages the coordination and payment by Brazil’s public healthcare system (covers around 34% of Brazil’s population). In this application, I will be using data from 2015 – 2018. This represents 3.5 years’ of information.

A record in the AIH database is created when a hospital or healthcare unit generates a request for hospitalization. Providers submit demographic and health information about the patient. This request is ultimately approved or rejected. While the patient is in the hospital, the record is updated to also contain information about procedures performed and discharge. 

More information about this data can be found below: 

* [DataSUS Website](http://datasus.saude.gov.br/informacoes-de-saude)
* [AIH Data Fields](https://github.com/IvetteMTapia/Capstone-2_Deep_Learning/blob/master/IT_SIHSUS_1603_DataDict.pdf)

## Create Dictionary of Variable Definitions for Reference

This dictionary contains dataset variables type and description information.

In [2]:
var_spread_path = ('/Users/ivettetapia 1/Symbolic Link Seagate Drive/Springboard/Capstone 2_Deep_Learning/References/IT_SIHSUS_1603_DataDict.xlsx')

var_df = pd.read_excel(var_spread_path, index_col = 'Field_Name')
var_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 113 entries, UF_ZI to TPDISEC9
Data columns (total 2 columns):
Type of Field    113 non-null object
Description      113 non-null object
dtypes: object(2)
memory usage: 2.6+ KB


In [3]:
var_def_dict = var_df.to_dict(orient = 'index')
var_def_dict

{'UF_ZI': {'Type of Field': 'char(6)', 'Description': 'Municipality Manager'},
 'ANO_CMPT': {'Type of Field': 'char(4)',
  'Description': 'Year of AIH processing, in yyyy format.'},
 'MÊS_CMPT': {'Type of Field': 'char(2)',
  'Description': 'Month of AIH processing, in mm format.'},
 'ESPEC': {'Type of Field': 'char(2)', 'Description': 'Specialty of Bed'},
 'CGC_HOSP': {'Type of Field': 'char(14)',
  'Description': 'CNPJ of the Establishment'},
 'N_AIH': {'Type of Field': 'char(13)', 'Description': 'Number of AIH'},
 'IDENT': {'Type of Field': 'char(1)',
  'Description': 'Identification of the type of AIH'},
 'CEP': {'Type of Field': 'char(8)', 'Description': 'CEP of the patient'},
 'MUNIC_RES': {'Type of Field': 'char(6)',
  'Description': "Municipality of Patient's Residence"},
 'NASC': {'Type of Field': 'char(8)',
  'Description': 'Date of birth of the patient (yyyammdd)'},
 'SEXO': {'Type of Field': 'char(1)', 'Description': 'Sex of patient'},
 'UTI_MES_IN': {'Type of Field': 'nume

## Upload Data

Upload the random samples from each year. 

In [2]:
%%time

# 2015 Sample

AIH_sample_2015_path = ('/Users/ivettetapia 1/Symbolic Link Seagate Drive/Springboard/Capstone 2_Deep_Learning/Data/Random Sample File/AIH_random_sample_2015.csv')

AIH_sample_2015 = pd.read_csv(AIH_sample_2015_path, 
                              encoding = 'UTF-8', 
                              na_values= ['NaN',' ',''], 
                              low_memory=True)



CPU times: user 1min 31s, sys: 18.2 s, total: 1min 49s
Wall time: 1min 48s


In [3]:
%%time

# 2016 Sample

AIH_sample_2016_path = ('/Users/ivettetapia 1/Symbolic Link Seagate Drive/Springboard/Capstone 2_Deep_Learning/Data/Random Sample File/AIH_random_sample_2016.csv')

AIH_sample_2016 = pd.read_csv(AIH_sample_2016_path, 
                              encoding = 'UTF-8', 
                              na_values= ['NaN',' ',''], 
                              low_memory=True)



CPU times: user 1min 31s, sys: 19 s, total: 1min 50s
Wall time: 1min 49s


In [4]:
%%time

# 2017 Sample

AIH_sample_2017_path = ('/Users/ivettetapia 1/Symbolic Link Seagate Drive/Springboard/Capstone 2_Deep_Learning/Data/Random Sample File/AIH_random_sample_2017.csv')

AIH_sample_2017 = pd.read_csv(AIH_sample_2017_path, 
                              encoding = 'UTF-8', 
                              na_values= ['NaN',' ',''], 
                              low_memory=True)



CPU times: user 1min 31s, sys: 18.7 s, total: 1min 50s
Wall time: 1min 49s


In [5]:
%%time

# 2018 Sample

AIH_sample_2018_path = ('/Users/ivettetapia 1/Symbolic Link Seagate Drive/Springboard/Capstone 2_Deep_Learning/Data/Random Sample File/AIH_random_sample_2018.csv')

AIH_sample_2018 = pd.read_csv(AIH_sample_2018_path, 
                              encoding = 'UTF-8', 
                              na_values= ['NaN',' ',''], 
                              low_memory=True)



CPU times: user 54.8 s, sys: 9.33 s, total: 1min 4s
Wall time: 1min 3s


## Concatenate Samples

Concatenate all the samples into one dataframe.

In [6]:
AIH_sample = pd.concat([AIH_sample_2015,
                        AIH_sample_2016, 
                        AIH_sample_2017,
                        AIH_sample_2018],
                        ignore_index = True)

In [7]:
AIH_sample.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16614830 entries, 0 to 16614829
Data columns (total 113 columns):
UF_ZI         16614830 non-null int64
ANO_CMPT      16614830 non-null int64
MES_CMPT      16614830 non-null int64
ESPEC         16614830 non-null int64
CGC_HOSP      12629682 non-null float64
N_AIH         16614830 non-null int64
IDENT         16614830 non-null int64
CEP           16614830 non-null int64
MUNIC_RES     16614830 non-null int64
NASC          16614830 non-null int64
SEXO          16614830 non-null int64
UTI_MES_IN    16614830 non-null int64
UTI_MES_AN    16614830 non-null int64
UTI_MES_AL    16614830 non-null int64
UTI_MES_TO    16614830 non-null int64
MARCA_UTI     16614830 non-null int64
UTI_INT_IN    16614830 non-null int64
UTI_INT_AN    16614830 non-null int64
UTI_INT_AL    16614830 non-null int64
UTI_INT_TO    16614830 non-null int64
DIAR_ACOM     16614830 non-null int64
QT_DIARIAS    16614830 non-null int64
PROC_SOLIC    16614830 non-null int64
PROC_REA

+ Memory usage of the dataset is 14 +GB.
+ 16,614,830 observations, this is 40% out of a total of 41,648,222 observations avaialble for the time period of 2015 - 2018.

In [8]:
AIH_sample.head(10)

Unnamed: 0,UF_ZI,ANO_CMPT,MES_CMPT,ESPEC,CGC_HOSP,N_AIH,IDENT,CEP,MUNIC_RES,NASC,...,DIAGSEC9,TPDISEC1,TPDISEC2,TPDISEC3,TPDISEC4,TPDISEC5,TPDISEC6,TPDISEC7,TPDISEC8,TPDISEC9
0,355030,2015,8,3,60922170000000.0,3515115312016,1,4339150,355030,19820321,...,,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
1,530000,2015,8,7,394700000000.0,5315100954273,1,70335900,530010,20150819,...,,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,160000,2015,10,2,60975740000000.0,1615100385789,1,68900010,160030,19850903,...,,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,150140,2015,3,1,4938437000000.0,1515101116320,1,67010000,150080,20030209,...,,2,0,0,0.0,0.0,0.0,0.0,0.0,0.0
4,310620,2015,12,1,19843930000000.0,3115109220069,1,39725000,315750,19810624,...,,2,0,0,0.0,0.0,0.0,0.0,0.0,0.0
5,314800,2015,8,3,23347960000000.0,3115117350950,1,38770000,313630,19980618,...,,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
6,520110,2015,7,1,,5215101323388,1,75083440,520110,19711120,...,,2,0,0,0.0,0.0,0.0,0.0,0.0,0.0
7,230250,2015,12,3,,2315107999820,1,63260000,230250,19361001,...,,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
8,355370,2015,11,3,72127210000000.0,3515123985241,1,15900000,355370,19500623,...,,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
9,350000,2015,1,2,46374500000000.0,3514124422072,1,6900000,351510,19880824,...,,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


## NaN Values Represented by Zeroes

There are columns in which NaN values that are being represented by zeroes. Replace values of these zeroes.

In [9]:
cols = ["TPDISEC1","TPDISEC2","TPDISEC3",
        "TPDISEC4","TPDISEC5","TPDISEC6",
        "TPDISEC7","TPDISEC8","TPDISEC9"]

In [10]:
AIH_sample[cols] = AIH_sample[cols].replace({0.0:np.nan, 0:np.nan})

In [11]:
AIH_sample.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16614830 entries, 0 to 16614829
Data columns (total 113 columns):
UF_ZI         16614830 non-null int64
ANO_CMPT      16614830 non-null int64
MES_CMPT      16614830 non-null int64
ESPEC         16614830 non-null int64
CGC_HOSP      12629682 non-null float64
N_AIH         16614830 non-null int64
IDENT         16614830 non-null int64
CEP           16614830 non-null int64
MUNIC_RES     16614830 non-null int64
NASC          16614830 non-null int64
SEXO          16614830 non-null int64
UTI_MES_IN    16614830 non-null int64
UTI_MES_AN    16614830 non-null int64
UTI_MES_AL    16614830 non-null int64
UTI_MES_TO    16614830 non-null int64
MARCA_UTI     16614830 non-null int64
UTI_INT_IN    16614830 non-null int64
UTI_INT_AN    16614830 non-null int64
UTI_INT_AL    16614830 non-null int64
UTI_INT_TO    16614830 non-null int64
DIAR_ACOM     16614830 non-null int64
QT_DIARIAS    16614830 non-null int64
PROC_SOLIC    16614830 non-null int64
PROC_REA

## Investigate columns that are described in the data dict as 'Reset'

In [12]:
cols_2 = ['UTI_MES_IN','UTI_MES_AN','UTI_MES_AL', 'TOT_PT_SP',
          'VAL_SADT','VAL_RN','VAL_ACOMP', 'VAL_ORTP','VAL_SANGUE',
          'VAL_SADTSR','VAL_TRANSP', 'VAL_OBSANG','VAL_PED1AC',
          'UTI_INT_IN','UTI_INT_AN','UTI_INT_AL','RUBRICA']

In [13]:
AIH_sample[cols_2].describe()

Unnamed: 0,UTI_MES_IN,UTI_MES_AN,UTI_MES_AL,TOT_PT_SP,VAL_SADT,VAL_RN,VAL_ACOMP,VAL_ORTP,VAL_SANGUE,VAL_SADTSR,VAL_TRANSP,VAL_OBSANG,VAL_PED1AC,UTI_INT_IN,UTI_INT_AN,UTI_INT_AL,RUBRICA
count,16614830.0,16614830.0,16614830.0,16614830.0,16614830.0,16614830.0,16614830.0,16614830.0,16614830.0,16614830.0,16614830.0,16614830.0,16614830.0,16614830.0,16614830.0,16614830.0,16614830.0
mean,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
std,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


+ These columns are fill completely with zeroes and do not provide any information.

## Drop Columns

Drop columns that are either **empty** or have more than **20% missing values**.

In [14]:
drop_cols = ['UTI_MES_IN','UTI_MES_AN','UTI_MES_AL', 'TOT_PT_SP',
             'UTI_INT_IN','UTI_INT_AN','UTI_INT_AL','RUBRICA',
             'VAL_SADT','VAL_RN','VAL_ACOMP', 'VAL_ORTP','VAL_SANGUE', 
             'VAL_SADTSR','VAL_TRANSP', 'VAL_OBSANG','VAL_PED1AC',
             'NUM_PROC', 'CPF_AUT','CID_NOTIF','GESTOR_DT', 'CNPJ_MANT', 
             'FAEC_TP','INFEHOSP', 'AUD_JUST','SIS_JUST', 'TPDISEC1',
             'TPDISEC2','TPDISEC3', 'TPDISEC4','TPDISEC5','TPDISEC6',
             'TPDISEC7','TPDISEC8','TPDISEC9','DIAGSEC1','DIAGSEC2',
             'DIAGSEC3','DIAGSEC4','DIAGSEC5','DIAGSEC6','DIAGSEC7',
             'DIAGSEC8','DIAGSEC9']

In [15]:
AIH_sample_drop = AIH_sample.drop(labels = drop_cols, axis = 1)

In [16]:
AIH_sample_drop.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16614830 entries, 0 to 16614829
Data columns (total 69 columns):
UF_ZI         16614830 non-null int64
ANO_CMPT      16614830 non-null int64
MES_CMPT      16614830 non-null int64
ESPEC         16614830 non-null int64
CGC_HOSP      12629682 non-null float64
N_AIH         16614830 non-null int64
IDENT         16614830 non-null int64
CEP           16614830 non-null int64
MUNIC_RES     16614830 non-null int64
NASC          16614830 non-null int64
SEXO          16614830 non-null int64
UTI_MES_TO    16614830 non-null int64
MARCA_UTI     16614830 non-null int64
UTI_INT_TO    16614830 non-null int64
DIAR_ACOM     16614830 non-null int64
QT_DIARIAS    16614830 non-null int64
PROC_SOLIC    16614830 non-null int64
PROC_REA      16614830 non-null int64
VAL_SH        16614830 non-null float64
VAL_SP        16614830 non-null float64
VAL_TOT       16614830 non-null float64
VAL_UTI       16614830 non-null float64
US_TOT        16614830 non-null float64

* 44 columns dropped. 69 columns remain.
* Memory has been decreased from 14+GB to 8.5+GB.

In [17]:
%%time

# Full random sample to CSV. Does not contains empty or scarcely used columns.

AIH_sample_drop.to_csv('AIH_random_sample_full.csv', index = False,
                       na_rep= 'NaN', encoding='utf-8',
                       chunksize = 50000)

CPU times: user 18min 36s, sys: 57.7 s, total: 19min 34s
Wall time: 20min 2s


## End of the Notebook 