## Import Libraries

In [4]:
import pandas as pd
import numpy as np
import random
from random import sample

## Authorization for Hospital Admission Data

The AIH datasets contains data on hospital production and services in Brazil. The data that will be used here is the Authorization for Hospital Admission. This dataset is part of `Brazil’s SIHSUS Hospital Information System`. This system manages the coordination and payment by Brazil’s public healthcare system (covers around 34% of Brazil’s population). In this application, I will be using data from 2015 – 2018. This represents 3.5 years’ of information.

A record in the AIH database is created when a hospital or healthcare unit generates a request for hospitalization. Providers submit demographic and health information about the patient. This request is ultimately approved or rejected. While the patient is in the hospital, the record is updated to also contain information about procedures performed and discharge. 

More information about this data can be found below: 

* [DataSUS Website](http://datasus.saude.gov.br/informacoes-de-saude)
* [AIH Data Fields](https://github.com/IvetteMTapia/Capstone-2_Deep_Learning/blob/master/IT_SIHSUS_1603_DataDict.pdf)


## Data Pre -  Processing Information

* Due to the size of the files, the data was extracted to a local machine. The extraction website can be found at: [DataSUS public file download site](http://www2.datasus.gov.br/DATASUS/index.php?area=0901).

* The format of the files at extraction was .dbc. The .dbc format compresses .dbf files. This format is propetary and used by Brazil's IT department to distribute the large files in their database. The .dbc files are a compressed version of .dbf files. 

* I have already done the pre-procesing step of converting the .dbc files to .csv files using R. The R envioroment has a R package specifically written to read and de-compresss these type of files. You can find the R code used for the  conversion [here](https://github.com/IvetteMTapia/Capstone-2_Deep_Learning/blob/master/Convert%20from%20dbc%20to%20CSV.R).

## Create Dictionary of Variable Definitions for Reference

*This dictionary contains dataset variables type and description information.*

In [5]:
var_spread_path = ('/Users/ivettetapia 1/Symbolic Link Seagate Drive/Springboard/Capstone 2_Deep_Learning/References/IT_SIHSUS_1603_DataDict.xlsx')

var_df = pd.read_excel(var_spread_path, index_col = 'Field_Name')
var_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 113 entries, UF_ZI to TPDISEC9
Data columns (total 2 columns):
Type of Field    113 non-null object
Description      113 non-null object
dtypes: object(2)
memory usage: 2.6+ KB


In [6]:
var_def_dict = var_df.to_dict(orient = 'index')
var_def_dict

{'UF_ZI': {'Type of Field': 'char(6)', 'Description': 'Municipality Manager'},
 'ANO_CMPT': {'Type of Field': 'char(4)',
  'Description': 'Year of AIH processing, in yyyy format.'},
 'MÊS_CMPT': {'Type of Field': 'char(2)',
  'Description': 'Month of AIH processing, in mm format.'},
 'ESPEC': {'Type of Field': 'char(2)', 'Description': 'Specialty of Bed'},
 'CGC_HOSP': {'Type of Field': 'char(14)',
  'Description': 'CNPJ of the Establishment'},
 'N_AIH': {'Type of Field': 'char(13)', 'Description': 'Number of AIH'},
 'IDENT': {'Type of Field': 'char(1)',
  'Description': 'Identification of the type of AIH'},
 'CEP': {'Type of Field': 'char(8)', 'Description': 'CEP of the patient'},
 'MUNIC_RES': {'Type of Field': 'char(6)',
  'Description': "Municipality of Patient's Residence"},
 'NASC': {'Type of Field': 'char(8)',
  'Description': 'Date of birth of the patient (yyyammdd)'},
 'SEXO': {'Type of Field': 'char(1)', 'Description': 'Sex of patient'},
 'UTI_MES_IN': {'Type of Field': 'nume

## Sample file

*Loading the larger files takes longer. This is to get a sense of the files, before uploading the large ones.*

In [2]:
sample_2018_path = ('/Users/ivettetapia 1/Symbolic Link Seagate Drive/Springboard/Capstone 2_Deep_Learning/Data/2- Converted Files (dbc to csv)/AIH_sample_2018.csv')

*Open small sample file from 2017. Bring all available columns.*

In [3]:
%%time

sample_2018 = pd.read_csv(sample_2018_path, 
                          encoding = 'UTF-8', 
                          na_values= ['NA',' ',''], 
                          low_memory=True)

CPU times: user 82 ms, sys: 21.1 ms, total: 103 ms
Wall time: 171 ms


In [6]:
sample_2018.info(verbose = True, null_counts = True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4608 entries, 0 to 4607
Data columns (total 113 columns):
UF_ZI         4608 non-null int64
ANO_CMPT      4608 non-null int64
MES_CMPT      4608 non-null int64
ESPEC         4608 non-null int64
CGC_HOSP      4574 non-null float64
N_AIH         4608 non-null int64
IDENT         4608 non-null int64
CEP           4608 non-null int64
MUNIC_RES     4608 non-null int64
NASC          4608 non-null int64
SEXO          4608 non-null int64
UTI_MES_IN    4608 non-null int64
UTI_MES_AN    4608 non-null int64
UTI_MES_AL    4608 non-null int64
UTI_MES_TO    4608 non-null int64
MARCA_UTI     4608 non-null int64
UTI_INT_IN    4608 non-null int64
UTI_INT_AN    4608 non-null int64
UTI_INT_AL    4608 non-null int64
UTI_INT_TO    4608 non-null int64
DIAR_ACOM     4608 non-null int64
QT_DIARIAS    4608 non-null int64
PROC_SOLIC    4608 non-null int64
PROC_REA      4608 non-null int64
VAL_SH        4608 non-null float64
VAL_SP        4608 non-null float64
VA

In [7]:
sample_2018.head()

Unnamed: 0,UF_ZI,ANO_CMPT,MES_CMPT,ESPEC,CGC_HOSP,N_AIH,IDENT,CEP,MUNIC_RES,NASC,...,DIAGSEC9,TPDISEC1,TPDISEC2,TPDISEC3,TPDISEC4,TPDISEC5,TPDISEC6,TPDISEC7,TPDISEC8,TPDISEC9
0,120000,2018,1,3,4034526000000.0,1218100020863,1,69982000,120039,20070713,...,,1,0,0,0,0,0,0,0,0
1,120000,2018,1,3,4034526000000.0,1218100020885,1,69970000,120060,20160606,...,,0,0,0,0,0,0,0,0,0
2,120000,2018,1,3,4034526000000.0,1218100020896,1,69982000,120039,19890214,...,,0,0,0,0,0,0,0,0,0
3,120000,2018,1,3,4034526000000.0,1218100020907,1,69980000,120020,19721206,...,,0,0,0,0,0,0,0,0,0
4,120000,2018,1,3,4034526000000.0,1218100022788,1,69980000,120020,19720809,...,,0,0,0,0,0,0,0,0,0


## Upload AIH Data to Pandas DataFrames

*Upload 2015 - 2018 AIH data contained in the pre-procesed CSV files*

> **AIH 2015 Data Upload**

In [7]:
%%time 

#Path to 2015 data in the local machine

aih_2015_path = ('/Users/ivettetapia 1/Symbolic Link Seagate Drive/Springboard/Capstone 2_Deep_Learning/Data/2- Converted Files (dbc to csv)/AIH_RD_2015.csv')

#Read to pandas df

aih_2015 = pd.read_csv(aih_2015_path, 
                       encoding = 'UTF-8', 
                       na_values= ['NA',' ',''], 
                       low_memory=True)



CPU times: user 4min 24s, sys: 1min 2s, total: 5min 26s
Wall time: 5min 50s


In [6]:
# See AIH 2015 data info

aih_2015.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11638853 entries, 0 to 11638852
Data columns (total 113 columns):
UF_ZI         11638853 non-null int64
ANO_CMPT      11638853 non-null int64
MES_CMPT      11638853 non-null int64
ESPEC         11638853 non-null int64
CGC_HOSP      8609798 non-null float64
N_AIH         11638853 non-null int64
IDENT         11638853 non-null int64
CEP           11638853 non-null int64
MUNIC_RES     11638853 non-null int64
NASC          11638853 non-null int64
SEXO          11638853 non-null int64
UTI_MES_IN    11638853 non-null int64
UTI_MES_AN    11638853 non-null int64
UTI_MES_AL    11638853 non-null int64
UTI_MES_TO    11638853 non-null int64
MARCA_UTI     11638853 non-null int64
UTI_INT_IN    11638853 non-null int64
UTI_INT_AN    11638853 non-null int64
UTI_INT_AL    11638853 non-null int64
UTI_INT_TO    11638853 non-null int64
DIAR_ACOM     11638853 non-null int64
QT_DIARIAS    11638853 non-null int64
PROC_SOLIC    11638853 non-null int64
PROC_REA 

In [10]:
# See AIH data 2015 head

aih_2015.head()

Unnamed: 0,UF_ZI,ANO_CMPT,MES_CMPT,ESPEC,CGC_HOSP,N_AIH,IDENT,CEP,MUNIC_RES,NASC,...,DIAGSEC9,TPDISEC1,TPDISEC2,TPDISEC3,TPDISEC4,TPDISEC5,TPDISEC6,TPDISEC7,TPDISEC8,TPDISEC9
0,120000,2015,1,2,4034526000000.0,1215100060867,1,69985000,120042,19950825,...,,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
1,120000,2015,1,3,4034526000000.0,1215100060383,1,69980000,120020,19941127,...,,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,120000,2015,1,3,4034526000000.0,1215100060559,1,69980000,120020,19890308,...,,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,120000,2015,1,3,4034526000000.0,1215100060658,1,69980000,120020,19980725,...,,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
4,120000,2015,1,3,4034526000000.0,1215100060911,1,69980000,120020,19990205,...,,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


> **AIH 2016 Data Upload**

In [8]:
%%time 

#Path to 2016 data in the local machine

aih_2016_path = ('/Users/ivettetapia 1/Symbolic Link Seagate Drive/Springboard/Capstone 2_Deep_Learning/Data/2- Converted Files (dbc to csv)/AIH_RD_2016.csv')

#Read to pandas df

aih_2016 = pd.read_csv(aih_2016_path, 
                       encoding = 'UTF-8', 
                       na_values= ['NA',' ',''], 
                       low_memory=True)



CPU times: user 4min 29s, sys: 1min 2s, total: 5min 32s
Wall time: 6min 5s


In [12]:
# See AIH 2016 data info

aih_2016.info(verbose = True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11527712 entries, 0 to 11527711
Data columns (total 113 columns):
UF_ZI         11527712 non-null int64
ANO_CMPT      11527712 non-null int64
MES_CMPT      11527712 non-null int64
ESPEC         11527712 non-null int64
CGC_HOSP      8783440 non-null float64
N_AIH         11527712 non-null int64
IDENT         11527712 non-null int64
CEP           11527712 non-null int64
MUNIC_RES     11527712 non-null int64
NASC          11527712 non-null int64
SEXO          11527712 non-null int64
UTI_MES_IN    11527712 non-null int64
UTI_MES_AN    11527712 non-null int64
UTI_MES_AL    11527712 non-null int64
UTI_MES_TO    11527712 non-null int64
MARCA_UTI     11527712 non-null int64
UTI_INT_IN    11527712 non-null int64
UTI_INT_AN    11527712 non-null int64
UTI_INT_AL    11527712 non-null int64
UTI_INT_TO    11527712 non-null int64
DIAR_ACOM     11527712 non-null int64
QT_DIARIAS    11527712 non-null int64
PROC_SOLIC    11527712 non-null int64
PROC_REA 

In [13]:
# See AIH 2016 head

aih_2016.head()

Unnamed: 0,UF_ZI,ANO_CMPT,MES_CMPT,ESPEC,CGC_HOSP,N_AIH,IDENT,CEP,MUNIC_RES,NASC,...,DIAGSEC9,TPDISEC1,TPDISEC2,TPDISEC3,TPDISEC4,TPDISEC5,TPDISEC6,TPDISEC7,TPDISEC8,TPDISEC9
0,120000,2016,1,3,63602940000000.0,1216100041772,1,69930000,120070,19961208,...,,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
1,120000,2016,1,3,63602940000000.0,1216100043752,1,69900970,120040,20020705,...,,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,120000,2016,1,3,63602940000000.0,1216100046282,1,69900970,120040,19600211,...,,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,120000,2016,1,3,63602940000000.0,1216100046315,1,69923000,120040,19990304,...,,1,0,0,0.0,0.0,0.0,0.0,0.0,0.0
4,120000,2016,1,3,63602940000000.0,1216100046337,1,69900970,120040,19821121,...,,1,0,0,0.0,0.0,0.0,0.0,0.0,0.0


> **AIH 2017 Data Upload**

In [9]:
%%time 

#Path to 2017 data in the local machine

aih_2017_path = ('/Users/ivettetapia 1/Symbolic Link Seagate Drive/Springboard/Capstone 2_Deep_Learning/Data/2- Converted Files (dbc to csv)/AIH_RD_2017.csv')


#Read to pandas df

aih_2017 = pd.read_csv(aih_2017_path, 
                       encoding = 'UTF-8', 
                       na_values= ['NA',' ',''], 
                       low_memory=True)



CPU times: user 4min 15s, sys: 1min 1s, total: 5min 17s
Wall time: 5min 37s


In [15]:
# See 2017 data info

aih_2017.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11560960 entries, 0 to 11560959
Data columns (total 113 columns):
UF_ZI         11560960 non-null int64
ANO_CMPT      11560960 non-null int64
MES_CMPT      11560960 non-null int64
ESPEC         11560960 non-null int64
CGC_HOSP      8924721 non-null float64
N_AIH         11560960 non-null int64
IDENT         11560960 non-null int64
CEP           11560960 non-null int64
MUNIC_RES     11560960 non-null int64
NASC          11560960 non-null int64
SEXO          11560960 non-null int64
UTI_MES_IN    11560960 non-null int64
UTI_MES_AN    11560960 non-null int64
UTI_MES_AL    11560960 non-null int64
UTI_MES_TO    11560960 non-null int64
MARCA_UTI     11560960 non-null int64
UTI_INT_IN    11560960 non-null int64
UTI_INT_AN    11560960 non-null int64
UTI_INT_AL    11560960 non-null int64
UTI_INT_TO    11560960 non-null int64
DIAR_ACOM     11560960 non-null int64
QT_DIARIAS    11560960 non-null int64
PROC_SOLIC    11560960 non-null int64
PROC_REA 

In [16]:
# See AIH 2017 head

aih_2017.head()

Unnamed: 0,UF_ZI,ANO_CMPT,MES_CMPT,ESPEC,CGC_HOSP,N_AIH,IDENT,CEP,MUNIC_RES,NASC,...,DIAGSEC9,TPDISEC1,TPDISEC2,TPDISEC3,TPDISEC4,TPDISEC5,TPDISEC6,TPDISEC7,TPDISEC8,TPDISEC9
0,120000,2017,1,3,63602940000000.0,1217100020312,1,69932000,120010,19191228,...,,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
1,120000,2017,1,3,63602940000000.0,1217100020367,1,69940000,120050,19321008,...,,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,120000,2017,1,3,63602940000000.0,1217100020829,1,69900970,120040,19850828,...,,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,120000,2017,1,1,529443000000.0,1217100002745,1,69900970,120040,20030712,...,,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
4,120000,2017,1,1,529443000000.0,1217100009202,1,76801000,110020,20020121,...,,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


> **AIH 2018 Data Upload**

In [10]:
%%time

#Path to 2018 data in the local machine

aih_2018_path = ('/Users/ivettetapia 1/Symbolic Link Seagate Drive/Springboard/Capstone 2_Deep_Learning/Data/2- Converted Files (dbc to csv)/AIH_RD_2018.csv')

#Read to pandas df

aih_2018 = pd.read_csv(aih_2018_path, 
                       encoding = 'UTF-8', 
                       na_values= ['NA',' ',''], 
                       low_memory=True)



CPU times: user 2min 17s, sys: 34.3 s, total: 2min 51s
Wall time: 3min


In [17]:
# See AIH 2018 data info

aih_2018.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6809556 entries, 0 to 6809555
Data columns (total 113 columns):
UF_ZI         6809556 non-null int64
ANO_CMPT      6809556 non-null int64
MES_CMPT      6809556 non-null int64
ESPEC         6809556 non-null int64
CGC_HOSP      5261211 non-null float64
N_AIH         6809556 non-null int64
IDENT         6809556 non-null int64
CEP           6809556 non-null int64
MUNIC_RES     6809556 non-null int64
NASC          6809556 non-null int64
SEXO          6809556 non-null int64
UTI_MES_IN    6809556 non-null int64
UTI_MES_AN    6809556 non-null int64
UTI_MES_AL    6809556 non-null int64
UTI_MES_TO    6809556 non-null int64
MARCA_UTI     6809556 non-null int64
UTI_INT_IN    6809556 non-null int64
UTI_INT_AN    6809556 non-null int64
UTI_INT_AL    6809556 non-null int64
UTI_INT_TO    6809556 non-null int64
DIAR_ACOM     6809556 non-null int64
QT_DIARIAS    6809556 non-null int64
PROC_SOLIC    6809556 non-null int64
PROC_REA      6809556 non-null in

In [18]:
# See AIH data 2018 head

aih_2018.head()

Unnamed: 0,UF_ZI,ANO_CMPT,MES_CMPT,ESPEC,CGC_HOSP,N_AIH,IDENT,CEP,MUNIC_RES,NASC,...,DIAGSEC9,TPDISEC1,TPDISEC2,TPDISEC3,TPDISEC4,TPDISEC5,TPDISEC6,TPDISEC7,TPDISEC8,TPDISEC9
0,120000,2018,1,3,4034526000000.0,1218100020863,1,69982000,120039,20070713,...,,1,0,0,0,0,0,0,0,0.0
1,120000,2018,1,3,4034526000000.0,1218100020885,1,69970000,120060,20160606,...,,0,0,0,0,0,0,0,0,0.0
2,120000,2018,1,3,4034526000000.0,1218100020896,1,69982000,120039,19890214,...,,0,0,0,0,0,0,0,0,0.0
3,120000,2018,1,3,4034526000000.0,1218100020907,1,69980000,120020,19721206,...,,0,0,0,0,0,0,0,0,0.0
4,120000,2018,1,3,4034526000000.0,1218100022788,1,69980000,120020,19720809,...,,0,0,0,0,0,0,0,0,0.0


## Create Random Sample for Each Year

*Calculate the proportion of observations that each year's dataset contribute to the total. The idea here is to make each year's random sample proportional. For example the year 2018 has almost 50% less observations that the rest of the years, I do not want to "extract" the same number of rows as other years with far more observtaions. If I do not take the number of observations in each into account I can over represent or underepresent certain year's observations.* 

In [11]:
# Calculate total AIH observations (2015-2018)

total_obs = len(aih_2015) + len(aih_2016) + len(aih_2017) + len(aih_2018)

print('Total Observations (2015-2018): ',total_obs)

Total Observations (2015-2018):  41537081


*A total of 41,648,222 observations.*

In [12]:
# Calculate the proportion each year is of total observations

prop_obs_2015 = len(aih_2015)/total_obs
prop_obs_2016 = len(aih_2016)/total_obs
prop_obs_2017 = len(aih_2017)/total_obs
prop_obs_2018 = len(aih_2018)/total_obs

In [13]:
# Memory usage of each df

mem_2015 = aih_2015.memory_usage().sum()

mem_2016 = aih_2016.memory_usage().sum()

mem_2017 = aih_2017.memory_usage().sum()

mem_2018 = aih_2018.memory_usage().sum()

print('Memory Usage AIH 2015 dataset (bytes): ', mem_2015, 
      '\nMemory Usage AIH 2016 dataset (bytes): ', mem_2016,
      '\nMemory Usage AIH 2017 dataset (bytes): ', mem_2017,
      '\nMemory Usage AIH 2018 dataset (bytes): ', mem_2018)

# Calculate total GB. The target to sample for this project is 5GB of data.

total_mem = mem_2015 + mem_2016 + mem_2017 + mem_2018

print('Total Memory Usage (bytes): ', total_mem)

Memory Usage AIH 2015 dataset (bytes):  10521523192 
Memory Usage AIH 2016 dataset (bytes):  10421051728 
Memory Usage AIH 2017 dataset (bytes):  10451107920 
Memory Usage AIH 2018 dataset (bytes):  6155838704
Total Memory Usage (bytes):  37549521544


*Total memory of the dataset is ~ 35GB*

In [15]:
# Target rows of the sampled dataset (rows needed to get to a data sample of around 5GB)

target_num_rows = np.round(total_obs * 0.40) # Sample 30% of the available data.

print('Target Number of Total Rows: ',target_num_rows )

Target Number of Total Rows:  16614832.0


In [16]:
# Determine number of rows to be sampled from each dataset. 
# The sampling will be proportional to the number of rows each dataset contributes to the total.

aih_2015_sample_size = int(target_num_rows * prop_obs_2015)

aih_2016_sample_size = int(target_num_rows * prop_obs_2016)

aih_2017_sample_size = int(target_num_rows * prop_obs_2017)

aih_2018_sample_size = int(target_num_rows * prop_obs_2018)

# Calcute total number of rows

total_rows = aih_2015_sample_size + aih_2016_sample_size + aih_2017_sample_size + aih_2018_sample_size

print('Sample AIH 2015 Rows: ', aih_2015_sample_size, 
      '\nSample AIH 2016 Rows: ', aih_2016_sample_size, 
      '\nSample AIH 2017 Rows: ', aih_2017_sample_size, 
      '\nSample AIH 2018 Rows: ', aih_2018_sample_size)

print('Total Number of Rows (in sample): ', total_rows)

Sample AIH 2015 Rows:  4655541 
Sample AIH 2016 Rows:  4611084 
Sample AIH 2017 Rows:  4624383 
Sample AIH 2018 Rows:  2723822
Total Number of Rows (in sample):  16614830


> ### Create a  random sample each year's dataset

*Define helper function to sample each year's dataset.*

In [17]:
def random_row_sample(df, sample_rows, seed = 42):
    
    ''' This function creates a random sample of rows from a dataframe. Parameters are as follows: 
    
        - df = Dataframe to be sampled from
    
        - sample_rows = Number of rows be sample. Default is entire dataframe.
        
        - pct = Percent of rows to sample. Default is 1 or 100%. Acceptable values are <= 1. 
        
        - seed = Random seed to be used. Default is 42.'''
    
    random_seed = random.seed(seed)
    
    rindex =  np.array(sample(range(len(df)), (sample_rows)))
    
    filtered_df = df.loc[rindex]

    return filtered_df

**Create samples for each year.**

In [18]:
%%time

# 2015 Sample

random_sample_2015 = random_row_sample(df = aih_2015, sample_rows = aih_2015_sample_size)

CPU times: user 30 s, sys: 19.3 s, total: 49.3 s
Wall time: 59.3 s


In [19]:
%%time

# 2016 Sample

random_sample_2016 = random_row_sample(df = aih_2016, sample_rows = aih_2016_sample_size)

CPU times: user 30.6 s, sys: 19.9 s, total: 50.4 s
Wall time: 1min 5s


In [20]:
%%time

# 2017 Sample

random_sample_2017 = random_row_sample(df = aih_2017, sample_rows = aih_2017_sample_size)

CPU times: user 29.3 s, sys: 18.8 s, total: 48.1 s
Wall time: 1min 1s


In [21]:
%%time

# 2018 Sample

random_sample_2018 = random_row_sample(df = aih_2018, sample_rows = aih_2018_sample_size)

CPU times: user 19.8 s, sys: 13.7 s, total: 33.5 s
Wall time: 56.5 s


In [22]:
# Print num of rows for each dataframe. Should be same as calculated number of rows above.

print('No. Rows 2015: ', len(random_sample_2015))
print('No. Rows 2016: ', len(random_sample_2016))
print('No. Rows 2017: ', len(random_sample_2017))
print('No. Rows 2018: ', len(random_sample_2018))

No. Rows 2015:  4655541
No. Rows 2016:  4611084
No. Rows 2017:  4624383
No. Rows 2018:  2723822


## Save random samples as a CSV files

In [26]:
%%time

# 2015 sample to CSV

random_sample_2015.to_csv('AIH_random_sample_2015.csv', index = False, 
                           na_rep= 'NaN', encoding='utf-8', 
                           chunksize = 50000)

CPU times: user 8min 48s, sys: 38.6 s, total: 9min 27s
Wall time: 9min 54s


In [28]:
%%time

# 2016 sample to CSV

random_sample_2016.to_csv('AIH_random_sample_2016.csv', index = False, 
                           na_rep= 'NaN', encoding='utf-8', 
                           chunksize = 50000)

CPU times: user 8min 14s, sys: 35.4 s, total: 8min 50s
Wall time: 9min 15s


In [17]:
%%time

# 2017 sample to CSV

random_sample_2017.to_csv('AIH_random_sample_2017.csv', index = False, 
                           na_rep= 'NaN', encoding='utf-8', 
                           chunksize = 50000)

CPU times: user 8min 40s, sys: 43.9 s, total: 9min 24s
Wall time: 10min 4s


In [19]:
%%time

# 2018 sample to CSV

random_sample_2018.to_csv('AIH_random_sample_2018.csv', index = False, 
                           na_rep= 'NaN', encoding='utf-8', 
                           chunksize = 50000)

CPU times: user 5min 1s, sys: 23.4 s, total: 5min 24s
Wall time: 6min 1s


## End of Notebook