# ETL Medical Records

Tokenize medical records.

Template Notebook using kardiasclean.

## Part 1: Split data

1. Load data
2. Split long strings into list of strings
3. Spread list of strings into multiple rows with repeated ids (new df)

In [1]:
import pandas as pd
from pathlib import Path
from getpass import getpass

import kardiasclean

df = pd.read_csv(Path("./resources/data_clean1.csv")).set_index("patient_id")
df.head()

Unnamed: 0_level_0,gender,state,municipality,altitude,age,weight_kg,height_cm,appearance,diagnosis_general,cx_previous,diagnosis_main,date_birth,date_procedure,procedure,rachs,stay_days,expired
patient_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
0,0,Estado de México,Huixquilucan,2726,3942,35.0,134.0,Normal,Ninguno,0,"Comunicación interauricular, secundum",2001-08-22,2012-04-08,"Reparación de CIA, parche",1.0,2.0,0
1,1,Estado de México,Timilpan,2741,3202,18.0,117.0,Desnutrido,Ninguno,0,"Comunicación interauricular, secundum",2003-09-19,2012-11-08,"Reparación CIA, parche",1.0,2.0,0
2,0,Ciudad de México,Coyoacán,2240,3147,22.0,120.0,Normal,Ninguno,0,"Comunicación interauricular, secundum",2003-11-21,2012-08-18,"Reparación CIA, parche",1.0,2.0,0
3,0,Estado de México,Nezahualcoyotl,2220,4005,42.0,147.0,Normal,Ninguno,0,"Comunicación interauricular, secundum",2001-10-07,2012-08-25,"Reparación CIA, parche",1.0,2.0,0
4,0,Ciudad de México,Alvaro Obregón,2373,5289,40.0,157.0,Normal,Ninguno,0,"Comunicación Interauricular, Secundum",1997-12-22,2012-01-09,"Reparación CIA, parche",1.0,3.0,0


In [2]:
df['procedure'] = kardiasclean.split_string(df['procedure'], delimiter="+")
df['procedure']

patient_id
0                             [Reparación de CIA, parche]
1                                [Reparación CIA, parche]
2                                [Reparación CIA, parche]
3                                [Reparación CIA, parche]
4                                [Reparación CIA, parche]
                              ...                        
1032    [Procedimiento de Switch arterial, Cierre de c...
1033    [Cierre quirurgico de comunicación interauricu...
1034    [Colocación de tubo valvuldo del ventrículo de...
1035    [Reparación de arco aórtico con técnica de ava...
1037    [Corrección de tetralogía de Fallot, ventricul...
Name: procedure, Length: 1003, dtype: object

In [3]:
spread_df = kardiasclean.spread_column(df['procedure'])
print(kardiasclean.get_unique_stats(spread_df))
spread_df[5:10]

                   patient_id   procedure
unique_count      1003.000000  759.000000
percent_of_total     0.603127    0.456404
avg_per_record       1.658026    2.191041


Unnamed: 0,patient_id,procedure
5,5,"Reparación de CIV, parche"
6,5,Reparación de estenosis aórtica subvalvular
7,6,"Reparación de CIV, parche"
8,7,"Reparacion de CIA, parche"
9,8,"Reparación de CIV, parche"


## Part 2: Clean and Tokenize Strings

1. Remove accents
2. Remove Symbols with regex
3. Remove stopwords
4. Tokenize with soundex

In [4]:
spread_df['procedure'] = kardiasclean.clean_accents(spread_df['procedure'])
print(kardiasclean.get_unique_stats(spread_df))
spread_df.head()

                   patient_id   procedure
unique_count      1003.000000  744.000000
percent_of_total     0.603127    0.447384
avg_per_record       1.658026    2.235215


Unnamed: 0,patient_id,procedure
0,0,"Reparacion de CIA, parche"
1,1,"Reparacion CIA, parche"
2,2,"Reparacion CIA, parche"
3,3,"Reparacion CIA, parche"
4,4,"Reparacion CIA, parche"


In [5]:
spread_df['procedure'] = kardiasclean.clean_symbols(spread_df['procedure'])
print(kardiasclean.get_unique_stats(spread_df))
spread_df.head()

                   patient_id   procedure
unique_count      1003.000000  714.000000
percent_of_total     0.603127    0.429345
avg_per_record       1.658026    2.329132


Unnamed: 0,patient_id,procedure
0,0,Reparacion de CIA parche
1,1,Reparacion CIA parche
2,2,Reparacion CIA parche
3,3,Reparacion CIA parche
4,4,Reparacion CIA parche


In [6]:
spread_df['keywords'] = kardiasclean.clean_stopwords(spread_df['procedure'])
print(kardiasclean.get_unique_stats(spread_df))
spread_df.head()

                   patient_id   procedure    keywords
unique_count      1003.000000  714.000000  666.000000
percent_of_total     0.603127    0.429345    0.400481
avg_per_record       1.658026    2.329132    2.496997


Unnamed: 0,patient_id,procedure,keywords
0,0,Reparacion de CIA parche,CIA Reparacion parche
1,1,Reparacion CIA parche,CIA Reparacion parche
2,2,Reparacion CIA parche,CIA Reparacion parche
3,3,Reparacion CIA parche,CIA Reparacion parche
4,4,Reparacion CIA parche,CIA Reparacion parche


In [7]:
spread_df['token'] = kardiasclean.clean_tokenize(spread_df['keywords'])
print(kardiasclean.get_unique_stats(spread_df))
spread_df.head()

                   patient_id   procedure    keywords       token
unique_count      1003.000000  714.000000  666.000000  603.000000
percent_of_total     0.603127    0.429345    0.400481    0.362598
avg_per_record       1.658026    2.329132    2.496997    2.757877


Unnamed: 0,patient_id,procedure,keywords,token
0,0,Reparacion de CIA parche,CIA Reparacion parche,SRPRSNPRX
1,1,Reparacion CIA parche,CIA Reparacion parche,SRPRSNPRX
2,2,Reparacion CIA parche,CIA Reparacion parche,SRPRSNPRX
3,3,Reparacion CIA parche,CIA Reparacion parche,SRPRSNPRX
4,4,Reparacion CIA parche,CIA Reparacion parche,SRPRSNPRX


## Part 3: Get Unique List

1. Get Unique Values from the spread dataframe
2. Normalize the spread dataframe with the new unique list

In [8]:
list_df = kardiasclean.create_unique_list(spread_df, spread_df['token'])
list_df = list_df.drop(["patient_id", "index"], axis=1)
list_df.head()

Unnamed: 0,procedure,keywords,token
0,Reparacion de CIA parche,CIA Reparacion parche,SRPRSNPRX
1,Reparacion de CIV parche,CIV Reparacion parche,SFRPRSNPRX
2,Reparacion de estenosis aortica subvalvular,Reparacion aortica estenosis subvalvular,RPRSNRTKSTNSSSPFLFLR
3,Cierre quirurgico de PCA,Cierre PCA quirurgico,SRPKKRRJK
4,Reparacion de CIV cierre primario,CIV Reparacion cierre primario,SFRPRSNSRPRMR


In [9]:
spread_df['procedure'] = kardiasclean.normalize_from_tokens(spread_df['token'], list_df['token'], list_df['procedure'])
spread_df = spread_df.set_index("patient_id")
spread_df.head()

Unnamed: 0_level_0,procedure,keywords,token
patient_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Reparacion de CIA parche,CIA Reparacion parche,SRPRSNPRX
1,Reparacion de CIA parche,CIA Reparacion parche,SRPRSNPRX
2,Reparacion de CIA parche,CIA Reparacion parche,SRPRSNPRX
3,Reparacion de CIA parche,CIA Reparacion parche,SRPRSNPRX
4,Reparacion de CIA parche,CIA Reparacion parche,SRPRSNPRX


## Part 4: Store in SQL

1. NOTE: Create a database in Postgres first!
2. Rename columns if necessary.
3. Use pandas and replace, NO NEED FOR SCHEMA (CREATE TABLE ...).

In [10]:
password = getpass('Enter database password')
host = "kardias-test.cvj7xeynbmtt.us-east-1.rds.amazonaws.com"
pgm = kardiasclean.PostgresManager("kardias", password, host)

In [13]:
# STORE MAIN DATA
df = df.drop(columns=["procedure", "diagnosis_main", "diagnosis_general"])
pgm.create_table("patients", df).count()

patient_id        1003
gender            1003
state             1003
municipality      1003
altitude          1003
age               1003
weight_kg         1003
height_cm         1003
appearance        1003
cx_previous       1003
date_birth        1003
date_procedure    1003
rachs             1003
stay_days         1003
expired           1003
dtype: int64

In [None]:
# STORE LIST DATA
list_df = list_df.set_index("token")
pgm.create_table("surgical_procedures", list_df).count()

token        603
procedure    603
keywords     603
dtype: int64

In [None]:
# STORE SPREAD DATA
spread_df = spread_df.drop(columns=["procedure", "keywords"])
pgm.create_table("surgical_procedures_map", spread_df).count()

patient_id    1663
token         1663
dtype: int64

## DONE!