# FORMATTING & CLEANING THE DATA

This notebook takes the signals and numerics csv files and combines them. 

The signals files is collected is data collected at 125Hz and has 60k rows. These are from the downloaded files.

Each file has a number associated with it, representing the person in the study.

The numeric files are measured at each second. These are joined with the signal data in a outer method.

The result data is has 60000 rows (approx).

In [83]:
import pandas as pd
import numpy as np
from scipy.stats import skew,kurtosis

This function joins and saves the numeric and signal data for each patient.

In [84]:
def make_dataframe(num):
    signals= pd.read_csv('csv/bidmc_'+str(num)+'_Signals.csv',index_col=0)
    signals['sec'] = signals['Time [s]'].apply(lambda x: int(np.floor(x)))
    numerics = pd.read_csv('csv/bidmc_'+str(num)+'_Numerics.csv',index_col=0)
    numerics.fillna(numerics.mean(),inplace=True) 
    numerics.rename(columns={'Time [s]':'sec'},inplace=True)
    numerics.drop(' RESP',axis=1,inplace=True)
    numerics['sec'] = numerics['sec'].apply(lambda x: int(x))
    signals = signals[[' RESP', ' PLETH', ' V', ' AVR', ' II','sec','Time [s]']]
    person = signals.merge(numerics,on='sec',how='outer')
    Hz_125_cols = [' RESP', ' PLETH', ' V', ' AVR', ' II']
    Min = person[Hz_125_cols+['sec']].groupby('sec').min()
    Min.columns = [i+'_Min' for i in Min.columns]
    Max = person[Hz_125_cols+['sec']].groupby('sec').max()
    Max.columns = [i+'_Max' for i in Max.columns]
    Mean = person[Hz_125_cols+['sec']].groupby('sec').mean()
    Mean.columns = Mean.columns = [i+'_Mean' for i in Mean.columns]
    Kurt = person[Hz_125_cols+['sec']].groupby('sec').agg(lambda x: kurtosis(x))
    Kurt.columns = [i+'_Kurt' for i in Kurt.columns]
    Skw = person[Hz_125_cols+['sec']].groupby('sec').agg(lambda x: skew(x))
    Skw.columns = [i+'_Skw' for i in Skw.columns]
    summary_frames = [Min,Max,Mean,Kurt,Skw]
    one_sec_summary = pd.concat(summary_frames,axis=1).reset_index()
    person = person.merge(one_sec_summary,on='sec',how='outer')
    person.to_csv('csv/person_'+str(num)+'.csv')

Each patient has a number between 1 and 53. We loop through all the numbers for each patient.

In [85]:
nums = []
for n in range(1,54):
    if n<10:
        nums.append('0'+str(n))
    else:
        nums.append(str(n))

We act on each patient, saving the dataframe and if there is an error we skip that patient.

In [86]:
for number in nums:
    try:
        make_dataframe(number)
    except:
        print(number)

09
15
30
38
39
41
47
49


Only 8 patients had dataframes with errors in them.

In this next cell, all those dataframes are loaded and concatenated to make one final representative dataframe across all patients.

In [None]:
from os import listdir
from os.path import isfile, join
onlyfiles = [f for f in listdir('csv') if isfile(join('csv', f))]
person_files = ['csv/'+i for i in onlyfiles if 'person' in i]
files = []
for person in person_files:
    df = pd.read_csv(person,index_col=0)
    files.append(df)

df = pd.concat(files, axis=0, ignore_index=True)
df.dropna(inplace=True)