# Example workflow for using flosp to import and clean ED data

Get to directory above to import flosp

In [1]:
%ls

 Volume in drive C is Windows
 Volume Serial Number is EE63-178E

 Directory of C:\Users\bjk1y13\OneDrive - University of Southampton\MH030_HHFT_flow\4_Analysis\flosp\example_data

04/01/2019  11:02    <DIR>          .
04/01/2019  11:02    <DIR>          ..
04/01/2019  11:00    <DIR>          .ipynb_checkpoints
04/01/2019  10:53         2,068,415 example_data.csv
04/01/2019  11:00            28,750 example_flosp_workflow.ipynb
04/01/2019  10:54    <DIR>          example_results
04/01/2019  11:03               126 setup.py
               3 File(s)      2,097,291 bytes
               4 Dir(s)  335,571,881,984 bytes free


In [2]:
%cd ".\..

C:\Users\bjk1y13\OneDrive - University of Southampton\MH030_HHFT_flow\4_Analysis\flosp


In [4]:
import flosp

## initialise flosp

You must provide the path to the setup file, which you should edit to your project specifics.

In [8]:
from flosp import Interface

In [9]:
eg = Interface()

created_flosp


### Importing ED data

In [5]:
EDdata = flosp.ioED('myhosp','./')

AttributeError: module 'flosp' has no attribute 'ioED'

load raw data

In [3]:
EDdata.load_csv('./example_data.csv') #,nrows=1500) # limit number of rows for quick runtime during dev

----------------------------------------
importing ED csv data to RAW dataframe


look at the dataframe at any stage of processing

In [4]:
EDdata.get_EDraw().head(3)

Unnamed: 0,PSEUDONYMISED_PATIENT_ID,AGE_AT_ARRIVAL,GENDER_NATIONAL_DESCRIPTION,SITE,ARRIVAL_DTTM,ARRIVAL_MODE_NATIONAL_CODE,INITIAL_ASSESSMENT_DTTM,SEEN_FOR_TREATMENT_DTTM,SPECIALTY_REQUEST_TIME,SPECIALTY_REFERRED_TO_DESCRIPTION,ADMISSION_FLAG,ATTENDANCE_CONCLUSION_DTTM,STREAM_LOCAL_CODE
0,2493,50,Female,HOSPITAL1,2013-01-08 19:44:00,2.0,2013-01-07 18:50:00,2013-01-09 21:50:00,,,0,2013-01-10 23:16:00,MIN
1,4822,22,Male,HOSPITAL1,2013-07-03 09:34:00,2.0,2013-07-03 08:57:00,2013-07-02 09:17:00,,,0,2013-07-05 09:02:00,MIN
2,6012,51,Male,HOSPITAL1,2017-08-27 23:33:00,2.0,2017-08-26 01:32:00,2017-08-27 03:12:00,,,0,2017-08-26 03:31:00,MIN


subsampling method - for quicker inital runtime during development of cleaning (some of the datetime conversions can take some time on larger data sets

In [5]:
EDdata.small_sample(size=15000)



### mapping columns to standard naming convention

unless your columns are already named using the required names - use column mapping method

In [1]:
#### define dictionary for mapping
col_map = {
'PSEUDONYMISED_PATIENT_ID':'dept_patid',
'PSEUDONYMISED_PATIENT_ID':'hosp_patid',
'AGE_AT_ARRIVAL':'age',
'GENDER_NATIONAL_DESCRIPTION':'gender',
'SITE':'site',
'ARRIVAL_DTTM':'arrive_datetime',
'ARRIVAL_MODE_NATIONAL_CODE':'arrive_mode',
'INITIAL_ASSESSMENT_DTTM':'first_triage_datetime',
'SEEN_FOR_TREATMENT_DTTM':'first_dr_datetime',
'SPECIALTY_REQUEST_TIME':'first_adm_request_time',
'SPECIALTY_REFERRED_TO_DESCRIPTION':'adm_referral_loc',
'ADMISSION_FLAG':'adm_flag',
'ATTENDANCE_CONCLUSION_DTTM':'depart_datetime',
'STREAM_LOCAL_CODE':'stream'
}

In [3]:
len(col_map)

13

In [7]:
EDdata.map_columns(col_map)

----------------------------------------
mapping column names


required column names can be found: 

In [8]:
flosp._expected_file_structures.dataRAW_expected_cols.keys()

dict_keys(['hosp_patid', 'age', 'age_group', 'gender', 'arrive_datetime', 'arrive_mode', 'arrive_hour', 'arrive_dayofweek', 'arrive_month', 'arrive_dayofweek_name', 'arrive_date', 'arrive_week', 'first_triage_datetime', 'first_dr_datetime', 'first_adm_request_datetime', 'adm_referral_loc', 'depart_datetime', 'depart_method', 'depart_hour', 'depart_dayofweek', 'depart_week', 'depart_month', 'depart_dayofweek_name', 'depart_date', 'waiting_time', 'breach_flag', 'adm_flag', 'stream', 'minutes_today', 'minutes_tomo', 'breach_datetime', 'arr_triage_wait', 'arr_dr_wait', 'arr_adm_req_wait', 'adm_req_dep_wait', 'dr_adm_req_wait', 'dr_dep_wait'])

In [9]:
EDdata.get_EDraw().head(3)

Unnamed: 0,hosp_patid,age,gender,site,arrive_datetime,arrive_mode,first_triage_datetime,first_dr_datetime,first_adm_request_time,adm_referral_loc,adm_flag,depart_datetime,stream
0,2493,50,Female,HOSPITAL1,2013-01-08 19:44:00,2.0,2013-01-07 18:50:00,2013-01-09 21:50:00,,,0,2013-01-10 23:16:00,MIN
1,4822,22,Male,HOSPITAL1,2013-07-03 09:34:00,2.0,2013-07-03 08:57:00,2013-07-02 09:17:00,,,0,2013-07-05 09:02:00,MIN
2,6012,51,Male,HOSPITAL1,2017-08-27 23:33:00,2.0,2017-08-26 01:32:00,2017-08-27 03:12:00,,,0,2017-08-26 03:31:00,MIN


### bespoke user cleaning operations 

if at any stage manual edits to the data are required (e.g. cleaning a spurious datetime)... get data out of class, edit and replace:

In [10]:
#### get data out
df = EDdata.get_EDraw()

#### make changes to df
# my changes here

#### replace data into EDdata
EDdata.replace_EDraw(df)

### converting datetimes
#### datetime formats are often non-standard...some attention is needed but there are some built in methods to help.

convert columns to datetime formats (by default anything with 'datetime' in column name will be transformed.

In [11]:
EDdata.convert_cols_datetime("%Y/%m/%d %H:%M")



Converting cols to datetime...(may take some time depedning on size of df)...
arrive_datetime...converting
first_triage_datetime...converting
first_dr_datetime...converting
depart_datetime...converting


create a datetime column from seperate time and date columns.

In [12]:
EDdata.create_datetime_from_time('first_adm_request_time','arrive_datetime','first_adm_request_datetime')

----------------------------------------
Create datetime column from: first_adm_request_time & arrive_datetime







### automated cleaning

check what else needs doing to get data into normalised format:

In [13]:
EDdata.run_tests()

--------------------
Finding missing columns...
age_group try using:  use make_age_group_column
arrive_hour try using:  make_callender_columns
arrive_dayofweek try using:  make_callender_columns
arrive_month try using:  make_callender_columns
arrive_dayofweek_name try using:  make_callender_columns
arrive_date try using:  make_callender_columns
arrive_week try using:  make_callender_columns
depart_method try using:  
depart_hour try using:  make_callender_columns
depart_dayofweek try using:  make_callender_columns
depart_week try using:  make_callender_columns
depart_month try using:  make_callender_columns
depart_dayofweek_name try using:  make_callender_columns
depart_date try using:  make_callender_columns
waiting_time try using:  
breach_flag try using:  
minutes_today try using:  
minutes_tomo try using:  
breach_datetime try using:  
arr_triage_wait try using:  
arr_dr_wait try using:  
arr_adm_req_wait try using:  
adm_req_dep_wait try using:  
dr_adm_req_wait try using:  
dr_de

#### run as much automated cleaning as possible using: 

In [15]:
EDdata.autoclean()

----------------------------------------
Making callender columns from:arrive_datetime
----------------------------------------
Making callender columns from:depart_datetime


run tests again:

In [16]:
EDdata.run_tests()

--------------------
Finding missing columns...
depart_method try using:  
minutes_today try using:  
minutes_tomo try using:  
--------------------
Finding columns with wrong datatypes...
Col  arrive_mode                is: float64 . Expected any of:  [<class 'object'>]
Col  adm_flag                   is: int64 . Expected any of:  [<class 'int'>]
Col  stream                     is: object . Expected any of:  [<class 'float'>, <class 'int'>, <class 'numpy.int64'>, <class 'str'>]
Col  depart_dayofweek_name      is: object . Expected any of:  [<class 'str'>]
Col  age_group                  is: category . Expected any of:  [<class 'str'>, 'pandas category type']


#### alternativly to autoclean,  step through the process using various methods:

In [17]:
EDdata.make_callender_columns()

----------------------------------------
Making callender columns from:arrive_datetime
----------------------------------------
Making callender columns from:depart_datetime


In [18]:
EDdata.make_wait_columns()

In [19]:
EDdata.make_breach_columns()

In [20]:
EDdata.make_age_group_column()

In [21]:
EDdata.run_tests()

--------------------
Finding missing columns...
depart_method try using:  
minutes_today try using:  
minutes_tomo try using:  
--------------------
Finding columns with wrong datatypes...
Col  arrive_mode                is: float64 . Expected any of:  [<class 'object'>]
Col  adm_flag                   is: int64 . Expected any of:  [<class 'int'>]
Col  stream                     is: object . Expected any of:  [<class 'float'>, <class 'int'>, <class 'numpy.int64'>, <class 'str'>]
Col  depart_dayofweek_name      is: object . Expected any of:  [<class 'str'>]
Col  age_group                  is: category . Expected any of:  [<class 'str'>, 'pandas category type']


## saving
#### ioED (and flosp) will automate the structure of your saving folder in the root dir you provided when you created the EDdata instance of ioED

In [26]:
EDdata.save_path

'./processed/myhosp/'

#### at any stage save your data out to come back to later..

In [22]:
EDdata.saveRAWasRAW()

----------------------------------------
saved file: ./processed/myhosp/RAW/myhospED.pkl


#### once your data is cleaned, save cleaned data out to .pkl file

In [23]:
EDdata.saveRAWasCLEAN()

----------------------------------------
saved file: ./processed/myhosp/myhospED.pkl


#### load your data back from cleaned file again if you are using ioED:

In [24]:
EDdata.loadPKLasRAW()

----------------------------------------
loaded file: ./processed/myhosp/myhospED.pkl
