## Installing

You can use pip to install the module 

```
pip install co-connect-tools
```



## Start the Tool

To start the ETLTool we can import it from the `coconnect` module we just installed

In [1]:
import coconnect
etltool = coconnect.ETLTool()

[32m2021-02-15 16:01:52[0m - [34mETLTool[0m - [1;37mINFO[0m - Starting the tool


## Load Inputs

To run the tool you need to load some input datasets, and specify how to map the fields 

The data will be loaded into pandas dataframes that we'll use for some visualisations of what the input `csv` files will look like

### Source data

This data is synthetic data that has been produced by [OHDSI](http://ohdsi.org/) which simply details a record of patients.

_Note: that these example data files will be stored in `<install_dir>/lib/python3.8/site-packages/coconnect/`, a directory that `ETLTool` will be looking in. For your own files, you should specify the full path to the inputs_

In [2]:
f_input_data = 'sample_input_data/patients_sample.csv'
etltool.load_input_data([f_input_data])

[32m2021-02-15 16:01:52[0m - [34mETLTool[0m - [1;37mINFO[0m - found the following input tables: ['patients_sample']


Verify what files have been loaded, by default the input dataset is mapped to to a name via `/path/<name>.csv`

In [3]:
etltool.get_input_names()

['patients_sample']

Sample (3 entries) what this input data looks like. __Note__ becareful using this method with a large dataset

In [4]:
df_input = etltool.get_input_df('patients_sample')
df_input.sample(3)

Unnamed: 0,ID,BIRTHDATE,DEATHDATE,SSN,DRIVERS,PASSPORT,PREFIX,FIRST,LAST,SUFFIX,MAIDEN,MARITAL,RACE,ETHNICITY,GENDER,BIRTHPLACE,ADDRESS,CITY,STATE,ZIP
3,525fdbdc-959e-472f-8986-d1a492c89d45,1966-02-11,,999-76-3812,S99987691,X37872486X,Mrs.,Marvel321,Turner526,,Gutkowski940,M,white,swedish,F,Pembroke,348 Jenkins Branch Suite 7,West Bridgewater,Massachusetts,2379
1,b4339f80-9313-4437-8664-cffdca3c5e9a,2002-10-25,,999-93-2628,S99955993,,,Cassandra224,Hessel84,,,,white,italian,F,Holbrook,355 Quitzon Run Unit 44,Palmer Town,Massachusetts,1009
13,b9009ab0-2d91-49a1-86e4-a5160331c51a,1980-09-27,,999-25-1618,S99935275,X76909937X,Mr.,Weldon459,Schroeder447,,,S,white,irish,M,Wilmington,1051 Hessel Skyway Apt 11,Rockland,Massachusetts,2370


### Structural Mapping

Next we use another `csv` file to define how to map different fields in the source data to a [Common Data Model (CDM)](https://www.ohdsi.org/data-standardization/the-common-data-model/).

In this example, the CDM that the source data (`patients_sample`)  is being mappped to is the `person` CDM.

There are three rules defined:
1. Performs a straight one-to-one mapping between the field `id` in the source field to the `person_id` field of the `person` CDM
2. Performs a mapping with the operation/function `extract year` 
3. Performs a term mapping which is defined in the term mapping `csv` file, see the next section for more information 


In [5]:
f_structural_mapping = 'sample_input_data/rules1.csv'
etltool.load_structural_mapping(f_structural_mapping) 
etltool.get_structural_mapping_df()

[32m2021-02-15 16:01:52[0m - [34mETLTool[0m - [1;37mINFO[0m - Loaded the structural mapping with 3 rules


Unnamed: 0_level_0,Unnamed: 1_level_0,destination_field,source_table,source_field,term_mapping,coding_system,operation
destination_table,rule_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
person,0,person_id,patients_sample,id,n,user defined,n
person,1,year_of_birth,patients_sample,birthdate,n,user defined,extract year
person,2,gender_concept_id,patients_sample,gender,y,user defined,n


#### Testing operations
The 2nd rule defined the operation `extract year`, this is a default operation defined in `etltool`, here is a quick example of how it works..

Load the function

In [6]:
fn_extract_year = etltool.allowed_operations['extract year']
fn_extract_year

<bound method ETLTool.get_year_from_date of <coconnect.etltool.ETLTool object at 0x102f2d9d0>>

For example, taking the `BIRTHDATE` columns, which looks like:

In [7]:
df_input['BIRTHDATE'].head(4)

0    2002-08-11
1    2002-10-25
2    1990-02-24
3    1966-02-11
Name: BIRTHDATE, dtype: object

The function can be used to easily extract the year from the date

In [8]:
fn_extract_year(df_input.head(4),column='BIRTHDATE')

0    2002
1    2002
2    1990
3    1966
Name: BIRTHDATE, dtype: int64

### Term Mapping

In the term mapping, the structural mapping `rule_id` is mapped telling us how to map a source term to a destination term, i.e. if the source term is `M` then the output should be `8507`

In [9]:
f_term_mapping = 'sample_input_data/rules2.csv'
etltool.load_term_mapping(f_term_mapping)
etltool.get_term_mapping_df()

[32m2021-02-15 16:01:52[0m - [34mETLTool[0m - [1;37mINFO[0m - Loaded the term mapping with 2 rules


Unnamed: 0_level_0,source_term,destination_term
rule_id,Unnamed: 1_level_1,Unnamed: 2_level_1
2,M,8507
2,F,8532


### Run the tool

In [10]:
etltool.run()

[32m2021-02-15 16:01:52[0m - [34mETLTool[0m - [1;37mINFO[0m - Creating an output data folder: ./data/
[32m2021-02-15 16:01:52[0m - [34mETLTool[0m - [1;37mINFO[0m - Destination tables to create... ['person']
[32m2021-02-15 16:01:52[0m - [34mETLTool[0m - [1;37mINFO[0m - Done with tool initialisation...
[32m2021-02-15 16:01:52[0m - [34mETLTool[0m - [1;37mINFO[0m - Starting ETL to CDM
[32m2021-02-15 16:01:52[0m - [34mETLTool[0m - [1;37mINFO[0m - Now running on Table "person"
[32m2021-02-15 16:01:52[0m - [34mETLTool[0m - [1;37mINFO[0m - Loaded the CDM for this table which has the following fields..
[32m2021-02-15 16:01:52[0m - [34mETLTool[0m - [1;37mINFO[0m - The CDM for "person" has 18, you have mapped 3 leaving 15 fields unmapped
[32m2021-02-15 16:01:52[0m - [34mETLTool[0m - [1;37mINFO[0m - Working on person_id
[32m2021-02-15 16:01:52[0m - [34mETLTool[0m - [1;37mINFO[0m - Working on year_of_birth
[32m2021-02-15 16:01:52[0m - [34mETL

We can finally get the output in a dataframe

In [11]:
etltool.get_output_df()

Unnamed: 0,person_id,gender_concept_id,year_of_birth,month_of_birth,day_of_birth,birth_datetime,race_concept_id,ethnicity_concept_id,location_id,provider_id,care_site_id,person_source_value,gender_source_value,gender_source_concept_id,race_source_value,race_source_concept_id,ethnicity_source_value,ethnicity_source_concept_id
0,0,8532,2002,,,,,,,,,,,,,,,
1,1,8532,2002,,,,,,,,,,,,,,,
2,2,8507,1990,,,,,,,,,,,,,,,
3,3,8532,1966,,,,,,,,,,,,,,,
4,4,8532,1963,,,,,,,,,,,,,,,
5,5,8507,1984,,,,,,,,,,,,,,,
6,6,8532,1960,,,,,,,,,,,,,,,
7,7,8507,1951,,,,,,,,,,,,,,,
8,8,8507,1970,,,,,,,,,,,,,,,
9,9,8507,1969,,,,,,,,,,,,,,,
