This work book shows how different types of input data can be manipulated manually and loaded into `pandas` dataframes , which are subsequently used by the `CommonDataModel`

Importing packages:

In [1]:
import carrot
import glob
import pandas as pd
import os
from sqlalchemy import create_engine

## CSV Files

Create a map between the csv filename and a `pandas` dataframe, loaded from the csv

__note__: `iterator=True` tells pandas to not read the data into memory, but setup a `parsers.TextFileReader`
          specifying `chunksize=<value>` will also return an iterator, allowing for easy looping over data chunks

In [2]:
df_map = {
            os.path.basename(x):pd.read_csv(x,iterator=True) 
            for x in glob.glob('../data/part1/*.csv')
         }
df_map

{'Blood_Test.csv': <pandas.io.parsers.readers.TextFileReader at 0x1111bc1c0>,
 'Demographics.csv': <pandas.io.parsers.readers.TextFileReader at 0x1111bc430>,
 'GP_Records.csv': <pandas.io.parsers.readers.TextFileReader at 0x1111bc2e0>,
 'Hospital_Visit.csv': <pandas.io.parsers.readers.TextFileReader at 0x1111bc730>,
 'Serology.csv': <pandas.io.parsers.readers.TextFileReader at 0x1111bc970>,
 'Symptoms.csv': <pandas.io.parsers.readers.TextFileReader at 0x1111bcbe0>,
 'Vaccinations.csv': <pandas.io.parsers.readers.TextFileReader at 0x1111bcdf0>,
 'pks.csv': <pandas.io.parsers.readers.TextFileReader at 0x10d2f7400>}

Create a carrot.`LocalDataCollection` object to store the dataframes

In [3]:
csv_inputs = carrot.io.LocalDataCollection()
csv_inputs.load_input_dataframe(df_map)
csv_inputs

[32m2022-06-17 15:11:44[0m - [34mLocalDataCollection[0m - [1;37mINFO[0m - DataCollection Object Created
[32m2022-06-17 15:11:44[0m - [34mLocalDataCollection[0m - [1;37mINFO[0m - Registering  Blood_Test.csv [<carrot.io.common.DataBrick object at 0x10d2f7c70>]
[32m2022-06-17 15:11:44[0m - [34mLocalDataCollection[0m - [1;37mINFO[0m - Registering  Demographics.csv [<carrot.io.common.DataBrick object at 0x1111bc190>]
[32m2022-06-17 15:11:44[0m - [34mLocalDataCollection[0m - [1;37mINFO[0m - Registering  GP_Records.csv [<carrot.io.common.DataBrick object at 0x1111bc280>]
[32m2022-06-17 15:11:44[0m - [34mLocalDataCollection[0m - [1;37mINFO[0m - Registering  Hospital_Visit.csv [<carrot.io.common.DataBrick object at 0x10d2bb2b0>]
[32m2022-06-17 15:11:44[0m - [34mLocalDataCollection[0m - [1;37mINFO[0m - Registering  Serology.csv [<carrot.io.common.DataBrick object at 0x10d2f74f0>]
[32m2022-06-17 15:11:44[0m - [34mLocalDataCollection[0m - [1;37mINFO[0m - R

<carrot.io.plugins.local.LocalDataCollection at 0x10d2f7700>

Check to see what data has been loaded:

In [4]:
csv_inputs.keys()

dict_keys(['Blood_Test.csv', 'Demographics.csv', 'GP_Records.csv', 'Hospital_Visit.csv', 'Serology.csv', 'Symptoms.csv', 'Vaccinations.csv', 'pks.csv'])

## SQL 

The following shows how these objects can be used to write the csv files from the input collection to a SQL database.

In [5]:
sql_store = carrot.io.SqlDataCollection(connection_string="postgresql://localhost:5432/ExampleCOVID19DataSet",
                                          drop_existing=True)
sql_store

[32m2022-06-17 15:11:44[0m - [34mSqlDataCollection[0m - [1;37mINFO[0m - DataCollection Object Created
[32m2022-06-17 15:11:45[0m - [34mSqlDataCollection[0m - [1;37mINFO[0m - Engine(postgresql://localhost:5432/ExampleCOVID19DataSet)


<carrot.io.plugins.sql.SqlDataCollection at 0x1111e0520>

Loop over all the inputs, get a loaded dataframe from the input collections, and use the sql store to write the dataframe to the SQL database 

In [6]:
for name in csv_inputs.keys():
    df = csv_inputs[name]
    name = name.split(".")[0]
    sql_store.write(name,df)

[32m2022-06-17 15:11:45[0m - [34mLocalDataCollection[0m - [1;37mINFO[0m - Retrieving initial dataframe for 'Blood_Test.csv' for the first time
[32m2022-06-17 15:11:45[0m - [34mSqlDataCollection[0m - [1;37mINFO[0m - updating Blood_Test in Engine(postgresql://localhost:5432/ExampleCOVID19DataSet)
[32m2022-06-17 15:11:45[0m - [34mSqlDataCollection[0m - [1;37mINFO[0m - finished save to psql
[32m2022-06-17 15:11:45[0m - [34mLocalDataCollection[0m - [1;37mINFO[0m - Retrieving initial dataframe for 'Demographics.csv' for the first time
[32m2022-06-17 15:11:45[0m - [34mSqlDataCollection[0m - [1;37mINFO[0m - updating Demographics in Engine(postgresql://localhost:5432/ExampleCOVID19DataSet)
[32m2022-06-17 15:11:45[0m - [34mSqlDataCollection[0m - [1;37mINFO[0m - finished save to psql
[32m2022-06-17 15:11:45[0m - [34mLocalDataCollection[0m - [1;37mINFO[0m - Retrieving initial dataframe for 'GP_Records.csv' for the first time
[32m2022-06-17 15:11:45[0m - 

Now we can used pandas to test the SQL database we created, and load in some filtered data:

In [7]:
connection_string="postgresql://localhost:5432/ExampleCOVID19DataSet"
engine = create_engine(connection_string)

Retrieve a filtered pandas dataframe from the SQL connection

In [8]:
df_demo = pd.read_sql('SELECT * FROM "Demographics" LIMIT 1000;',con=engine)
df_demo

Unnamed: 0,ID,Age,Sex
0,pk1,57.0,Male
1,pk2,68.0,Female
2,pk3,78.0,Female
3,pk4,51.0,Female
4,pk5,51.0,Male
...,...,...,...
995,pk996,76.0,Female
996,pk997,62.0,Male
997,pk998,54.0,Female
998,pk999,63.0,Male


Use a more complex SQL command to filter the Serology table based on information in the demographics table, creating a pandas dataframe object.

In [9]:
sql_command = r'''
SELECT 
    * 
FROM "Serology" 
WHERE "ID" in (
    SELECT 
        "ID" 
    FROM "Demographics" 
    LIMIT 1000
    )
'''
df_serology = pd.read_sql(sql_command,con=engine)
df_serology

Unnamed: 0,ID,Date,IgG
0,pk654,2020-10-03,17.172114692899758
1,pk460,2020-11-02,201.93861878809216
2,pk12,20223-11-08,a10.601377479381105
3,pk987,2021-07-26,11.506250956970998
4,pk700,2021-10-29,2.6594057121417487
...,...,...,...
410,pk190,2022-11-07,51.77573831029082
411,pk890,2022-09-07,57.11515081936336
412,pk51,2022-11-07,15.264660709568151
413,pk263,2019-11-13,26.051354325968106


Build a new LocalDataCollection from the dataframes pulled from SQL and loaded in memory:

In [10]:
sql_inputs = carrot.io.LocalDataCollection()
sql_inputs.load_input_dataframe({'Serology.csv':df_serology,'Demographics.csv':df_demo})
sql_inputs

[32m2022-06-17 15:11:46[0m - [34mLocalDataCollection[0m - [1;37mINFO[0m - DataCollection Object Created
[32m2022-06-17 15:11:46[0m - [34mLocalDataCollection[0m - [1;37mINFO[0m - Registering  Serology.csv [<carrot.io.common.DataBrick object at 0x111650fa0>]
[32m2022-06-17 15:11:46[0m - [34mLocalDataCollection[0m - [1;37mINFO[0m - Registering  Demographics.csv [<carrot.io.common.DataBrick object at 0x1112114f0>]


<carrot.io.plugins.local.LocalDataCollection at 0x111650fd0>

Load some rules (and remove some missing source tables, since we only are dealing with two tables, and only want to apply rules associated with them):

In [11]:
rules = carrot.tools.load_json("../data/rules.json")
rules = carrot.tools.remove_missing_sources_from_rules(rules,sql_inputs.keys())
rules



{'metadata': {'date_created': '2022-02-12T12:22:48.465257',
  'dataset': 'FAILED: ExampleV4'},
 'cdm': {'person': {'MALE 3025': {'birth_datetime': {'source_table': 'Demographics.csv',
     'source_field': 'Age',
     'operations': ['get_datetime_from_age']},
    'gender_concept_id': {'source_table': 'Demographics.csv',
     'source_field': 'Sex',
     'term_mapping': {'Male': 8507}},
    'gender_source_concept_id': {'source_table': 'Demographics.csv',
     'source_field': 'Sex',
     'term_mapping': {'Male': 8507}},
    'gender_source_value': {'source_table': 'Demographics.csv',
     'source_field': 'Sex'},
    'person_id': {'source_table': 'Demographics.csv', 'source_field': 'ID'}},
   'FEMALE 3026': {'birth_datetime': {'source_table': 'Demographics.csv',
     'source_field': 'Age',
     'operations': ['get_datetime_from_age']},
    'gender_concept_id': {'source_table': 'Demographics.csv',
     'source_field': 'Sex',
     'term_mapping': {'Female': 8532}},
    'gender_source_concept_i

Create a common data model object and process it to create CDM tables

In [12]:
cdm = carrot.cdm.CommonDataModel.from_rules(rules,inputs=sql_inputs)
cdm.process()

[32m2022-06-17 15:11:47[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - CommonDataModel (5.3.1) created with co-connect-tools version 0.0.0
[32m2022-06-17 15:11:47[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - Running with an DataCollection object
[32m2022-06-17 15:11:47[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - Turning on automatic cdm column filling
[32m2022-06-17 15:11:47[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - Added MALE 3025 of type person
[32m2022-06-17 15:11:47[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - Added FEMALE 3026 of type person
[32m2022-06-17 15:11:47[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - Added Antibody 3027 of type observation
[32m2022-06-17 15:11:47[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - Starting processing in order: ['person', 'observation']
[32m2022-06-17 15:11:47[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - Number of objects to process for each table...
{
      "person": 2,
      "observation

could not convert string to float: 'na'
could not convert string to float: 'na'


[32m2022-06-17 15:11:47[0m - [34mPerson[0m - [1;37mINFO[0m - Mapped person_id
[32m2022-06-17 15:11:47[0m - [34mPerson[0m - [1;37mINFO[0m - Automatically formatting data columns.
[32m2022-06-17 15:11:47[0m - [34mPerson[0m - [1;37mINFO[0m - created df (0x111608820)[FEMALE_3026]
[32m2022-06-17 15:11:47[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - finished FEMALE 3026 (0x111608820) ... 2/2 completed, 435 rows
[32m2022-06-17 15:11:47[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - called save_dateframe but outputs are not defined. save_files: True
[32m2022-06-17 15:11:47[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - finalised person on iteration 0 producing 996 rows from 2 tables
[32m2022-06-17 15:11:47[0m - [34mLocalDataCollection[0m - [1;37mINFO[0m - Getting next chunk of data
[32m2022-06-17 15:11:47[0m - [34mLocalDataCollection[0m - [1;37mINFO[0m - All input files for this object have now been used.
[32m2022-06-17 15:11:47[0m - [34mLocalD

In [13]:
cdm['person'].dropna(axis=1)

Unnamed: 0_level_0,gender_concept_id,year_of_birth,month_of_birth,day_of_birth,birth_datetime,gender_source_value,gender_source_concept_id
person_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,8507,1963,7,16,1963-07-16 00:00:00.000000,Male,8507
2,8507,1969,7,14,1969-07-14 00:00:00.000000,Male,8507
3,8507,1956,7,17,1956-07-17 00:00:00.000000,Male,8507
4,8507,1960,7,16,1960-07-16 00:00:00.000000,Male,8507
5,8507,1962,7,16,1962-07-16 00:00:00.000000,Male,8507
...,...,...,...,...,...,...,...
992,8532,1995,7,8,1995-07-08 00:00:00.000000,Female,8532
993,8532,1956,7,17,1956-07-17 00:00:00.000000,Female,8532
994,8532,1944,7,20,1944-07-20 00:00:00.000000,Female,8532
995,8532,1966,7,15,1966-07-15 00:00:00.000000,Female,8532


In [14]:
cdm['observation'].dropna(axis=1)

Unnamed: 0_level_0,person_id,observation_concept_id,observation_date,observation_datetime,observation_source_value,observation_source_concept_id
observation_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,357,4288455,2020-10-03,2020-10-03 00:00:00.000000,17.172114692899758,4288455
2,258,4288455,2020-11-02,2020-11-02 00:00:00.000000,201.93861878809216,4288455
4,556,4288455,2021-07-26,2021-07-26 00:00:00.000000,11.506250956970998,4288455
5,380,4288455,2021-10-29,2021-10-29 00:00:00.000000,2.6594057121417487,4288455
6,415,4288455,2021-09-07,2021-09-07 00:00:00.000000,40.844873593089126,4288455
...,...,...,...,...,...,...
411,641,4288455,2022-11-07,2022-11-07 00:00:00.000000,51.77573831029082,4288455
412,492,4288455,2022-09-07,2022-09-07 00:00:00.000000,57.11515081936336,4288455
413,31,4288455,2022-11-07,2022-11-07 00:00:00.000000,15.264660709568151,4288455
414,672,4288455,2019-11-13,2019-11-13 00:00:00.000000,26.051354325968106,4288455


## PySpark 

Using `PySpark` we can create a session and a reader to connect to the same SQL database we created above

In [15]:
from pyspark.sql import SparkSession

Define the session:

In [16]:
spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.jars", "/Users/calummacdonald/Downloads/postgresql-42.3.1.jar") \
    .getOrCreate()

Create a reader:

In [17]:
reader = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:postgresql://localhost:5432/ExampleCOVID19DataSet") \
    .option("driver", "org.postgresql.Driver") 
reader

<pyspark.sql.readwriter.DataFrameReader at 0x1119ef910>

Create and load a spark dataframe for the Demographics table and specify to filter this on all people under the age of 20, selecting only the first 300 rows:

In [18]:
sdf_demo = reader.option("dbtable", '"Demographics"')\
                 .load()\

sdf_demo = sdf_demo.filter(sdf_demo.Age<50).limit(300)
sdf_demo.count()

300

Select the first 100 rows:

In [19]:
sdf_demo_first = sdf_demo.limit(100)

Drop the first 100 rows by subtracting the first 1000:

In [20]:
sdf_demo = sdf_demo.subtract(sdf_demo_first).limit(100)
sdf_demo.count()

100

Load the serology table, selecting only those whos ID is in the already loaded spark dataframe for the demographics

In [21]:
sdf_serology = reader.option("dbtable", '"Serology"')\
                     .load()

sdf_serology = sdf_serology.join(sdf_demo,
                                 ['ID'])\
                            .select(*sdf_serology.columns)
                         
sdf_serology.count()

52

Retrieve pandas dataframes from these spark dataframes and put them in a new map
_note_: we keep the name as '.csv' because this is what the name is in the rules file!

In [22]:
df_map = {
            'Demographics.csv': sdf_demo.select('*').toPandas(),
            'Serology.csv': sdf_serology.select('*').toPandas()
         }

In [23]:
spark_inputs = carrot.io.LocalDataCollection()
spark_inputs.load_input_dataframe(df_map)
spark_inputs

[32m2022-06-17 15:12:05[0m - [34mLocalDataCollection[0m - [1;37mINFO[0m - DataCollection Object Created
[32m2022-06-17 15:12:05[0m - [34mLocalDataCollection[0m - [1;37mINFO[0m - Registering  Demographics.csv [<carrot.io.common.DataBrick object at 0x11188c910>]
[32m2022-06-17 15:12:05[0m - [34mLocalDataCollection[0m - [1;37mINFO[0m - Registering  Serology.csv [<carrot.io.common.DataBrick object at 0x10d214c10>]


<carrot.io.plugins.local.LocalDataCollection at 0x11188c970>

In [24]:
cdm = carrot.cdm.CommonDataModel.from_rules(rules,inputs=spark_inputs)
cdm.process()

[32m2022-06-17 15:12:05[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - CommonDataModel (5.3.1) created with co-connect-tools version 0.0.0
[32m2022-06-17 15:12:05[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - Running with an DataCollection object
[32m2022-06-17 15:12:05[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - Turning on automatic cdm column filling
[32m2022-06-17 15:12:05[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - Added MALE 3025 of type person
[32m2022-06-17 15:12:05[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - Added FEMALE 3026 of type person
[32m2022-06-17 15:12:05[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - Added Antibody 3027 of type observation
[32m2022-06-17 15:12:05[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - Starting processing in order: ['person', 'observation']
[32m2022-06-17 15:12:05[0m - [34mCommonDataModel[0m - [1;37mINFO[0m - Number of objects to process for each table...
{
      "person": 2,
      "observation

In [25]:
cdm['observation'].dropna(axis=1)

Unnamed: 0_level_0,person_id,observation_concept_id,observation_date,observation_datetime,observation_source_value,observation_source_concept_id
observation_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,7,4288455,2021-04-12,2021-04-12 00:00:00.000000,67.58837665287089,4288455
2,70,4288455,2020-07-26,2020-07-26 00:00:00.000000,0.6408428671070668,4288455
3,30,4288455,2020-04-08,2020-04-08 00:00:00.000000,7.11704584051039,4288455
4,47,4288455,2020-04-05,2020-04-05 00:00:00.000000,51.60608444799083,4288455
5,14,4288455,2022-11-23,2022-11-23 00:00:00.000000,33.520886653263354,4288455
6,99,4288455,2022-10-21,2022-10-21 00:00:00.000000,30.00234968904614,4288455
7,7,4288455,2020-12-09,2020-12-09 00:00:00.000000,44.98630030598384,4288455
8,45,4288455,2021-06-03,2021-06-03 00:00:00.000000,1.356998868542723,4288455
9,44,4288455,2021-07-29,2021-07-29 00:00:00.000000,4.280139762594507,4288455
10,49,4288455,2021-08-21,2021-08-21 00:00:00.000000,38.4800593377047,4288455


In [26]:
cdm['person'].dropna(axis=1)

Unnamed: 0_level_0,gender_concept_id,year_of_birth,month_of_birth,day_of_birth,birth_datetime,gender_source_value,gender_source_concept_id
person_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,8507,1972,7,13,1972-07-13 00:00:00.000000,Male,8507
2,8507,1979,7,12,1979-07-12 00:00:00.000000,Male,8507
3,8507,1982,7,11,1982-07-11 00:00:00.000000,Male,8507
4,8507,2012,7,3,2012-07-03 00:00:00.000000,Male,8507
5,8507,1973,7,13,1973-07-13 00:00:00.000000,Male,8507
...,...,...,...,...,...,...,...
96,8532,1972,7,13,1972-07-13 00:00:00.000000,Female,8532
97,8532,1977,7,12,1977-07-12 00:00:00.000000,Female,8532
98,8532,1977,7,12,1977-07-12 00:00:00.000000,Female,8532
99,8532,1996,7,7,1996-07-07 00:00:00.000000,Female,8532
