# Week 03 Assignment Covid

New types of data and new data science technologies enable new research. These new technologies are technologies such as the ability to combine existing data or the ability to generate synthetic data from existing knowledge. This week casus is based on such research. Data is generated by Synthea's COVID-19 module. The data was constructed using three peer-reviewed publications published in the early stages of the global pandemic, when less was known, along with emerging resources, data, publications, and clinical knowledge. The simulation outputs synthetic Electronic Health Records (EHR), including the daily consumption of Personal Protective Equipment (PPE) and other medical devices and supplies. The Data is stored in separate tables to avoid redundancy, with as a concequence that tables need to be combined and reorganized in dataframes for analysing purpose.

Keywords: merge data, subset data, clean data, generate data

You will learn about combining data with pandas and numpy and you will learn to visualize with bokeh. Concretely, you will preprocess the partly Synthetic Covid data in an appropiate format in order to conduct statistical and visual analysis. Learning objectives

- Combine multiple data sources for analysis
- Read, inspect, clean, reshape data
- Visualize data using bokeh
- Maintain development environment 
- Apply coding standards and FAIR principles
- Reshape the dataset into a format suitable for visual and statistical analysis
- Use widgets to make the plot interactive 
- Use GIS libraries to plot geographical data

Tutorials about combining data: https://github.com/fenna/BFVM22PROG1/blob/main/tutorials/tutorial_combine_data.ipynb

study case combining data:https://github.com/fenna/BFVM22PROG1/blob/main/study_cases/adults_who_binge_drank_in_hot_towns.ipynb


Please add the topics you want to learn about here: https://padlet.com/ffeenstra1/kzh2chaqleq3iovu


Your job is to **visualize the lab values taken for COVID-19 patients of survived versus not survived patients**. 

The assignment consists of 6 parts:

- [part 1: load the data](#0)
     - [Exercise 1.1](#ex-11)
- [part 2: data wrangling](#1)
     - [Exercise 2.1](#ex-21)
- [part 3: more wrangling](#2)
     - [Exercise 3.1](#ex-31)
- [part 4: plot the data](#3)
     - [Exercise 4.1](#ex-41)
- [part 5: plot patient location](#5)
     - [Exercise 5.1](#ex-51)


Part 1 and 4 are mandatory, part 5 is optional (bonus)
Mind you that you cannot copy code without referencing the code. If you copy code you need to be able to explain your code verbally and you will not get the full score. 


## About the data

The data is generated by Synthea's COVID-19 module. The data was constructed using three peer-reviewed publications published in the early stages of the global pandemic, when less was known, along with emerging resources, data, publications, and clinical knowledge. The simulation outputs synthetic Electronic Health Records (EHR), including the daily consumption of Personal Protective Equipment (PPE) and other medical devices and supplies. For this assignment the `conditions`, `patients`, `observations`, `careplans` and `encounters` table will be used. The Data is stored in separate tables to avoid redundancy, with as a concequence that tables need to be combined and reorganized in dataframes for analysing purpose.

Source: Walonoski J, Klaus S, Granger E, Hall D, Gregorowicz A, Neyarapally G, Watson A, Eastman J. Synthea™ Novel coronavirus (COVID-19) model and synthetic data set. Intelligence-Based Medicine. 2020 Nov;1:100007. https://doi.org/10.1016/j.ibmed.2020.100007

Please <a href = "https://synthetichealth.github.io/synthea-sample-data/downloads/10k_synthea_covid19_csv.zip">download</a> the data

#### Covid Patients
Patients are considered Covid patients if they are identified with `CODE` `840539006`


#### Survivors
Patients that had covid and where tested negative after isolation have tested code `94531-1`,  SARS-CoV-2 RNA Pnl Resp NAA+probe (covid-sars test) + a value of `Not detected (qualifier value)`. These patients are considered to be survived covid patients. 

#### Non-Survivors
Patients that did not survived Covid have a `DEATHDATE` which is not null. 


#### Lab values  COVID-19 patients

Patients are monitored for blood and heart conditions once they are admitted in Hospital or under treatment. The lab values of interest are as follow: 

- `48065-7`  Fibrin D-dimer FEU [Mass/volume] in Platelet poor plasma
- `26881-3`   Interleukin 6 [Mass/volume] in Serum or Plasma
- `2276-4` Ferritin [Mass/volume] in Serum or Plasma
- `89579-7` Troponin I.cardiac [Mass/volume] in Serum or Plasma by High sensitivity method
- `731-0` Lymphocytes [#/volume] in Blood by Automated count
- `14804-9` Lactate dehydrogenase [Enzymatic activity/volume] in Serum or Plasma by Lactate to pyruvate reaction


---

<a name='0'></a>
## Part 1: Load the data (20 pt)

Instructions: Load the data of the following files. 
Preferably we read the data not with a hard coded data path but using a config file. See https://fennaf.gitbook.io/bfvm22prog1/data-processing/configuration-files/yaml

- conditions.csv
- patients.csv
- observations.csv
- careplans.csv
- encounters.csv

Get yourself familiar with the data. Create some meaningful overviews. Answer the following questions

1. How many patients are there
2. How many covid-patients are there
3. How many patients do have a 'Hospital admission for isolation' encounter
    
<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
    <ul><li>use a unique dataframe for each file, use a meaningful name</li>
    <li>pandas.read_csv() method can be used to read a csv file</li>
    <li>pandas.DataFrame.head() method is often used to inspect the dataframe</li>
    <li>.unique() returns a list of unique values of a column</li>
</ul>
</details>

<a name='ex-11'></a>
### 1.1 Code your solution

In [2]:
pwd

'C:\\Users\\hanna\\OneDrive\\Bureaublad\\my_python_files'

In [1]:
import yaml
import pandas as pd
from datetime import datetime
import numpy as np

# Opening all the files
with open("config.yaml", "r") as config_reader:
    files = yaml.safe_load(config_reader)
    condition_file = files['condition_file']
    patients_file = files["patients_file"]
    observations_file = files["observations_file"]
    careplan_file = files["careplan_file"]
    encounters_file = files["encounters_file"]

# Making a dictionairy of dataframes
df_dict = {"conditions": pd.read_csv(condition_file), "patients": pd.read_csv(patients_file), "observations": pd.read_csv(observations_file), "careplan": pd.read_csv(careplan_file), "encounters": pd.read_csv(encounters_file)}

# Quantity of the patients
print("There are {} patients".format(len(df_dict["patients"]["Id"])))

# Quantity of covid patients
amount_of_covid_pat = df_dict['conditions']["CODE"].value_counts()[840539006]
print(f"There are {amount_of_covid_pat} covid patients")

# Quantity of patients with 'Hospital admission for isolation' 
hospital_ad_for_isolation = df_dict['encounters']["DESCRIPTION"].value_counts()["Hospital admission for isolation (procedure)"]
print(f"There are {hospital_ad_for_isolation} patients hospilized for isolation")

# How many patients died
died_patients = len(df_dict["patients"]["DEATHDATE"]) - df_dict["patients"]["DEATHDATE"].isnull().sum()
print(f"{died_patients} have died")


There are 12352 patients
There are 8820 covid patients
There are 1867 patients hospilized for isolation
2352 have died


In [3]:
# Checking encounters
df_dict['encounters'].head()

Unnamed: 0,Id,START,STOP,PATIENT,ORGANIZATION,PROVIDER,PAYER,ENCOUNTERCLASS,CODE,DESCRIPTION,BASE_ENCOUNTER_COST,TOTAL_CLAIM_COST,PAYER_COVERAGE,REASONCODE,REASONDESCRIPTION
0,d5ee30a9-362f-429e-a87a-ee38d999b0a5,2019-02-16T01:02:32Z,2019-02-16T01:17:32Z,f0f3bc8d-ef38-49ce-a2bd-dfdda982b271,5103c940-0c08-392f-95cd-446e0cea042a,e2c226c2-3e1e-3d0b-b997-ce9544c10528,7c4411ce-02f1-39b5-b9ec-dfbea9ad3c1a,outpatient,185345009,Encounter for symptom,129.16,129.16,69.16,65363002.0,Otitis media
1,6a74fdef-2287-44bf-b9e7-18012376faca,2019-08-02T01:02:32Z,2019-08-02T01:32:32Z,f0f3bc8d-ef38-49ce-a2bd-dfdda982b271,0b9f3f7c-8ab6-30a5-b3ae-4dc0e0c00cb3,87c33fc5-3fd1-3c52-815a-b89a1623bb3a,7c4411ce-02f1-39b5-b9ec-dfbea9ad3c1a,wellness,410620009,Well child visit (procedure),129.16,129.16,129.16,,
2,8bca6d8a-ab80-4cbf-8abb-46654235f227,2019-10-31T01:02:32Z,2019-10-31T01:17:32Z,f0f3bc8d-ef38-49ce-a2bd-dfdda982b271,5103c940-0c08-392f-95cd-446e0cea042a,e2c226c2-3e1e-3d0b-b997-ce9544c10528,7c4411ce-02f1-39b5-b9ec-dfbea9ad3c1a,outpatient,185345009,Encounter for symptom,129.16,129.16,69.16,65363002.0,Otitis media
3,821e57ac-9304-46a9-9f9b-83daf60e9e43,2020-01-31T01:02:32Z,2020-01-31T01:17:32Z,f0f3bc8d-ef38-49ce-a2bd-dfdda982b271,0b9f3f7c-8ab6-30a5-b3ae-4dc0e0c00cb3,87c33fc5-3fd1-3c52-815a-b89a1623bb3a,7c4411ce-02f1-39b5-b9ec-dfbea9ad3c1a,wellness,410620009,Well child visit (procedure),129.16,129.16,129.16,,
4,681c380b-3c84-4c55-80a6-db3d9ea12fee,2020-03-02T01:02:32Z,2020-03-02T01:58:32Z,f0f3bc8d-ef38-49ce-a2bd-dfdda982b271,fd328395-ab1d-35c6-a2d0-d05a9a79cf11,9c875a09-93e0-39aa-9260-ad264bbdd3fe,7c4411ce-02f1-39b5-b9ec-dfbea9ad3c1a,ambulatory,185345009,Encounter for symptom (procedure),129.16,129.16,69.16,,
5,9aa748b8-3b44-4e34-b7a8-2e56f2ca3ca2,2019-07-08T08:02:25Z,2019-07-08T08:17:25Z,067318a4-db8f-447f-8b6e-f2f61e9baaa5,f18084da-cd20-347a-93a9-f62f748ba19a,46775954-ffef-3608-aa92-9ba974682a16,5059a55e-5d6e-34d1-b6cb-d83d16e57bcf,wellness,410620009,Well child visit (procedure),129.16,129.16,129.16,,
6,df2e9ebd-090c-4fb4-b749-ec6e4cf3e75e,2020-01-06T08:02:25Z,2020-01-06T08:32:25Z,067318a4-db8f-447f-8b6e-f2f61e9baaa5,f18084da-cd20-347a-93a9-f62f748ba19a,46775954-ffef-3608-aa92-9ba974682a16,5059a55e-5d6e-34d1-b6cb-d83d16e57bcf,wellness,410620009,Well child visit (procedure),129.16,129.16,129.16,,
7,adedca64-700b-4fb9-82f1-9cbb658abb73,2020-02-12T08:02:25Z,2020-02-12T09:02:25Z,067318a4-db8f-447f-8b6e-f2f61e9baaa5,3bd5eda0-16da-3ba5-8500-4dfd6ae118b8,bc4a66b7-a2ba-3ad3-af08-2975489d8495,5059a55e-5d6e-34d1-b6cb-d83d16e57bcf,emergency,50849002,Emergency room admission (procedure),129.16,129.16,59.16,,
8,1ea74a77-3ad3-4948-a9cc-3084462035d6,2020-03-13T08:02:25Z,2020-03-13T08:52:25Z,067318a4-db8f-447f-8b6e-f2f61e9baaa5,3bd5eda0-16da-3ba5-8500-4dfd6ae118b8,bc4a66b7-a2ba-3ad3-af08-2975489d8495,5059a55e-5d6e-34d1-b6cb-d83d16e57bcf,ambulatory,185345009,Encounter for symptom (procedure),129.16,129.16,59.16,,
9,e03b96de-5604-4989-a2d5-03a63e041eab,2020-04-28T08:02:25Z,2020-04-28T08:32:25Z,067318a4-db8f-447f-8b6e-f2f61e9baaa5,3bd5eda0-16da-3ba5-8500-4dfd6ae118b8,bc4a66b7-a2ba-3ad3-af08-2975489d8495,5059a55e-5d6e-34d1-b6cb-d83d16e57bcf,ambulatory,185345009,Encounter for symptom,129.16,129.16,59.16,43878008.0,Streptococcal sore throat (disorder)


In [4]:
# Checking patients
df_dict['patients'].head()

Unnamed: 0,Id,BIRTHDATE,DEATHDATE,SSN,DRIVERS,PASSPORT,PREFIX,FIRST,LAST,SUFFIX,...,BIRTHPLACE,ADDRESS,CITY,STATE,COUNTY,ZIP,LAT,LON,HEALTHCARE_EXPENSES,HEALTHCARE_COVERAGE
0,f0f3bc8d-ef38-49ce-a2bd-dfdda982b271,2017-08-24,,999-68-6630,,,,Jacinto644,Kris249,,...,Beverly Massachusetts US,888 Hickle Ferry Suite 38,Springfield,Massachusetts,Hampden County,1106.0,42.151961,-72.598959,8446.49,1499.08
1,067318a4-db8f-447f-8b6e-f2f61e9baaa5,2016-08-01,,999-15-5895,,,,Alva958,Krajcik437,,...,Boston Massachusetts US,1048 Skiles Trailer,Walpole,Massachusetts,Norfolk County,2081.0,42.17737,-71.281353,89893.4,1845.72
2,ae9efba3-ddc4-43f9-a781-f72019388548,1992-06-30,,999-27-3385,S99971451,X53218815X,Mr.,Jayson808,Fadel536,,...,Springfield Massachusetts US,1056 Harris Lane Suite 70,Chicopee,Massachusetts,Hampden County,1020.0,42.181642,-72.608842,577445.86,3528.84
3,199c586f-af16-4091-9998-ee4cfc02ee7a,2004-01-09,,999-73-2461,S99956432,,,Jimmie93,Harris789,,...,Worcester Massachusetts US,201 Mitchell Lodge Unit 67,Pembroke,Massachusetts,Plymouth County,,42.075292,-70.757035,336701.72,2705.64
4,353016ea-a0ff-4154-85bb-1cf8b6cedf20,1996-11-15,,999-60-7372,S99917327,X58903159X,Mr.,Gregorio366,Auer97,,...,Patras Achaea GR,1050 Lindgren Extension Apt 38,Boston,Massachusetts,Suffolk County,2135.0,42.352434,-71.02861,484076.34,3043.04


In [2]:
# Renaming the CODE name to CODE-Y
df_dict['observations'] = df_dict['observations'].rename(columns={'CODE': 'CODE_Y'})

# Checking observations
df_dict['observations'].head()

Unnamed: 0,DATE,PATIENT,ENCOUNTER,CODE_Y,DESCRIPTION,VALUE,UNITS,TYPE
0,2019-08-01,f0f3bc8d-ef38-49ce-a2bd-dfdda982b271,6a74fdef-2287-44bf-b9e7-18012376faca,8302-2,Body Height,82.7,cm,numeric
1,2019-08-01,f0f3bc8d-ef38-49ce-a2bd-dfdda982b271,6a74fdef-2287-44bf-b9e7-18012376faca,72514-3,Pain severity - 0-10 verbal numeric rating [Sc...,2.0,{score},numeric
2,2019-08-01,f0f3bc8d-ef38-49ce-a2bd-dfdda982b271,6a74fdef-2287-44bf-b9e7-18012376faca,29463-7,Body Weight,12.6,kg,numeric
3,2019-08-01,f0f3bc8d-ef38-49ce-a2bd-dfdda982b271,6a74fdef-2287-44bf-b9e7-18012376faca,77606-2,Weight-for-length Per age and sex,86.1,%,numeric
4,2019-08-01,f0f3bc8d-ef38-49ce-a2bd-dfdda982b271,6a74fdef-2287-44bf-b9e7-18012376faca,9843-4,Head Occipital-frontal circumference,46.9,cm,numeric


In [6]:
# Checking careplan
df_dict['careplan'].head(40)

Unnamed: 0,Id,START,STOP,PATIENT,ENCOUNTER,CODE,DESCRIPTION,REASONCODE,REASONDESCRIPTION
0,fea43343-7312-423f-bb82-b2f5ae71a260,2020-03-01,2020-03-01,f0f3bc8d-ef38-49ce-a2bd-dfdda982b271,681c380b-3c84-4c55-80a6-db3d9ea12fee,736376001,Infectious disease care plan (record artifact),840544004.0,Suspected COVID-19
1,cbcade35-42bf-4807-8154-3f7f847221e0,2020-03-01,2020-03-30,f0f3bc8d-ef38-49ce-a2bd-dfdda982b271,681c380b-3c84-4c55-80a6-db3d9ea12fee,736376001,Infectious disease care plan (record artifact),840539006.0,COVID-19
2,51dd78df-2b01-486a-8b33-1fbcd9cec211,2020-02-12,2020-02-26,067318a4-db8f-447f-8b6e-f2f61e9baaa5,adedca64-700b-4fb9-82f1-9cbb658abb73,91251008,Physical therapy procedure,44465007.0,Sprain of ankle
3,8aa5055b-cddc-4170-9e31-e71e5552502a,2020-03-13,2020-03-13,067318a4-db8f-447f-8b6e-f2f61e9baaa5,1ea74a77-3ad3-4948-a9cc-3084462035d6,736376001,Infectious disease care plan (record artifact),840544004.0,Suspected COVID-19
4,976d369a-2b71-488d-ba20-8674fc272be0,2020-03-13,2020-04-14,067318a4-db8f-447f-8b6e-f2f61e9baaa5,1ea74a77-3ad3-4948-a9cc-3084462035d6,736376001,Infectious disease care plan (record artifact),840539006.0,COVID-19
5,b7f24b6d-1907-4154-9d78-b608bc958d96,2010-08-24,,ae9efba3-ddc4-43f9-a781-f72019388548,07a2f747-fb8c-46b2-9b17-9e79c9ec153f,443402002,Lifestyle education regarding hypertension,59621000.0,Hypertension
6,aade3cba-7529-42c9-9a93-2f12aeaf37d5,2020-03-11,2020-03-11,ae9efba3-ddc4-43f9-a781-f72019388548,eeab7c2d-71ba-4e04-af16-87a01dce7d54,736376001,Infectious disease care plan (record artifact),840544004.0,Suspected COVID-19
7,97c1b0e7-dbf6-4c07-9264-1e934742443c,2020-03-11,2020-04-15,ae9efba3-ddc4-43f9-a781-f72019388548,eeab7c2d-71ba-4e04-af16-87a01dce7d54,736376001,Infectious disease care plan (record artifact),840539006.0,COVID-19
8,178a36f5-eb58-4fe2-8448-717e4af045c2,2020-03-01,2020-03-02,199c586f-af16-4091-9998-ee4cfc02ee7a,8333efdf-f7bf-43bb-b73f-2b663d14c1ad,736376001,Infectious disease care plan (record artifact),840544004.0,Suspected COVID-19
9,cc8ebf57-cfda-445e-8d74-6a0caef899fa,2020-03-02,2020-04-07,199c586f-af16-4091-9998-ee4cfc02ee7a,8333efdf-f7bf-43bb-b73f-2b663d14c1ad,736376001,Infectious disease care plan (record artifact),840539006.0,COVID-19


In [7]:
# Checking conditions
df_dict['conditions'].head()

Unnamed: 0,START,STOP,PATIENT,ENCOUNTER,CODE,DESCRIPTION
0,2019-02-15,2019-08-01,f0f3bc8d-ef38-49ce-a2bd-dfdda982b271,d5ee30a9-362f-429e-a87a-ee38d999b0a5,65363002,Otitis media
1,2019-10-30,2020-01-30,f0f3bc8d-ef38-49ce-a2bd-dfdda982b271,8bca6d8a-ab80-4cbf-8abb-46654235f227,65363002,Otitis media
2,2020-03-01,2020-03-30,f0f3bc8d-ef38-49ce-a2bd-dfdda982b271,681c380b-3c84-4c55-80a6-db3d9ea12fee,386661006,Fever (finding)
3,2020-03-01,2020-03-01,f0f3bc8d-ef38-49ce-a2bd-dfdda982b271,681c380b-3c84-4c55-80a6-db3d9ea12fee,840544004,Suspected COVID-19
4,2020-03-01,2020-03-30,f0f3bc8d-ef38-49ce-a2bd-dfdda982b271,681c380b-3c84-4c55-80a6-db3d9ea12fee,840539006,COVID-19


### 1.2 Test your solution
The following function needs to be called. You can use this as a test. There are however more meaningful overviews 
you can create. 

In [8]:
def part1(num_pat, num_cov, num_admitted, num_died):
    print(f'There are {num_pat} patients in total')
    print(f'There are {num_cov} covid patients')
    print(f'There are {num_admitted} admitted patients')
    print(f'{num_died} patients died')


### Expected outcome

---

<a name='1'></a>
## Part 2: Data Wrangling: set up the dataframe (30 pt)

In this part we are going to combine data to create a dataframe with values of interest for the lab values analysis. 

We would like a dataframe containing the following information per record (only Covid patients!!!)

- `PATIENT` - the ID of the covid patient
- `days` - the number of days the patient is under observation
- `CODE-Y` - the code of the observation  
- `VALUE` - the lab value of the observation

where only the following observation codes needs to be selected:

- `48065-7`  Fibrin D-dimer FEU [Mass/volume] in Platelet poor plasma
- `26881-3`   Interleukin 6 [Mass/volume] in Serum or Plasma
- `2276-4` Ferritin [Mass/volume] in Serum or Plasma
- `89579-7` Troponin I.cardiac [Mass/volume] in Serum or Plasma by High sensitivity method
- `731-0` Lymphocytes [#/volume] in Blood by Automated count
- `14804-9` Lactate dehydrogenase [Enzymatic activity/volume] in Serum or Plasma by Lactate to pyruvate reaction

The days information is not primarely available and needs to be calculated by substracting observation DATE - START. 

An example of such a dataframe is given below:

In [9]:
#Possible approach:

#Select all the patients with covid from the conditions table
#Combine conditions table (only covid patients) with the patient table into a covid_patient table
#select the only the relevant lab observations from the observations table into a lab_obs table
#merge the covid_patient table with the lab_obs table into a covid_patients_obs table
#clean the covid_patients_obs table (rename columns, select only relevant columns, sort, typecast, add days column)


<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
    <ul><li>you can use pandas.DataFrame.merge() to merge dataframes</li>
    <li>df = df[(df.CODE == condition1 | df.CODE == condition1 )] selects rows with CODE of 2 conditional values</li>
    <li>df.DATE - df.START return days if DATE and START are datetime format</li>
    <li>pd.to_datetime() can be used to typecast to datetime</li>
</ul>
</details>

<a name='ex-21'></a>
### 2.1 Code your solution

In [3]:
#Select all the patients with covid from the conditions table
df = df_dict["conditions"].loc[df_dict["conditions"]["CODE"] == 840539006]

#Combine conditions table (only covid patients) with the patient table into a covid_patient table
df = pd.merge(df, df_dict['observations'], on = "PATIENT", how = "inner")

df = df.loc[df["CODE_Y"].isin(["48065-7","26881-3","2276-4","89579-7","731-0","14804-9"])]

# Calculating the amount of days
df["DAYS"] = ( pd.to_datetime(df['DATE']) -  pd.to_datetime(df["START"].str[:10])) / np.timedelta64(1, 'D') 

df = df.loc[:,["PATIENT", "DAYS", "CODE_Y", "VALUE", "UNITS"]]

# Converting the labvalues to float
df["VALUE"] = df["VALUE"].astype("float")

df

Unnamed: 0,PATIENT,DAYS,CODE_Y,VALUE,UNITS
273,f58bf921-cba1-475a-b4f8-dc6fa3b8f89c,0.0,731-0,1.1,10*3/uL
292,f58bf921-cba1-475a-b4f8-dc6fa3b8f89c,0.0,48065-7,0.4,ug/mL
293,f58bf921-cba1-475a-b4f8-dc6fa3b8f89c,0.0,2276-4,332.4,ug/L
294,f58bf921-cba1-475a-b4f8-dc6fa3b8f89c,0.0,89579-7,2.3,pg/mL
295,f58bf921-cba1-475a-b4f8-dc6fa3b8f89c,0.0,14804-9,223.9,U/L
...,...,...,...,...,...
1479602,c9699449-7a8b-400a-8e86-fab6aa7134cb,8.0,731-0,0.9,10*3/uL
1479621,c9699449-7a8b-400a-8e86-fab6aa7134cb,8.0,48065-7,0.5,ug/mL
1479622,c9699449-7a8b-400a-8e86-fab6aa7134cb,8.0,2276-4,525.2,ug/L
1479623,c9699449-7a8b-400a-8e86-fab6aa7134cb,8.0,89579-7,3.0,pg/mL


---

<a name='2'></a>
## Part 3: Data Wrangling, split into survived and not survived (10 pt)

Now we have the required data we would like to split the data into survived and not survived. First we fetch all the ids of the survived and deceased patients. We can use these ids to select the records of the survived patients and the patients that did not survived

Your job is to split the data into survived and not survived records. There are multiple ways to do this. One way is the  `.isin()` method

In [4]:
#the following code is given, RUN THIS CELL
#get survived and deceased ids
completed_isolation_patients = df_dict['careplan'][(df_dict['careplan'].CODE == 736376001) & (df_dict['careplan'].STOP.notna()) \
                                          & (df_dict['careplan'].REASONCODE == 840539006)].PATIENT
negative_covid_patient_ids = df_dict['observations'][(df_dict['observations'].CODE_Y == '94531-1') \
                                          & (df_dict['observations'].VALUE == 'Not detected (qualifier value)')].PATIENT.unique()
survivor_ids = np.union1d(completed_isolation_patients, negative_covid_patient_ids)
deceased_ids = df_dict['patients'][df_dict['patients'].DEATHDATE.notna()].Id

<a name='ex-31'></a>
### 3.1 Code your solution

In [5]:
# Making a survived dataframe
survived_df = df.loc[df["PATIENT"].isin(survivor_ids)]
len(survived_df)

57303

In [6]:
# Making a deceased dataframe
deceased_df = df.loc[df["PATIENT"].isin(deceased_ids)]
len(deceased_df)

16793

### 3.2 Test your solution

In [7]:
def test3(survived, died):
    print(f'patients records survived: {survived}, patients records deceased {died}')
#call the test3

test3(len(survived_df), len(deceased_df))

patients records survived: 57303, patients records deceased 16793


#### Expected outcome

---

<a name='3'></a>
## Part 4: Plot the data (20 pt)

Create plots with the lab data, for each code one plot. Separate the survivors and the deceased by color. An example of such a plot is given below. You can create 6 plots in one grid (for each code one plot) or use a widget (for instance a drop down menu widget) to select a lab CODE. Plot on the x-axis the days, on the y-axis the VALUE. Use proper labels, titles and legends.

<img src="../images/week3_plot.png" width="500" height="500"/>

<a name='ex-41'></a>
### 4.1 Code your solution

In [8]:
from bokeh.plotting import figure, show, gridplot

# Making a dictionairy for graph titles
title_dict = {"48065-7" : "Fibrin D-dimer FEU in Platelet poor plasma",
    "26881-3" : "Interleukin 6 in Serum or Plasma",
    "2276-4" : "Ferritin in Serum or Plasma",
    "89579-7" : "Troponin I.cardiac in Serum or Plasma by High sensitivity method",
    "731-0" : "Lymphocytes in Blood by Automated count", "14804-9" : "Lactate dehydrogenase in Serum or Plasma by Lactate to pyruvate reaction"}

# Making a dictionairy for units that are used
unit_dict = {"48065-7": "ug/mL", "26881-3": "pg/mL", "2276-4": "ug/L", "89579-7": "pg/mL", "731-0": "10*3/uL", "14804-9": "U/L"} 


# A function that creates a plot
def make_plot(code_y):
    """ A function that takes in a code-y and return a plot of the lab value data. The deceased and survived patients are indicated with a different color"""
    
    df_deceased_y = deceased_df.loc[df["CODE_Y"] == code_y]
    df_survived_y = survived_df.loc[df["CODE_Y"] == code_y]

    p = figure(width=400, height= 400)
    # Added a star renderer (Christmas style)
    p.star(x = df_deceased_y["DAYS"], y= df_deceased_y["VALUE"], size=5, color="red", alpha=0.5, legend_label="Deceased")
    p.star(x = df_survived_y["DAYS"], y= df_survived_y["VALUE"], size=5, color="green", alpha=0.5, legend_label="Survived")
    p.title.text = title_dict[code_y]
    
    
    p.xaxis.axis_label = "Time in days"
    p.yaxis.axis_label = unit_dict[code_y]

    return p

# Making a list with plots of the six different code-y
plots = [make_plot("48065-7"), make_plot("26881-3"), make_plot("2276-4"), make_plot("89579-7"), make_plot("731-0"), make_plot("14804-9")]

# Creating a gridplot with all the six plots
show(gridplot(plots, ncols=2 , width=400, height=400))



Opening in existing browser session.


<a name='4'></a>
## Part 5: Plot the location of the patients (10 pt)

This is a bonus part. Can you plot the patients location on a map? See also 
https://docs.bokeh.org/en/latest/docs/user_guide/geo.html 

You can use either package folium or geopandas. You need the Latitude and Longitude information from the patient tabel


<a name='ex-51'></a>
### 5.1 Code your solution

In [12]:
from bokeh.plotting import figure, output_file, show
from bokeh.tile_providers import CARTODBPOSITRON, get_provider

# A function that converts the coordinates to mercators
def coordinates(latitude, longitude):
    """ A function that takes in a latitude and a longitude. It converts the value into mercators and returns them as a tuple"""
    r_major = 6378137.000
    x = r_major * np.radians(longitude)
    scale = x/longitude
    y = 180.0/np.pi * np.log(np.tan(np.pi/4.0 + 
        latitude * (np.pi/180.0)/2.0)) * scale
    return (x, y)

# Creating a separated dataframe for the geoplot
df_map = df_dict["patients"]

# Creating a separate column with the latitude and longitude values packed into tuples called coordinates
df_map["coordinates"] = list(zip(df_map['LAT'], df_map['LON']))

# Creating a column of mercator coordinates
df_map['mercator'] = [coordinates(x, y) for x, y in df_map["coordinates"] ]

# Creating two lists with the two mercator values to use in the geoplot
merc_x = [x for x, y in df_map['mercator']]
merc_y = [y for x, y in df_map['mercator']]

# Creating the geoplot
tile_provider = get_provider(CARTODBPOSITRON)
p = figure(x_range=(-8200000, -7700000), y_range=(5200000, 5300000),
           x_axis_type="mercator", y_axis_type="mercator", width =  800)
p.add_tile(tile_provider)

# Adding the datapoint to the geoplot
p.circle(x = merc_x, y = merc_y, size=3, fill_color="blue", fill_alpha=0.5)

# Show
show(p)

Opening in existing browser session.
