<a id="data_exploration"></a>
# 1. Data Exploration

The dataset is available in the */data* folder. The database structure is freely inspired from the AP-HP Clinical Data Warehouse (CDW).

Let's assume that data were extracted from the CDW on **December 1st, 2025**.

# Exercise 1: a first illustration of Real-World Data (RWD) analysis

In this first exercise we introduce some basic categories of RWD and we illustrate some of the challenges related to their preprocessing for research.

We initialize the notebook by importing the following libraries:

In [None]:
import pandas as pd
import numpy as np

# Visualization library
import altair as alt
alt.data_transformers.enable('default', max_rows=None)

# Dates management
import datetime

# For the computation of Kaplan-Meier estimates and log-rank tests
import lifelines

# Table of content

1. [Data Exploration](#data_exploration)  
    1.1 [Patients' identities and demographic data](#patient_ids)  
    1.2 [Administrative data related to patients' pathways](#visits)  
    1.3 [Claim data related to patients' conditions (diagnosis)](#cond)  
    1.4 [Structured medication data](#med)  
2. [Preprocessing](#preprocessing)  
    2.1 [First attempt of visualizing survival curves](#first_kaplan)  
    2.2 [Pre-processing patients' identities and demographic data](#prepro_patient)  
    2.3 [Pre-processing administrative data related to patients' pathways](#prepro_visits)  
    2.4 [Pre-processing claim data](#prepro_cond)  
3. [Statistical analysis](#stat)  
    3.1 [First objective : Are the drugs efficient on the overall population ?](#stat_first_kaplan)  
    3.2 [Second objective : Sub-population analysis](#stat_second_kaplan)  
4. [Takeaways](#takeaways) 
5. [References](#references)  

<a id="data_exploration"></a>
# 1. Data Exploration

The dataset is available in the */data* folder. The database structure is freely inspired from the AP-HP Clinical Data Warehouse (CDW).

Let's assume that data were extracted from the CDW on **December 1st, 2025**.

<a id="patient_ids"></a>
## 1.1 Patients' identities and demographic data

- Open the *data/df_person.pkl* file using the `pandas.read_pickle()` function.
- Explore the type of each feature of the `df_person` DataFrame with the `.info()` function.
- Check out the first rows of the DataFrame using the `.head()` function.

In [None]:
df_person = pd.read_pickle('/data/df_person.pkl')
#TODO

How many patient_ids do we have in this database ?

TIP : You can use the `DataFrame.col_name.unique()` function on one of the columns

In [None]:
print(f"We have {#TODO} unique patient ids in this dataset.")

<a id="visits"></a>
## 1.2 Administrative data related to patients' pathways

Explore the `df_visit` DataFrame that gathers administrative data about patient's hospitalizations.

In [None]:
df_visit = #TODO
#TODO

Compute the number of visits recorded by care site.

TIP : You can use the `DataFrame.col_name.value_counts()`function on the right column

In [None]:
#TODO

<a id="cond"></a>
## 1.3 Claim data related to patients' conditions (diagnosis)

Explore the `df_condition` DataFrame that gathers claim data used by hospital managers for reimbursement purposes.

Meeting with experts has allowed us to identify the ICD-10 codes corresponding to the flu virus :
- J09
- J10
- J11

You can check out on the following website the meaning of those codes : https://www.aideaucodage.fr/cim

In [None]:
df_condition = #TODO
#TODO

How many ICD-10 codes can you identify in this dataset ?  

TIP : You can use the `DataFrame.col_name.unique()` function.

In [None]:
print(f"The available ICD-10 codes are the following : {#TODO}")

<a id="med"></a>
## 1.4 Structured medication data

Explore the `df_med` DataFrame gathering structured data in regards to medication administration during hospital stays.

In [None]:
df_med = nfhlaefhkzl
#TODO

<a id="preprocessing"></a>
# 2. Preprocessing

Now that we know what categories of data are available, let's process them!

We have defined two helper functions in the *viz.py* script that leverage the [*lifelines*](https://lifelines.readthedocs.io/en/latest/) library to plot Kaplan-Meier estimates and compute log-rank tests relatively to our objectives :  
1. Evaluate the overall impact of drug A and B
2. Stratify our analysis on age and gender

In [None]:
#Import the helper functions
import sys
sys.path.append("../")
from viz import plot_primary_kaplan, plot_secondary_kaplan

<a id="first_kaplan"></a>
## 2.1 First attempt of visualizing survival curves

Let's try to compute straightforwardly the Kaplan-Meier estimates!  

First define the end date of the study, as needed to censor data, using the [*datetime*](https://pypi.org/project/DateTime/) package.

In [None]:
t_end_of_study = #TODO

Print the docstrings of the `plot_primary_kaplan` function, and use them to plot the first Kaplan-Meier estimates for the whole population regarding the drug administration.

TIP : To print the documentation of a function, you can call the `__doc__`attribute of the function.

In [None]:
print(#TODO)

In [None]:
plot_primary_kaplan(#TODO)

Does the Kaplan-Meier estimates seem correct ?

<a id="prepro_patient"></a>
## 2.2 Pre-processing patients' identities and demographic data

### 2.2.1 Birth dates

Let's explore identities and demographic data. Count the number of missing values for each feature of the `df_person` DataFrame. What do you observe ?

TIP : Use the `DataFrame.isna().sum()` function to compute the number of NA values within each column of the DataFrame.

In [None]:
df_person.isna().sum()

Some dates of birth are missing. Can you search for the origin of this lack of data ?

TIP : Check out the impact of `cdm_source` in the missingness of birth datetime.

In [None]:
print(f"Number of missing birth datetimes for EHR 1 : {#TODO}")

In [None]:
print(f"Number of missing birth datetimes for EHR 2 : {#TODO}")

What would you suggest to address this bias ?

**Correction** :  
The birth dates quality issue can be directly associated to the registration within the "EHR 2" software. Although it could include a clinical bias, a solution may be to discard data coming from this software (this assumption should be grounded in an understanding of the context of use of both softwares! Clinicians' expertise shall be leveraged to confirm/infirm this assumption).

Create a `df_person_fix` DataFrame that contains patient information coming from `cdm_source` other than "EHR 2".

TIP : 
You can use the `query` built-in function : 
```python 
df_condition.query("content of your query")

In [None]:
df_person_fix = #TODO
df_person_fix.info()

Now that we have handled the missingness of dates of birth, let's check the plausibility of the available dates. Plot the birth datetime distribution as a bar chart.

Tip 1 : you can convert the birth datetime to a "YYYY-MM" format using the following command : 
```python 
    df_person_fix['birth_date'] = df_person_fix['birth_datetime'].dt.strftime('%Y-%m')
```  
Tip 2 : If your DataFrame contains too many rows, you can use the `pandas.DataFrame.groupby()` function and count the number of `person_id` by `birth_date`.

In [None]:
df_person_fix['birth_date'] = #TODO

In [None]:
birth_dates_summary = #TODO

In [None]:
#TODO

What do you see ? Does this distribution look normal ?

### 2.2.2 Death dates

Plot the death datetime distribution. Do you observe anything seeming abnormal ?

TIP : Use the same steps than for the birth datetime

In [None]:
df_person_fix['death_date'] = #TODO

In [None]:
death_dates_summary = #TODO

In [None]:
#TODO

What do you observe ?

<a id="prepro_visits"></a>
## 2.3 Pre-processing administrative data related to patients' pathways

We consider now administrative data related to patients' hospitalizations.

Plot the distribution of entrance dates : column "visit_start_datetime" of the *df_visit* DataFrame.  

Tip 1: Check out the presence of null dates, and convert dates to the "YYYY-MM" format.  
Tip 2: If your DataFrame contains to many rows, you can use the `pandas.DataFrame.groupby()` function and count the number of `person_id` by `visit_start_date`.

In [None]:
df_visit['visit_start_date'] = #TODO

In [None]:
visit_start_dates_summary = #TODO

In [None]:
#TODO

Discard visits which dates are not plausible (*e.g* occurring before 01/01/2000), and checkout for the new date repartition.

TIPS : 
- Use  `pd.to_datetime("01/01/2000"))` to compare visit start date to 01/01/2000 and create a Dataframe `df_visit_fix` with only visits starting after 01/01/2000.
- Convert the visit start datetime to a "YYYY-MM" format
- Use the `pandas.DataFrame.groupby()` function and count the number of `person_id` by `birth_date`

In [None]:
df_visit_fix = #TODO

In [None]:
df_visit_fix['visit_start_date'] = #TODO

In [None]:
visit_fix_start_dates_summary = #TODO

In [None]:
#TODO

**WARNING** : although knowing the temporality of a visit is crucial to estimate survival functions, do not forget that this selection may once more induce biases. We will evaluate its impact later on in the project.

<a id="prepro_cond"></a>
## 2.4 Pre-processing claim data

We consider now claim data. We want to plot the total amount of visits related to flu treatment, and their temporal repartition.  

Create a `df_cond_fix` DataFrame that contains only information about the previously selected visits.  

Tip 1 : Merge the `df_cond` and `df_visit_fix` DataFrames on their common feature *visit_occurrence_id*.

Tip 2 : we only need the *visit_start_date* and *visit_occurrence_id* from the `df_visit_fix` DataFrame.

In [None]:
df_cond_fix = #TODO

In [None]:
df_cond_fix.head()

In [None]:
df_cond_fix["visit_start_date"] = #TODO

Plot the temporal repartition of *visit_occurrence_id* counts for each *condition_source_value*.

Tip : you can group the `df_cond_fix` DataFrame by *visit_start_date* and *condition_source_value*, and count the number of *visit_occurrence_id* for each group, using the following command :
```python
 DataFrame.groupby(['key1', 'key2'], as_index=False).visit_occurrence_id.count()
 ```

In [None]:
cond_fix_start_dates_summary = #TODO

In [None]:
cond_fix_start_dates_summary.head()

In [None]:
#TODO

What do you observe ?

Create an `is_epidemic` column in the `df_cond_fix` DataFrame to detect conditions linked to the flu epidemic, using it associated codes (J09, J10, J11).

Tip 1 : Create a `list_epidemic_icd10` gathering the epidemic codes.  
TIP 2: You can use the `df_cond_fix.condition_source_value.apply()` function, and detect if each element of the column is in the `list_epidemic_icd10`. The argument can be a function of type :
``` python
lambda x: treatment(x)
```

In [None]:
list_epidemic_icd10 = #TODO

In [None]:
df_cond_fix["is_epidemic"] = #TODO

Create a `epidemic_cond_summary` DataFrame gathering the number of epidemic vs non-epidemic visits by `condition_start_date`.

In [None]:
epidemic_cond_summary = #TODO

In [None]:
epidemic_cond_summary.head()

Plot the temporal repartition of epidemic visits compared to non-epidemic visits.

In [None]:
#TODO

<a id="stat"></a>
# 3. Statistical analysis

Now that we have pre-processed raw data to correct flawed or missing values and to define research-oriented variables, we can conduct the statistical analysis. Our data is ready to plot the Kaplan-Meier estimates of survival curves, and realize the log-rank tests.  
We are only interested in epidemic visits, so filter out non epidemic conditions in the `df_cond_fix` DataFrame, and deduce from it the `df_visit_epidemic` DataFrame.


In [None]:
df_cond_epidemic = #TODO

In [None]:
df_visit_epidemic = #TODO

<a id="stat_first_kaplan"></a>
### 3.1 First objective : Are the drugs efficient on the overall population ?

Plot the new primary Kaplan-Meier estimates for the whole `df_person_fix` DataFrame with regards to the epidemic conditions, newly fixed visits and drug administration.

In [None]:
plot_primary_kaplan(#TODO)

What do you observe ?

<a id="stat_second_kaplan"></a>
### 3.2 Second objective : Sub-population analysis

To reach our secondary objective, we now conduct the same statistical analysis on sub-populations that correspond to different sexes and ages  in order to obtain a better insight on drugs' efficiencies.  

Plot the secondary Kaplain-Meier estimates for the sub-group analysis on **cohort A**:

In [None]:
plot_secondary_kaplan(#TODO)

Plot the secondary Kaplain-Meier estimates for the sub-group analysis on **cohort B**:

In [None]:
plot_secondary_kaplan(#TODO)

What can you conclude from this subgroup analyses ?

The analysis presented in this notebook is obviously not representative of a real research study that usually comprises more data transformations and a more involved statistical design. In particular, we have not considered the biases that may be induced by discarding missing data although it is a crucial issue that may be addressed leveraging complex statistical methodologies. This notebook aims only at providing a first illustration of some challenges related to the analysis of Real-World Data provided in hospitals' clinical data warehouses. 

In particular, it shows that analysis pipelines suited to Real-World Data studies are complex and multistage. Consolidating the quality of preprocessing pipelines appears important to enhance the reliability of evidences produced on EHR data. This consolidation may be reached by opening the code of the analysis pipelines to review, and by developing and testing it collaboratively for instance as part of open source scientific libraries.

<a id="takeaways"></a>
# 4. Takeaways

- **Real-World Data verifies the no pain, no gain** principle. Although data may appear simpler to collect than in a randomized controlled trial, reaching meaningful insights requires correcting numerous biases and applying important transformations to raw data.
- **Administrative and claim data** comprises important information for research although its has not been collected for that purpose.
- **The analysis pipelines required to analyse Real-World Data are complex** and rely on numerous transformations that progressively improve data quality and its suitability for research. Sharing the development of analysis pipelines among projects, for instance in scientific libraries, improves the overall efficiency and quality of research.

<a id="references"></a>
# 5. References

- Kohane, Isaac S, Bruce J Aronow, Paul Avillach, Brett K Beaulieu-Jones, Riccardo Bellazzi, Robert L Bradford, Gabriel A Brat, et al. « What Every Reader Should Know About Studies Using Electronic Health Record Data but May Be Afraid to Ask ». Journal of Medical Internet Research 23, nᵒ 3 (2 mars 2021): e22219. https://doi.org/10.2196/22219.
- Kaplan, E. L., et Paul Meier. « Nonparametric Estimation from Incomplete Observations ». Journal of the American Statistical Association 53, nᵒ 282 (1958): 457‑81. https://doi.org/10.2307/2281868.
- Davidson-Pilon, Cameron. lifelines, survival analysis in Python. Zenodo, 2021. https://doi.org/10.5281/zenodo.5745573.
- McCoy, Allison B, Adam Wright, Michael G Kahn, Jason S Shapiro, Elmer Victor Bernstam, et Dean F Sittig. « Matching Identifiers in Electronic Health Records: Implications for Duplicate Records and Patient Safety ». BMJ Quality & Safety 22, nᵒ 3 (mars 2013): 219‑24. https://doi.org/10.1136/bmjqs-2012-001419.
- Wilson, Greg, D. A. Aruliah, C. Titus Brown, Neil P. Chue Hong, Matt Davis, Richard T. Guy, Steven H. D. Haddock, et al. « Best Practices for Scientific Computing ». Édité par Jonathan A. Eisen. PLoS Biology 12, nᵒ 1 (7 janvier 2014): e1001745. https://doi.org/10.1371/journal.pbio.1001745.