# Study Flow Sankey diagram

We will plot the study flow chart as a Sankey diagram.

We will use an external tool and need to export the data in a special format.

<small>**NOTE:** This notebook was used on top of the `ema_rwd_statistic.py` script to generate data for the study flow chart Sankey.</small>

First we will import the needed packages:

In [2]:
import numpy as np
import pandas as pd

And the data:

In [3]:
def python_name_converter(x):
    return '_'.join([word.lower() for word in x.split(' ')]) if x[0] != '$' else x

base = '../../output/ema_rwd/ema_rwd_final'

raw = pd.read_excel(f'{base}.xlsx').rename(columns=python_name_converter).set_index('eu_pas_register_number')
variables, variables_due_protocol, variables_due_result = pd.read_excel(f'{base}_statistics_variables.xlsx', sheet_name=None).values()

variables = variables.set_index('eu_pas_register_number')
variables_due_protocol = variables_due_protocol.set_index('eu_pas_register_number')
variables_due_result = variables_due_result.set_index('eu_pas_register_number')

## Cancelled Studies

In [6]:
raw.loc[raw['$CANCELLED_MANUAL'].fillna(False).astype(bool), '$UPDATED_state'] = 'Cancelled'
sankey_df = raw['$UPDATED_state_override'].combine_first(raw['$UPDATED_state']).rename('To').value_counts().to_frame().reset_index().assign(From='All')
sankey_df

Unnamed: 0,To,count,From
0,Finalised,1484,All
1,Ongoing,816,All
2,Planned,400,All
3,Cancelled,60,All


## Due Report and Due Protocol

In [7]:
due_protocols = variables_due_protocol['updated_state'].rename('From').value_counts().to_frame().reset_index().assign(To='Due Protocols')

sankey_df = pd.concat([
    sankey_df, due_protocols
])

due_protocols

Unnamed: 0,From,count,To
0,Finalised,1484,Due Protocols
1,Ongoing,816,Due Protocols


In [8]:
due_results = variables.loc[variables['updated_state'].isin(['Finalised']), 'due_result'] \
    .rename('updated_state').value_counts().to_frame().reset_index() \
    .assign(
        From='Finalised',
        To=lambda x : np.where(x['updated_state'], 'Due Results', 'Not Due Results')
    ) \
    .drop(columns='updated_state')

sankey_df = pd.concat([
    sankey_df, due_results
])

due_results

Unnamed: 0,count,From,To
0,1482,Finalised,Due Results
1,2,Finalised,Not Due Results


### Newest study due report / protocol

In [9]:
compare_datetime = np.datetime64('2024-02-21T23:20', 'm')
print(variables_due_result['final_report_date_actual'].sort_values().iat[-1])
print(compare_datetime - variables_due_result['final_report_date_actual'].sort_values().iat[-1])
print(variables_due_protocol['data_collection_date_actual'].sort_values().iat[-1])
print(compare_datetime - variables_due_protocol['data_collection_date_actual'].sort_values().iat[-1])

2023-12-15 00:00:00
68 days 23:20:00
2024-01-16 00:00:00
36 days 23:20:00


## Export

We can now export the combined data

In [10]:
sankey_df

Unnamed: 0,To,count,From
0,Finalised,1484,All
1,Ongoing,816,All
2,Planned,400,All
3,Cancelled,60,All
0,Due Protocols,1484,Finalised
1,Due Protocols,816,Ongoing
0,Due Results,1482,Finalised
1,Not Due Results,2,Finalised


In [11]:
sankey_df.to_csv('study_flow_sankey.csv', index=False)

### Other format

This format is used by another tool

In [12]:
extra_df = sankey_df
extra_df = extra_df.assign(
    formula=lambda x : x['From'].astype(str) + ' [' + x['count'].astype(str) + '] ' + x['To'].astype(str)
)
print('\n'.join(extra_df['formula'].values))

All [1484] Finalised
All [816] Ongoing
All [400] Planned
All [60] Cancelled
Finalised [1484] Due Protocols
Ongoing [816] Due Protocols
Finalised [1482] Due Results
Finalised [2] Not Due Results


## Extended Sankey + Export

We can also export an extended version, which will also add the flow to the outcomes as a last stage

In [13]:
has_protocols = variables_due_protocol['has_protocol'].rename('has_protocol').value_counts().to_frame().reset_index() \
    .assign(
        From='Due Protocols',
        To=lambda x : np.where(x['has_protocol'], 'Has Protocols', 'No Protocols')
    ) \
    .drop(columns='has_protocol')

sankey_df = pd.concat([
    sankey_df, has_protocols
])

has_results = variables_due_result['has_result'].rename('has_result').value_counts().to_frame().reset_index() \
    .assign(
        From='Due Results',
        To=lambda x : np.where(x['has_result'], 'Has Results', 'No Results')
    ) \
    .drop(columns='has_result')

sankey_extended_df = pd.concat([
    sankey_df, has_results
])

In [14]:
sankey_extended_df

Unnamed: 0,To,count,From
0,Finalised,1484,All
1,Ongoing,816,All
2,Planned,400,All
3,Cancelled,60,All
0,Due Protocols,1484,Finalised
1,Due Protocols,816,Ongoing
0,Due Results,1482,Finalised
1,Not Due Results,2,Finalised
0,Has Protocols,1370,Due Protocols
1,No Protocols,930,Due Protocols


In [15]:
sankey_extended_df.to_csv('study_flow_sankey_extended.csv', index=False)