# Analysing variables

This notebook was used to analyse:

1. `requested_by_regulator` stratified by `risk_managment_plan` and `has_protocol` for all studies requested by a regulator.</br>
<small>
**NOTE:** We found that there are studies with RMP "Not applicable" or "Unspecified", but stil requested by a regulator
+ We don't now how and in which way this is possible or whether it was an error
+ The lack of documentation makes this unclear
+ Nevertheless this variable will be used in the logistic regression modules, because of the use in previous works
</small>
2. Median (IQR) time since protocol and final report due
3. Creating `funding_sources_grouped_override` to override `funding_sources_grouped`, where manual mapping is needed

Import needed libraries and data:

In [1]:
import pandas as pd

from itertools import chain

na_values = [
    "", "#N/A", "#N/A N/A", "#NA", "-1.#IND", "-1.#QNAN", "-NaN", "-nan", 
    "1.#IND", "1.#QNAN", "<NA>", "NULL", "NaN", "None", "nan", "null"
    # "N/A",
    # "NA",
    # "n/a",
]

def python_name_converter(x):
    return '_'.join([word.lower() for word in x.split(' ')]) if x[0] != '$' else x

In [2]:
all_included, due_protocol, due_results = pd.read_excel(
    '../../output/ema_rwd/ema_rwd_final_statistics_variables.xlsx',
    index_col=0,
    sheet_name=None,
    keep_default_na=False,
    na_values=na_values,
    na_filter=True
).values()

## `requested_by_regulator`

Checking `requested_by_regulator` value counts for studies with RMP:
```python 
"Not applicable"
```

In [3]:
pd.merge(
    all_included[all_included['risk_management_plan'].eq('Not applicable')]['requested_by_regulator'].value_counts(dropna=False),
    (all_included[all_included['risk_management_plan'].eq('Not applicable')]['requested_by_regulator'].value_counts(normalize=True, dropna=False) * 100).round(1),
    left_index=True, right_index=True
)

Unnamed: 0_level_0,count,proportion
requested_by_regulator,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,1220,78.7
1.0,305,19.7
,25,1.6


Unspecified (`pd.NA`)

In [4]:
pd.merge(
    all_included[all_included['risk_management_plan'].isna()]['requested_by_regulator'].value_counts(dropna=False),
    (all_included[all_included['risk_management_plan'].isna()]['requested_by_regulator'].value_counts(normalize=True, dropna=False) * 100).round(1),
    left_index=True, right_index=True
)

Unnamed: 0_level_0,count,proportion
requested_by_regulator,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,114,86.4
1.0,15,11.4
,3,2.3


`has_protocol` for all studies requested by a regulator

In [5]:
tmp = pd.merge(
    all_included[all_included['requested_by_regulator'].eq(1)]['has_protocol'].value_counts(dropna=False),
    (all_included[all_included['requested_by_regulator'].eq(1)]['has_protocol'].value_counts(normalize=True, dropna=False) * 100).round(1),
    left_index=True, right_index=True
)

pd.concat([
    tmp,
    tmp.sum().rename('All').to_frame().T.rename_axis('has_protocol', axis='index')
], axis='index')


Unnamed: 0_level_0,count,proportion
has_protocol,Unnamed: 1_level_1,Unnamed: 2_level_1
True,703.0,60.9
False,452.0,39.1
All,1155.0,100.0


## Median (IQR) time since protocol and final report due

In [6]:
_, bins = pd.qcut(
    due_protocol['data_collection_days_difference'], 
    q=[0, .25, .5, .75, 1.],
    retbins=True
)

result = (bins / 365.25).round(1)

display(
    result,
    f'Median (IQR) time since protocol due: {result[2]} ({result[1]} - {result[3]})'
)

array([ 0. ,  3.6,  6.6,  9.2, 30.1])

'Median (IQR) time since protocol due: 6.6 (3.6 - 9.2)'

In [7]:
_, bins = pd.qcut(
    due_results['data_collection_days_difference'], 
    q=[0, .25, .5, .75, 1.],
    retbins=True
)

result = (bins / 365.25).round(1)

display(
    result,
    f'Median (IQR) time since protocol due: {result[2]} ({result[1]} - {result[3]})'
)

array([ 0.5,  4.9,  7.5,  9.9, 27.1])

'Median (IQR) time since protocol due: 7.5 (4.9 - 9.9)'

## Creating `funding_sources_grouped_override`

First we will define the values, which will need manual mapping

In [8]:
filter_values = [
    # 'EMA', # non-commercial 
    'EMA; Other', # mixed or non-commercial
    # 'EU institutional research programme', # non-commercial
    # 'EU institutional research programme; Non for-profit organisation (e.g. charity); Other; Pharmaceutical company and other private sector\xa0', # mixed
    'EU institutional research programme; Other', # mixed or non-commercial
    # 'EU institutional research programme; Pharmaceutical company and other private sector\xa0', # mixed
    # 'No external funding', # no funding
    # 'Non for-profit organisation (e.g. charity)', # non-commercial 
    'Non for-profit organisation (e.g. charity); Other', # mixed or non-commercial
    # 'Non for-profit organisation (e.g. charity); Other; Pharmaceutical company and other private sector\xa0', # mixed
    # 'Non for-profit organisation (e.g. charity); Pharmaceutical company and other private sector\xa0', # mixed
    'Other', # mixed or commercial or non-commercial
    'Other; Pharmaceutical company and other private sector\xa0', # mixed or commercial
    # 'Pharmaceutical company and other private sector\xa0' # commercial
]

We can now use `funding_details` to determine the sponsor class manually. We have already unified / cleaned the values in `funding_details` in `$MATCHED` and will use these values to reduce the mapping efforts. 

In [9]:
sponsor_classification = pd.read_excel(
    '../../output/ema_rwd/ema_rwd_final.xlsx', 
    index_col=0, 
    keep_default_na=False,
    na_values=na_values,
    na_filter=True
).rename(
    columns=python_name_converter
).set_index("eu_pas_register_number")[
    ["funding_sources", "funding_details", "$MATCHED", "$CANCELLED_MANUAL"]
]

# display(sorted(sponsor_classification["funding_sources"].dropna().unique()))

sponsor_classification = sponsor_classification[~sponsor_classification["$CANCELLED_MANUAL"].fillna(0).astype(bool)].drop("$CANCELLED_MANUAL", axis='columns')
sponsor_classification = sponsor_classification[sponsor_classification["funding_sources"].isin(filter_values)]

print(
    'All studies with unclear funding_sources\nNumber of entries: ' +
    str(len(sponsor_classification)) +
    '\nNumber of unique sponsors: ' +
    str(len(sorted(set(chain.from_iterable(
        sponsor_classification['$MATCHED'].str.split("; ").dropna().values
    )))))
)

sponsor_classification = sponsor_classification.merge(all_included[['due_protocol', 'due_result']].any(axis='columns').rename('due'), left_index=True, right_index=True, how='left')

sponsor_classification = sponsor_classification[sponsor_classification['due']].drop("due", axis='columns')

print(
    'Due studies with unclear funding_sources\nNumber of entries: ' +
    str(len(sponsor_classification)) +
    '\nNumber of unique sponsors: ' +
    str(len(sorted(set(chain.from_iterable(
        sponsor_classification['$MATCHED'].str.split("; ").dropna().values
    )))))
)

All studies with unclear funding_sources
Number of entries: 500
Number of unique sponsors: 296
Due studies with unclear funding_sources
Number of entries: 372
Number of unique sponsors: 244


In [10]:
mapping_output = pd.Series(
    sorted(set(chain.from_iterable(
        sponsor_classification['$MATCHED'].str.split("; ").dropna().values
    ))), 
    name='manual'
).to_frame().assign(category=pd.NA)
mapping_output.shape

(244, 2)

In [11]:
mapping_output.to_excel('./sponsor_classification.xlsx', index=False)

We will now import the mapping file after manual classification.

In [15]:
mapping_input = pd.read_excel('./sponsor_classification_manual.xlsx', index_col=0)
mapping_input

  warn(msg)


Unnamed: 0_level_0,category
manual,Unnamed: 1_level_1
$NoFunding,No Funding
$NotFound/UnclearValue,Unclear
Aarhus University (AU),Non-Commercial
Aarne Koskelo Foundation,Non-Commercial
AbbVie,Commercial
...,...
University of Auckland,Non-Commercial
University of Bristol,Non-Commercial
Universitätsklinikum Erlangen,Non-Commercial
ViiV Healthcare,Commercial


Check completeness of mapping file after manual classification.

In [16]:
mapping_input.index.difference(mapping_output.set_index('manual').index)

Index([], dtype='object', name='manual')

In [17]:
mapping_output.set_index('manual').index.difference(mapping_input.index)

Index([], dtype='object', name='manual')

Map sponsor classifications back to `eu_pas_register_number` (first with duplicates, which will be removed by aggregation):

In [18]:
sponsor_classification_manual = pd.merge(
    sponsor_classification["$MATCHED"].str.split('; ').explode(),
    mapping_input,
    how='left',
    left_on='$MATCHED',
    right_index=True,
    validate='many_to_one'
)

We will convert all `pd.NA` values into `No Funding`.

In [19]:
sponsor_classification_manual[sponsor_classification_manual['category'].isna()]

Unnamed: 0_level_0,$MATCHED,category
eu_pas_register_number,Unnamed: 1_level_1,Unnamed: 2_level_1


In [20]:
sponsor_classification_manual.loc[sponsor_classification_manual['category'].isna(), 'category'] = 'No Funding'

In [22]:
sponsor_classification_manual['category'].value_counts()

category
Non-Commercial    315
Commercial        163
Unclear            64
No Funding         21
Name: count, dtype: int64

We can now map back from `$MATCHED` to `funding_details` and aggregate the categories with the help of `agg_sponsor_category`:

In [23]:
def agg_sponsor_category(series: pd.Series):
    uniques = set(series)
    if len(uniques) == 0:
        raise ValueError('There should not be an empty entry')
    elif len(uniques) == 1:
        return uniques.pop()
    
    uniques = set(series) - frozenset(['No Funding', 'Unclear'])
    if len(uniques) == 0:
        return 'No Funding'
    elif len(uniques) == 1:
        return uniques.pop()
    else:
        return 'Mixed'

In [24]:
sponsor_classification_manual = sponsor_classification_manual.groupby(sponsor_classification_manual.index).agg({
    '$MATCHED': lambda x : '; '.join(x.astype(str)) if not x.isna().all() else x,
    'category': agg_sponsor_category
})

sponsor_classification_manual

Unnamed: 0_level_0,$MATCHED,category
eu_pas_register_number,Unnamed: 1_level_1,Unnamed: 2_level_1
1578,$NotFound/UnclearValue; Agenzia Italiana del F...,Non-Commercial
1587,$NotFound/UnclearValue; Vall d'Hebron Universi...,Non-Commercial
1620,Agenzia Italiana del Farmaco (AIFA),Non-Commercial
1661,$NotFound/UnclearValue,Unclear
2221,$NotFound/UnclearValue; VII Framework Programm...,Non-Commercial
...,...,...
107696,Avextra Pharma; Pain Technologies and Clinical...,Commercial
107708,European Medicines Agency (EMA),Non-Commercial
108001,National Institute on Aging (NIA),Non-Commercial
108167,$NoFunding,No Funding


In [25]:
sponsor_classification_manual['category'].value_counts()

category
Non-Commercial    212
Commercial         67
Unclear            36
Mixed              36
No Funding         21
Name: count, dtype: int64

Finally, we will export the results:

In [26]:
pd.read_excel('../../data/ema_rwd/ema_rwd_p_m_gpt_o_v2.xlsx', index_col=0).set_index('Eu Pas Register Number').merge(
    sponsor_classification_manual['category'].rename("Funding Sources Grouped Override"),
    left_index=True, 
    right_index=True,
    how='left'
).reset_index(names='Eu Pas Register Number').to_excel('../../data/ema_rwd/ema_rwd_p_m_gpt_o_s.xlsx', sheet_name='PAS')