# Comparing EU PAS Register and EMA RWD Catalogue data
We will compare the scraped EU PAS Register Data and the scraped EMA RWD Catalogue data to find out changes made to the `centre` and `funding` fields.

<small>**NOTE:** We did use this notebook to figure out which columns we will use for the sponsor matching.</small>

<small>**NOTE:** `centre_name` and `centre_name_of_investigator` were used previously for the EU PAS Register.</small>

<small>**NOTE:** `funding_details` was used for the EMA RWD Catalalogue.</small>

First we import `pandas`:

In [1]:
import pandas as pd

We will now read the scraped data of the EU PAS Register, the EMA RWD Catalogue and the centre matching file and filter the relevant columns.

In [None]:
eupas = (
    pd.read_csv('./data/scraped_eupas_2024-01-23T22-15-21+00-00.csv')
    .set_index('eu_pas_register_number')
    .filter(like='centre')
    .drop('centre_location', axis='columns')
)
eupas

Unnamed: 0_level_0,centre_name,centre_name_of_investigator,centre_organisation
eu_pas_register_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
49988,Eli Lilly and Company,,
11512,,OPRI Pte Ltd,OPRI Pte Ltd
9142,,OPRI Pte Ltd,OPRI Pte Ltd
7678,,OPRI Pte Ltd,OPRI Pte Ltd
104156,,OPRI Pte Ltd,OPRI Pte Ltd
...,...,...,...
6023,,PAREXEL,PAREXEL International Corporation
13040,,OXON,Oxon Epidemiology Ltd.
35147,,OXON,OXON Epidemiology Ltd.
23454,,OXON,OXON Epidemiology Ltd.


In [None]:
na_values = [
    "", "#N/A", "#N/A N/A", "#NA", "-1.#IND", "-1.#QNAN", "-NaN", "-nan", 
    "1.#IND", "1.#QNAN", "<NA>", "NULL", "NaN", "None", "nan", "null"
    # "N/A",
    # "NA",
    # "n/a",
]

matching_data = pd.read_excel(
    '../../data/eupas/centre_manual.xlsx', #
    sheet_name=None, 
    keep_default_na=False,
    na_values=na_values,
    na_filter=True
)

In [None]:
ema_rwd = (
    pd.read_csv('./data/scraped_ema-rwd_2024-02-21T22-22-05+00-00.csv')
    .set_index('eu_pas_register_number')
    .filter(regex='institution|network|funding_details')
)
ema_rwd

Unnamed: 0_level_0,additional_institutions_encepp,additional_institutions_not_encepp,funding_details,lead_institution_encepp,lead_institution_not_encepp,networks_encepp,networks_not_encepp
eu_pas_register_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1587,,,"Ministry of Science & Technology, Vall d'Hebro...",University Hospital Vall d’Hebron (HUVH),,,
3061,,,Boehringer Ingelheim,Brigham and Women's Hospital,,,
8800,,,"Best Care Consulting, MSC-INHPs",Best Care Consulting,,,MSC-INHP
3177,,Multiple centres: 999 centres are involved in ...,Daiichi Sankyo European affiliates and Daiichi...,Institut für Pharmakologie und präventive Medizin,Dr. Bramlage & Dr. Hankowitz Partnerschaft,,
9215,,,National health insurance,Institute of Oncology Slovenia,,,
...,...,...,...,...,...,...,...
2113,,Multiple centres: 9 centres involved in this s...,F. Hoffmann-La Roche Ltd,"Scientific Affairs, Outcome SARL",,,Full list available on request
2572,,,Cancer Research UK,Queen's University Belfast,,,
2221,"Centre for Maternal, Fetal and Infant Research...","University of Groningen (RUG) Netherlands, The...","FP7, Universities",University of Ulster,,European Surveillance of Congenital Anomalies ...,
1661,"Clinical Pharmacology Department, UASP Hospita...","Hospital de Sant Pau Barcelona, Hospital del M...",ICS and FIS,Fundació Institut Català de Farmacologia (FICF),,,Hepatox-TBCGroup


## Deleted and new Studies
We will now analysed which studies got deleted or added in the time between these two datasets:

In [7]:
shared_index = eupas.index.intersection(ema_rwd.index)
print('Removed from EU PAS:', eupas.index.difference(ema_rwd.index).values)
print('New in EMA RWD:', ema_rwd.index.difference(eupas.index).values)
eupas = eupas.filter(items=shared_index, axis='index')
ema_rwd = ema_rwd.filter(items=shared_index, axis='index')

Removed from EU PAS: [12194 23583 29718 30525 35008 35763 39572 45442 47920 49370 50259]
New in EMA RWD: []


## Match Centre Names
We will also match the centre names with the old matching file.

In [8]:
for field_name in ['centre_name', 'centre_name_of_investigator']:
    
    eupas = pd.merge(
        eupas.reset_index(),
        matching_data[field_name].loc[:, ['manual', 'original']].rename(
            columns={
                'manual': f'$MATCHED_{field_name}',
                'original': field_name
            }
        ),
        how='left',
        on=field_name,
    ).set_index('eu_pas_register_number')

match_combined_field_name = '$MATCHED_combined_centre_name'

eupas[match_combined_field_name] = \
    eupas.filter(like='$MATCHED') \
    .apply(lambda x: ''.join([str(y) for y in x.values if isinstance(y, str)]), axis='columns')

eupas.loc[
    eupas[match_combined_field_name] == '',
    match_combined_field_name
] = pd.NA

eupas = eupas.drop(['$MATCHED_centre_name_of_investigator', '$MATCHED_centre_name'], axis='columns')
column = eupas.pop(match_combined_field_name)
eupas.insert(0, match_combined_field_name, column)

eupas

Unnamed: 0_level_0,$MATCHED_combined_centre_name,centre_name,centre_name_of_investigator,centre_organisation
eu_pas_register_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
49988,Eli Lilly,Eli Lilly and Company,,
11512,Observational and Pragmatic Research Institute...,,OPRI Pte Ltd,OPRI Pte Ltd
9142,Observational and Pragmatic Research Institute...,,OPRI Pte Ltd,OPRI Pte Ltd
7678,Observational and Pragmatic Research Institute...,,OPRI Pte Ltd,OPRI Pte Ltd
104156,Observational and Pragmatic Research Institute...,,OPRI Pte Ltd,OPRI Pte Ltd
...,...,...,...,...
6023,Parexel,,PAREXEL,PAREXEL International Corporation
13040,OXON Epidemiology,,OXON,Oxon Epidemiology Ltd.
35147,OXON Epidemiology,,OXON,OXON Epidemiology Ltd.
23454,OXON Epidemiology,,OXON,OXON Epidemiology Ltd.


## Field changes
We can now conveniently analyse the changes of the `centre` fields:

In [9]:
merged = pd.merge(eupas, ema_rwd, left_index=True, right_index=True)
merged

Unnamed: 0_level_0,$MATCHED_combined_centre_name,centre_name,centre_name_of_investigator,centre_organisation,additional_institutions_encepp,additional_institutions_not_encepp,funding_details,lead_institution_encepp,lead_institution_not_encepp,networks_encepp,networks_not_encepp
eu_pas_register_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
49988,Eli Lilly,Eli Lilly and Company,,,,,Eli Lilly and Company,Eli Lilly and Company,,,
11512,Observational and Pragmatic Research Institute...,,OPRI Pte Ltd,OPRI Pte Ltd,,,Teva,Observational & Pragmatic Research Institute P...,Foundation for the Promotion of Health and Bio...,,
9142,Observational and Pragmatic Research Institute...,,OPRI Pte Ltd,OPRI Pte Ltd,,,Chiesi Ltd,Observational & Pragmatic Research Institute P...,Biogen,,
7678,Observational and Pragmatic Research Institute...,,OPRI Pte Ltd,OPRI Pte Ltd,,,TEVA Ltd,Observational & Pragmatic Research Institute P...,Baxter Healthcare Corporation,,
104156,Observational and Pragmatic Research Institute...,,OPRI Pte Ltd,OPRI Pte Ltd,,,Astra Zeneca,Observational & Pragmatic Research Institute P...,,,
...,...,...,...,...,...,...,...,...,...,...,...
6023,Parexel,,PAREXEL,PAREXEL International Corporation,,,Viatris Specialty LLC,Parexel International,,,
13040,OXON Epidemiology,,OXON,Oxon Epidemiology Ltd.,,,Medical Developments International,OXON Epidemiology,,NIHR Medicines for Children Research Network,
35147,OXON Epidemiology,,OXON,OXON Epidemiology Ltd.,,,"GSK, OXON Epidemiology",OXON Epidemiology,,,
23454,OXON Epidemiology,,OXON,OXON Epidemiology Ltd.,"OXON Epidemiology; Pharmacoepidemiology Group,...",,No funding,OXON Epidemiology,,,


We can also export the data with the following code for easy access:

In [10]:
merged.to_excel('compare_sponsors.xlsx')