# Converting exported and scraped data into a harmonised format for comparision
This notebook provides the code to convert exported `.csv` files from the [HMA-EMA Catalogue of real-world data studies](https://catalogues.ema.europa.eu/search?f%5B0%5D=content_type%3Adarwin_study) and compare them to the scraped `.csv` files (from the same database).

<small>**NOTE:** We did use this notebook to figure out differences between our own extraction method and EMAs methods.</small>

<small>These are the biggest differences:
1. EMAs `.csv` contains **document names** vs. the **document URL** in our version
    + Interestingly it is not possible to construct the URL from the document name even though they used a simple urlquoted version of the name in their URLs at first
        + This is demonstrated in the `.ipynb` used for document download
    + That's because there were some URLs with extra spaces and other small changes so that is wasn't possible to do this
    + They are also using different base URLs for the document URLs now
1. EMAs `.csv` had a bug which cuts off text between < and > (This is likely a HTML/XML parsing error)
1. EMAs `.csv` used `, ` as a delimiter, which causes problems for text which already contains this delimiter
</small>

First we import `pandas`:

In [4]:
import pandas as pd

## Reading, cleaning and tranforming exported data

In [None]:
exported = pd.read_csv('./data/exported_ema-rwd_2024-02-21T23-20-00+00-00.csv')
exported

Unnamed: 0,Title,URL LINK,Created,First published,Updated,Published,Moderation,PURI,EU PAS number,Study ID,...,Data characterisation moment,Data characterisation details,Data characterisation details.1,Data characterisation details.2,Data characterisation results (file),Data characterisation results (link),Procedure of data extraction,Procedure of data extraction.1,Procedure of results generation,Procedure of results generation.1
0,An Observational Post-Authorization Safety Stu...,https://catalogues.ema.europa.eu/study/48839,12/02/2024 - 17:31,"Wed, 26/01/2022 - 13:00",21/02/2024 - 20:38,Published,published,https://redirect.ema.europa.eu/resource/48839,EUPAS45362,48839,...,,,,,,,,,,
1,Effectiveness of the Novavax COVID-19 Vaccine ...,https://catalogues.ema.europa.eu/study/105522,12/02/2024 - 17:29,"Mon, 03/07/2023 - 14:00",21/02/2024 - 17:29,Published,published,https://redirect.ema.europa.eu/resource/105522,EUPAS105521,105522,...,,,,,,,,,,
2,Consequences for life of children with in uter...,https://catalogues.ema.europa.eu/study/47386,12/02/2024 - 17:28,"Thu, 13/07/2017 - 14:00",21/02/2024 - 16:30,Published,published,https://redirect.ema.europa.eu/resource/47386,EUPAS19686,47386,...,,,,,,,,,,
3,Determining the prevalence of severe asthma in...,https://catalogues.ema.europa.eu/study/50651,12/02/2024 - 17:35,"Wed, 01/02/2023 - 13:00",21/02/2024 - 16:25,Published,published,https://redirect.ema.europa.eu/resource/50651,EUPAS50650,50651,...,,,,,,,,,,
4,Incidence of oral thrush in COPD patients pres...,https://catalogues.ema.europa.eu/study/14064,12/02/2024 - 17:09,"Fri, 11/03/2016 - 13:00",21/02/2024 - 16:13,Published,published,https://redirect.ema.europa.eu/resource/14064,EUPAS12762,14064,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2755,Exposure to beta-blockers and survival in brea...,https://catalogues.ema.europa.eu/study/2573,12/02/2024 - 17:08,"Wed, 25/04/2012 - 14:00",25/04/2012 - 12:00,Published,published,https://redirect.ema.europa.eu/resource/2573,EUPAS2572,2573,...,,,,,,,,,,
2756,EUROmediCAT: Safety of Medication Use in Pregn...,https://catalogues.ema.europa.eu/study/2222,12/02/2024 - 17:08,"Sun, 29/01/2012 - 13:00",29/01/2012 - 12:00,Published,published,https://redirect.ema.europa.eu/resource/2222,EUPAS2221,2222,...,,,,,,,,,,
2757,Burden of AV GRAft and Fistula Complications i...,https://catalogues.ema.europa.eu/study/2193,12/02/2024 - 17:08,"Fri, 23/09/2011 - 14:00",23/09/2011 - 12:00,Published,published,https://redirect.ema.europa.eu/resource/2193,EUPAS2192,2193,...,,,,,,,,,,
2758,CHARACTERISATION OF PROTEOMIC PROFILES PREDICT...,https://catalogues.ema.europa.eu/study/1662,12/02/2024 - 17:08,"Fri, 29/10/2010 - 14:00",29/10/2010 - 12:00,Published,published,https://redirect.ema.europa.eu/resource/1662,EUPAS1661,1662,...,,,,,,,,,,


We can list all colum names by uncommmenting the last line of code. Note that `data_source_types` and `data_source_types_other` are not exported by the tool as of $\texttt{21-02-2024}$.

In [None]:
# NOTE: Not included
# data_source_types
# data_source_types_other
print('\n'.join(exported.columns.tolist()))

Title
URL LINK
Created
First published
Updated
Published
Moderation
PURI
EU PAS number
Study ID
DARWIN EU® study
Study countries
Study description
Study status
Institution conducting the study
Institution conducting the study if not in the list
Additional institutions
Additional institutions if not in the list
Networks conducting the study
Additional networks if not in the list
Main contact: First name
Main contact: Last name
Alternate contact: First name
Alternate contact: Last name
Alternate contact: ORCID number
ENCePP partner
Date when funding contract was signed (Planned)
Date when funding contract was signed (Actual)
Study start date (Planned)
Study start date (Actual)
Data analysis start date (Planned)
Data analysis start date (Actual)
Date of interim report, if expected (Planned)
Date of interim report, if expected (Actual)
Date of final study report (Planned)
Date of final study report (Actual)
Source of funding
More details on source of funding
Protocol
Protocol.1
Was the stu

There were no Protocol, Result or Other documents Urls at the time of extraction.

In [None]:
exported[['Protocol.1', 'Study report.1', 'Study, other information.1']].value_counts()

Series([], Name: count, dtype: int64)

There were no Summary Results at the time of extraction.

In [None]:
exported['Summary results'].value_counts()

Series([], Name: count, dtype: int64)

There were no alternative study ids either.

In [None]:
exported[['Study ID, other', 'Study ID, other.1']].value_counts()

Series([], Name: count, dtype: int64)

Now we can copy the relevant parts of this list to create a rename mapping for the columns

In [8]:
rename_cols = {
 'Title': 'title',
 'URL LINK': 'url',
 'First published': 'registration_date',
 'Updated': 'update_date',
 'PURI': 'puri',
 'EU PAS number': 'eu_pas_register_number',
 'Study countries': 'countries',
 'Study description': 'description',
 'Study status': 'state',
 'Institution conducting the study': 'lead_institution_encepp',
 'Institution conducting the study if not in the list': 'lead_institution_not_encepp',
 'Additional institutions': 'additional_institutions_encepp',
 'Additional institutions if not in the list': 'additional_institutions_not_encepp',
 'Networks conducting the study': 'networks_encepp',
 'Additional networks if not in the list': 'networks_not_encepp',
 'Date when funding contract was signed (Planned)': 'funding_contract_date_planed',
 'Date when funding contract was signed (Actual)': 'funding_contract_date_actual',
 'Study start date (Planned)': 'data_collection_date_planed',
 'Study start date (Actual)': 'data_collection_date_actual',
 'Data analysis start date (Planned)': 'data_analysis_date_planed',
 'Data analysis start date (Actual)': 'data_analysis_date_actual',
 'Date of interim report, if expected (Planned)': 'iterim_report_date_planed',
 'Date of interim report, if expected (Actual)': 'iterim_report_date_actual',
 'Date of final study report (Planned)': 'final_report_date_planed',
 'Date of final study report (Actual)': 'final_report_date_actual',
 'Source of funding': 'funding_sources',
 'More details on source of funding': 'funding_details',
 'Protocol': 'protocol_document_name',
 'Was the study required by a regulatory body?': 'requested_by_regulator',
 'Is the study required by a Risk Management Plan (RMP)?': 'risk_management_plan',
 'Regulatory procedure number': 'regulatory_procedure_number',
 'Study topic': 'study_topic',
 'Study topic, other': 'study_topic_other',
 'Study type': 'study_type',
 'If ‘Not applicable’, further details on the study type': 'study_type_other',
 'Scope of the study': 'non_interventional_scopes',
 'If ‘other’, further details on the scope of the study': 'non_interventional_scopes_other',
 'Non-interventional study design': 'non_interventional_study_design',
 'Non-interventional study design, other': 'non_interventional_study_design_other',
 'Name of medicine': 'substance_brand_name',
 'Name of medicine, other': 'substance_brand_name_other',
 'Study drug International non-proprietary name (INN) or common name': 'substance_inn',
 'Anatomical Therapeutic Chemical (ATC) code': 'substance_atc',
 'Medicinal condition to be studied': 'medical_conditions',
 'Additional medical condition(s)': 'additional_medical_conditions',
 'Population age groups': 'age_population',
 'Special population of interest': 'special_population',
 'Special population of interest, other' : 'special_population_other',
 'Estimated number of subjects': 'number_of_subjects',
 'Outcomes': 'outcomes',
 'Results tables': 'result_tables_name',
 'Study report': 'result_document_name',
 'Study, other information': 'other_documents_name',
 'Study publications': 'references',
 'Data source(s) ': 'data_sources_registered_with_encepp',
 'Other linked data sources ': 'data_sources_not_registered_with_encepp',
 'Check conformance': 'check_conformance',
 'Check completeness': 'check_completeness',
 'Check stability': 'check_stability',
 'Check logical consistency': 'check_logical_consistency',
 'Data characterisation conducted': 'conducted_data_characterisation'
}

We can now rename the columns and drop the columns missing in the rename map:

In [None]:
harmonised_exported = exported.filter(items=rename_cols.keys()).rename(columns=rename_cols)
harmonised_exported

Unnamed: 0,title,url,registration_date,update_date,puri,eu_pas_register_number,countries,description,state,lead_institution_encepp,...,result_document_name,other_documents_name,references,data_sources_registered_with_encepp,data_sources_not_registered_with_encepp,check_conformance,check_completeness,check_stability,check_logical_consistency,conducted_data_characterisation
0,An Observational Post-Authorization Safety Stu...,https://catalogues.ema.europa.eu/study/48839,"Wed, 26/01/2022 - 13:00",21/02/2024 - 20:38,https://redirect.ema.europa.eu/resource/48839,EUPAS45362,"Italy, Netherlands, Norway, Spain, United Kingdom","Observational Study, Retrospective observation...",Ongoing,University Medical Center Utrecht (UMCU),...,,,,"Clinical Practice Research Datalink, The Infor...","HSD Italy, The Norwegian Health register Norwa...",Unknown,Unknown,Unknown,Unknown,No
1,Effectiveness of the Novavax COVID-19 Vaccine ...,https://catalogues.ema.europa.eu/study/105522,"Mon, 03/07/2023 - 14:00",21/02/2024 - 17:29,https://redirect.ema.europa.eu/resource/105522,EUPAS105521,United States,Observational retrospective comparative effect...,Ongoing,Aetion,...,,,,,HealthVerity United States,Unknown,Unknown,Unknown,Unknown,No
2,Consequences for life of children with in uter...,https://catalogues.ema.europa.eu/study/47386,"Thu, 13/07/2017 - 14:00",21/02/2024 - 16:30,https://redirect.ema.europa.eu/resource/47386,EUPAS19686,Finland,Metformin is used during pregnancy to treat hy...,Finalised,IQVIA,...,,"Annex5_DoIForm_EPID Research_1_20170712, Annex...","https://doi.org/10.1136/bmjdrc-2021-002363, ht...",,Drugs and Pregnancy Finland,Unknown,Unknown,Unknown,Unknown,No
3,Determining the prevalence of severe asthma in...,https://catalogues.ema.europa.eu/study/50651,"Wed, 01/02/2023 - 13:00",21/02/2024 - 16:25,https://redirect.ema.europa.eu/resource/50651,EUPAS50650,United Kingdom,It is a retrospective epidemiological database...,Ongoing,Respiratory Effectiveness Group,...,,,,Optimum Patient Care Research Database,,Unknown,Unknown,Unknown,Unknown,No
4,Incidence of oral thrush in COPD patients pres...,https://catalogues.ema.europa.eu/study/14064,"Fri, 11/03/2016 - 13:00",21/02/2024 - 16:13,https://redirect.ema.europa.eu/resource/14064,EUPAS12762,United Kingdom,"A historical cohort, UK database study in pati...",Finalised,Observational & Pragmatic Research Institute P...,...,,,,Optimum Patient Care Research Database,,Unknown,Unknown,Unknown,Unknown,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2755,Exposure to beta-blockers and survival in brea...,https://catalogues.ema.europa.eu/study/2573,"Wed, 25/04/2012 - 14:00",25/04/2012 - 12:00,https://redirect.ema.europa.eu/resource/2573,EUPAS2572,United Kingdom,New therapeutic strategies are needed to reduc...,Ongoing,Queen's University Belfast,...,,,,Clinical Practice Research Datalink,,Unknown,Unknown,Unknown,Unknown,No
2756,EUROmediCAT: Safety of Medication Use in Pregn...,https://catalogues.ema.europa.eu/study/2222,"Sun, 29/01/2012 - 13:00",29/01/2012 - 12:00,https://redirect.ema.europa.eu/resource/2222,EUPAS2221,"United Kingdom, Switzerland, Poland, Norway, N...",A variety of complementary approaches are need...,Ongoing,University of Ulster,...,,All 17 Research Centres involved in Study (Q2)...,,"Clinical Practice Research Datalink, IADB.nl",Emilia Romagna GPs drug prescription,Unknown,Unknown,Unknown,Unknown,No
2757,Burden of AV GRAft and Fistula Complications i...,https://catalogues.ema.europa.eu/study/2193,"Fri, 23/09/2011 - 14:00",23/09/2011 - 12:00,https://redirect.ema.europa.eu/resource/2193,EUPAS2192,"United States, United Kingdom, Spain, Germany",The purpose of the study is to assess the heal...,Ongoing,Zentrum für Nieren- und Hochdruckkrankheiten,...,,,,,,Unknown,Unknown,Unknown,Unknown,No
2758,CHARACTERISATION OF PROTEOMIC PROFILES PREDICT...,https://catalogues.ema.europa.eu/study/1662,"Fri, 29/10/2010 - 14:00",29/10/2010 - 12:00,https://redirect.ema.europa.eu/resource/1662,EUPAS1661,Spain,Tuberculosis represents nowadays one of the ma...,Ongoing,Fundació Institut Català de Farmacologia (FICF),...,,,,,,Unknown,Unknown,Unknown,Unknown,No


At this point we can use the same naming conventions for the scraped and exported data

In [None]:
cleaned = harmonised_exported.copy()

We will now transform the columns.

* `eu_pas_register_number` will be the new numeric index

* All dates should be converted into date objects

* We will split the array string with the delimiter `, `, sort the array and join this sorted array with the delimiter `; ` used in the scraped data

In [10]:
cleaned['eu_pas_register_number'] = cleaned['eu_pas_register_number'].str[5:].astype(int)
cleaned = cleaned.set_index('eu_pas_register_number').sort_index()

In [11]:
date_cols = cleaned.columns[cleaned.columns.str.contains('date')]
for col in date_cols:
    cleaned[col] = pd.to_datetime(cleaned[col], format='%d/%m/%Y', exact=False)

cleaned.filter(like='date')

Unnamed: 0_level_0,registration_date,update_date,funding_contract_date_planed,funding_contract_date_actual,data_collection_date_planed,data_collection_date_actual,data_analysis_date_planed,data_analysis_date_actual,iterim_report_date_planed,iterim_report_date_actual,final_report_date_planed,final_report_date_actual
eu_pas_register_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1578,2010-10-27,2015-03-30,2010-06-01,2010-06-16,2010-09-15,2010-09-15,2012-01-01,2012-01-02,2011-12-31,2011-12-31,2013-06-01,2014-06-01
1587,2010-10-25,2010-10-25,NaT,2002-10-01,NaT,2002-12-02,NaT,NaT,NaT,NaT,NaT,2003-11-11
1591,2010-10-06,2016-09-09,NaT,2010-08-13,NaT,2010-08-13,NaT,NaT,NaT,NaT,NaT,2010-08-23
1594,2010-10-20,2016-07-19,NaT,2006-12-01,NaT,2007-03-05,NaT,NaT,NaT,NaT,NaT,2010-06-01
1597,2010-10-26,2020-08-24,2010-04-01,2010-05-14,2010-11-22,2010-12-01,NaT,2019-10-02,NaT,NaT,2020-03-31,2020-08-19
...,...,...,...,...,...,...,...,...,...,...,...,...
108481,2024-01-18,2024-01-18,NaT,2022-08-16,NaT,2023-09-22,2024-06-11,NaT,NaT,NaT,2024-08-31,NaT
108728,2024-01-16,2024-01-16,2022-03-15,2022-03-15,2025-05-31,NaT,NaT,NaT,NaT,NaT,2032-03-31,NaT
108847,2024-01-22,2024-01-22,2024-01-31,NaT,2024-03-31,NaT,NaT,NaT,NaT,NaT,2026-04-30,NaT
108904,2024-01-22,2024-02-17,2024-02-29,NaT,2024-02-29,NaT,NaT,NaT,NaT,NaT,2034-12-31,NaT


In [12]:
# We will handle 'countries' seperatly to fix the problem with the ', ' delimiter appearing in country names
array_fields = ['additional_institutions_encepp', 'age_population', 'data_sources_registered_with_encepp', 
                'funding_sources', 'medical_conditions', 'networks_encepp', 'non_interventional_scopes', 
                'non_interventional_study_design', 'other_documents_name', 'references', 'special_population',
                'study_topic', 'substance_atc', 'substance_brand_name', 'substance_inn']
# Venezuela, Bolivarian Republic of; 
for field in array_fields:
    #cleaned[field] = cleaned[field].str.replace(', ', '; ')
    cleaned[field] = cleaned[field].str.split(', ').apply(lambda x : sorted(x) if type(x) is list else x).str.join('; ')

cleaned.filter(items=array_fields)

Unnamed: 0_level_0,additional_institutions_encepp,age_population,data_sources_registered_with_encepp,funding_sources,medical_conditions,networks_encepp,non_interventional_scopes,non_interventional_study_design,other_documents_name,references,special_population,study_topic,substance_atc,substance_brand_name,substance_inn
eu_pas_register_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1578,Emilia-Romagna Health and Social Agency (ASSR ...,Adults (46 to < 65 years); Adults (65 to < 75 ...,,Other,Chronic obstructive pulmonary disease,,Assessment of risk minimisation measure implem...,Cohort,,https://doi.org/10.3109/15412555.2013.839646,Other,Disease /health condition; Human medicinal pro...,(H02AB) Glucocorticoids; (J01) ANTIBACTERIALS ...,,
1587,,Adults (18 to < 46 years); Adults (46 to < 65 ...,,Other,Crohn's disease; Inflammatory bowel disease,,Other,Cross-sectional,,https://doi.org/10.1097/00054725-200407000-000...,Other,Disease /health condition,,,
1591,,Adults (18 to < 46 years); Adults (46 to < 65 ...,THIN® (The Health Improvement Network®),EMA,,,Other,Cohort,,,,Human medicinal product,,,ROSIGLITAZONE
1594,,Adolescents (12 to < 18 years); Adults (18 to ...,,EMA,,,Assessment of risk minimisation measure implem...,Cohort; Cross-sectional,,https://doi.org/10.2165/11534410-000000000-00000,Hepatic impaired; Immunocompromised; Pregnant ...,Other,,,
1597,,Adolescents (12 to < 18 years); Adults (18 to ...,,Pharmaceutical company and other private sector,Acute myocardial infarction; Ischaemic stroke;...,,Assessment of risk minimisation measure implem...,Cohort; Other,,,Women of childbearing potential not using cont...,Disease /health condition; Human medicinal pro...,,,DROSPIRENONE
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
108481,,Adults (18 to < 46 years); Adults (46 to < 65 ...,,Pharmaceutical company and other private sector,,,Effectiveness study (incl. comparative),Other,,,,,(J02AC04) posaconazole,,
108728,,Adults (18 to < 46 years); Adults (46 to < 65 ...,Danish registries (access/analysis),Pharmaceutical company and other private sector,Systemic lupus erythematosus,,Assessment of risk minimisation measure implem...,Cohort,,,Pregnant women,,(L04AA51) anifrolumab,,
108847,University Medical Center Utrecht (UMCU),Adolescents (12 to < 18 years); Adults (18 to ...,Clinical Practice Research Datalink; EpiChron ...,Pharmaceutical company and other private sector,Adverse event following immunisation,Vaccine monitoring Collaboration for Europe (V...,Assessment of risk minimisation measure implem...,Cohort; Other,,,Immunocompromised; Pregnant women,,(J07BX) Other viral vaccines,,
108904,,Adults (18 to < 46 years); Adults (46 to < 65 ...,,Pharmaceutical company and other private sector,,,Assessment of risk minimisation measure implem...,Cohort,,,Hepatic impaired; Pregnant women,,,,


## Finding unique dummie values
We can use the code below to find all possible values for a field. These should be the same as in this [document](https://catalogues.ema.europa.eu/system/files/2024-01/Study_Questionnaire_Offline.pdf), but this is not the case as of 21-02-2024.

In [13]:
cleaned['funding_sources'].str.get_dummies('; ').columns.tolist()

['EMA',
 'EU institutional research programme',
 'No external funding',
 'Non for-profit organisation (e.g. charity)',
 'Other',
 'Pharmaceutical company and other private sector\xa0']

In [14]:
cleaned['study_topic'].str.get_dummies('; ').columns.tolist()

['Disease\xa0/health condition',
 'Human medicinal product',
 'Medical device',
 'Medical procedure',
 'Other']

In [15]:
cleaned['non_interventional_scopes'].str.get_dummies('; ').columns.tolist()

['Assessment of risk minimisation measure implementation or effectiveness',
 'Disease epidemiology',
 'Drug utilisation',
 'Effectiveness study (incl. comparative)',
 'Feasibility analysis',
 'Healthcare resource utilisation',
 'Method development or testing',
 'Other',
 'Patient reported outcomes',
 'Safety study (incl. comparative)',
 'Scoping review (including literature review)',
 'Validation of study variables (exposure outcome covariate)']

In [16]:
cleaned['non_interventional_study_design'].str.get_dummies('; ').columns.tolist()

['Case-control',
 'Case-only',
 'Cluster design',
 'Cohort',
 'Cross-sectional',
 'Ecological',
 'Other',
 'Systematic review and meta-analysis']

In [17]:
cleaned['age_population'].str.get_dummies('; ').columns.tolist()

['Adolescents (12 to < 18 years)',
 'Adults (18 to < 46 years)',
 'Adults (46 to < 65 years)',
 'Adults (65 to < 75 years)',
 'Adults (75 to < 85 years)',
 'Adults (85 years and over)',
 'Children (2 to < 12 years)',
 'Elderly (≥ 65 years)',
 'Infants and toddlers (28 days – 23 months)',
 'Paediatric Population (< 18 years)',
 'Preterm newborn infants (0 – 27 days)',
 'Term newborn infants (0 – 27 days)']

In [18]:
cleaned['special_population'].str.get_dummies('; ').columns.tolist()

['Frail population',
 'Hepatic impaired',
 'Immunocompromised',
 'Other',
 'Pregnant women',
 'Renal impaired',
 'Women of childbearing potential not using contraception',
 'Women of childbearing potential using contraception']

Here is the final Dataframe:

In [20]:
cleaned.sort_index(axis='columns')

Unnamed: 0_level_0,additional_institutions_encepp,additional_institutions_not_encepp,additional_medical_conditions,age_population,check_completeness,check_conformance,check_logical_consistency,check_stability,conducted_data_characterisation,countries,...,study_topic_other,study_type,study_type_other,substance_atc,substance_brand_name,substance_brand_name_other,substance_inn,title,update_date,url
eu_pas_register_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1578,Emilia-Romagna Health and Social Agency (ASSR ...,Epidemiologic Observatory of the Health Direct...,,Adults (46 to < 65 years); Adults (65 to < 75 ...,Unknown,Unknown,Unknown,Unknown,No,Italy,...,,Non-interventional study,,(H02AB) Glucocorticoids; (J01) ANTIBACTERIALS ...,,,,Long-term outcomes and adverse events of thera...,2015-03-30,https://catalogues.ema.europa.eu/study/9133
1587,,,,Adults (18 to < 46 years); Adults (46 to < 65 ...,Unknown,Unknown,Unknown,Unknown,No,Spain,...,,Non-interventional study,,,,,,DEVELOPMENT AND VALIDATION OF A SHORTENED VERS...,2010-10-25,https://catalogues.ema.europa.eu/study/1588
1591,,,,Adults (18 to < 46 years); Adults (46 to < 65 ...,Unknown,Unknown,Unknown,Unknown,No,United Kingdom,...,,Non-interventional study,,,,,ROSIGLITAZONE,Cardiac profile of patients using rosiglitazon...,2016-09-09,https://catalogues.ema.europa.eu/study/15148
1594,,,,Adolescents (12 to < 18 years); Adults (18 to ...,Unknown,Unknown,Unknown,Unknown,No,United Kingdom,...,Signal detection procedure,Non-interventional study,,,,,,Validation of statistical signal detection pro...,2016-07-19,https://catalogues.ema.europa.eu/study/14160
1597,,,,Adolescents (12 to < 18 years); Adults (18 to ...,Unknown,Unknown,Unknown,Unknown,No,"United States, Ukraine, Russian Federation, Ca...",...,,Non-interventional study,,,,,DROSPIRENONE,International Active Surveillance study - Fola...,2020-08-24,https://catalogues.ema.europa.eu/study/36862
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
108481,,"10 centres, Merck Investigational Site, China",Invasive Aspergillosis,Adults (18 to < 46 years); Adults (46 to < 65 ...,Unknown,Unknown,Unknown,Unknown,No,China,...,,Non-interventional study,,(J02AC04) posaconazole,,,,Post Marketing Surveillance of Effectiveness (...,2024-01-18,https://catalogues.ema.europa.eu/study/199010
108728,,,,Adults (18 to < 46 years); Adults (46 to < 65 ...,Unknown,Unknown,Unknown,Unknown,No,"United States, Germany, France, Finland, Denmark",...,,Non-interventional study,,(L04AA51) anifrolumab,,,,A Non-Interventional Multi-Database Post-Autho...,2024-01-16,https://catalogues.ema.europa.eu/study/199011
108847,University Medical Center Utrecht (UMCU),"Teamit Institute S.L.,Fondaziione Penta ONLUS",,Adolescents (12 to < 18 years); Adults (18 to ...,Unknown,Unknown,Unknown,Unknown,No,"United Kingdom, Spain, Norway, Netherlands, Italy",...,,Non-interventional study,,(J07BX) Other viral vaccines,,,,Post-Authorisation Safety Study of Comirnaty O...,2024-01-22,https://catalogues.ema.europa.eu/study/199012
108904,,Multiple centres: 41 centres involved in the s...,"Transthyretin-mediated amyloidosis, Hereditary...",Adults (18 to < 46 years); Adults (46 to < 65 ...,Unknown,Unknown,Unknown,Unknown,No,"Brazil, Bulgaria, Denmark, France, Germany, Is...",...,,Non-interventional study,,,,,,Prospective Observational Study to Monitor and...,2024-02-17,https://catalogues.ema.europa.eu/study/199013


Now we will load the scraped data

In [21]:
scraped = pd.read_csv('./data/scraped_ema-rwd_2024-02-21T22-22-05+00-00.csv').set_index('eu_pas_register_number').sort_index()
scraped

Unnamed: 0_level_0,additional_institutions_encepp,additional_institutions_not_encepp,additional_medical_conditions,age_population,check_completeness,check_conformance,check_logical_consistency,check_stability,conducted_data_characterisation,countries,...,study_topic_other,study_type,study_type_other,substance_atc,substance_brand_name,substance_brand_name_other,substance_inn,title,update_date,url
eu_pas_register_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1578,Emilia-Romagna Health and Social Agency (ASSR ...,Epidemiologic Observatory of the Health Direct...,,Adults (46 to < 65 years); Adults (65 to < 75 ...,Unknown,Unknown,Unknown,Unknown,No,Italy,...,,Non-interventional study,,ANTIBACTERIALS FOR SYSTEMIC USE; DRUGS FOR OBS...,,,,Long-term outcomes and adverse events of thera...,2015-03-30,https://catalogues.ema.europa.eu/node/1221
1587,,,,Adults (18 to < 46 years); Adults (46 to < 65 ...,Unknown,Unknown,Unknown,Unknown,No,Spain,...,,Non-interventional study,,,,,,DEVELOPMENT AND VALIDATION OF A SHORTENED VERS...,2010-10-25,https://catalogues.ema.europa.eu/node/1146
1591,,,,Adults (18 to < 46 years); Adults (46 to < 65 ...,Unknown,Unknown,Unknown,Unknown,No,United Kingdom,...,,Non-interventional study,,,,,ROSIGLITAZONE,Cardiac profile of patients using rosiglitazon...,2016-09-09,https://catalogues.ema.europa.eu/node/1330
1594,,,,Adolescents (12 to < 18 years); Adults (18 to ...,Unknown,Unknown,Unknown,Unknown,No,United Kingdom,...,Signal detection procedure,Non-interventional study,,,,,,Validation of statistical signal detection pro...,2016-07-19,https://catalogues.ema.europa.eu/node/1304
1597,,,,Adolescents (12 to < 18 years); Adults (18 to ...,Unknown,Unknown,Unknown,Unknown,No,Canada; Russian Federation; Ukraine; United St...,...,,Non-interventional study,,,,,DROSPIRENONE,International Active Surveillance study - Fola...,2020-08-24,https://catalogues.ema.europa.eu/node/2359
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
108481,,"10 centres, Merck Investigational Site, China",Invasive Aspergillosis,Adults (18 to < 46 years); Adults (46 to < 65 ...,Unknown,Unknown,Unknown,Unknown,No,China,...,,Non-interventional study,,posaconazole,,,,Post Marketing Surveillance of Effectiveness (...,2024-01-18,https://catalogues.ema.europa.eu/node/3895
108728,,,,Adults (18 to < 46 years); Adults (46 to < 65 ...,Unknown,Unknown,Unknown,Unknown,No,Denmark; Finland; France; Germany; United States,...,,Non-interventional study,,anifrolumab,,,,A Non-Interventional Multi-Database Post-Autho...,2024-01-16,https://catalogues.ema.europa.eu/node/3896
108847,University Medical Center Utrecht (UMCU),"Teamit Institute S.L.,Fondaziione Penta ONLUS",,Adolescents (12 to < 18 years); Adults (18 to ...,Unknown,Unknown,Unknown,Unknown,No,Italy; Netherlands; Norway; Spain; United Kingdom,...,,Non-interventional study,,Other viral vaccines,,,,Post-Authorisation Safety Study of Comirnaty O...,2024-01-22,https://catalogues.ema.europa.eu/node/3897
108904,,Multiple centres: 41 centres involved in the s...,"Transthyretin-mediated amyloidosis, Hereditary...",Adults (18 to < 46 years); Adults (46 to < 65 ...,Unknown,Unknown,Unknown,Unknown,No,Brazil; Bulgaria; Denmark; France; Germany; Is...,...,,Non-interventional study,,,,,,Prospective Observational Study to Monitor and...,2024-02-17,https://catalogues.ema.europa.eu/node/3898


We can now fix the `countries` field. First we get the list of all country values containing a comma.

In [22]:
comma_countries = [c for c in scraped['countries'].str.get_dummies('; ').columns.values if ',' in c]
comma_countries

['Iran, Islamic Republic of',
 "Korea, Democratic People's Republic of",
 'Korea, Republic of',
 'Moldova, Republic of',
 'Tanzania, United Republic of',
 'Venezuela, Bolivarian Republic of']

Next we will replace the comma with another character split, sort and join or string and then restore the comma.

In [23]:
for c in comma_countries:
    cleaned['countries'] = cleaned['countries'].str.replace(c, c.replace(',', '|'), regex=False)

cleaned['countries'] = cleaned['countries'].str.split(', ').apply(lambda x : sorted(x) if type(x) is list else x).str.join('; ')

for c in comma_countries:
    cleaned['countries'] = cleaned['countries'].str.replace(c.replace(',', '|'), c, regex=False)

cleaned['countries'].str.get_dummies('; ').columns.tolist()[:10]

['Afghanistan',
 'Albania',
 'Algeria',
 'American Samoa',
 'Argentina',
 'Armenia',
 'Aruba',
 'Australia',
 'Austria',
 'Azerbaijan']

## Comparing the data
We will now compare the data. The Dataframes need to have the same index and columns in order to use `pd.compare`. We will use `intersection` to find the shared index and column values.

In [24]:
shared_cols = cleaned.columns.intersection(scraped.columns)
print('Amount of shared cols: ', len(shared_cols))
print('Amount of cols in cleaned: ', len(cleaned.columns))
print('Amount of cols in scraped: ', len(scraped.columns))
print('Symmetric differences: ', scraped.columns.symmetric_difference(cleaned.columns))

shared_indices = scraped.index.intersection(cleaned.index)
print('Amount of shared indices: ', len(shared_indices))
print('Amount of indices in cleaned: ', len(cleaned.index))
print('Amount of indices in scraped: ', len(scraped.index))
print('Symmetric differences: ', scraped.index.symmetric_difference(cleaned.index))

Amount of shared cols:  56
Amount of cols in cleaned:  60
Amount of cols in scraped:  63
Symmetric differences:  Index(['data_source_types', 'data_source_types_other', 'other_documents_name',
       'other_documents_url', 'pdf_url', 'protocol_document_name',
       'protocol_document_url', 'result_document_name', 'result_document_url',
       'result_tables_name', 'result_tables_url'],
      dtype='object')
Amount of shared indices:  2760
Amount of indices in cleaned:  2760
Amount of indices in scraped:  2760
Symmetric differences:  Index([], dtype='int64', name='eu_pas_register_number')


Okay. Now we can compare the Dataframes. We will exclude substance_atc and url, as we know that these are different, because they extract different informations.

In [25]:
# We know that there are differences in substance_atc and url
interesting_cols = shared_cols.difference(['substance_atc', 'url'])
df_c = cleaned.filter(items=interesting_cols).filter(items=shared_indices, axis='index').sort_index().sort_index(axis='columns')
df_s = scraped.filter(items=interesting_cols).filter(items=shared_indices, axis='index').sort_index().sort_index(axis='columns')

comparison = df_c.compare(df_s, result_names=('exported', 'scraped'))
comparison

Unnamed: 0_level_0,additional_institutions_encepp,additional_institutions_encepp,additional_medical_conditions,additional_medical_conditions,data_sources_registered_with_encepp,data_sources_registered_with_encepp,description,description,outcomes,outcomes,references,references,risk_management_plan,risk_management_plan,title,title,update_date,update_date
Unnamed: 0_level_1,exported,scraped,exported,scraped,exported,scraped,exported,scraped,exported,scraped,exported,scraped,exported,scraped,exported,scraped,exported,scraped
eu_pas_register_number,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2
1661,Clinical Pharmacology Department; Institut de ...,"Clinical Pharmacology Department, UASP Hospita...",,,,,,,,,,,,,,,NaT,
1705,London School of Hygiene & Tropical Medicine (...,"Pharmacoepidemiology Group, London School of H...",,,,,,,,,,,,,,,NaT,
1777,Aarhus University Hospital; Department of Clin...,"Department of Clinical Epidemiology, Aarhus Un...",,,,,,,,,,,,,,,NaT,
2181,Bordeaux PharmacoEpi; Hopitaux de Toulouse; Ph...,"Bordeaux PharmacoEpi, University of Bordeaux; ...",,,,,,,,,,,,,,,NaT,
2221,Centre for Maternal; Fetal and Infant Research...,"Centre for Maternal, Fetal and Infant Research...",,,,,,,,,,,,,,,NaT,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
107217,Cegedim Health Data (CHD); Centre for Pharmaco...,Cegedim Health Data (CHD); Department of medic...,,,,,,,,,,,,,,,NaT,
107315,IQVIA; Pharmaco- and Device epidemiology; Univ...,"IQVIA; Pharmaco- and Device epidemiology, Univ...",,,,,,,,,,,,,,,NaT,
107885,Aarhus University Hospital; Centre for Pharmac...,"Department of Clinical Epidemiology, Aarhus Un...",,,,,,,,,,,,,,,NaT,
107932,Fundació Institut Universitari per a la Recerc...,Fundació Institut Universitari per a la Recerc...,,,,,,,,,,,,,,,NaT,


## Exporting
There seems to be a problem, when we try to export the exported `.csv` as a `.xlsx`.

It is related to these illegal characters:

In [26]:
from openpyxl.cell.cell import ILLEGAL_CHARACTERS_RE
ILLEGAL_CHARACTERS_RE

re.compile(r'[\000-\010]|[\013-\014]|[\016-\037]', re.UNICODE)

We can find the entries containing these characters...

In [27]:
matches = []
for col in cleaned:
    m = cleaned.loc[cleaned[col].astype(str).str.contains(ILLEGAL_CHARACTERS_RE), col]
    if not m.empty:
        matches.append(m)
matches

[eu_pas_register_number
 6435    Study of Current Standard of Care in the U.S.:...
 Name: title, dtype: object]

...and display the value

In [28]:
cleaned.loc[6435].title

'Study of Current Standard of Care in the U.S.: Incidence \nof postoperative events and associated costs among non\x02cardiac surgery patients exposed to neuromuscular \nblocking agents in the Cleveland Clinic (2005-2013)'

We can still export the data if we encode and decode these ASCII-control characters like so:

In [29]:
cleaned.reset_index().map(lambda x: x.encode('unicode_escape').decode('utf-8') if isinstance(x, str) and ILLEGAL_CHARACTERS_RE.search(x) else x).to_excel('converted_ema-rwd_2024-02-21.xlsx', sheet_name='PAS')
comparison.map(lambda x: x.encode('unicode_escape').decode('utf-8') if isinstance(x, str) and ILLEGAL_CHARACTERS_RE.search(x) else x).to_excel('dataset_comparison_2024-02-21.xlsx', sheet_name='compare')