# Downloading EMA RWD Documents and generating classification files

We will download all study documents in the EMA RWD Catalogue and generate classification files for the report documents.

<small>**NOTE:** The classification files were used to find all studies with abstracts or final reports posted.</small>

First import the needed libraries...

In [2]:
import numpy as np
import pandas as pd
import requests
from tenacity import retry, stop_after_attempt, wait_exponential
from tqdm.notebook import tqdm

from pathlib import Path
from urllib.parse import quote, unquote

cwd = Path.cwd()

...and the datasets (**NOTE:** scraped is needed for the correct document urls):

In [None]:
scraped = pd.read_csv('../database_migration/data/scraped_ema-rwd_2024-02-21T22-22-05+00-00.csv').set_index('eu_pas_register_number').sort_index()

exported = pd.read_excel('../database_migration/converted_ema-rwd_2024-02-21.xlsx').set_index('eu_pas_register_number').sort_index()

na_values = [
    "", "#N/A", "#N/A N/A", "#NA", "-1.#IND", "-1.#QNAN", "-NaN", "-nan", 
    "1.#IND", "1.#QNAN", "<NA>", "NULL", "NaN", "None", "nan", "null"
    # "N/A",
    # "NA",
    # "n/a",
]

def python_name_converter(x):
        return '_'.join([word.lower() for word in x.split(' ')]) if x[0] != '$' else x

cancelled = pd.read_excel(
    '../../data/ema_rwd/ema_rwd_patched_manual_gpt_v3.xlsx', 
    index_col=0, 
    keep_default_na=False,
    na_values=na_values,
    na_filter=True
).rename(
    columns=python_name_converter
).set_index(
    'eu_pas_register_number'
)['$CANCELLED_MANUAL']

cancelled = cancelled[~cancelled.fillna(False).astype(bool)].index

## Comparison of scraped and exported urls

The following code demonstrates the differences between the **scraped** and the **exported** URLs. 

**NOTE:** The **exported** data only contains the names of the document.

We will show that we can't simply reconstruct the document URLs from the document name.

First we will filter the needed URLs:

In [13]:
scraped_urls = scraped.filter(regex="(protocol|result)_document")
scraped_urls

Unnamed: 0_level_0,protocol_document_url,result_document_url
eu_pas_register_number,Unnamed: 1_level_1,Unnamed: 2_level_1
1578,,
1587,https://catalogues.ema.europa.eu/sites/default...,
1591,,
1594,,
1597,https://catalogues.ema.europa.eu/sites/default...,
...,...,...
108481,https://catalogues.ema.europa.eu/sites/default...,
108728,https://catalogues.ema.europa.eu/sites/default...,
108847,,
108904,,


Now we can compare the two `pd.Dataframe` and export the results:

In [14]:
exported_urls = exported.filter(regex='(protocol|result)_document') \
.rename(columns=lambda x : x.replace('name', 'url')) \
.map(
    lambda x : f'https://catalogues.ema.europa.eu/sites/default/files/document_files/{quote(x)}.pdf' if pd.notna(x) else x
)

exported_urls

Unnamed: 0_level_0,protocol_document_url,result_document_url
eu_pas_register_number,Unnamed: 1_level_1,Unnamed: 2_level_1
1578,,
1587,https://catalogues.ema.europa.eu/sites/default...,
1591,,
1594,,
1597,https://catalogues.ema.europa.eu/sites/default...,
...,...,...
108481,https://catalogues.ema.europa.eu/sites/default...,
108728,https://catalogues.ema.europa.eu/sites/default...,
108847,,
108904,,


In [15]:
exported_urls \
    .sort_index(axis='index') \
    .sort_index(axis='columns') \
.compare(
    scraped_urls \
        .sort_index(axis='index') \
        .sort_index(axis='columns')
).to_excel('compare_document_urls.xlsx')

## Meta data
Next we will add additional meta informations about the documents...

In [9]:
meta_df = scraped
meta_df = meta_df.assign(
    has_result_tables = lambda x : x['result_tables_url'].notna(),
    has_result_documents = lambda x : x['result_document_url'].notna(),
    has_result=lambda x : x['has_result_tables'] | x['has_result_documents'],
    has_full_result=lambda x : x['has_result_tables'] & x['has_result_documents'],
    has_protocol=lambda x : x['protocol_document_url'].notna(),
    has_other=lambda x : x['other_documents_url'].notna(),
    has_document=lambda x : x['has_result'] | x['has_protocol'] | x['has_other']
)
meta_df = meta_df.filter(regex='protocol|result|other_documents|has|risk|^url$')
meta_df.head()

Unnamed: 0_level_0,other_documents_url,protocol_document_url,result_document_url,result_tables_url,risk_management_plan,url,has_result_tables,has_result_documents,has_result,has_full_result,has_protocol,has_other,has_document
eu_pas_register_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1578,,,,,,https://catalogues.ema.europa.eu/node/1221,False,False,False,False,False,False,False
1587,,https://catalogues.ema.europa.eu/sites/default...,,,,https://catalogues.ema.europa.eu/node/1146,False,False,False,False,True,False,True
1591,,,,https://catalogues.ema.europa.eu/sites/default...,Not applicable,https://catalogues.ema.europa.eu/node/1330,True,False,True,False,False,False,True
1594,,,,,Not applicable,https://catalogues.ema.europa.eu/node/1304,False,False,False,False,False,False,False
1597,,https://catalogues.ema.europa.eu/sites/default...,,https://catalogues.ema.europa.eu/sites/default...,EU RMP category 3 (required),https://catalogues.ema.europa.eu/node/2359,True,False,True,False,True,False,True


## PDF Download
We can now download the PDFs following these steps:

1. Create a folder for RMP Category 1 ("rmp1"), RMP Category 2 ("rmp")" and for the rest of the studies ("rmp_other")

1. Create `pdf_df` with the path information and some meta informations

1. Download the documents into the correct folder and rename them (by putting the EU PAS Register number and the document type in the filename)

In [None]:
# Step 1
(cwd / 'rmp1').mkdir(exist_ok=True, parents=True)
(cwd / 'rmp2').mkdir(exist_ok=True, parents=True)
(cwd / 'rmp_other').mkdir(exist_ok=True, parents=True)

# Step 2
pdf_df = pd.DataFrame().assign(
    path = pd.NA,
    url = pd.NA,
    last_modified = pd.NA,
    e_tag = pd.NA
)

In [None]:
# Step 3
@retry(stop=stop_after_attempt(8), wait=wait_exponential())
def fetch_url(url, session):
    response = session.get(url)
    response.raise_for_status()  # Raise HTTPError for bad responses (4xx and 5xx)
    return response

def download_pdf(url, pdf_file_path, pdf_df, session, override=False):
    if not pdf_file_path.is_file() or override:
        
        response = None
        try:
            response = fetch_url(url, session)
            pdf_file_path.write_bytes(response.content)
        except requests.exceptions.HTTPError as e:
            print(f"Failed to fetch {url}: {e}")
        finally:                  
            headers = getattr(response, 'headers', {})
            pdf_df = pd.concat([ 
                pdf_df,
                pd.DataFrame([[
                    pdf_file_path.name,
                    url,
                    headers.get('Last-Modified', pd.NA),
                    headers.get('ETag', pd.NA)
                ]], columns=pdf_df.columns)
            ], ignore_index=True)  

    return pdf_df

with requests.Session() as session:

    for [sub_df, folder_name] in [
        (meta_df[meta_df['risk_management_plan'] == 'EU RMP category 1 (imposed as condition of marketing authorisation)'], 'rmp1'),
        (meta_df[meta_df['risk_management_plan'] == 'EU RMP category 2 (specific obligation of marketing authorisation)'], 'rmp2'),
        (meta_df[~meta_df['risk_management_plan'].isin([
            'EU RMP category 1 (imposed as condition of marketing authorisation)',
            'EU RMP category 2 (specific obligation of marketing authorisation)'
        ])], 'rmp_other'),
    ]:

        for [urls, name] in [
            (sub_df.loc[sub_df['has_protocol'], 'protocol_document_url'], 'protocol_document'),
            (sub_df.loc[sub_df['has_result_tables'], 'result_tables_url'], 'result_tables'),
            (sub_df.loc[sub_df['has_result_documents'], 'result_document_url'], 'result_document')
        ]:
            
            print(folder_name, name)
            for url, eupas_id in tqdm(zip(urls.values, urls.index), total=len(urls)):
                pdf_file_path = cwd / folder_name / f'EUPAS{eupas_id}_{name}.pdf'
                pdf_df = download_pdf(url, pdf_file_path, pdf_df, session)
        
        other_df = sub_df.loc[sub_df['has_other'], 'other_documents_url'].str.split('; ').explode().to_frame()
        other_df['counter'] = other_df.groupby(level=0).cumcount() + 1

        print(folder_name, 'other_document')
        for [url, value], eupas_id in tqdm(zip(other_df.values, other_df.index), total=len(other_df)):
            pdf_file_path = cwd / folder_name / f'EUPAS{eupas_id}_other_document_#{value}.pdf'
            pdf_df = download_pdf(url, pdf_file_path, pdf_df, session)

pdf_df

rmp1 protocol_document


  0%|          | 0/83 [00:00<?, ?it/s]

rmp1 result_tables


  0%|          | 0/62 [00:00<?, ?it/s]

rmp1 result_document


  0%|          | 0/18 [00:00<?, ?it/s]

rmp1 other_document


  0%|          | 0/33 [00:00<?, ?it/s]

rmp2 protocol_document


  0%|          | 0/23 [00:00<?, ?it/s]

rmp2 result_tables


  0%|          | 0/17 [00:00<?, ?it/s]

rmp2 result_document


  0%|          | 0/4 [00:00<?, ?it/s]

rmp2 other_document


  0%|          | 0/11 [00:00<?, ?it/s]

rmp_other protocol_document


  0%|          | 0/1426 [00:00<?, ?it/s]

rmp_other result_tables


  0%|          | 0/950 [00:00<?, ?it/s]

rmp_other result_document


  0%|          | 0/140 [00:00<?, ?it/s]

rmp_other other_document


  0%|          | 0/247 [00:00<?, ?it/s]

Unnamed: 0,path,url,last_modified,e_tag
0,EUPAS2165_protocol_document.pdf,https://catalogues.ema.europa.eu/sites/default...,"Thu, 29 Feb 2024 07:34:59 GMT","W/""492b6-612804bf842b8"""
1,EUPAS2196_protocol_document.pdf,https://catalogues.ema.europa.eu/sites/default...,"Thu, 29 Feb 2024 07:40:53 GMT","W/""4d1321-61280610d4770"""
2,EUPAS3142_protocol_document.pdf,https://catalogues.ema.europa.eu/sites/default...,"Thu, 29 Feb 2024 07:44:20 GMT","W/""170e9a-612806d67b960"""
3,EUPAS3901_protocol_document.pdf,https://catalogues.ema.europa.eu/sites/default...,"Thu, 29 Feb 2024 07:47:56 GMT","W/""117970-612807a47b2e8"""
4,EUPAS4270_protocol_document.pdf,https://catalogues.ema.europa.eu/sites/default...,"Thu, 29 Feb 2024 07:38:24 GMT","W/""b023f-61280582a5c88"""
...,...,...,...,...
3009,EUPAS105033_other_document_#1.pdf,https://catalogues.ema.europa.eu/sites/default...,"Thu, 29 Feb 2024 07:50:07 GMT","W/""17cd44-61280820b7de8"""
3010,EUPAS105257_other_document_#1.pdf,https://catalogues.ema.europa.eu/sites/default...,"Thu, 29 Feb 2024 07:50:09 GMT","W/""38a4c-612808228ed10"""
3011,EUPAS108167_other_document_#1.pdf,https://catalogues.ema.europa.eu/sites/default...,"Thu, 29 Feb 2024 07:50:24 GMT","W/""55711-61280830dbf30"""
3012,EUPAS108254_other_document_#1.pdf,https://catalogues.ema.europa.eu/sites/default...,"Thu, 29 Feb 2024 07:50:24 GMT","W/""4789d-612808313f120"""


Save and load `pdf_df`. 

Optionally: Skip the step above and load the old `pdf_df`.

In [54]:
pdf_df.to_excel('all_documents_download_meta_data.xlsx', sheet_name='meta_data')

In [10]:
pdf_df = pd.read_excel('all_documents_download_meta_data.xlsx', sheet_name='meta_data', index_col=0)
pdf_df.head(3)

Unnamed: 0,path,url,last_modified,e_tag
0,EUPAS2165_protocol_document.pdf,https://catalogues.ema.europa.eu/sites/default...,"Thu, 29 Feb 2024 07:34:59 GMT","W/""492b6-612804bf842b8"""
1,EUPAS2196_protocol_document.pdf,https://catalogues.ema.europa.eu/sites/default...,"Thu, 29 Feb 2024 07:40:53 GMT","W/""4d1321-61280610d4770"""
2,EUPAS3142_protocol_document.pdf,https://catalogues.ema.europa.eu/sites/default...,"Thu, 29 Feb 2024 07:44:20 GMT","W/""170e9a-612806d67b960"""


We can now prepare multiple `pd.Dataframe`, which we will use for analysing the documents and for document classification:

In [None]:
analysis_df = pdf_df.assign(
    name=lambda x : x['url'].str.split('/').str[-1].str.replace('.pdf', '').apply(unquote),
    eu_pas_register_number=lambda x : x['path'].str.split('_').str[0].str[5:].astype(int),
    uploaded_document_type=lambda x: x['path'].str.split('_').str[1:].str.join('_').str[:-4].str.replace(r'_#\d+', '', regex=True)
).drop(
    ['last_modified', 'e_tag'], 
    axis='columns'
).set_index(
    'eu_pas_register_number'
).merge(
    meta_df['risk_management_plan'], 
    right_index=True, 
    left_index=True
).assign(
    folder_name=lambda x : np.where(
        x['risk_management_plan'] == 'EU RMP category 1 (imposed as condition of marketing authorisation)',
        'rmp1',
        np.where(
            x['risk_management_plan'] == 'EU RMP category 2 (specific obligation of marketing authorisation)',
            'rmp2',
            'rmp_other'
        )
    ),
    path=lambda x : x['folder_name'] + '/' + x['path']
).drop(
    ['folder_name'], axis='columns'
)

# This is important to filter out cancelled studies
analysis_df = analysis_df.loc[analysis_df.index.intersection(cancelled)]

imposed_rmp_manual_analysis_df = analysis_df[
    analysis_df['risk_management_plan'].isin([
        'EU RMP category 1 (imposed as condition of marketing authorisation)',
        'EU RMP category 2 (specific obligation of marketing authorisation)'
    ]) & 
    analysis_df['uploaded_document_type'].ne('protocol_document')
].drop(
    ['risk_management_plan'], axis='columns'
).assign(
    manual_document_type=pd.NA
)

other_rmp_manual_analysis_df = analysis_df[
    ~(analysis_df['risk_management_plan'].isin([
        'EU RMP category 1 (imposed as condition of marketing authorisation)',
        'EU RMP category 2 (specific obligation of marketing authorisation)'
    ])) & 
    analysis_df['uploaded_document_type'].ne('protocol_document')
].drop(
    ['risk_management_plan'], axis='columns'
).assign(
    manual_document_type=pd.NA
)

analysis_df

Unnamed: 0_level_0,path,url,name,uploaded_document_type,risk_management_plan
eu_pas_register_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1587,rmp_other/EUPAS1587_protocol_document.pdf,https://catalogues.ema.europa.eu/sites/default...,Tesis Maria Jose Alcala,protocol_document,
1591,rmp_other/EUPAS1591_result_tables.pdf,https://catalogues.ema.europa.eu/sites/default...,Report_Rosiglitazone_use,result_tables,Not applicable
1597,rmp_other/EUPAS1597_protocol_document.pdf,https://catalogues.ema.europa.eu/sites/default...,Protocol INAS-FOCUS,protocol_document,EU RMP category 3 (required)
1597,rmp_other/EUPAS1597_result_tables.pdf,https://catalogues.ema.europa.eu/sites/default...,IFOC_FinalStudyReport_Public Version 20200819,result_tables,EU RMP category 3 (required)
1613,rmp_other/EUPAS1613_protocol_document.pdf,https://catalogues.ema.europa.eu/sites/default...,VIPOS_Study Protocol,protocol_document,EU RMP category 3 (required)
...,...,...,...,...,...
108254,rmp_other/EUPAS108254_other_document_#1.pdf,https://catalogues.ema.europa.eu/sites/default...,CEIM_LEGIT_MC_EVCDAO_2019 Modificacion Favorab...,other_document,Not applicable
108260,rmp_other/EUPAS108260_protocol_document.pdf,https://catalogues.ema.europa.eu/sites/default...,LEGIT_COVIDX_EVCDAO_2022 Protocol Multipatholo...,protocol_document,Not applicable
108260,rmp_other/EUPAS108260_other_document_#1.pdf,https://catalogues.ema.europa.eu/sites/default...,CEIm_LEGIT_COVIDX_EVCDAO_2022_TRJON-8abc0f12d8...,other_document,Not applicable
108481,rmp_other/EUPAS108481_protocol_document.pdf,https://catalogues.ema.europa.eu/sites/default...,MK-5592-141-00-v1-Protocol_final-redaction,protocol_document,Not applicable


We can also save and load this `pd.Dataframe` for further analysis (See other notebook for analysis):

In [105]:
analysis_df.to_excel('all_documents_analysis_data.xlsx', sheet_name='analysis')

In [14]:
analysis_df = pd.read_excel('all_documents_analysis_data.xlsx', sheet_name='analysis', index_col=0)
analysis_df.head(3)

Unnamed: 0_level_0,path,url,name,risk_management_plan
eu_pas_register_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1587,rmp_other/EUPAS1587_protocol_document.pdf,https://catalogues.ema.europa.eu/sites/default...,Tesis Maria Jose Alcala,
1591,rmp_other/EUPAS1591_result_tables.pdf,https://catalogues.ema.europa.eu/sites/default...,Report_Rosiglitazone_use,Not applicable
1597,rmp_other/EUPAS1597_protocol_document.pdf,https://catalogues.ema.europa.eu/sites/default...,Protocol INAS-FOCUS,EU RMP category 3 (required)


We can also save two other `pd.Dataframe` for manual classification (See other notebook for further processing):

In [134]:
imposed_rmp_manual_analysis_df.to_excel('rmp1&2_documents_manual.xlsx', sheet_name='analysis')

In [12]:
other_rmp_manual_analysis_df.to_excel('rmpother_documents_manual.xlsx', sheet_name='analysis')