# Searching EMA RWD Studies in other databases
We will search for EMA RWD Studies in [ClinicalTrials.gov](https://www.clinicaltrials.gov) and [Pubmed](https://pubmed.ncbi.nlm.nih.gov/).

<small>**NOTE:** This notebook was supposed to be used to find other references based on the metadata of each study.</small>

<small>**NOTE:** Not used for final analysis. Out of scope.</small>

First import the needed libraries...

In [1]:
import pandas as pd
from pymed import PubMed
import requests
from tqdm.notebook import tqdm

from urllib.parse import quote_plus

...and the data:

In [2]:
exported = pd.read_excel('../database_migration/converted_ema-rwd_2024-02-21.xlsx').set_index('eu_pas_register_number').sort_index()

We will just focus on the required RMP (category 1 and 2) studies and try to extract the acronym from the title.

**NOTE:** Acronym was once a seperate field in the EU PAS Register study. It has been merged with the title in the new EMA RWD database.

In [None]:
base = exported.assign(
    acronym=lambda x : x['title'].str.extract(r'\((?P<acronym>[^\(\)]+)\)\s*$')
)[exported['risk_management_plan'].isin([
    'EU RMP category 1 (imposed as condition of marketing authorisation)',
    'EU RMP category 2 (specific obligation of marketing authorisation)'
])].filter(regex='title|references|has|^url$|acronym|regulatory_procedure_number')

base

Unnamed: 0_level_0,title,url,regulatory_procedure_number,references,acronym
eu_pas_register_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2165,Post-Authorisation Safety Study of Esbriet® (P...,https://catalogues.ema.europa.eu/study/23388,,,PASSPORT
2181,Study of Acute Liver Transplant: A study of NS...,https://catalogues.ema.europa.eu/study/40864,,https://doi.org/10.1002/pds.3204; https://doi....,SALT-I
2196,Prospective controlled cohort study on the saf...,https://catalogues.ema.europa.eu/study/41500,,,PRO-E2
2857,A Multicenter Cohort Study of the Short and Lo...,https://catalogues.ema.europa.eu/study/35221,,https://doi.org/10.1093/jac/dkw225; https://do...,MYCOS
3142,A Safety and Pharmacokinetic study in Real-lif...,https://catalogues.ema.europa.eu/study/47210,,https://doi.org/10.1007/s40264-019-00821-6,
...,...,...,...,...,...
103852,"Non-interventional, post-authorization efficac...",https://catalogues.ema.europa.eu/study/103853,EMEA/H/C/PSP/S/0098.1,,CA082-1105
103855,"Non-interventional, post-authorization safety ...",https://catalogues.ema.europa.eu/study/103856,EMEA/H/C/PSP/S/0098.1,,JCAR017-BCM-005
105358,A Registry of Patients Treated with Fintepla (...,https://catalogues.ema.europa.eu/study/105359,EMEA/H/C/003933,,TAPESTRY Registry
105422,"A Two-Part, International, Real-World, Observa...",https://catalogues.ema.europa.eu/study/105423,EMEA/H/C/005352/SOB/002,,PTC-AADC-MA-406


## Searching in ClinicalTrials.gov
We will now search the ClinicalTrials.gov database in this order:

1. EU PAS Register Number
1. Title
1. Regulatory Procedure Number (if applicable)
1. Acronym (if applicable)

In [None]:
rest_search_url = "https://clinicaltrials.gov/api/v2/studies?format=json&markupFormat=markdown&countTotal=true&pageSize=15"
id_search_query = '"EUPAS{0}" OR "Eupas{0}" OR "eupas{0}" OR (("EUPAS" OR "Eupas" OR "Eu Pas" OR "EU PAS" OR "EU Pas" OR "Eu PAS") AND "{0}")'

nct_df = pd.DataFrame().assign(
    eu_pas_register_number=pd.NA,
    nct_total_count=pd.NA,
    nct_url=pd.NA,
    nct_id=pd.NA,
    nct_state=pd.NA,
    nct_first_posted=pd.NA,
    nct_title=pd.NA,
    nct_acronym=pd.NA,
    nct_lead_sponsor=pd.NA,
    nct_pmid_references=pd.NA,
    nct_status_code=pd.NA,
    nct_search_term=pd.NA,
)

for [id, title, references, rpn, references, acronym] in tqdm(base.reset_index().values, total=len(base)):

    search_term = pd.NA

    def add_fail_entry(df, status_code, search_term):
        df = pd.concat([ 
            df,
            pd.DataFrame([[
                id, status_code, search_term
            ]], columns=['eu_pas_register_number', 'nct_status_code', 'nct_search_term'])
        ], ignore_index=True)
        return df
    
    search_term = 'eu_pas_register_number'
    response = requests.get(f'{rest_search_url}&query.id={quote_plus(id_search_query.format(id))}')
    if response.status_code != 200:
        nct_df = add_fail_entry(nct_df, response.status_code, search_term)
        continue
    clinical_trials_studies = response.json()

    if not clinical_trials_studies['studies']:
        
        search_term = 'title'
        response = requests.get(f'{rest_search_url}&query.titles={quote_plus(title)}')
        if response.status_code != 200:
            nct_df = add_fail_entry(nct_df, response.status_code, search_term)
            continue
        clinical_trials_studies = response.json()
        
        if not clinical_trials_studies['studies']:

            if pd.notna(rpn):
                search_term = 'regulatory_procedure_number'
                response = requests.get(f'{rest_search_url}&query.id={quote_plus(rpn)}')
                if response.status_code != 200:
                    nct_df = add_fail_entry(nct_df, response.status_code, search_term)
                    continue
                clinical_trials_studies = response.json()
                
            
            if pd.notna(acronym) and not clinical_trials_studies['studies']:
                search_term = 'acronym'
                response = requests.get(f'{rest_search_url}&query.id={quote_plus(acronym)}')
                if response.status_code != 200:
                    nct_df = add_fail_entry(nct_df, response.status_code, search_term)
                    continue
                clinical_trials_studies = response.json()
                
            
    if not clinical_trials_studies['studies']:
        search_term = pd.NA
    
    total_count = clinical_trials_studies['totalCount']

    nct_urls = '; '.join(
        f"https://www.clinicaltrials.gov/study/{study['protocolSection']['identificationModule']['nctId']}"
        for study in clinical_trials_studies['studies']
    )

    nct_ids = '; '.join(
        study['protocolSection']['identificationModule']['nctId']
        for study in clinical_trials_studies['studies']
    )

    nct_states = '; '.join(
        study['protocolSection']['statusModule']['overallStatus']
        for study in clinical_trials_studies['studies']
    )

    nct_first_posted = '; '.join(
        study['protocolSection']['statusModule']['studyFirstPostDateStruct']['date']
        for study in clinical_trials_studies['studies']
    )

    nct_titles = '; '.join(
        study['protocolSection']['identificationModule'].get('officialTitle',
            study['protocolSection']['identificationModule']['briefTitle']
        )
        for study in clinical_trials_studies['studies']
    )

    nct_acronyms = '; '.join(
        study['protocolSection']['identificationModule'].get('acronym', 'n.a.')
        for study in clinical_trials_studies['studies']
    )

    nct_sponsors = '; '.join(
        study['protocolSection']['sponsorCollaboratorsModule']['leadSponsor']['name']
        for study in clinical_trials_studies['studies']
    )

    nct_references = []
    for study in clinical_trials_studies['studies']:
        if study['protocolSection'].get('referencesModule', dict()).get('references'):
            nct_references.append([
                reference['pmid']
                for reference in study['protocolSection']['referencesModule']['references']
                if reference.get('pmid')
            ] or ['n.a.'])
        else:
            nct_references.append(['n.a.'])
    nct_references = '; '.join([', '.join(references) for references in nct_references])

    nct_df = pd.concat([ 
        nct_df,
        pd.DataFrame([[
            id,
            total_count,
            nct_urls,
            nct_ids,
            nct_states,
            nct_first_posted,
            nct_titles,
            nct_acronyms,
            nct_sponsors,
            nct_references,
            response.status_code,
            search_term
        ]], columns=nct_df.columns)
    ], ignore_index=True)

nct_df = nct_df.set_index('eu_pas_register_number')
nct_df

  0%|          | 0/166 [00:00<?, ?it/s]

Unnamed: 0_level_0,nct_total_count,nct_url,nct_id,nct_state,nct_first_posted,nct_title,nct_acronym,nct_lead_sponsor,nct_pmid_references,nct_status_code,nct_search_term
eu_pas_register_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2165,5,https://www.clinicaltrials.gov/study/NCT062742...,NCT06274294; NCT04716023; NCT03068468; NCT0397...,NOT_YET_RECRUITING; UNKNOWN; TERMINATED; COMPL...,2024-02-23; 2021-01-20; 2017-03-01; 2019-06-07...,"The ""PASSPORT Trial"": Pharmacokinetics, Effica...",PASSPORT; PASSPORT; PASSPORT; PASSPoRT; n.a.,CMC Ambroise Paré; Imperial College London; Bi...,"33676969, 35673354; n.a.; 34736158, 34385707; ...",200,acronym
2181,0,,,,,,,,,200,
2196,1,https://www.clinicaltrials.gov/study/NCT01650168,NCT01650168,COMPLETED,2012-07-26,Prospective Controlled Cohort Study on the Saf...,PRO-E2,"Center for Epidemiology and Health Research, G...",n.a.,200,title
2857,1,https://www.clinicaltrials.gov/study/NCT01686607,NCT01686607,COMPLETED,2012-09-18,A Multicenter Cohort Study of the Short and Lo...,MYCOS,Astellas Pharma Europe B.V.,n.a.,200,title
3142,0,,,,,,,,,200,
...,...,...,...,...,...,...,...,...,...,...,...
103852,0,,,,,,,,,200,
103855,0,,,,,,,,,200,
105358,0,,,,,,,,,200,
105422,0,,,,,,,,,200,


We can save and reload this Dataframe:

In [93]:
nct_df.to_excel('rmp1&2_nct.xlsx', sheet_name='nct')

In [3]:
nct_df = pd.read_excel('rmp1&2_nct.xlsx', sheet_name='nct', index_col='eu_pas_register_number')
nct_df.head()

Unnamed: 0_level_0,nct_total_count,nct_url,nct_id,nct_state,nct_first_posted,nct_title,nct_acronym,nct_lead_sponsor,nct_pmid_references,nct_status_code,nct_search_term
eu_pas_register_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2165,5.0,https://www.clinicaltrials.gov/study/NCT062742...,NCT06274294; NCT04716023; NCT03068468; NCT0397...,NOT_YET_RECRUITING; UNKNOWN; TERMINATED; COMPL...,2024-02-23; 2021-01-20; 2017-03-01; 2019-06-07...,"The ""PASSPORT Trial"": Pharmacokinetics, Effica...",PASSPORT; PASSPORT; PASSPORT; PASSPoRT; n.a.,CMC Ambroise Paré; Imperial College London; Bi...,"33676969, 35673354; n.a.; 34736158, 34385707; ...",200,acronym
2181,0.0,,,,,,,,,200,
2196,1.0,https://www.clinicaltrials.gov/study/NCT01650168,NCT01650168,COMPLETED,2012-07-26,Prospective Controlled Cohort Study on the Saf...,PRO-E2,"Center for Epidemiology and Health Research, G...",n.a.,200,title
2857,1.0,https://www.clinicaltrials.gov/study/NCT01686607,NCT01686607,COMPLETED,2012-09-18,A Multicenter Cohort Study of the Short and Lo...,MYCOS,Astellas Pharma Europe B.V.,n.a.,200,title
3142,0.0,,,,,,,,,200,


## Searching in Pubmed

We will use the `pymed` library to search for studies with the Pubmed API.

We will search by NCT ID and EU PAS Register number.

In [None]:
pubmed = PubMed(tool="EmaRwdSearcher", email="pedram.ramezani@charite.de")

eupas_id_search_term_template = '((eupas OR "eu pas") AND {id}) OR eupas{id}'
article_info = []

for [id, title, references, rpn, references, acronym] in tqdm(base.reset_index().values, total=len(base)):

    search_term = 'eu_pas_register_number'
    results = pubmed.query(eupas_id_search_term_template.format(id=id), max_results=100)
    if not results:
        search_term = 'title'
        results = pubmed.query(title, max_results=100)
        if not results and rpn:
            search_term = 'regulatory_procedure_number'
            results = pubmed.query(rpn, max_results=100)
    
    article_list = []
    for article in results:
        article_dict = article.toDict()
        article_list.append(article_dict)

    for article in article_list:
        # Sometimes article['pubmed_id'] contains list separated with comma - take first pubmedId in that list - thats article pubmedId
        pubmedId = article['pubmed_id'].partition('\n')[0]
        
        article_info.append({
            'eu_pas_register_number': id,
            'pubmed_id': pubmedId,
            'pubmed_url': f'https://pubmed.ncbi.nlm.nih.gov/{pubmedId}/',
            'pubmed_title': article['title'],
            'pubmed_publication_date': article['publication_date'],
            'pubmed_search_term': search_term
        })

articles_df = pd.DataFrame.from_dict(article_info)

articles_df


  0%|          | 0/166 [00:00<?, ?it/s]

Unnamed: 0,eu_pas_register_number,pubmed_id,pubmed_url,pubmed_title,pubmed_publication_date,pubmed_search_term
0,3142,31069703,https://pubmed.ncbi.nlm.nih.gov/31069703/,Bismuth Concentrations in Patients Treated in ...,2019-05-10,eu_pas_register_number
1,4270,29748252,https://pubmed.ncbi.nlm.nih.gov/29748252/,Data from the US and UK cystic fibrosis regist...,2018-05-12,eu_pas_register_number
2,5812,38888495,https://pubmed.ncbi.nlm.nih.gov/38888495/,Long-term safety of hyaluronidase-facilitated ...,2024-06-18,eu_pas_register_number
3,6942,33964945,https://pubmed.ncbi.nlm.nih.gov/33964945/,Longitudinal study based on a safety registry ...,2021-05-10,eu_pas_register_number
4,9361,32140554,https://pubmed.ncbi.nlm.nih.gov/32140554/,"A European, multicentre, observational, post-a...",2020-03-07,eu_pas_register_number
5,11384,35766393,https://pubmed.ncbi.nlm.nih.gov/35766393/,Use and safety of aprotinin in routine clinica...,2022-06-30,eu_pas_register_number
6,12330,31749061,https://pubmed.ncbi.nlm.nih.gov/31749061/,Comparative Safety Profile of the Fixed-Dose C...,2019-11-22,eu_pas_register_number
7,13514,38459585,https://pubmed.ncbi.nlm.nih.gov/38459585/,Effectiveness of asfotase alfa for treatment o...,2024-03-09,eu_pas_register_number
8,13514,37051203,https://pubmed.ncbi.nlm.nih.gov/37051203/,Impact of muscular symptoms and/or pain on dis...,2023-04-14,eu_pas_register_number
9,16927,34801015,https://pubmed.ncbi.nlm.nih.gov/34801015/,The PARADIGHM (physicians advancing disease kn...,2021-11-22,eu_pas_register_number


We will have to aggregate multiple entries:

In [107]:
articles_df = articles_df.groupby('eu_pas_register_number').agg({
    col: lambda x : '; '.join(x.astype(str))
    for col in articles_df.drop('eu_pas_register_number', axis='columns')
})

articles_df

Unnamed: 0_level_0,pubmed_id,pubmed_url,pubmed_title,pubmed_publication_date,pubmed_search_term
eu_pas_register_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
3142,31069703,https://pubmed.ncbi.nlm.nih.gov/31069703/,Bismuth Concentrations in Patients Treated in ...,2019-05-10,eu_pas_register_number
4270,29748252,https://pubmed.ncbi.nlm.nih.gov/29748252/,Data from the US and UK cystic fibrosis regist...,2018-05-12,eu_pas_register_number
5812,38888495,https://pubmed.ncbi.nlm.nih.gov/38888495/,Long-term safety of hyaluronidase-facilitated ...,2024-06-18,eu_pas_register_number
6942,33964945,https://pubmed.ncbi.nlm.nih.gov/33964945/,Longitudinal study based on a safety registry ...,2021-05-10,eu_pas_register_number
9361,32140554,https://pubmed.ncbi.nlm.nih.gov/32140554/,"A European, multicentre, observational, post-a...",2020-03-07,eu_pas_register_number
11384,35766393,https://pubmed.ncbi.nlm.nih.gov/35766393/,Use and safety of aprotinin in routine clinica...,2022-06-30,eu_pas_register_number
12330,31749061,https://pubmed.ncbi.nlm.nih.gov/31749061/,Comparative Safety Profile of the Fixed-Dose C...,2019-11-22,eu_pas_register_number
13514,38459585; 37051203,https://pubmed.ncbi.nlm.nih.gov/38459585/; htt...,Effectiveness of asfotase alfa for treatment o...,2024-03-09; 2023-04-14,eu_pas_register_number; eu_pas_register_number
16927,34801015,https://pubmed.ncbi.nlm.nih.gov/34801015/,The PARADIGHM (physicians advancing disease kn...,2021-11-22,eu_pas_register_number
31153,36878196,https://pubmed.ncbi.nlm.nih.gov/36878196/,Current Management of Patients with RPE65 Muta...,2023-03-07,eu_pas_register_number


We can now save and reload this Dataframe:

In [108]:
articles_df.to_excel('rmp1&2_pubmed.xlsx', sheet_name='pubmed')

In [4]:
articles_df = pd.read_excel('rmp1&2_pubmed.xlsx', sheet_name='pubmed', index_col='eu_pas_register_number')
articles_df.head()

Unnamed: 0_level_0,pubmed_id,pubmed_url,pubmed_title,pubmed_publication_date,pubmed_search_term
eu_pas_register_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
3142,31069703,https://pubmed.ncbi.nlm.nih.gov/31069703/,Bismuth Concentrations in Patients Treated in ...,2019-05-10,eu_pas_register_number
4270,29748252,https://pubmed.ncbi.nlm.nih.gov/29748252/,Data from the US and UK cystic fibrosis regist...,2018-05-12,eu_pas_register_number
5812,38888495,https://pubmed.ncbi.nlm.nih.gov/38888495/,Long-term safety of hyaluronidase-facilitated ...,2024-06-18,eu_pas_register_number
6942,33964945,https://pubmed.ncbi.nlm.nih.gov/33964945/,Longitudinal study based on a safety registry ...,2021-05-10,eu_pas_register_number
9361,32140554,https://pubmed.ncbi.nlm.nih.gov/32140554/,"A European, multicentre, observational, post-a...",2020-03-07,eu_pas_register_number


## Checking the references
We will now follow  ther reference URLs and see if they references the EU PAS register number:

In [None]:
import re
external_df = base['references'].dropna().str.split('; ').explode().to_frame().assign(
    external_id_found=pd.NA,
    external_status_code=pd.NA
)

headers = {
    'Content-Type': 'application/json',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:127.0) Gecko/20100101 Firefox/127.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US;q=0.7,en;q=0.3',
    'Accept-Encoding': 'gzip, deflate, br, zstd',
    'Connection': 'keep-alive',
    'Priority': 'u=1'
}

external_data = []
for [id, url, *other] in tqdm(external_df.reset_index().values, total=len(external_df)):
    response = requests.get(url, headers=headers)
    found_match = None
    
    print(id, url, response.status_code, sep=' ', end="\n" if response.status_code == 200 else "\n\n")
    
    if response.status_code == 200:
        found_match = re.search(fr'\b\S*?\s*{id}', response.text, re.IGNORECASE)
        if found_match:
            found_match = found_match.group(0)
        print(found_match or pd.NA, end='\n\n')
    

    external_data.append({
        'eu_pas_register_number': id,
        'external_references': url,
        'external_id_found': found_match or pd.NA,
        'external_status_code': response.status_code,
    })

external_df = pd.DataFrame.from_records(external_data)

external_df = external_df.groupby('eu_pas_register_number').agg({
    col: lambda x : '; '.join(x.astype(str))
    for col in external_df.drop('eu_pas_register_number', axis='columns')
})

external_df

  0%|          | 0/36 [00:00<?, ?it/s]

2181 https://doi.org/10.1002/pds.3204 403

2181 https://doi.org/10.1002/pds.3371 403

2181 https://doi.org/10.1007/s00228-012-1357-8 200
<NA>

2181 https://doi.org/10.1007/s40264-012-0013-7 200
<NA>

2181 https://doi.org/10.1007/s40264-013-0071-5 200
<NA>

2857 https://doi.org/10.1093/jac/dkw225 200
<NA>

2857 https://doi.org/10.1093/jac/dkz396 200
EUPAS2857

3142 https://doi.org/10.1007/s40264-019-00821-6 200
EUPAS3142

3901 https://doi.org/10.1016/j.therap.2020.09.002 200
<NA>

4270 https://doi.org/10.1136/thoraxjnl-2017-210394 200
EUPAS4270

7708 https://doi.org/10.1159/000371798 200
<NA>

7708 https://doi.org/10.1530/EJE-20-0325 200
ENCEPP/SDPP/7708

7708 https://doi.org/10.3389/fendo.2022.812568 200
<NA>

8585 https://doi.org/10.19080/JGWH.2018.09.555762 200
<NA>

9361 https://doi.org/10.1016/S0016-5085(18)31718-9 200
<NA>

11145 https://doi.org/10.1007/s00228-014-1697-7 200
<NA>

13276 https://doi.org/10.1093/jcag/gwab002.214 200
<NA>

13514 https://abstracts.eurospe.org/hrp/0094

Unnamed: 0_level_0,external_references,external_id_found,external_status_code
eu_pas_register_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2181,https://doi.org/10.1002/pds.3204; https://doi....,<NA>; <NA>; <NA>; <NA>; <NA>,403; 403; 200; 200; 200
2857,https://doi.org/10.1093/jac/dkw225; https://do...,<NA>; EUPAS2857,200; 200
3142,https://doi.org/10.1007/s40264-019-00821-6,EUPAS3142,200
3901,https://doi.org/10.1016/j.therap.2020.09.002,,200
4270,https://doi.org/10.1136/thoraxjnl-2017-210394,EUPAS4270,200
7708,https://doi.org/10.1159/000371798; https://doi...,<NA>; ENCEPP/SDPP/7708; <NA>,200; 200; 200
8585,https://doi.org/10.19080/JGWH.2018.09.555762,,200
9361,https://doi.org/10.1016/S0016-5085(18)31718-9,,200
11145,https://doi.org/10.1007/s00228-014-1697-7,,200
13276,https://doi.org/10.1093/jcag/gwab002.214,,200


We can now save and reload this Dataframe:

In [159]:
external_df.to_excel('rmp1&2_external.xlsx', sheet_name='external')

In [5]:
external_df = pd.read_excel('rmp1&2_external.xlsx', sheet_name='external', index_col='eu_pas_register_number')
external_df.head()

Unnamed: 0_level_0,external_references,external_id_found,external_status_code
eu_pas_register_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2181,https://doi.org/10.1002/pds.3204; https://doi....,<NA>; <NA>; <NA>; <NA>; <NA>,403; 403; 200; 200; 200
2857,https://doi.org/10.1093/jac/dkw225; https://do...,<NA>; EUPAS2857,200; 200
3142,https://doi.org/10.1007/s40264-019-00821-6,EUPAS3142,200
3901,https://doi.org/10.1016/j.therap.2020.09.002,,200
4270,https://doi.org/10.1136/thoraxjnl-2017-210394,EUPAS4270,200


## Putting all data together
We will now merge the nct data, the pubmed data and the external references data with the original Dataframe:

In [None]:
references_df = base.merge(
    nct_df,
    how='left', 
    left_index=True,
    right_index=True
).merge(
    articles_df,
    how='left', 
    left_index=True,
    right_index=True
).merge(
    external_df,
    how='left', 
    left_index=True,
    right_index=True
)

references_df.head()

Unnamed: 0_level_0,title,url,regulatory_procedure_number,references,acronym,nct_total_count,nct_url,nct_id,nct_state,nct_first_posted,...,nct_status_code,nct_search_term,pubmed_id,pubmed_url,pubmed_title,pubmed_publication_date,pubmed_search_term,external_references,external_id_found,external_status_code
eu_pas_register_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2165,Post-Authorisation Safety Study of Esbriet® (P...,https://catalogues.ema.europa.eu/study/23388,,,PASSPORT,5.0,https://www.clinicaltrials.gov/study/NCT062742...,NCT06274294; NCT04716023; NCT03068468; NCT0397...,NOT_YET_RECRUITING; UNKNOWN; TERMINATED; COMPL...,2024-02-23; 2021-01-20; 2017-03-01; 2019-06-07...,...,200,acronym,,,,,,,,
2181,Study of Acute Liver Transplant: A study of NS...,https://catalogues.ema.europa.eu/study/40864,,https://doi.org/10.1002/pds.3204; https://doi....,SALT-I,0.0,,,,,...,200,,,,,,,https://doi.org/10.1002/pds.3204; https://doi....,<NA>; <NA>; <NA>; <NA>; <NA>,403; 403; 200; 200; 200
2196,Prospective controlled cohort study on the saf...,https://catalogues.ema.europa.eu/study/41500,,,PRO-E2,1.0,https://www.clinicaltrials.gov/study/NCT01650168,NCT01650168,COMPLETED,2012-07-26,...,200,title,,,,,,,,
2857,A Multicenter Cohort Study of the Short and Lo...,https://catalogues.ema.europa.eu/study/35221,,https://doi.org/10.1093/jac/dkw225; https://do...,MYCOS,1.0,https://www.clinicaltrials.gov/study/NCT01686607,NCT01686607,COMPLETED,2012-09-18,...,200,title,,,,,,https://doi.org/10.1093/jac/dkw225; https://do...,<NA>; EUPAS2857,200; 200
3142,A Safety and Pharmacokinetic study in Real-lif...,https://catalogues.ema.europa.eu/study/47210,,https://doi.org/10.1007/s40264-019-00821-6,,0.0,,,,,...,200,,31069703.0,https://pubmed.ncbi.nlm.nih.gov/31069703/,Bismuth Concentrations in Patients Treated in ...,2019-05-10,eu_pas_register_number,https://doi.org/10.1007/s40264-019-00821-6,EUPAS3142,200


We can now save and reload the combined results:

In [None]:
references_df.to_excel('rmp1&2_all_references.xlsx', sheet_name='references')

In [None]:
references_df = pd.read_excel('rmp1&2_all_references.xlsx', sheet_name='references', index_col='eu_pas_register_number')
references_df.head()

Unnamed: 0_level_0,title,url,regulatory_procedure_number,references,acronym,nct_total_count,nct_url,nct_id,nct_state,nct_first_posted,...,nct_status_code,nct_search_term,pubmed_id,pubmed_url,pubmed_title,pubmed_publication_date,pubmed_search_term,external_references,external_id_found,external_status_code
eu_pas_register_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2165,Post-Authorisation Safety Study of Esbriet® (P...,https://catalogues.ema.europa.eu/study/23388,,,PASSPORT,5.0,https://www.clinicaltrials.gov/study/NCT062742...,NCT06274294; NCT04716023; NCT03068468; NCT0397...,NOT_YET_RECRUITING; UNKNOWN; TERMINATED; COMPL...,2024-02-23; 2021-01-20; 2017-03-01; 2019-06-07...,...,200,acronym,,,,,,,,
2181,Study of Acute Liver Transplant: A study of NS...,https://catalogues.ema.europa.eu/study/40864,,https://doi.org/10.1002/pds.3204; https://doi....,SALT-I,0.0,,,,,...,200,,,,,,,https://doi.org/10.1002/pds.3204; https://doi....,<NA>; <NA>; <NA>; <NA>; <NA>,403; 403; 200; 200; 200
2196,Prospective controlled cohort study on the saf...,https://catalogues.ema.europa.eu/study/41500,,,PRO-E2,1.0,https://www.clinicaltrials.gov/study/NCT01650168,NCT01650168,COMPLETED,2012-07-26,...,200,title,,,,,,,,
2857,A Multicenter Cohort Study of the Short and Lo...,https://catalogues.ema.europa.eu/study/35221,,https://doi.org/10.1093/jac/dkw225; https://do...,MYCOS,1.0,https://www.clinicaltrials.gov/study/NCT01686607,NCT01686607,COMPLETED,2012-09-18,...,200,title,,,,,,https://doi.org/10.1093/jac/dkw225; https://do...,<NA>; EUPAS2857,200; 200
3142,A Safety and Pharmacokinetic study in Real-lif...,https://catalogues.ema.europa.eu/study/47210,,https://doi.org/10.1007/s40264-019-00821-6,,0.0,,,,,...,200,,31069703.0,https://pubmed.ncbi.nlm.nih.gov/31069703/,Bismuth Concentrations in Patients Treated in ...,2019-05-10,eu_pas_register_number,https://doi.org/10.1007/s40264-019-00821-6,EUPAS3142,200


## Experiments

Single NCT ID Query:

In [113]:
import json

nct_id = 'NCT03774914'
response = requests.get(f"https://clinicaltrials.gov/api/v2/studies/{nct_id}?format=json&markupFormat=markdown")

if response.status_code == 200:
    print(json.dumps(response.json(), indent=4))

{
    "protocolSection": {
        "identificationModule": {
            "nctId": "NCT03774914",
            "orgStudyIdInfo": {
                "id": "OBS13436"
            },
            "secondaryIdInfos": [
                {
                    "id": "EU PAS - cat 3"
                }
            ],
            "organization": {
                "fullName": "Sanofi",
                "class": "INDUSTRY"
            },
            "briefTitle": "LEMTRADA Pregnancy Registry in Multiple Sclerosis",
            "officialTitle": "International LEMTRADA Pregnancy Exposure Cohort in Multiple Sclerosis"
        },
        "statusModule": {
            "statusVerifiedDate": "2022-04-21",
            "overallStatus": "TERMINATED",
            "whyStopped": "The sponsor stopped the study due to low recruitment with no safety concerns",
            "expandedAccessInfo": {
                "hasExpandedAccess": false
            },
            "startDateStruct": {
                "date": "2015-09-0

Find all Pubmed entries with the term "eu pas" or "eupas" or "eupas" followed by extra text: 

In [None]:
pubmed = PubMed(tool="EmaRwdSearcher", email="pedram.ramezani@charite.de")

# NOTE: Old identifier? "ENCEPP/*/*"

search_term = 'eupas OR "eu pas" OR eupas*'
results = pubmed.query(search_term, max_results=1000)
article_list = []
article_info = []

for article in results:
    article_dict = article.toDict()
    article_list.append(article_dict)


for article in article_list:
    # Sometimes article['pubmed_id'] contains list separated with comma - take first pubmedId in that list - thats article pubmedId
    pubmedId = article['pubmed_id'].partition('\n')[0]
    
    article_info.append({
        'pubmed_id': pubmedId,
        'pubmed_url': f'https://pubmed.ncbi.nlm.nih.gov/{pubmedId}/',
        'title': article['title'],
        'publication_date': article['publication_date']
    })

all_ema_rwd_articles_df = pd.DataFrame.from_dict(article_info)

all_ema_rwd_articles_df

dict_keys(['pubmed_id', 'title', 'abstract', 'keywords', 'journal', 'publication_date', 'authors', 'methods', 'conclusions', 'results', 'copyrights', 'doi', 'xml'])


Unnamed: 0,pubmed_id,title,doi,publication_date
0,38888495,Long-term safety of hyaluronidase-facilitated ...,10.1080/1750743X.2024.2354091,2024-06-18


In [7]:
analysis_df = exported.assign(
    acronym=lambda x : x['title'].str.extract(r'\((?P<acronym>[^\(\)]+)\)\s*$'),
    has_result_tables = lambda x : x['result_tables_name'].notna(),
    has_result_documents = lambda x : x['result_document_name'].notna(),
    has_result=lambda x : x['has_result_tables'] | x['has_result_documents'],
    has_other=lambda x : x['other_documents_name'].notna(),
)[exported['risk_management_plan'].isin([
    'EU RMP category 1 (imposed as condition of marketing authorisation)',
    'EU RMP category 2 (specific obligation of marketing authorisation)'
])].filter(regex='title|references|has_result$|has_other|^url$')

# Filter all studies without result documents, but with references
analysis_df = analysis_df[~analysis_df['has_result'] & analysis_df['references'].notna()]

analysis_df.loc[:, 'references'] = analysis_df['references'].str.split('; ')

analysis_df = analysis_df.explode('references').assign(
    result_in_reference=pd.NA
)

analysis_df

Unnamed: 0_level_0,title,url,references,has_result,has_other,result_in_reference
eu_pas_register_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2181,Study of Acute Liver Transplant: A study of NS...,https://catalogues.ema.europa.eu/study/40864,https://doi.org/10.1002/pds.3204,False,False,
2181,Study of Acute Liver Transplant: A study of NS...,https://catalogues.ema.europa.eu/study/40864,https://doi.org/10.1002/pds.3371,False,False,
2181,Study of Acute Liver Transplant: A study of NS...,https://catalogues.ema.europa.eu/study/40864,https://doi.org/10.1007/s00228-012-1357-8,False,False,
2181,Study of Acute Liver Transplant: A study of NS...,https://catalogues.ema.europa.eu/study/40864,https://doi.org/10.1007/s40264-012-0013-7,False,False,
2181,Study of Acute Liver Transplant: A study of NS...,https://catalogues.ema.europa.eu/study/40864,https://doi.org/10.1007/s40264-013-0071-5,False,False,
8585,Safety and Incidence of Side Effects in a Coho...,https://catalogues.ema.europa.eu/study/41123,https://doi.org/10.19080/JGWH.2018.09.555762,False,False,
13276,An observational disease and clinical outcomes...,https://catalogues.ema.europa.eu/study/47904,https://doi.org/10.1093/jcag/gwab002.214,False,False,
13514,"AN OBSERVATIONAL, LONGITUDINAL, PROSPECTIVE, L...",https://catalogues.ema.europa.eu/study/47907,https://abstracts.eurospe.org/hrp/0094/hrp0094...,False,False,
13514,"AN OBSERVATIONAL, LONGITUDINAL, PROSPECTIVE, L...",https://catalogues.ema.europa.eu/study/47907,https://doi.org/10.1002/jbmr.4130,False,False,
13514,"AN OBSERVATIONAL, LONGITUDINAL, PROSPECTIVE, L...",https://catalogues.ema.europa.eu/study/47907,https://doi.org/10.1186/s12891-019-2420-8,False,False,


In [8]:
analysis_df.to_excel('rmp1&2_references_manual.xlsx', sheet_name='analysis')