# Objectif:

On souhaite évaluer le nombre de publications qui sont le résultat d'études cliniques liées à l'INSERM.

En particulier, on souhaite identifier les études cliniques qui ne donnent lieu à aucune publication et essayer de comprendre les raisons de cette absence de publication.

# Organisation :
- On récupère via l'API de *ClinicalTrial* les *ID* (**NCTId**) des études cliniques qui ont pour sponsors l'*INSERM*, l'*ANRS* etc.

- A partir de ces **NCTIds**, on récupère sur *ClinicalTrial* les **PMIDs** des publications liées à ces études.  
    Ces publications sont de 2 types:
    1. Elles ont été uploadées sur *CT* par les auteurs de l'étude: `BACKGROUND, RESULT`
    2. Elles ont été automatiquement récupérés sur PubMed par *CT*: `DERIVED`
- A partir de ces **NCTIds**, on récupère sur *PubMed* les **PMIDs** des publications liées à ces études.  
    On retrouve général un peu plus de publications que le traitement automatique réalisé par *CT*. 

- Pour chaque **NCTIds**, on fusionne l'ensemble des **PMIDs** retournées par *CT* et *PubMed*.

- A partir de cet ensemble de **PMIDs**, on récupère les infos liés : `titre, auteurs, doi`...

# Extraction des NCTIds dans ClinicalTrial:

## API v1:

Pour faciliter la récupération des données via l'API v1 de ClinicalTrial, on utilise le wrapper Python: [pytrials](https://github.com/jvfe/pytrials)

Installer ***pytrials***:
- Dans Conda Navigator, se placer dans le meme environnement que celui qui execute Jupyter
- Lancer Powershell Prompt dans cet environnement
- Taper: `pip install pytrials`

In [1]:
from pytrials.client import ClinicalTrials
import urllib.parse

In [2]:
ct = ClinicalTrials()

### Création de la requête:

On crée la requête qui sera envoyé à l'API de ClinicalTrial

#### Sponsors:

In [3]:
sponsors = [
    'anrs',
    'inserm',
    'institut national de la santé et de la recherche médicale',
    'french national agency for research on aids and viral hepatitis',
]

In [4]:
sponsors_expr = [f'AREA[LeadSponsorName]{sponsor}' for sponsor in sponsors]

# Add OR keyword
sponsors_expr = ' OR '.join(sponsors_expr)

# Add parenthesis for correct interpretation of OR expression
sponsors_expr = f'({sponsors_expr})'

sponsors_expr

'(AREA[LeadSponsorName]anrs OR AREA[LeadSponsorName]inserm OR AREA[LeadSponsorName]institut national de la santé et de la recherche médicale OR AREA[LeadSponsorName]french national agency for research on aids and viral hepatitis)'

#### Status:

In [5]:
status = 'completed'

In [6]:
status_expr = f'AREA[OverallStatus]{status}'
status_expr

'AREA[OverallStatus]completed'

#### Date de fin d'étude supérieure ou égale à 2013:

In [7]:
date_expr = 'AREA[CompletionDate]RANGE[01/01/2013,MAX]'
date_expr

'AREA[CompletionDate]RANGE[01/01/2013,MAX]'

#### Search Expression:

In [8]:
search_expr = ' AND '.join([sponsors_expr, status_expr, date_expr])
search_expr

'(AREA[LeadSponsorName]anrs OR AREA[LeadSponsorName]inserm OR AREA[LeadSponsorName]institut national de la santé et de la recherche médicale OR AREA[LeadSponsorName]french national agency for research on aids and viral hepatitis) AND AREA[OverallStatus]completed AND AREA[CompletionDate]RANGE[01/01/2013,MAX]'

#### URL encode: 

In [9]:
search_expr_url_encode = urllib.parse.quote_plus(search_expr)
search_expr_url_encode

'%28AREA%5BLeadSponsorName%5Danrs+OR+AREA%5BLeadSponsorName%5Dinserm+OR+AREA%5BLeadSponsorName%5Dinstitut+national+de+la+sant%C3%A9+et+de+la+recherche+m%C3%A9dicale+OR+AREA%5BLeadSponsorName%5Dfrench+national+agency+for+research+on+aids+and+viral+hepatitis%29+AND+AREA%5BOverallStatus%5Dcompleted+AND+AREA%5BCompletionDate%5DRANGE%5B01%2F01%2F2013%2CMAX%5D'

#### Fields:

Les champs que l'on veut récupérer:

In [10]:
fields = [
    'NCTId',
    'BriefTitle',
    'OverallStatus',
    'StudyType',
    'LeadSponsorName',
    'CollaboratorName',
    'OrgStudyId',
    'SecondaryId',
    'StudyFirstPostDate',
    'ReferencePMID',
    'ReferenceCitation',
    'ReferenceType',
]

### Envoi de la requête :

In [11]:
study_fields = ct.get_study_fields(
    search_expr=search_expr_url_encode,
    fields=fields,
    max_studies=1000,
    fmt='csv',
)

In [12]:
print(f'NStudiesReturned: {len(study_fields[1:])}')

NStudiesReturned: 200


### Lire le résultat de la requête dans Pandas :

In [13]:
import pandas as pd

In [14]:
pd.DataFrame.from_records(study_fields[1:], index='Rank', columns=study_fields[0])

Unnamed: 0_level_0,NCTId,BriefTitle,OverallStatus,StudyType,LeadSponsorName,CollaboratorName,OrgStudyId,SecondaryId,StudyFirstPostDate,ReferencePMID,ReferenceCitation,ReferenceType
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,NCT03671291,Missed Opportunities to Pre-exposure Prophylax...,Completed,Interventional,French National Agency for Research on AIDS an...,"University Hospital, Marseille|University Hosp...",ANRS 95041 Missed Opportunity,,"September 14, 2018",,,
2,NCT01463956,Efficacy of PegInterferon-Ribavirin-Boceprevir...,Completed,Interventional,French National Agency for Research on AIDS an...,Merck Sharp & Dohme LLC,2011- 001089 -17,,"November 2, 2011",,,
3,NCT01426243,The Yellow Fever Vaccine Immunity in HIV Infec...,Completed,Interventional,French National Agency for Research on AIDS an...,,2009-014921-17,,"August 31, 2011",30096071,"Colin de Verdiere N, Durier C, Samri A, Meiffr...",derived
4,NCT01269632,Cohort of Young Adults Infected With HIV Since...,Completed,Interventional,French National Agency for Research on AIDS an...,,2009-AO1219-48,,"January 4, 2011",,,
5,NCT01226446,Efficacy of Vitamin D on Top of Pegylated Inte...,Completed,Interventional,French National Agency for Research on AIDS an...,,2010-021967-34,,"October 22, 2010",25987791,"Terrier B, Lapidus N, Pol S, Serfaty L, Ratziu...",derived
...,...,...,...,...,...,...,...,...,...,...,...,...
196,NCT00265642,Evaluation of Irbesartan on Hepatic Fibrosis i...,Completed,Interventional,"ANRS, Emerging Infectious Diseases",Sanofi,2005-006027-37,ANRS HC 19 Fibrosar,"December 15, 2005",,,
197,NCT00116454,Trial for Hepatocellular Carcinoma Adjuvant Tr...,Completed,Interventional,"ANRS, Emerging Infectious Diseases",,2004-003883-31,ANRS HC06 LIPIOCIS,"June 30, 2005",,,
198,NCT05349162,Epicardial vs. Transvenous ICDs in Children,Completed,Observational,Paris Cardiovascular Research Center (Inserm U...,Hôpital Necker-Enfants Malades,134526,,"April 27, 2022",,,
199,NCT04842851,Cardiac Resynchronization Therapy in Congenita...,Completed,Observational,Paris Cardiovascular Research Center (Inserm U...,"Marie Lannelongue Hospital, Le Plessis Robinso...",2219047,,"April 13, 2021",,,


## API v2:

L'API v1 ne sera plus supporté a [partir de mi-2024](https://clinicaltrials.gov/data-api/api) :

>***Notice to API users:  
>The new ClinicalTrials.gov API, version 2.0 is available. Classic API users are strongly encouraged to switch to the modernized API. We will continue to support the classic API until mid-2024 and are planning blackouts for the spring to help with the transition to the modernized API.***

De plus, l'API v2 supporte un nouveau champs **"HasResults"**, qui pour l'instant n'est que très peu utilisé mais qui pourrait être utile à l'avenir.

En contre-partie, l'export des données sous forme de CSV est limité à un certain nombre de champs visible sur cette page: https://clinicaltrials.gov/data-api/about-api/csv-download

On est donc obligé d'utiliser l'export de données sous forme de JSON.

### Création de la requête:

`pytrials` n'étant pas compatible avec la v2, on envoie la requête manuellement en utilisant [Requests](https://requests.readthedocs.io/en/latest/) 

#### Format:

In [15]:
format = 'json'

#### Sponsors:

In [16]:
sponsors

['anrs',
 'inserm',
 'institut national de la santé et de la recherche médicale',
 'french national agency for research on aids and viral hepatitis']

In [17]:
sponsors_expr_v2 = ' OR '.join(sponsors)
sponsors_expr_v2 = urllib.parse.quote_plus(sponsors_expr_v2)
sponsors_expr_v2

'anrs+OR+inserm+OR+institut+national+de+la+sant%C3%A9+et+de+la+recherche+m%C3%A9dicale+OR+french+national+agency+for+research+on+aids+and+viral+hepatitis'

#### Overall_status:

In [18]:
overall_status = 'COMPLETED'

#### Fields:

In [19]:
fields_v2 = [
    'NCTId',
    'BriefTitle',
    # 'OfficialTitle',
    'OverallStatus',
    'StudyType',
    'LeadSponsorName',
    'CollaboratorName',
    # 'OrgStudyId',
    # 'SecondaryId',
    'StudyFirstPostDate',
    'StartDate',
    # 'PrimaryCompletionDate',
    'CompletionDate',
    'ReferencePMID',
    'ReferenceCitation',
    'ReferenceType',
    'hasResults',
]
fields_v2

['NCTId',
 'BriefTitle',
 'OverallStatus',
 'StudyType',
 'LeadSponsorName',
 'CollaboratorName',
 'StudyFirstPostDate',
 'StartDate',
 'CompletionDate',
 'ReferencePMID',
 'ReferenceCitation',
 'ReferenceType',
 'hasResults']

In [20]:
fields_expr_v2 = ','.join(fields_v2)
fields_expr_v2 = urllib.parse.quote_plus(fields_expr_v2)
fields_expr_v2

'NCTId%2CBriefTitle%2COverallStatus%2CStudyType%2CLeadSponsorName%2CCollaboratorName%2CStudyFirstPostDate%2CStartDate%2CCompletionDate%2CReferencePMID%2CReferenceCitation%2CReferenceType%2ChasResults'

#### Date de fin d'étude postérieure à 2013:

In [21]:
date = 'AREA[CompletionDate]RANGE[01/01/2013, MAX]'
date_expr = urllib.parse.quote_plus(date)
date_expr

'AREA%5BCompletionDate%5DRANGE%5B01%2F01%2F2013%2C+MAX%5D'

#### Nombre de résultats max :

In [22]:
count_total = 'true'

In [23]:
page_size = 1000

#### URL api:

In [24]:
query_url = f'https://clinicaltrials.gov/api/v2/studies?format={format}&query.lead={sponsors_expr_v2}&filter.overallStatus={overall_status}&fields={fields_expr_v2}&filter.advanced={date_expr}&countTotal={count_total}&pageSize={page_size}'
query_url

'https://clinicaltrials.gov/api/v2/studies?format=json&query.lead=anrs+OR+inserm+OR+institut+national+de+la+sant%C3%A9+et+de+la+recherche+m%C3%A9dicale+OR+french+national+agency+for+research+on+aids+and+viral+hepatitis&filter.overallStatus=COMPLETED&fields=NCTId%2CBriefTitle%2COverallStatus%2CStudyType%2CLeadSponsorName%2CCollaboratorName%2CStudyFirstPostDate%2CStartDate%2CCompletionDate%2CReferencePMID%2CReferenceCitation%2CReferenceType%2ChasResults&filter.advanced=AREA%5BCompletionDate%5DRANGE%5B01%2F01%2F2013%2C+MAX%5D&countTotal=true&pageSize=1000'

### Envoi de la requête :

In [108]:
import requests

In [109]:
response = requests.get(query_url)
response.raise_for_status()
response

<Response [200]>

In [110]:
print(f'Studies returned: {response.json()["totalCount"]}')

Studies returned: 200


### Traitement du JSON retourné:

In [111]:
import json

In [112]:
# print(json.dumps(response.json(), indent=2))

***La structure du JSON est bien trop imbriquée pour le normaliser avec Pandas, du coup on l'applatit à la main:***

A partir du JSON on construit un dictionnaire équivalent mais beaucoup plus "plat"

In [113]:
# Si la liste des collaborateurs est vide on renvoie None, sinon on concatène les valeurs de la liste
# sous la forme "collaborateur_0 | collaborateur_1 | ..."
def concatenate_collaborator_list(collaborator_list):
    if collaborator_list == []:
        return None
    else:
        return ' | '.join(collaborator_list)

In [114]:
studies_list = []
for study in response.json()['studies']:
    study_dict = {
        'NCTId': study['protocolSection']['identificationModule']['nctId'],
        'BriefTitle': study['protocolSection']['identificationModule']['briefTitle'],
        'LeadSponsorName': study['protocolSection']['sponsorCollaboratorsModule']['leadSponsor']['name'],
        'CollaboratorName': concatenate_collaborator_list(
            [
                c['name']
                for c in (
                    study['protocolSection']['sponsorCollaboratorsModule'].get('collaborators', [])  # can be missing
                )
            ]
        ),
        'OverallStatus': study['protocolSection']['statusModule']['overallStatus'],
        'StudyType': study['protocolSection']['designModule']['studyType'],
        'HasResults': study['hasResults'],
        'StudyFirstPostDate': study['protocolSection']['statusModule']['studyFirstPostDateStruct']['date'],
        'StartDate': study['protocolSection']['statusModule'].get('startDateStruct', {}).get('date', None),  # can be missing
        # 'PrimaryCompletionDate' : study["protocolSection"]["statusModule"].get('primaryCompletionDateStruct', {}).get('date', None), # can be missing
        'CompletionDate': study['protocolSection']['statusModule'].get('completionDateStruct', {}).get('date', None),  # can be missing
        'Reference': study['protocolSection'].get('referencesModule', {}).get('references', []),  # can be missing
    }
    studies_list.append(study_dict)

# print(json.dumps(studies_list, indent=2))

On vérifie que l'on n'a pas perdu de NCTId en route:

In [115]:
print(f'Nombre de NCTId: {len(studies_list)}')
assert response.json()['totalCount'] == len(studies_list)

Nombre de NCTId: 200


### Import dans Pandas

In [116]:
df_ct = pd.json_normalize(data=studies_list)
df_ct

Unnamed: 0,NCTId,BriefTitle,LeadSponsorName,CollaboratorName,OverallStatus,StudyType,HasResults,StudyFirstPostDate,StartDate,CompletionDate,Reference
0,NCT02081066,Identification of CETP as a Marker of Atherosc...,Institut National de la Santé Et de la Recherc...,,COMPLETED,INTERVENTIONAL,False,2014-03-07,2014-09-25,2020-09,[]
1,NCT05199831,Situational Analysis of HIV-related Disability...,"Programme PAC-CI, Site ANRS-MIE de Côte d'Ivoire","Institute of Research for Development, France ...",COMPLETED,OBSERVATIONAL,False,2022-01-20,2021-02-05,2022-08-10,[]
2,NCT01895920,Viral Biofilms: Hijacking T Cell Extracellular...,"ANRS, Emerging Infectious Diseases",,COMPLETED,INTERVENTIONAL,False,2013-07-11,2013-01,2018-02,[]
3,NCT02116374,Physiopathology Study of the Microbiota Biodiv...,"ANRS, Emerging Infectious Diseases",,COMPLETED,OBSERVATIONAL,False,2014-04-16,2014-12,2016-06,[]
4,NCT02107365,"Therapy With Asunaprevir, Daclatasvir, Ribavir...","ANRS, Emerging Infectious Diseases",Bristol-Myers Squibb,COMPLETED,INTERVENTIONAL,False,2014-04-08,2013-11,2015-04,[]
...,...,...,...,...,...,...,...,...,...,...,...
195,NCT03215732,Cross Sectional Survey on the Burden and Impac...,"ANRS, Emerging Infectious Diseases","Institute of Research for Development, France",COMPLETED,OBSERVATIONAL,False,2017-07-12,2017-10-19,2019-07-31,"[{'pmid': '31320358', 'type': 'DERIVED', 'cita..."
196,NCT02052271,Experimental Therapeutics in Essential Tremor ...,Institut National de la Santé Et de la Recherc...,,COMPLETED,INTERVENTIONAL,False,2014-02-03,2014-06-03,2018-03-27,[]
197,NCT00918307,Efficacy and Safety of Varenicline Among HIV-i...,"ANRS, Emerging Infectious Diseases",Pfizer,COMPLETED,INTERVENTIONAL,False,2009-06-11,2009-10,2014-07,"[{'pmid': '34611902', 'type': 'DERIVED', 'cita..."
198,NCT01698411,Study of the Influence of Sleep on Hemodynamic...,Institut National de la Santé Et de la Recherc...,,COMPLETED,OBSERVATIONAL,False,2012-10-03,2012-10,2015-02,[]


#### On "explose" la colonne "References":

Pour chaque NCTId, la colonne réferences contient potentiellement une liste de plusieurs références.  
Si par exemple, on a 3 références, on veut se retrouver avec 3 lignes, chacune indexée par le même NCTId et contenant une unique réference.  

*Avant*:  
**`NCTId    References`**  
`NCT0001, [Ref1, Ref2, Ref3]`   

*Après*:  
**`NCTId    References`**  
`NCT0001,   Ref1`  
`NCT0001,   Ref2`  
`NCT0001,   Ref3`   

In [117]:
df_ct = df_ct.explode('Reference', ignore_index=True)
df_ct.loc[:, ['NCTId', 'Reference']]

Unnamed: 0,NCTId,Reference
0,NCT02081066,
1,NCT05199831,
2,NCT01895920,
3,NCT02116374,
4,NCT02107365,
...,...,...
477,NCT02052271,
478,NCT00918307,"{'pmid': '34611902', 'type': 'DERIVED', 'citat..."
479,NCT00918307,"{'pmid': '29329763', 'type': 'DERIVED', 'citat..."
480,NCT01698411,


Pour chaque NCTId, la colonne réference contient maintenant un dictionnaire de la forme:  
`{  
"pmid": "17545707",  
"type": "BACKGROUND",  
"citation": "...",  
}`

Dont on veut extraire de nouvelles colonnes en utilant les clés du dictionnaire:  
**`pmid     type        citation`**  
`17545707, BACKGROUND, "..."`

In [118]:
df_ct_references = pd.json_normalize(df_ct.pop('Reference'))
df_ct_references

Unnamed: 0,pmid,type,citation
0,,,
1,,,
2,,,
3,,,
4,,,
...,...,...,...
477,,,
478,34611902,DERIVED,"Hartmann-Boyce J, Theodoulou A, Farley A, Haje..."
479,29329763,DERIVED,"Mercie P, Arsandaux J, Katlama C, Ferret S, Be..."
480,,,


On réassemble la dataFrame complète:

In [119]:
df_ct = df_ct.join(df_ct_references)
df_ct

Unnamed: 0,NCTId,BriefTitle,LeadSponsorName,CollaboratorName,OverallStatus,StudyType,HasResults,StudyFirstPostDate,StartDate,CompletionDate,pmid,type,citation
0,NCT02081066,Identification of CETP as a Marker of Atherosc...,Institut National de la Santé Et de la Recherc...,,COMPLETED,INTERVENTIONAL,False,2014-03-07,2014-09-25,2020-09,,,
1,NCT05199831,Situational Analysis of HIV-related Disability...,"Programme PAC-CI, Site ANRS-MIE de Côte d'Ivoire","Institute of Research for Development, France ...",COMPLETED,OBSERVATIONAL,False,2022-01-20,2021-02-05,2022-08-10,,,
2,NCT01895920,Viral Biofilms: Hijacking T Cell Extracellular...,"ANRS, Emerging Infectious Diseases",,COMPLETED,INTERVENTIONAL,False,2013-07-11,2013-01,2018-02,,,
3,NCT02116374,Physiopathology Study of the Microbiota Biodiv...,"ANRS, Emerging Infectious Diseases",,COMPLETED,OBSERVATIONAL,False,2014-04-16,2014-12,2016-06,,,
4,NCT02107365,"Therapy With Asunaprevir, Daclatasvir, Ribavir...","ANRS, Emerging Infectious Diseases",Bristol-Myers Squibb,COMPLETED,INTERVENTIONAL,False,2014-04-08,2013-11,2015-04,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
477,NCT02052271,Experimental Therapeutics in Essential Tremor ...,Institut National de la Santé Et de la Recherc...,,COMPLETED,INTERVENTIONAL,False,2014-02-03,2014-06-03,2018-03-27,,,
478,NCT00918307,Efficacy and Safety of Varenicline Among HIV-i...,"ANRS, Emerging Infectious Diseases",Pfizer,COMPLETED,INTERVENTIONAL,False,2009-06-11,2009-10,2014-07,34611902,DERIVED,"Hartmann-Boyce J, Theodoulou A, Farley A, Haje..."
479,NCT00918307,Efficacy and Safety of Varenicline Among HIV-i...,"ANRS, Emerging Infectious Diseases",Pfizer,COMPLETED,INTERVENTIONAL,False,2009-06-11,2009-10,2014-07,29329763,DERIVED,"Mercie P, Arsandaux J, Katlama C, Ferret S, Be..."
480,NCT01698411,Study of the Influence of Sleep on Hemodynamic...,Institut National de la Santé Et de la Recherc...,,COMPLETED,OBSERVATIONAL,False,2012-10-03,2012-10,2015-02,,,


***On reconstruit l'index :***

In [37]:
# df_studies_v2.set_index('NCTId', inplace = True)

***On précise les types :***

In [120]:
df_ct = df_ct.convert_dtypes()
# df_ct = df_ct.astype({"OverallStatus" : 'category', "StudyType" : 'category', "type": 'category',})
df_ct = df_ct.astype({'OverallStatus': 'category', 'StudyType': 'category'})
df_ct.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 482 entries, 0 to 481
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   NCTId               482 non-null    string  
 1   BriefTitle          482 non-null    string  
 2   LeadSponsorName     482 non-null    string  
 3   CollaboratorName    217 non-null    string  
 4   OverallStatus       482 non-null    category
 5   StudyType           482 non-null    category
 6   HasResults          482 non-null    boolean 
 7   StudyFirstPostDate  482 non-null    string  
 8   StartDate           482 non-null    string  
 9   CompletionDate      482 non-null    string  
 10  pmid                364 non-null    string  
 11  type                364 non-null    string  
 12  citation            364 non-null    string  
dtypes: boolean(1), category(2), string(10)
memory usage: 39.9 KB


In [121]:
df_ct

Unnamed: 0,NCTId,BriefTitle,LeadSponsorName,CollaboratorName,OverallStatus,StudyType,HasResults,StudyFirstPostDate,StartDate,CompletionDate,pmid,type,citation
0,NCT02081066,Identification of CETP as a Marker of Atherosc...,Institut National de la Santé Et de la Recherc...,,COMPLETED,INTERVENTIONAL,False,2014-03-07,2014-09-25,2020-09,,,
1,NCT05199831,Situational Analysis of HIV-related Disability...,"Programme PAC-CI, Site ANRS-MIE de Côte d'Ivoire","Institute of Research for Development, France ...",COMPLETED,OBSERVATIONAL,False,2022-01-20,2021-02-05,2022-08-10,,,
2,NCT01895920,Viral Biofilms: Hijacking T Cell Extracellular...,"ANRS, Emerging Infectious Diseases",,COMPLETED,INTERVENTIONAL,False,2013-07-11,2013-01,2018-02,,,
3,NCT02116374,Physiopathology Study of the Microbiota Biodiv...,"ANRS, Emerging Infectious Diseases",,COMPLETED,OBSERVATIONAL,False,2014-04-16,2014-12,2016-06,,,
4,NCT02107365,"Therapy With Asunaprevir, Daclatasvir, Ribavir...","ANRS, Emerging Infectious Diseases",Bristol-Myers Squibb,COMPLETED,INTERVENTIONAL,False,2014-04-08,2013-11,2015-04,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
477,NCT02052271,Experimental Therapeutics in Essential Tremor ...,Institut National de la Santé Et de la Recherc...,,COMPLETED,INTERVENTIONAL,False,2014-02-03,2014-06-03,2018-03-27,,,
478,NCT00918307,Efficacy and Safety of Varenicline Among HIV-i...,"ANRS, Emerging Infectious Diseases",Pfizer,COMPLETED,INTERVENTIONAL,False,2009-06-11,2009-10,2014-07,34611902,DERIVED,"Hartmann-Boyce J, Theodoulou A, Farley A, Haje..."
479,NCT00918307,Efficacy and Safety of Varenicline Among HIV-i...,"ANRS, Emerging Infectious Diseases",Pfizer,COMPLETED,INTERVENTIONAL,False,2009-06-11,2009-10,2014-07,29329763,DERIVED,"Mercie P, Arsandaux J, Katlama C, Ferret S, Be..."
480,NCT01698411,Study of the Influence of Sleep on Hemodynamic...,Institut National de la Santé Et de la Recherc...,,COMPLETED,OBSERVATIONAL,False,2012-10-03,2012-10,2015-02,,,


#### Export en CSV :

In [40]:
# df_ct.to_csv('Data/outputs/extract_CT_api_v2.csv', sep=";", encoding='utf-8-sig')
df_ct.to_csv(
    'Data/outputs/extract_CT_api_v2.csv',
    sep=';',
    index=False,
    encoding='utf-8-sig',
)

# PubMed

### Utilisation d'une clé pour l'API Pubmed : 

Il est recommandé d'utiliser une clé pour accéder à l'API Pubmed, ce qui permet de lancer jusqu'à 10 requêtes par seconde.  
Sans clé, la limite est de 3 requêtes par seconde.  

> E-utils users are allowed 3 requests/second without an API key. Create an API key to increase your e-utils limit to 10 requests/second.

**En pratique, l'API Pubmed étant beaucoup plus lente (~1 requête par seconde), cela ne semble pas changer grand chose.**

Pour récupérer sa clé, se rendre sur la page en étant loggué:
https://account.ncbi.nlm.nih.gov/settings/

Une fois la clé récupérée, l'ajouter aux variables d'environnement avec la commande suivante:

**Windows :** 

`setx NCBI_API_KEY “123456"`

**Linux/MacOS :**

`export NCBI_API_KEY = 123456`

In [41]:
import os

assert os.getenv('NCBI_API_KEY', None) is not None

### Installer metapub :
- Dans Conda Navigator, se placer dans le meme environnement que celui qui execute Jupyter  
- Lancer Powershell Prompt dans cet environnement  
- Taper la commande suivante :  `pip install metapub`

In [42]:
from metapub import PubMedFetcher
from metapub.exceptions import InvalidPMID

fetch = PubMedFetcher(cachedir='./.cache/')

### Récupération des PMIDs via PubMed:

Pour chaque NCTId de CT, on récupère les PMIDs des publications associées via PubMed:

In [43]:
# Liste unique des NCTId extraits de ClinicalTrial
nctid_array = df_ct.loc[:, 'NCTId'].unique()

pmids_pubmed_dict = {}
for i, nctid in enumerate(nctid_array):
    # Display the progress on a single line
    print(f'\r{i+1}/{len(nctid_array)}...', end='', flush=True)

    pmids = [pmid for pmid in fetch.pmids_for_query(nctid)]
    pmids_pubmed_dict[nctid] = set(pmids)

200/200...

In [44]:
pmids_pubmed_dict

{'NCT01490489': set(),
 'NCT02542891': {'27488181', '34874888'},
 'NCT04780191': set(),
 'NCT03546127': set(),
 'NCT02078076': set(),
 'NCT02014727': {'28947345'},
 'NCT02486731': set(),
 'NCT05311865': {'36438274', '37795682'},
 'NCT04008927': {'35774932'},
 'NCT01895920': set(),
 'NCT02650427': set(),
 'NCT02273765': {'33667406'},
 'NCT02497274': set(),
 'NCT02100774': {'24808487', '26063065', '27798339'},
 'NCT02592174': {'31755936'},
 'NCT02099474': {'32661003'},
 'NCT04808986': set(),
 'NCT04470648': set(),
 'NCT01207986': set(),
 'NCT03078439': {'38408861'},
 'NCT02573948': set(),
 'NCT01693848': set(),
 'NCT00136630': {'22491195', '24780882'},
 'NCT03459157': {'34048794'},
 'NCT00670839': {'26257021'},
 'NCT04315948': {'32304640',
  '32958495',
  '33264556',
  '34048876',
  '34350582',
  '34473343',
  '34534511',
  '35233617',
  '35512728',
  '36695483',
  '37728045',
  '38552208'},
 'NCT02453048': {'31945015', '32687804'},
 'NCT03446430': set(),
 'NCT02987530': set(),
 'NCT0162

**On veut fusionner la liste PMIDs que l'on vient de récupérer sur PubMed, à la liste des PMIDs déjà récupérés via CT.**

On met les PMIDs de CT sous la même forme:

In [45]:
pmids_ct_dict = {}
for nctid in nctid_array:
    pmids = df_ct[df_ct.loc[:, 'NCTId'] == nctid].loc[:, 'pmid'].dropna()
    pmids_ct_dict[nctid] = set(pmids)
pmids_ct_dict

{'NCT01490489': {'16384869', '17531315', '19602057'},
 'NCT02542891': {'27488181', '34874888'},
 'NCT04780191': {'32232101'},
 'NCT03546127': set(),
 'NCT02078076': set(),
 'NCT02014727': {'28947345'},
 'NCT02486731': set(),
 'NCT05311865': set(),
 'NCT04008927': {'22310560',
  '26806260',
  '27168667',
  '27178119',
  '27537841',
  '28004616',
  '29432973',
  '35774932'},
 'NCT01895920': set(),
 'NCT02650427': set(),
 'NCT02273765': {'33667406'},
 'NCT02497274': set(),
 'NCT02100774': {'21957063', '24808487', '26063065', '27798339'},
 'NCT02592174': {'31755936'},
 'NCT02099474': {'32661003'},
 'NCT04808986': set(),
 'NCT04470648': {'31978945',
  '31995857',
  '32046819',
  '32070465',
  '32109013',
  '32224310',
  '32338732',
  '32371096'},
 'NCT01207986': set(),
 'NCT03078439': set(),
 'NCT02573948': {'25540950',
  '26032121',
  '26050614',
  '26075647',
  '26080690',
  '27006257',
  '27178119',
  '28612212',
  '28800503'},
 'NCT01693848': set(),
 'NCT00136630': {'15313841',
  '15365

Pour un NCTDId donné, on fait l'union des deux ensembles de PMIDs:

In [46]:
pmids_complete_dict = {}
for nctid in nctid_array:
    # L'ensemble des PMIds présents dans PubMed et CT
    pmids_complete_dict[nctid] = pmids_pubmed_dict[nctid] | pmids_ct_dict[nctid]
pmids_complete_dict

{'NCT01490489': {'16384869', '17531315', '19602057'},
 'NCT02542891': {'27488181', '34874888'},
 'NCT04780191': {'32232101'},
 'NCT03546127': set(),
 'NCT02078076': set(),
 'NCT02014727': {'28947345'},
 'NCT02486731': set(),
 'NCT05311865': {'36438274', '37795682'},
 'NCT04008927': {'22310560',
  '26806260',
  '27168667',
  '27178119',
  '27537841',
  '28004616',
  '29432973',
  '35774932'},
 'NCT01895920': set(),
 'NCT02650427': set(),
 'NCT02273765': {'33667406'},
 'NCT02497274': set(),
 'NCT02100774': {'21957063', '24808487', '26063065', '27798339'},
 'NCT02592174': {'31755936'},
 'NCT02099474': {'32661003'},
 'NCT04808986': set(),
 'NCT04470648': {'31978945',
  '31995857',
  '32046819',
  '32070465',
  '32109013',
  '32224310',
  '32338732',
  '32371096'},
 'NCT01207986': set(),
 'NCT03078439': {'38408861'},
 'NCT02573948': {'25540950',
  '26032121',
  '26050614',
  '26075647',
  '26080690',
  '27006257',
  '27178119',
  '28612212',
  '28800503'},
 'NCT01693848': set(),
 'NCT001366

#### Verifications:

In [47]:
num_pmids_ct = sum((len(v) for v in pmids_ct_dict.values()))
print(f'Nombre total de publications issus de CT: {num_pmids_ct}')

Nombre total de publications issus de CT: 364


In [48]:
num_pmids_complete = sum((len(v) for v in pmids_complete_dict.values()))
print(f'Nombre total de publications après consultation PubMed: {num_pmids_complete}')

Nombre total de publications après consultation PubMed: 393


In [49]:
pmids_pubmed_only_dict = {}
for nctid in nctid_array:
    # L'ensemble des PMIds présents dans PubMed uniquement
    pmids_pubmed_only_dict[nctid] = pmids_pubmed_dict[nctid] - pmids_ct_dict[nctid]

In [50]:
num_pmids_pubmed_only = sum((len(v) for v in pmids_pubmed_only_dict.values()))
print(f'Nombre de nouveaux PMIDs trouvés via Pubmed: {num_pmids_pubmed_only}')

Nombre de nouveaux PMIDs trouvés via Pubmed: 29


In [51]:
assert num_pmids_complete - num_pmids_ct == num_pmids_pubmed_only

In [52]:
print('NCTId des nouveaux PMIDs trouvés via Pubmed:')
{k: v for k, v in pmids_pubmed_only_dict.items() if v != set()}

NCTId des nouveaux PMIDs trouvés via Pubmed:


{'NCT05311865': {'36438274', '37795682'},
 'NCT03078439': {'38408861'},
 'NCT04315948': {'36695483', '38552208'},
 'NCT00640263': {'34425825'},
 'NCT02777229': {'37851566', '38156046'},
 'NCT01703962': {'37668523'},
 'NCT05349162': {'36735263'},
 'NCT01453192': {'30688008'},
 'NCT01473472': {'36601747'},
 'NCT01426243': {'26314624'},
 'NCT02212379': {'31269208'},
 'NCT03870438': {'38484756'},
 'NCT03335995': {'37497675'},
 'NCT02405013': {'36686592'},
 'NCT01801618': {'29662875'},
 'NCT02057796': {'36883573'},
 'NCT03005652': {'38100477'},
 'NCT02481453': {'38273639'},
 'NCT01688453': {'35272723'},
 'NCT04409405': {'38043556'},
 'NCT04288128': {'38419144', '38421662'},
 'NCT02833961': {'36318030'},
 'NCT04392388': {'34293141'},
 'NCT03215732': {'37143029'},
 'NCT01089387': {'26439886'}}

In [53]:
num_nctid_empty_ct = sum((1 for v in pmids_ct_dict.values() if v == set()))
print(f"Nombre d'études sans PMIDs issus de CT: {num_nctid_empty_ct}")

Nombre d'études sans PMIDs issus de CT: 118


In [54]:
num_nctid_empty_pubmed = sum((1 for v in pmids_complete_dict.values() if v == set()))
print(f"Nombre d'études sans PMIDs après consultation PubMed: {num_nctid_empty_pubmed}")

Nombre d'études sans PMIDs après consultation PubMed: 106


In [55]:
print("NCTIds qui n'avait aucun PMIDs sous CT, mais enrichis via Pubmed:")
nctids_previously_empty = {k for k, v in pmids_ct_dict.items() if v == set()} - {k for k, v in pmids_complete_dict.items() if v == set()}
{k: pmids_complete_dict[k] for k in nctids_previously_empty}

NCTIds qui n'avait aucun PMIDs sous CT, mais enrichis via Pubmed:


{'NCT03078439': {'38408861'},
 'NCT04392388': {'34293141'},
 'NCT01801618': {'29662875'},
 'NCT01703962': {'37668523'},
 'NCT04409405': {'38043556'},
 'NCT05311865': {'36438274', '37795682'},
 'NCT01453192': {'30688008'},
 'NCT02212379': {'31269208'},
 'NCT02833961': {'36318030'},
 'NCT02405013': {'36686592'},
 'NCT04288128': {'38419144', '38421662'},
 'NCT05349162': {'36735263'}}

In [56]:
len(nctids_previously_empty)

12

In [57]:
assert num_nctid_empty_ct - num_nctid_empty_pubmed == len(nctids_previously_empty)

### Enrichissement des PMIDs via l'API Pubmed

Pour chaque PMID récupéré, on l'enrichit avec les données de PubMed tel que le titre, les auteurs, ...:

In [58]:
counter = 0  # To keep track of progress
book_counter = 0  # type different from 'article', get ignored
total_publications_list = []

# For each NTCID...
for nctid, pmids in pmids_complete_dict.items():
    pmids_list = []

    # We process each PMID...
    for pmid in pmids:

        # Display the progress on a single line
        print(f'\r{counter+1} / {num_pmids_complete}...', end='', flush=True)

        try:
            # Fetch article details from Pubmed
            article = fetch.article_by_pmid(pmid)

            # We are not interested by articles with type 'book'
            # TODO: book special case ?
            if article.pubmed_type == 'article':
                pmids_list.append(
                    {
                        'pmid': pmid,
                        'title': article.title,
                        'authors': article.authors_str.strip(),
                        'doi': article.doi,
                        'year': article.year,
                        'publication_types': list(article.publication_types.values()),
                        'citation': article.citation,
                    }
                )
            else:
                book_counter += 1
                book = (nctid, pmid)

            counter += 1
        except InvalidPMID as e:
            print(f'\n{e}')

    publication_dict = {'NCTId': nctid, 'publications': pmids_list}

    total_publications_list.append(publication_dict)

393 / 393...

In [59]:
print(f'Nb de publications avec un type différent d"article": {book_counter}')
print(book)

Nb de publications avec un type différent d"article": 1
('NCT03537196', '27227200')


In [60]:
# print(json.dumps(total_publications_list, indent=2))

In [61]:
# The number of NCTId didn't change
assert len(total_publications_list) == len(studies_list)

### Import dans Pandas

In [122]:
df_pubmed = pd.DataFrame.from_records(total_publications_list)
df_pubmed

Unnamed: 0,NCTId,publications
0,NCT01490489,"[{'pmid': '19602057', 'title': 'Role of EG-VEG..."
1,NCT02542891,"[{'pmid': '27488181', 'title': 'European COMPA..."
2,NCT04780191,"[{'pmid': '32232101', 'title': 'Non-invasive t..."
3,NCT03546127,[]
4,NCT02078076,[]
...,...,...
195,NCT02346409,[]
196,NCT02976298,[]
197,NCT02101398,[]
198,NCT01089387,"[{'pmid': '26439886', 'title': 'Safety of Intr..."


#### On "explose" la colonne "publications":

Pour chaque NCTId, la colonne 'publications' contient potentiellement une liste de plusieurs publications.  
Si par exemple, on a 3 publications, on veut se retrouver avec 3 lignes, chacune indexée par le même NCTId et contenant une unique publication.  

*Avant*:  
**`NCTId    Reference`**  
`NCT0001, [Pub1, Pub2, Pub3]`   

*Après*:  
**`NCTId    Reference`**  
`NCT0001,   Pub1`  
`NCT0001,   Pub2`  
`NCT0001,   Pub3`   

In [123]:
df_pubmed = df_pubmed.explode('publications', ignore_index=True)
df_pubmed

Unnamed: 0,NCTId,publications
0,NCT01490489,"{'pmid': '19602057', 'title': 'Role of EG-VEGF..."
1,NCT01490489,"{'pmid': '16384869', 'title': 'Expression and ..."
2,NCT01490489,"{'pmid': '17531315', 'title': 'Placental expre..."
3,NCT02542891,"{'pmid': '27488181', 'title': 'European COMPAR..."
4,NCT02542891,"{'pmid': '34874888', 'title': 'Examining the T..."
...,...,...
493,NCT02101398,
494,NCT01089387,"{'pmid': '26439886', 'title': 'Safety of Intra..."
495,NCT01089387,"{'pmid': '28753830', 'title': 'Intracavernous ..."
496,NCT01089387,"{'pmid': '25974235', 'title': 'Mesenchymal ste..."


On vérifie que l'on a retrouvé plus de publications avec PubMed + CT que CT tout seul:

In [124]:
assert len(df_pubmed) >= len(df_ct)

Pour chaque NCTId, la colonne 'publications' contient maintenant un dictionnaire de la forme:  
`{  
"pmid": "17545707",  
"title": "Haematological ...",  
"authors": "Smith DJ; ...",  
}`

Dont on veut extraire de nouvelles colonnes en utilant les clés du dictionnaire:  
**`pmid      title                 authors`**  
`17545707, "Haematological ...", "Smith DJ; ..."`

In [125]:
df_pubmed_publications = pd.json_normalize(df_pubmed.pop('publications'))
df_pubmed_publications

Unnamed: 0,pmid,title,authors,doi,year,publication_types,citation
0,19602057,Role of EG-VEGF in human placentation: Physiol...,Hoffmann P; Saoudi Y; Benharouga M; Graham CH;...,10.1111/j.1582-4934.2008.00554.x,2009,"[Journal Article, Research Support, Non-U.S. G...","Hoffmann P, et al. Role of EG-VEGF in human pl..."
1,16384869,Expression and oxygen regulation of endocrine ...,Hoffmann P; Feige JJ; Alfaidy N,10.1210/en.2005-0912,2006,"[Journal Article, Research Support, Non-U.S. G...","Hoffmann P, et al. Expression and oxygen regul..."
2,17531315,Placental expression of EG-VEGF and its recept...,Hoffmann P; Feige JJ; Alfaidy N,10.1016/j.placenta.2007.03.008,2007,"[Journal Article, Research Support, Non-U.S. G...","Hoffmann P, et al. Placental expression of EG-..."
3,27488181,European COMPARative Effectiveness research on...,Kleiboer A; Smit J; Bosmans J; Ruwaard J; Ande...,10.1186/s13063-016-1511-1,2016,"[Equivalence Trial, Journal Article, Multicent...","Kleiboer A, et al. European COMPARative Effect..."
4,34874888,Examining the Theoretical Framework of Behavio...,van Genugten CR; Schuurmans J; Hoogendoorn AW;...,10.2196/32007,2021,[Journal Article],"van Genugten CR, et al. Examining the Theoreti..."
...,...,...,...,...,...,...,...
493,,,,,,,
494,26439886,Safety of Intracavernous Bone Marrow-Mononucle...,Yiou R; Hamidou L; Birebent B; Bitari D; Lecor...,10.1016/j.eururo.2015.09.026,2016,"[Journal Article, Research Support, Non-U.S. G...","Yiou R, et al. Safety of Intracavernous Bone M..."
495,28753830,Intracavernous Injections of Bone Marrow Monon...,Yiou R; Hamidou L; Birebent B; Bitari D; Le Co...,10.1016/j.euf.2017.06.009,2017,"[Clinical Trial, Phase I, Clinical Trial, Phas...","Yiou R, et al. Intracavernous Injections of Bo..."
496,25974235,Mesenchymal stem cell therapy for the treatmen...,Khera M; Albersen M; Mulhall JP,10.1111/jsm.12871,2015,[Journal Article],"Khera M, et al. Mesenchymal stem cell therapy ..."


On réassemble la dataFrame complète:

In [126]:
df_pubmed = df_pubmed.join(df_pubmed_publications)
df_pubmed

Unnamed: 0,NCTId,pmid,title,authors,doi,year,publication_types,citation
0,NCT01490489,19602057,Role of EG-VEGF in human placentation: Physiol...,Hoffmann P; Saoudi Y; Benharouga M; Graham CH;...,10.1111/j.1582-4934.2008.00554.x,2009,"[Journal Article, Research Support, Non-U.S. G...","Hoffmann P, et al. Role of EG-VEGF in human pl..."
1,NCT01490489,16384869,Expression and oxygen regulation of endocrine ...,Hoffmann P; Feige JJ; Alfaidy N,10.1210/en.2005-0912,2006,"[Journal Article, Research Support, Non-U.S. G...","Hoffmann P, et al. Expression and oxygen regul..."
2,NCT01490489,17531315,Placental expression of EG-VEGF and its recept...,Hoffmann P; Feige JJ; Alfaidy N,10.1016/j.placenta.2007.03.008,2007,"[Journal Article, Research Support, Non-U.S. G...","Hoffmann P, et al. Placental expression of EG-..."
3,NCT02542891,27488181,European COMPARative Effectiveness research on...,Kleiboer A; Smit J; Bosmans J; Ruwaard J; Ande...,10.1186/s13063-016-1511-1,2016,"[Equivalence Trial, Journal Article, Multicent...","Kleiboer A, et al. European COMPARative Effect..."
4,NCT02542891,34874888,Examining the Theoretical Framework of Behavio...,van Genugten CR; Schuurmans J; Hoogendoorn AW;...,10.2196/32007,2021,[Journal Article],"van Genugten CR, et al. Examining the Theoreti..."
...,...,...,...,...,...,...,...,...
493,NCT02101398,,,,,,,
494,NCT01089387,26439886,Safety of Intracavernous Bone Marrow-Mononucle...,Yiou R; Hamidou L; Birebent B; Bitari D; Lecor...,10.1016/j.eururo.2015.09.026,2016,"[Journal Article, Research Support, Non-U.S. G...","Yiou R, et al. Safety of Intracavernous Bone M..."
495,NCT01089387,28753830,Intracavernous Injections of Bone Marrow Monon...,Yiou R; Hamidou L; Birebent B; Bitari D; Le Co...,10.1016/j.euf.2017.06.009,2017,"[Clinical Trial, Phase I, Clinical Trial, Phas...","Yiou R, et al. Intracavernous Injections of Bo..."
496,NCT01089387,25974235,Mesenchymal stem cell therapy for the treatmen...,Khera M; Albersen M; Mulhall JP,10.1111/jsm.12871,2015,[Journal Article],"Khera M, et al. Mesenchymal stem cell therapy ..."


### Jointure des DataFrame de CT et Pubmed:

In [127]:
df_final = df_ct.merge(df_pubmed, on=['NCTId', 'pmid'], how='right')
df_final

Unnamed: 0,NCTId,BriefTitle,LeadSponsorName,CollaboratorName,OverallStatus,StudyType,HasResults,StudyFirstPostDate,StartDate,CompletionDate,pmid,type,citation_x,title,authors,doi,year,publication_types,citation_y
0,NCT01490489,EG-VEGF : Potential Marker of Pre-eclampsia an...,Institut National de la Santé Et de la Recherc...,,COMPLETED,INTERVENTIONAL,False,2011-12-13,2011-07-11,2015-03-13,19602057,RESULT,"Hoffmann P, Saoudi Y, Benharouga M, Graham CH,...",Role of EG-VEGF in human placentation: Physiol...,Hoffmann P; Saoudi Y; Benharouga M; Graham CH;...,10.1111/j.1582-4934.2008.00554.x,2009,"[Journal Article, Research Support, Non-U.S. G...","Hoffmann P, et al. Role of EG-VEGF in human pl..."
1,NCT01490489,EG-VEGF : Potential Marker of Pre-eclampsia an...,Institut National de la Santé Et de la Recherc...,,COMPLETED,INTERVENTIONAL,False,2011-12-13,2011-07-11,2015-03-13,16384869,RESULT,"Hoffmann P, Feige JJ, Alfaidy N. Expression an...",Expression and oxygen regulation of endocrine ...,Hoffmann P; Feige JJ; Alfaidy N,10.1210/en.2005-0912,2006,"[Journal Article, Research Support, Non-U.S. G...","Hoffmann P, et al. Expression and oxygen regul..."
2,NCT01490489,EG-VEGF : Potential Marker of Pre-eclampsia an...,Institut National de la Santé Et de la Recherc...,,COMPLETED,INTERVENTIONAL,False,2011-12-13,2011-07-11,2015-03-13,17531315,RESULT,"Hoffmann P, Feige JJ, Alfaidy N. Placental exp...",Placental expression of EG-VEGF and its recept...,Hoffmann P; Feige JJ; Alfaidy N,10.1016/j.placenta.2007.03.008,2007,"[Journal Article, Research Support, Non-U.S. G...","Hoffmann P, et al. Placental expression of EG-..."
3,NCT02542891,European Comparative Effectiveness Research on...,Institut National de la Santé Et de la Recherc...,European Commission | Fondation FondaMental | ...,COMPLETED,INTERVENTIONAL,False,2015-09-07,2015-09,2018-09-15,27488181,DERIVED,"Kleiboer A, Smit J, Bosmans J, Ruwaard J, Ande...",European COMPARative Effectiveness research on...,Kleiboer A; Smit J; Bosmans J; Ruwaard J; Ande...,10.1186/s13063-016-1511-1,2016,"[Equivalence Trial, Journal Article, Multicent...","Kleiboer A, et al. European COMPARative Effect..."
4,NCT02542891,European Comparative Effectiveness Research on...,Institut National de la Santé Et de la Recherc...,European Commission | Fondation FondaMental | ...,COMPLETED,INTERVENTIONAL,False,2015-09-07,2015-09,2018-09-15,34874888,DERIVED,"van Genugten CR, Schuurmans J, Hoogendoorn AW,...",Examining the Theoretical Framework of Behavio...,van Genugten CR; Schuurmans J; Hoogendoorn AW;...,10.2196/32007,2021,[Journal Article],"van Genugten CR, et al. Examining the Theoreti..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
493,NCT02101398,Study of the Effect of Transcranial Stimulatio...,Institut National de la Santé Et de la Recherc...,,COMPLETED,INTERVENTIONAL,False,2014-04-02,2014-10-02,2016-07,,,,,,,,,
494,NCT01089387,,,,,,,,,,26439886,,,Safety of Intracavernous Bone Marrow-Mononucle...,Yiou R; Hamidou L; Birebent B; Bitari D; Lecor...,10.1016/j.eururo.2015.09.026,2016,"[Journal Article, Research Support, Non-U.S. G...","Yiou R, et al. Safety of Intracavernous Bone M..."
495,NCT01089387,Intracavernous Bone Marrow Stem-cell Injection...,Institut National de la Santé Et de la Recherc...,,COMPLETED,INTERVENTIONAL,False,2010-03-18,2010-05,2015-03,28753830,DERIVED,"Yiou R, Hamidou L, Birebent B, Bitari D, Le Co...",Intracavernous Injections of Bone Marrow Monon...,Yiou R; Hamidou L; Birebent B; Bitari D; Le Co...,10.1016/j.euf.2017.06.009,2017,"[Clinical Trial, Phase I, Clinical Trial, Phas...","Yiou R, et al. Intracavernous Injections of Bo..."
496,NCT01089387,Intracavernous Bone Marrow Stem-cell Injection...,Institut National de la Santé Et de la Recherc...,,COMPLETED,INTERVENTIONAL,False,2010-03-18,2010-05,2015-03,25974235,DERIVED,"Khera M, Albersen M, Mulhall JP. Mesenchymal s...",Mesenchymal stem cell therapy for the treatmen...,Khera M; Albersen M; Mulhall JP,10.1111/jsm.12871,2015,[Journal Article],"Khera M, et al. Mesenchymal stem cell therapy ..."


Suppression des colonnes 'citation':

In [128]:
df_final.drop(['citation_x', 'citation_y'], axis=1, inplace=True)

In [129]:
df_final

Unnamed: 0,NCTId,BriefTitle,LeadSponsorName,CollaboratorName,OverallStatus,StudyType,HasResults,StudyFirstPostDate,StartDate,CompletionDate,pmid,type,title,authors,doi,year,publication_types
0,NCT01490489,EG-VEGF : Potential Marker of Pre-eclampsia an...,Institut National de la Santé Et de la Recherc...,,COMPLETED,INTERVENTIONAL,False,2011-12-13,2011-07-11,2015-03-13,19602057,RESULT,Role of EG-VEGF in human placentation: Physiol...,Hoffmann P; Saoudi Y; Benharouga M; Graham CH;...,10.1111/j.1582-4934.2008.00554.x,2009,"[Journal Article, Research Support, Non-U.S. G..."
1,NCT01490489,EG-VEGF : Potential Marker of Pre-eclampsia an...,Institut National de la Santé Et de la Recherc...,,COMPLETED,INTERVENTIONAL,False,2011-12-13,2011-07-11,2015-03-13,16384869,RESULT,Expression and oxygen regulation of endocrine ...,Hoffmann P; Feige JJ; Alfaidy N,10.1210/en.2005-0912,2006,"[Journal Article, Research Support, Non-U.S. G..."
2,NCT01490489,EG-VEGF : Potential Marker of Pre-eclampsia an...,Institut National de la Santé Et de la Recherc...,,COMPLETED,INTERVENTIONAL,False,2011-12-13,2011-07-11,2015-03-13,17531315,RESULT,Placental expression of EG-VEGF and its recept...,Hoffmann P; Feige JJ; Alfaidy N,10.1016/j.placenta.2007.03.008,2007,"[Journal Article, Research Support, Non-U.S. G..."
3,NCT02542891,European Comparative Effectiveness Research on...,Institut National de la Santé Et de la Recherc...,European Commission | Fondation FondaMental | ...,COMPLETED,INTERVENTIONAL,False,2015-09-07,2015-09,2018-09-15,27488181,DERIVED,European COMPARative Effectiveness research on...,Kleiboer A; Smit J; Bosmans J; Ruwaard J; Ande...,10.1186/s13063-016-1511-1,2016,"[Equivalence Trial, Journal Article, Multicent..."
4,NCT02542891,European Comparative Effectiveness Research on...,Institut National de la Santé Et de la Recherc...,European Commission | Fondation FondaMental | ...,COMPLETED,INTERVENTIONAL,False,2015-09-07,2015-09,2018-09-15,34874888,DERIVED,Examining the Theoretical Framework of Behavio...,van Genugten CR; Schuurmans J; Hoogendoorn AW;...,10.2196/32007,2021,[Journal Article]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
493,NCT02101398,Study of the Effect of Transcranial Stimulatio...,Institut National de la Santé Et de la Recherc...,,COMPLETED,INTERVENTIONAL,False,2014-04-02,2014-10-02,2016-07,,,,,,,
494,NCT01089387,,,,,,,,,,26439886,,Safety of Intracavernous Bone Marrow-Mononucle...,Yiou R; Hamidou L; Birebent B; Bitari D; Lecor...,10.1016/j.eururo.2015.09.026,2016,"[Journal Article, Research Support, Non-U.S. G..."
495,NCT01089387,Intracavernous Bone Marrow Stem-cell Injection...,Institut National de la Santé Et de la Recherc...,,COMPLETED,INTERVENTIONAL,False,2010-03-18,2010-05,2015-03,28753830,DERIVED,Intracavernous Injections of Bone Marrow Monon...,Yiou R; Hamidou L; Birebent B; Bitari D; Le Co...,10.1016/j.euf.2017.06.009,2017,"[Clinical Trial, Phase I, Clinical Trial, Phas..."
496,NCT01089387,Intracavernous Bone Marrow Stem-cell Injection...,Institut National de la Santé Et de la Recherc...,,COMPLETED,INTERVENTIONAL,False,2010-03-18,2010-05,2015-03,25974235,DERIVED,Mesenchymal stem cell therapy for the treatmen...,Khera M; Albersen M; Mulhall JP,10.1111/jsm.12871,2015,[Journal Article]


Les nouveaux PMIDs trouvés via Pubmed, n'ont aucune des infos associées avec CT présentes: BriefTitle, LeadSponsorName, etc ...

In [130]:
# Index of empty rows we need to fill
index_empty_rows = df_final.loc[:, 'BriefTitle'].isna()

# Columns we need to fill
columns_to_fill = [
    'BriefTitle',
    'LeadSponsorName',
    'CollaboratorName',
    'OverallStatus',
    'StudyType',
    'HasResults',
    'StudyFirstPostDate',
    'StartDate',
    'CompletionDate',
    'type',
]

df_final.loc[index_empty_rows, ['NCTId'] + columns_to_fill]

Unnamed: 0,NCTId,BriefTitle,LeadSponsorName,CollaboratorName,OverallStatus,StudyType,HasResults,StudyFirstPostDate,StartDate,CompletionDate,type
10,NCT05311865,,,,,,,,,,
11,NCT05311865,,,,,,,,,,
40,NCT03078439,,,,,,,,,,
69,NCT04315948,,,,,,,,,,
81,NCT04315948,,,,,,,,,,
112,NCT00640263,,,,,,,,,,
146,NCT02777229,,,,,,,,,,
148,NCT02777229,,,,,,,,,,
161,NCT01703962,,,,,,,,,,
162,NCT05349162,,,,,,,,,,


On rajoute un type 'PUBMED' pour les PMIDs issues de PubMed uniquement:

In [136]:
# We add a 'PUBMED' type to the PMIDs extracted from Pubmed exclusively
df_final.loc[index_empty_rows, 'type'] = 'PUBMED'

In [137]:
df_final.loc[index_empty_rows, ['NCTId'] + columns_to_fill]

Unnamed: 0,NCTId,BriefTitle,LeadSponsorName,CollaboratorName,OverallStatus,StudyType,HasResults,StudyFirstPostDate,StartDate,CompletionDate,type
10,NCT05311865,Transmission of Covid-19 During Clubbing Event...,"ANRS, Emerging Infectious Diseases",Cerballiance | Kappa Santé,COMPLETED,INTERVENTIONAL,False,2022-04-05,2021-09-04,2022-02-26,PUBMED
11,NCT05311865,Transmission of Covid-19 During Clubbing Event...,"ANRS, Emerging Infectious Diseases",Cerballiance | Kappa Santé,COMPLETED,INTERVENTIONAL,False,2022-04-05,2021-09-04,2022-02-26,PUBMED
40,NCT03078439,EPIPAGE2 Cohort Study Follow up at Five and a ...,Institut National de la Santé Et de la Recherc...,,COMPLETED,OBSERVATIONAL,False,2017-03-13,2016-09-02,2018-01-08,PUBMED
69,NCT04315948,Trial of Treatments for COVID-19 in Hospitaliz...,Institut National de la Santé Et de la Recherc...,,COMPLETED,INTERVENTIONAL,False,2020-03-20,2020-03-22,2023-09-25,PUBMED
81,NCT04315948,Trial of Treatments for COVID-19 in Hospitaliz...,Institut National de la Santé Et de la Recherc...,,COMPLETED,INTERVENTIONAL,False,2020-03-20,2020-03-22,2023-09-25,PUBMED
112,NCT00640263,Comparison of Efficacy and Safety of Infant Pe...,French National Agency for Research on AIDS an...,European and Developing Countries Clinical Tri...,COMPLETED,INTERVENTIONAL,False,2008-03-21,2009-12,2014-02,PUBMED
146,NCT02777229,Efficacy and Safety of a Dolutegravir-based Re...,"ANRS, Emerging Infectious Diseases",Institut de Recherche pour le Developpement | ...,COMPLETED,INTERVENTIONAL,False,2016-05-19,2016-07,2021-07,PUBMED
148,NCT02777229,Efficacy and Safety of a Dolutegravir-based Re...,"ANRS, Emerging Infectious Diseases",Institut de Recherche pour le Developpement | ...,COMPLETED,INTERVENTIONAL,False,2016-05-19,2016-07,2021-07,PUBMED
161,NCT01703962,Non Invasive IDentification of Gliomas With ID...,Institut National de la Santé Et de la Recherc...,,COMPLETED,OBSERVATIONAL,False,2012-10-11,2012-03-14,2014-03-20,PUBMED
162,NCT05349162,Epicardial vs. Transvenous ICDs in Children,Paris Cardiovascular Research Center (Inserm U...,Hôpital Necker-Enfants Malades,COMPLETED,OBSERVATIONAL,False,2022-04-27,2003-01-01,2022-04-01,PUBMED


On va remplir ces colonnes avec les infos contenus dans la DataFrame CT:

In [138]:
# NCTIds of empty rows
NCTIds_empty_rows = df_final.loc[index_empty_rows, 'NCTId']

# Columns we wish to copy
columns_to_copy = [
    'BriefTitle',
    'LeadSponsorName',
    'CollaboratorName',
    'OverallStatus',
    'StudyType',
    'HasResults',
    'StudyFirstPostDate',
    'StartDate',
    'CompletionDate',
]

# We copy the missing values from the CT dataframe
for index, nctid in NCTIds_empty_rows.items():
    # For an NCTId, we look in the CT dataframe for the first row with this NCTDId
    # and copy the missings columns
    df_final.loc[index, columns_to_copy] = df_ct.loc[
        df_ct.loc[:, 'NCTId'] == nctid, columns_to_copy
    ].iloc[0]

In [140]:
df_final.loc[index_empty_rows, ['NCTId'] + columns_to_fill]

Unnamed: 0,NCTId,BriefTitle,LeadSponsorName,CollaboratorName,OverallStatus,StudyType,HasResults,StudyFirstPostDate,StartDate,CompletionDate,type
10,NCT05311865,Transmission of Covid-19 During Clubbing Event...,"ANRS, Emerging Infectious Diseases",Cerballiance | Kappa Santé,COMPLETED,INTERVENTIONAL,False,2022-04-05,2021-09-04,2022-02-26,PUBMED
11,NCT05311865,Transmission of Covid-19 During Clubbing Event...,"ANRS, Emerging Infectious Diseases",Cerballiance | Kappa Santé,COMPLETED,INTERVENTIONAL,False,2022-04-05,2021-09-04,2022-02-26,PUBMED
40,NCT03078439,EPIPAGE2 Cohort Study Follow up at Five and a ...,Institut National de la Santé Et de la Recherc...,,COMPLETED,OBSERVATIONAL,False,2017-03-13,2016-09-02,2018-01-08,PUBMED
69,NCT04315948,Trial of Treatments for COVID-19 in Hospitaliz...,Institut National de la Santé Et de la Recherc...,,COMPLETED,INTERVENTIONAL,False,2020-03-20,2020-03-22,2023-09-25,PUBMED
81,NCT04315948,Trial of Treatments for COVID-19 in Hospitaliz...,Institut National de la Santé Et de la Recherc...,,COMPLETED,INTERVENTIONAL,False,2020-03-20,2020-03-22,2023-09-25,PUBMED
112,NCT00640263,Comparison of Efficacy and Safety of Infant Pe...,French National Agency for Research on AIDS an...,European and Developing Countries Clinical Tri...,COMPLETED,INTERVENTIONAL,False,2008-03-21,2009-12,2014-02,PUBMED
146,NCT02777229,Efficacy and Safety of a Dolutegravir-based Re...,"ANRS, Emerging Infectious Diseases",Institut de Recherche pour le Developpement | ...,COMPLETED,INTERVENTIONAL,False,2016-05-19,2016-07,2021-07,PUBMED
148,NCT02777229,Efficacy and Safety of a Dolutegravir-based Re...,"ANRS, Emerging Infectious Diseases",Institut de Recherche pour le Developpement | ...,COMPLETED,INTERVENTIONAL,False,2016-05-19,2016-07,2021-07,PUBMED
161,NCT01703962,Non Invasive IDentification of Gliomas With ID...,Institut National de la Santé Et de la Recherc...,,COMPLETED,OBSERVATIONAL,False,2012-10-11,2012-03-14,2014-03-20,PUBMED
162,NCT05349162,Epicardial vs. Transvenous ICDs in Children,Paris Cardiovascular Research Center (Inserm U...,Hôpital Necker-Enfants Malades,COMPLETED,OBSERVATIONAL,False,2022-04-27,2003-01-01,2022-04-01,PUBMED


### Resultat final:

In [141]:
df_final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 498 entries, 0 to 497
Data columns (total 17 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   NCTId               498 non-null    object  
 1   BriefTitle          498 non-null    string  
 2   LeadSponsorName     498 non-null    string  
 3   CollaboratorName    224 non-null    string  
 4   OverallStatus       498 non-null    category
 5   StudyType           498 non-null    category
 6   HasResults          498 non-null    boolean 
 7   StudyFirstPostDate  498 non-null    string  
 8   StartDate           498 non-null    string  
 9   CompletionDate      498 non-null    string  
 10  pmid                392 non-null    object  
 11  type                392 non-null    string  
 12  title               392 non-null    object  
 13  authors             392 non-null    object  
 14  doi                 391 non-null    object  
 15  year                392 non-null    obje

In [76]:
df_final = df_final.convert_dtypes()
# df_final = df_final.astype({"OverallStatus" : 'category', "StudyType" : 'category'})
df_final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 498 entries, 0 to 497
Data columns (total 17 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   NCTId               498 non-null    string  
 1   BriefTitle          498 non-null    string  
 2   LeadSponsorName     498 non-null    string  
 3   CollaboratorName    224 non-null    string  
 4   OverallStatus       498 non-null    category
 5   StudyType           498 non-null    category
 6   HasResults          498 non-null    boolean 
 7   StudyFirstPostDate  498 non-null    string  
 8   StartDate           498 non-null    string  
 9   CompletionDate      498 non-null    string  
 10  pmid                392 non-null    string  
 11  type                392 non-null    string  
 12  title               392 non-null    string  
 13  authors             392 non-null    string  
 14  doi                 391 non-null    string  
 15  year                392 non-null    stri

### Export en CSV: 

In [77]:
df_final.to_csv(
    'Data/outputs/extract_df_final.csv',
    sep=';',
    index=False,
    encoding='utf-8-sig',
)