# Objectif:

On souhaite évaluer le nombre de publications qui sont le résultat d'études cliniques liées à l'INSERM.

En particulier, on souhaite identifier les études cliniques qui ne donnent lieu à aucune publication et essayer de comprendre les raisons de cette absence de publication.

# Organisation :
Dans ce premier notebook, l'objectif est d'extraire de manière automatique les données des études cliniques:

- On récupère via l'API de *ClinicalTrial* les *ID* (**NCTId**) des études cliniques qui ont :
  + pour sponsors l'*INSERM*, l'*ANRS* etc.
  + un statut 'COMPLETED'
  + une date de fin d'étude postérieure à 2013

- A partir de ces **NCTIds**:
   + on récupère sur *ClinicalTrial* les **PMIDs** des publications liées à ces études.  
    Ces publications sont de 2 types:
        1. Elles ont été uploadées sur *CT* par les auteurs de l'étude: `BACKGROUND, RESULT`
        2. Elles ont été automatiquement récupérés sur PubMed par *CT*: `DERIVED`
    + on récupère sur *PubMed* les **PMIDs** des publications liées à ces études.  
      On retrouve général un peu plus de publications que le traitement automatique réalisé par *CT*. 

- Pour chaque **NCTIds**, on fusionne l'ensemble des **PMIDs** retournées par *CT* et *PubMed*.

- A partir de cet ensemble de **PMIDs**, on récupère les infos liés via *Pubmed*: `titre, auteurs, doi`...

- On récupère le status OA avec le DOI et l'API d'Unpaywall

- On sauvegarde le résultat sous la forme d'une fichier CSV.

# Extraction des NCTIds dans ClinicalTrial:

<span style="color:red">**Obsolète**</span>  
<span style="color:red">**L'API v1 n'est plus disponible depuis mi-2024.**  
**Aller directement à [la section API v2](#API-v2:)**</span>

## API v1:

Pour faciliter la récupération des données via l'API v1 de ClinicalTrial, on utilise le wrapper Python: [pytrials](https://github.com/jvfe/pytrials)

Installer ***pytrials***:
- Dans Conda Navigator, se placer dans le meme environnement que celui qui execute Jupyter
- Lancer Powershell Prompt dans cet environnement
- Taper: `pip install pytrials`

In [1]:
# from pytrials.client import ClinicalTrials
# import urllib.parse

In [2]:
# ct = ClinicalTrials()

### Création de la requête:

On crée la requête qui sera envoyé à l'API de ClinicalTrial

#### Sponsors:

In [3]:
# sponsors = [
#     'anrs',
#     'inserm',
#     'institut national de la santé et de la recherche médicale',
#     'french national agency for research on aids and viral hepatitis',
# ]

In [4]:
# sponsors_expr = [f'AREA[LeadSponsorName]{sponsor}' for sponsor in sponsors]

# # Add OR keyword
# sponsors_expr = ' OR '.join(sponsors_expr)

# # Add parenthesis for correct interpretation of OR expression
# sponsors_expr = f'({sponsors_expr})'

# sponsors_expr

#### Status:

In [5]:
# status = 'completed'

In [6]:
# status_expr = f'AREA[OverallStatus]{status}'
# status_expr

#### Date de fin d'étude supérieure ou égale à 2013:

<span style="color:red">**Mettre à jour la date si nécessaire**</span>

In [7]:
# date_expr = 'AREA[CompletionDate]RANGE[01/01/2013,MAX]'
# date_expr

#### Search Expression:

In [8]:
# search_expr = ' AND '.join([sponsors_expr, status_expr, date_expr])
# search_expr

#### URL encode: 

In [9]:
# search_expr_url_encode = urllib.parse.quote_plus(search_expr)
# search_expr_url_encode

#### Fields:

Les champs que l'on veut récupérer:

In [10]:
# fields = [
#     'NCTId',
#     'BriefTitle',
#     'OverallStatus',
#     'StudyType',
#     'LeadSponsorName',
#     'CollaboratorName',
#     'OrgStudyId',
#     'SecondaryId',
#     'StudyFirstPostDate',
#     'ReferencePMID',
#     'ReferenceCitation',
#     'ReferenceType',
# ]

### Envoi de la requête :

In [11]:
# study_fields = ct.get_study_fields(
#     search_expr=search_expr_url_encode,
#     fields=fields,
#     max_studies=1000,
#     fmt='csv',
# )

In [12]:
# print(f'NStudiesReturned: {len(study_fields[1:])}')

### Lire le résultat de la requête dans Pandas :

In [13]:
# import pandas as pd

In [14]:
# pd.DataFrame.from_records(study_fields[1:], index='Rank', columns=study_fields[0])

## API v2:

L'API v1 ne sera plus supporté a [partir de mi-2024](https://clinicaltrials.gov/data-api/api) :

>***Notice to API users:  
>The new ClinicalTrials.gov API, version 2.0 is available. Classic API users are strongly encouraged to switch to the modernized API. We will continue to support the classic API until mid-2024 and are planning blackouts for the spring to help with the transition to the modernized API.***

De plus, l'API v2 supporte un nouveau champs **"HasResults"**, qui pour l'instant n'est que très peu utilisé mais qui pourrait être utile à l'avenir.

En contre-partie, l'export des données sous forme de CSV est limité à un certain nombre de champs visible sur cette page: https://clinicaltrials.gov/data-api/about-api/csv-download

On est donc obligé d'utiliser l'export de données sous forme de JSON.

In [15]:
import urllib.parse

### Création de la requête:

`pytrials` n'étant pas compatible avec la v2, on envoie la requête manuellement en utilisant [Requests](https://requests.readthedocs.io/en/latest/) 

#### Format:

In [16]:
format = 'json'

#### Sponsors:

In [17]:
sponsors = [
    'anrs',
    'inserm',
    'institut national de la santé et de la recherche médicale',
    'french national agency for research on aids and viral hepatitis',
]

In [18]:
sponsors_expr_v2 = ' OR '.join(sponsors)
sponsors_expr_v2 = urllib.parse.quote_plus(sponsors_expr_v2)
sponsors_expr_v2

'anrs+OR+inserm+OR+institut+national+de+la+sant%C3%A9+et+de+la+recherche+m%C3%A9dicale+OR+french+national+agency+for+research+on+aids+and+viral+hepatitis'

#### Overall_status:

In [19]:
overall_status = 'COMPLETED'

#### Fields:

In [20]:
fields_v2 = [
    'NCTId',
    'BriefTitle',
    # 'OfficialTitle',
    'OverallStatus',
    'StudyType',
    'LeadSponsorName',
    'CollaboratorName',
    # 'OrgStudyId',
    # 'SecondaryId',
    'StudyFirstPostDate',
    'StartDate',
    # 'PrimaryCompletionDate',
    'CompletionDate',
    'ReferencePMID',
    'ReferenceCitation',
    'ReferenceType',
    'hasResults',
]
fields_v2

['NCTId',
 'BriefTitle',
 'OverallStatus',
 'StudyType',
 'LeadSponsorName',
 'CollaboratorName',
 'StudyFirstPostDate',
 'StartDate',
 'CompletionDate',
 'ReferencePMID',
 'ReferenceCitation',
 'ReferenceType',
 'hasResults']

In [21]:
fields_expr_v2 = ','.join(fields_v2)
fields_expr_v2 = urllib.parse.quote_plus(fields_expr_v2)
fields_expr_v2

'NCTId%2CBriefTitle%2COverallStatus%2CStudyType%2CLeadSponsorName%2CCollaboratorName%2CStudyFirstPostDate%2CStartDate%2CCompletionDate%2CReferencePMID%2CReferenceCitation%2CReferenceType%2ChasResults'

#### Date de fin d'étude postérieure à 2013:

<span style="color:red">**Mettre à jour la date si nécessaire**</span>

In [22]:
date = '01/01/2013'
date_expr = urllib.parse.quote_plus(f'AREA[CompletionDate]RANGE[{date}, MAX]')
date_expr

'AREA%5BCompletionDate%5DRANGE%5B01%2F01%2F2013%2C+MAX%5D'

#### Nombre de résultats max :

In [23]:
count_total = 'true'

In [24]:
page_size = 1000

#### URL api:

In [25]:
# Make sure we don't have more than 1000 results, otherwise we need to handle several page of results
query_url = f'https://clinicaltrials.gov/api/v2/studies?format={format}&query.lead={sponsors_expr_v2}&filter.overallStatus={overall_status}&fields={fields_expr_v2}&filter.advanced={date_expr}&countTotal={count_total}&pageSize={page_size}'
query_url

'https://clinicaltrials.gov/api/v2/studies?format=json&query.lead=anrs+OR+inserm+OR+institut+national+de+la+sant%C3%A9+et+de+la+recherche+m%C3%A9dicale+OR+french+national+agency+for+research+on+aids+and+viral+hepatitis&filter.overallStatus=COMPLETED&fields=NCTId%2CBriefTitle%2COverallStatus%2CStudyType%2CLeadSponsorName%2CCollaboratorName%2CStudyFirstPostDate%2CStartDate%2CCompletionDate%2CReferencePMID%2CReferenceCitation%2CReferenceType%2ChasResults&filter.advanced=AREA%5BCompletionDate%5DRANGE%5B01%2F01%2F2013%2C+MAX%5D&countTotal=true&pageSize=1000'

### Envoi de la requête :

Installer [Requests](https://requests.readthedocs.io/en/latest/) et [Requests_cache](https://requests-cache.readthedocs.io/en/stable/index.html), si ils ne sont pas déjà présent:

`python -m pip install requests`  
`python -m pip install requests-cache`

In [26]:
from datetime import timedelta
import requests
import requests_cache

# Add caching to all requests functions
requests_cache.install_cache(
    cache_control = True, # Use Cache-Control response headers for expiration, if available
    expire_after = timedelta(days=1), # Otherwise expire responses after one day
)

In [27]:
response = requests.get(query_url)
response.raise_for_status()
response

CachedResponse(_content=b'{"totalCount":200,"studies":[\n{"protocolSection":{"identificationModule":{"nctId":"NCT02777229","briefTitle":"Efficacy and Safety of a Dolutegravir-based Regimen for the Initial Management of HIV Infected Adults in Resource-limited Settings"},"statusModule":{"overallStatus":"COMPLETED","startDateStruct":{"date":"2016-07"},"completionDateStruct":{"date":"2021-07"},"studyFirstPostDateStruct":{"date":"2016-05-19"}},"sponsorCollaboratorsModule":{"leadSponsor":{"name":"ANRS, Emerging Infectious Diseases"},"collaborators":[{"name":"Institut de Recherche pour le Developpement"},{"name":"UNITAID"}]},"designModule":{"studyType":"INTERVENTIONAL"},"referencesModule":{"references":[{"pmid":"33355914","type":"DERIVED","citation":"Bousmah MA, Nishimwe ML, Tovar-Sanchez T, Lantche Wandji M, Mpoudi-Etame M, Maradan G, Omgba Bassega P, Varloteaux M, Montoyo A, Kouanfack C, Delaporte E, Boyer S; New Antiretroviral and Monitoring Strategies in HIV-infected Adults in Low-Income 

In [28]:
print(f'Studies returned: {response.json()["totalCount"]}')

Studies returned: 200


### Traitement du JSON retourné:

In [29]:
import json

In [30]:
# print(json.dumps(response.json(), indent=2))

***La structure du JSON est bien trop imbriquée pour le normaliser avec Pandas, du coup on l'applatit à la main:***

A partir du JSON on construit un dictionnaire équivalent mais beaucoup plus "plat"

In [31]:
# Si la liste des collaborateurs est vide on renvoie None, sinon on concatène les valeurs de la liste
# sous la forme "collaborateur_0 | collaborateur_1 | ..."
def concatenate_collaborator_list(collaborator_list):
    if collaborator_list == []:
        return None
    else:
        return ' | '.join(collaborator_list)

In [32]:
studies_list = []
for study in response.json()['studies']:
    study_pro = study['protocolSection']
    study_dict = {
        'NCTId': study_pro['identificationModule']['nctId'],
        'BriefTitle': study_pro['identificationModule']['briefTitle'],
        'LeadSponsorName': study_pro['sponsorCollaboratorsModule']['leadSponsor']['name'],
        'CollaboratorName': concatenate_collaborator_list(
            [
                c['name']
                for c in (
                    study_pro['sponsorCollaboratorsModule'].get('collaborators', [])  # can be missing
                )
            ]
        ),
        'OverallStatus': study_pro['statusModule']['overallStatus'],
        'StudyType': study_pro['designModule']['studyType'],
        'HasResults': study['hasResults'],
        'StudyFirstPostDate': study_pro['statusModule']['studyFirstPostDateStruct']['date'],
        'StartDate': study_pro['statusModule'].get('startDateStruct', {}).get('date', None),  # can be missing
        # 'PrimaryCompletionDate' : study_pro["statusModule"].get('primaryCompletionDateStruct', {}).get('date', None), # can be missing
        'CompletionDate': study_pro['statusModule'].get('completionDateStruct', {}).get('date', None),  # can be missing
        'Reference': study_pro.get('referencesModule', {}).get('references', []),  # can be missing
    }
    studies_list.append(study_dict)

# print(json.dumps(studies_list, indent=2))

On vérifie que l'on n'a pas perdu de NCTId en route:

In [33]:
print(f'Nombre de NCTId: {len(studies_list)}')
assert response.json()['totalCount'] == len(studies_list)

Nombre de NCTId: 200


### Import dans Pandas

In [34]:
import pandas as pd

In [35]:
df_ct = pd.json_normalize(data=studies_list)
df_ct

Unnamed: 0,NCTId,BriefTitle,LeadSponsorName,CollaboratorName,OverallStatus,StudyType,HasResults,StudyFirstPostDate,StartDate,CompletionDate,Reference
0,NCT02777229,Efficacy and Safety of a Dolutegravir-based Re...,"ANRS, Emerging Infectious Diseases",Institut de Recherche pour le Developpement | ...,COMPLETED,INTERVENTIONAL,False,2016-05-19,2016-07,2021-07,"[{'pmid': '33355914', 'type': 'DERIVED', 'cita..."
1,NCT01882062,Proof of Concept of an Anaplerotic Study Using...,Institut National de la Santé Et de la Recherc...,,COMPLETED,INTERVENTIONAL,True,2013-06-20,2013-05,2013-07,"[{'pmid': '25568297', 'type': 'RESULT', 'citat..."
2,NCT03671291,Missed Opportunities to Pre-exposure Prophylax...,French National Agency for Research on AIDS an...,"University Hospital, Marseille | University Ho...",COMPLETED,INTERVENTIONAL,False,2018-09-14,2019-04-03,2021-10-03,[]
3,NCT05349162,Epicardial vs. Transvenous ICDs in Children,Paris Cardiovascular Research Center (Inserm U...,Hôpital Necker-Enfants Malades,COMPLETED,OBSERVATIONAL,False,2022-04-27,2003-01-01,2022-04-01,[]
4,NCT01703962,Non Invasive IDentification of Gliomas With ID...,Institut National de la Santé Et de la Recherc...,,COMPLETED,OBSERVATIONAL,False,2012-10-11,2012-03-14,2014-03-20,[]
...,...,...,...,...,...,...,...,...,...,...,...
195,NCT01688453,Overweight Management and Social Inequalities,Institut National de la Santé Et de la Recherc...,,COMPLETED,INTERVENTIONAL,False,2012-09-19,2012-04,2015-11,"[{'pmid': '34087324', 'type': 'DERIVED', 'cita..."
196,NCT02658253,Trial to Evaluate the Safety and Immunogenicit...,Institut National de la Santé Et de la Recherc...,"EVI Industries, Inc. | Recherche Clinique Pari...",COMPLETED,INTERVENTIONAL,False,2016-01-18,2016-01,2019-02-21,"[{'pmid': '33717176', 'type': 'DERIVED', 'cita..."
197,NCT01842477,Evaluation of Efficacy and Safety of Autologou...,Institut National de la Santé Et de la Recherc...,,COMPLETED,INTERVENTIONAL,False,2013-04-29,2013-05,2016-02-05,[]
198,NCT02172677,The Influence of Collective Schemas on Individ...,Institut National de la Santé Et de la Recherc...,La Région Basse-Normandie | Université de Caen...,COMPLETED,INTERVENTIONAL,False,2014-06-24,2014-10,2016-10-14,[]


#### On "explose" la colonne "References":

Pour chaque NCTId, la colonne réferences contient potentiellement une liste de plusieurs références.  
Si par exemple, on a 3 références, on veut se retrouver avec 3 lignes, chacune indexée par le même NCTId et contenant une unique réference.  

*Avant*:  
**`NCTId    References`**  
`NCT0001, [Ref1, Ref2, Ref3]`   

*Après*:  
**`NCTId    References`**  
`NCT0001,   Ref1`  
`NCT0001,   Ref2`  
`NCT0001,   Ref3`   

In [36]:
df_ct = df_ct.explode('Reference', ignore_index=True)
df_ct.loc[:, ['NCTId', 'Reference']]

Unnamed: 0,NCTId,Reference
0,NCT02777229,"{'pmid': '33355914', 'type': 'DERIVED', 'citat..."
1,NCT02777229,"{'pmid': '33010241', 'type': 'DERIVED', 'citat..."
2,NCT02777229,"{'pmid': '31339676', 'type': 'DERIVED', 'citat..."
3,NCT01882062,"{'pmid': '25568297', 'type': 'RESULT', 'citati..."
4,NCT03671291,
...,...,...
477,NCT02172677,
478,NCT01037777,"{'pmid': '35264424', 'type': 'DERIVED', 'citat..."
479,NCT01037777,"{'pmid': '32822634', 'type': 'DERIVED', 'citat..."
480,NCT01037777,"{'pmid': '24780882', 'type': 'DERIVED', 'citat..."


Pour chaque NCTId, la colonne réference contient maintenant un dictionnaire de la forme:  
`{  
"pmid": "17545707",  
"type": "BACKGROUND",  
"citation": "...",  
}`

Dont on veut extraire de nouvelles colonnes en utilant les clés du dictionnaire:  
**`pmid     type        citation`**  
`17545707, BACKGROUND, "..."`

In [37]:
df_ct_references = pd.json_normalize(df_ct.pop('Reference'))
df_ct_references

Unnamed: 0,pmid,type,citation
0,33355914,DERIVED,"Bousmah MA, Nishimwe ML, Tovar-Sanchez T, Lant..."
1,33010241,DERIVED,"Calmy A, Tovar Sanchez T, Kouanfack C, Mpoudi-..."
2,31339676,DERIVED,"NAMSAL ANRS 12313 Study Group; Kouanfack C, Mp..."
3,25568297,RESULT,"Adanyeguh IM, Rinaldi D, Henry PG, Caillet S, ..."
4,,,
...,...,...,...
477,,,
478,35264424,DERIVED,"Wilke C, Mengel D, Schols L, Hengel H, Rakowic..."
479,32822634,DERIVED,"Jacobi H, du Montcel ST, Romanzetti S, Harmuth..."
480,24780882,DERIVED,"Tezenas du Montcel S, Durr A, Rakowicz M, Nane..."


On réassemble la dataFrame complète:

In [38]:
df_ct = df_ct.join(
    df_ct_references,
    validate = 'one_to_one',
)
df_ct

Unnamed: 0,NCTId,BriefTitle,LeadSponsorName,CollaboratorName,OverallStatus,StudyType,HasResults,StudyFirstPostDate,StartDate,CompletionDate,pmid,type,citation
0,NCT02777229,Efficacy and Safety of a Dolutegravir-based Re...,"ANRS, Emerging Infectious Diseases",Institut de Recherche pour le Developpement | ...,COMPLETED,INTERVENTIONAL,False,2016-05-19,2016-07,2021-07,33355914,DERIVED,"Bousmah MA, Nishimwe ML, Tovar-Sanchez T, Lant..."
1,NCT02777229,Efficacy and Safety of a Dolutegravir-based Re...,"ANRS, Emerging Infectious Diseases",Institut de Recherche pour le Developpement | ...,COMPLETED,INTERVENTIONAL,False,2016-05-19,2016-07,2021-07,33010241,DERIVED,"Calmy A, Tovar Sanchez T, Kouanfack C, Mpoudi-..."
2,NCT02777229,Efficacy and Safety of a Dolutegravir-based Re...,"ANRS, Emerging Infectious Diseases",Institut de Recherche pour le Developpement | ...,COMPLETED,INTERVENTIONAL,False,2016-05-19,2016-07,2021-07,31339676,DERIVED,"NAMSAL ANRS 12313 Study Group; Kouanfack C, Mp..."
3,NCT01882062,Proof of Concept of an Anaplerotic Study Using...,Institut National de la Santé Et de la Recherc...,,COMPLETED,INTERVENTIONAL,True,2013-06-20,2013-05,2013-07,25568297,RESULT,"Adanyeguh IM, Rinaldi D, Henry PG, Caillet S, ..."
4,NCT03671291,Missed Opportunities to Pre-exposure Prophylax...,French National Agency for Research on AIDS an...,"University Hospital, Marseille | University Ho...",COMPLETED,INTERVENTIONAL,False,2018-09-14,2019-04-03,2021-10-03,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
477,NCT02172677,The Influence of Collective Schemas on Individ...,Institut National de la Santé Et de la Recherc...,La Région Basse-Normandie | Université de Caen...,COMPLETED,INTERVENTIONAL,False,2014-06-24,2014-10,2016-10-14,,,
478,NCT01037777,RISCA : Prospective Study of Individuals at Ri...,Institut National de la Santé Et de la Recherc...,,COMPLETED,OBSERVATIONAL,False,2009-12-23,2009-05-07,2017-12-14,35264424,DERIVED,"Wilke C, Mengel D, Schols L, Hengel H, Rakowic..."
479,NCT01037777,RISCA : Prospective Study of Individuals at Ri...,Institut National de la Santé Et de la Recherc...,,COMPLETED,OBSERVATIONAL,False,2009-12-23,2009-05-07,2017-12-14,32822634,DERIVED,"Jacobi H, du Montcel ST, Romanzetti S, Harmuth..."
480,NCT01037777,RISCA : Prospective Study of Individuals at Ri...,Institut National de la Santé Et de la Recherc...,,COMPLETED,OBSERVATIONAL,False,2009-12-23,2009-05-07,2017-12-14,24780882,DERIVED,"Tezenas du Montcel S, Durr A, Rakowicz M, Nane..."


***On reconstruit l'index :***

In [39]:
# df_studies_v2.set_index('NCTId', inplace = True)

***On précise les types :***

In [40]:
df_ct = df_ct.convert_dtypes()
df_ct = df_ct.astype({"OverallStatus" : 'category', "StudyType" : 'category', "type": 'category',})
# df_ct = df_ct.astype({'OverallStatus': 'category', 'StudyType': 'category'})
df_ct.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 482 entries, 0 to 481
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   NCTId               482 non-null    string  
 1   BriefTitle          482 non-null    string  
 2   LeadSponsorName     482 non-null    string  
 3   CollaboratorName    217 non-null    string  
 4   OverallStatus       482 non-null    category
 5   StudyType           482 non-null    category
 6   HasResults          482 non-null    boolean 
 7   StudyFirstPostDate  482 non-null    string  
 8   StartDate           482 non-null    string  
 9   CompletionDate      482 non-null    string  
 10  pmid                364 non-null    string  
 11  type                364 non-null    category
 12  citation            364 non-null    string  
dtypes: boolean(1), category(3), string(9)
memory usage: 36.7 KB


In [41]:
df_ct

Unnamed: 0,NCTId,BriefTitle,LeadSponsorName,CollaboratorName,OverallStatus,StudyType,HasResults,StudyFirstPostDate,StartDate,CompletionDate,pmid,type,citation
0,NCT02777229,Efficacy and Safety of a Dolutegravir-based Re...,"ANRS, Emerging Infectious Diseases",Institut de Recherche pour le Developpement | ...,COMPLETED,INTERVENTIONAL,False,2016-05-19,2016-07,2021-07,33355914,DERIVED,"Bousmah MA, Nishimwe ML, Tovar-Sanchez T, Lant..."
1,NCT02777229,Efficacy and Safety of a Dolutegravir-based Re...,"ANRS, Emerging Infectious Diseases",Institut de Recherche pour le Developpement | ...,COMPLETED,INTERVENTIONAL,False,2016-05-19,2016-07,2021-07,33010241,DERIVED,"Calmy A, Tovar Sanchez T, Kouanfack C, Mpoudi-..."
2,NCT02777229,Efficacy and Safety of a Dolutegravir-based Re...,"ANRS, Emerging Infectious Diseases",Institut de Recherche pour le Developpement | ...,COMPLETED,INTERVENTIONAL,False,2016-05-19,2016-07,2021-07,31339676,DERIVED,"NAMSAL ANRS 12313 Study Group; Kouanfack C, Mp..."
3,NCT01882062,Proof of Concept of an Anaplerotic Study Using...,Institut National de la Santé Et de la Recherc...,,COMPLETED,INTERVENTIONAL,True,2013-06-20,2013-05,2013-07,25568297,RESULT,"Adanyeguh IM, Rinaldi D, Henry PG, Caillet S, ..."
4,NCT03671291,Missed Opportunities to Pre-exposure Prophylax...,French National Agency for Research on AIDS an...,"University Hospital, Marseille | University Ho...",COMPLETED,INTERVENTIONAL,False,2018-09-14,2019-04-03,2021-10-03,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
477,NCT02172677,The Influence of Collective Schemas on Individ...,Institut National de la Santé Et de la Recherc...,La Région Basse-Normandie | Université de Caen...,COMPLETED,INTERVENTIONAL,False,2014-06-24,2014-10,2016-10-14,,,
478,NCT01037777,RISCA : Prospective Study of Individuals at Ri...,Institut National de la Santé Et de la Recherc...,,COMPLETED,OBSERVATIONAL,False,2009-12-23,2009-05-07,2017-12-14,35264424,DERIVED,"Wilke C, Mengel D, Schols L, Hengel H, Rakowic..."
479,NCT01037777,RISCA : Prospective Study of Individuals at Ri...,Institut National de la Santé Et de la Recherc...,,COMPLETED,OBSERVATIONAL,False,2009-12-23,2009-05-07,2017-12-14,32822634,DERIVED,"Jacobi H, du Montcel ST, Romanzetti S, Harmuth..."
480,NCT01037777,RISCA : Prospective Study of Individuals at Ri...,Institut National de la Santé Et de la Recherc...,,COMPLETED,OBSERVATIONAL,False,2009-12-23,2009-05-07,2017-12-14,24780882,DERIVED,"Tezenas du Montcel S, Durr A, Rakowicz M, Nane..."


#### Export en CSV :

In [42]:
# df_ct.to_csv(
#     'Data/outputs/extract_CT_api_v2.csv',
#     sep=';',
#     index=False,
#     encoding='utf-8-sig',
# )

# PubMed

### Utilisation d'une clé pour l'API Pubmed : 

Il est recommandé d'utiliser une clé pour accéder à l'API Pubmed, ce qui permet de lancer jusqu'à 10 requêtes par seconde.  
Sans clé, la limite est de 3 requêtes par seconde.  

> E-utils users are allowed 3 requests/second without an API key. Create an API key to increase your e-utils limit to 10 requests/second.

**En pratique, l'API Pubmed étant beaucoup plus lente, cela ne semble pas changer grand chose.**

Pour récupérer sa clé, se rendre sur la page en étant loggué:
https://account.ncbi.nlm.nih.gov/settings/

Une fois la clé récupérée, l'ajouter aux variables d'environnement avec la commande suivante dans un Terminal:

**Windows :** 

`setx NCBI_API_KEY “123456"`

**Linux/MacOS :**

`export NCBI_API_KEY=123456`

In [43]:
import os

assert os.getenv('NCBI_API_KEY', None) is not None

### Récupération des PMIDs via PubMed:

Pour chaque NCTId de CT, on récupère les PMIDs des publications associées via PubMed:

In [44]:
# Liste unique des NCTId extraits de ClinicalTrial
nctids = df_ct.loc[:, 'NCTId'].unique()
num_nctids = len(nctids)
session = requests.Session()

pmids_pubmed_dict = {}
for i, nctid in enumerate(nctids):
    # Display the progress on a single line
    print(f'\r{i+1}/{num_nctids}...', end='', flush=True)

    # Query Pubmed's API
    query_url = f'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&retmode=json&term={nctid}&api_key={os.getenv('NCBI_API_KEY')}'
    response = session.get(query_url)
    response.raise_for_status()
    
    pmids = response.json()['esearchresult']['idlist']
    pmids_pubmed_dict[nctid] = set(pmids)

200/200...

In [45]:
# pmids_pubmed_dict

**On veut fusionner la liste PMIDs que l'on vient de récupérer sur PubMed, à la liste des PMIDs déjà récupérés via CT.**

On met les PMIDs de CT sous la même forme:

In [46]:
pmids_ct_dict = {}
for nctid in nctids:
    pmids = df_ct[df_ct.loc[:, 'NCTId'] == nctid].loc[:, 'pmid'].dropna()
    pmids_ct_dict[nctid] = set(pmids)
# pmids_ct_dict

Pour un NCTDId donné, on fait l'union des deux ensembles de PMIDs:

In [47]:
pmids_complete_dict = {}
for nctid in nctids:
    # L'ensemble des PMIds présents dans PubMed et CT
    pmids_complete_dict[nctid] = pmids_pubmed_dict[nctid] | pmids_ct_dict[nctid]
# pmids_complete_dict

#### Verifications:

In [48]:
num_pmids_ct = sum((len(v) for v in pmids_ct_dict.values()))
print(f'Nombre total de publications issus de CT: {num_pmids_ct}')

Nombre total de publications issus de CT: 364


In [49]:
num_pmids_complete = sum((len(v) for v in pmids_complete_dict.values()))
print(f'Nombre total de publications après consultation PubMed: {num_pmids_complete}')

Nombre total de publications après consultation PubMed: 394


In [50]:
pmids_pubmed_only_dict = {}
for nctid in nctids:
    # L'ensemble des PMIds présents dans PubMed uniquement
    pmids_pubmed_only_dict[nctid] = pmids_pubmed_dict[nctid] - pmids_ct_dict[nctid]

In [51]:
num_pmids_pubmed_only = sum((len(v) for v in pmids_pubmed_only_dict.values()))
print(f'Nombre de nouveaux PMIDs trouvés via Pubmed: {num_pmids_pubmed_only}')

Nombre de nouveaux PMIDs trouvés via Pubmed: 30


In [52]:
assert num_pmids_complete - num_pmids_ct == num_pmids_pubmed_only

In [53]:
print('NCTId des nouveaux PMIDs trouvés via Pubmed:')
{k: v for k, v in pmids_pubmed_only_dict.items() if v != set()}

NCTId des nouveaux PMIDs trouvés via Pubmed:


{'NCT02777229': {'37851566', '38156046'},
 'NCT05349162': {'36735263'},
 'NCT01703962': {'37668523'},
 'NCT01453192': {'30688008'},
 'NCT01473472': {'36601747'},
 'NCT03335995': {'37497675'},
 'NCT02212379': {'31269208'},
 'NCT01426243': {'26314624'},
 'NCT03870438': {'38484756'},
 'NCT00640263': {'34425825'},
 'NCT05311865': {'36438274', '37795682'},
 'NCT03078439': {'38408861'},
 'NCT04315948': {'36695483', '38552208'},
 'NCT03215732': {'37143029'},
 'NCT01089387': {'26439886'},
 'NCT03005652': {'38100477'},
 'NCT02405013': {'36686592'},
 'NCT01801618': {'29662875'},
 'NCT02057796': {'36883573'},
 'NCT02150993': {'38740027'},
 'NCT04392388': {'34293141'},
 'NCT02833961': {'36318030'},
 'NCT04288128': {'38419144', '38421662'},
 'NCT04409405': {'38043556'},
 'NCT02481453': {'38273639'},
 'NCT01688453': {'35272723'}}

In [54]:
num_nctid_empty_ct = sum((1 for v in pmids_ct_dict.values() if v == set()))
print(f"Nombre d'études sans PMIDs issus de CT: {num_nctid_empty_ct}")

Nombre d'études sans PMIDs issus de CT: 118


In [55]:
num_nctid_empty_pubmed = sum((1 for v in pmids_complete_dict.values() if v == set()))
print(f"Nombre d'études sans PMIDs après consultation PubMed: {num_nctid_empty_pubmed}")

Nombre d'études sans PMIDs après consultation PubMed: 105


In [56]:
print("NCTIds qui n'avait aucun PMIDs sous CT, mais enrichis via Pubmed:")
nctids_previously_empty = {k for k, v in pmids_ct_dict.items() if v == set()} - {k for k, v in pmids_complete_dict.items() if v == set()}
{k: pmids_complete_dict[k] for k in nctids_previously_empty}

NCTIds qui n'avait aucun PMIDs sous CT, mais enrichis via Pubmed:


{'NCT01801618': {'29662875'},
 'NCT01703962': {'37668523'},
 'NCT02833961': {'36318030'},
 'NCT02405013': {'36686592'},
 'NCT02150993': {'38740027'},
 'NCT02212379': {'31269208'},
 'NCT04288128': {'38419144', '38421662'},
 'NCT01453192': {'30688008'},
 'NCT04409405': {'38043556'},
 'NCT03078439': {'38408861'},
 'NCT05311865': {'36438274', '37795682'},
 'NCT04392388': {'34293141'},
 'NCT05349162': {'36735263'}}

In [57]:
len(nctids_previously_empty)

13

In [58]:
assert num_nctid_empty_ct - num_nctid_empty_pubmed == len(nctids_previously_empty)

### Enrichissement des PMIDs via l'API Pubmed

Pour chaque PMID récupéré, on l'enrichit avec les données de PubMed tel que le titre, les auteurs, ...:

In [59]:
# Si la liste des auteurs est vide on renvoie None, sinon on concatène les valeurs de la liste
# sous la forme "auteur0; auteur1 ..."
def concatenate_authors_list(authors_list):
    if authors_list == []:
        return None
    else:
        return '; '.join([author['name'] for author in authors_list])

In [60]:
# We return a list of dictionnaries, each dictionnary containing the NCTId and the publication associated with it
# total_publications_list = [{'NCTId': '...', 'publications': [{'pmid': '...', 'title': '...'}, {'pmid': '...', 'title': '...'}]}, {'NCTId': ...}]
total_publications_list = []

books_list = []
counter = 0  # To keep track of progress of pmids
session = requests.Session()

# For each NTCID...
for nctid, pmids in pmids_complete_dict.items():
    # [{'pmid': '...', 'title': '...'}, {'pmid': '...', 'title': '...'}]
    publications_list = []

    # If the set of pmids is not empty
    if (pmids != set()):
        # Query Pubmed's API with several pmids at the same time
        pmids_str = ','.join(pmids)
        query_url = f'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&retmode=json&id={pmids_str}&api_key={os.getenv('NCBI_API_KEY')}'
        response = session.get(query_url)
        response.raise_for_status()
        result = response.json()['result']
    
        # We process each PMID...
        for pmid in result['uids']:
            # Check that the pmid returned by Pubmed was part of the query
            assert (pmid in pmids)
            
            publication = result[pmid]
            
            # Display the progress on a single line
            print(f'\r{counter+1} / {num_pmids_complete}...', end='', flush=True)

            # We are not interested by articles with type 'book'
            # TODO: book special case ?
            if publication['doctype'] == 'citation':
                
                # Find the DOI amongst the different ids
                doi = None
                for id in publication['articleids']:
                    if id['idtype'] == 'doi':
                        doi = id['value']
                        break

                # Add the infos for the publication
                publications_list.append(
                    {
                        'pmid': pmid,
                        'title': publication['title'],
                        'authors' : concatenate_authors_list(publication['authors']),
                        'doi': doi,
                        'date': publication['sortpubdate'],
                        'publication_types': publication['pubtype'],
                    }
                )
            else:
                # type different from 'article', gets ignored
                books_list.append((nctid, pmid))

            counter += 1

    publication_dict = {'NCTId': nctid, 'publications': publications_list}
    total_publications_list.append(publication_dict)

# total_publications_list

394 / 394...

In [61]:
print(books_list)

[('NCT03537196', '27227200')]


In [62]:
# print(json.dumps(total_publications_list, indent=2))

Publications sans DOI:

In [63]:
# PMID without DOI
for study in total_publications_list:
    for publi in study['publications']:
        if publi['doi'] is None:
            print(f'PMID without DOI: {publi['pmid']}')

PMID without DOI: 19839502


In [64]:
# The number of NCTId didn't change
assert len(total_publications_list) == len(studies_list)

### Import dans Pandas

In [65]:
df_pubmed = pd.DataFrame.from_records(total_publications_list)
df_pubmed

Unnamed: 0,NCTId,publications
0,NCT02777229,"[{'pmid': '33355914', 'title': 'Cost-Utility A..."
1,NCT01882062,"[{'pmid': '25568297', 'title': 'Triheptanoin i..."
2,NCT03671291,[]
3,NCT05349162,"[{'pmid': '36735263', 'title': 'Epicardial vs...."
4,NCT01703962,"[{'pmid': '37668523', 'title': 'Neurochemical ..."
...,...,...
195,NCT01688453,"[{'pmid': '34087324', 'title': 'Sociodemograph..."
196,NCT02658253,"[{'pmid': '32032566', 'title': 'PRIMVAC vaccin..."
197,NCT01842477,[]
198,NCT02172677,[]


#### On "explose" la colonne "publications":

Pour chaque NCTId, la colonne 'publications' contient potentiellement une liste de plusieurs publications.  
Si par exemple, on a 3 publications, on veut se retrouver avec 3 lignes, chacune indexée par le même NCTId et contenant une unique publication.  

*Avant*:  
**`NCTId    Reference`**  
`NCT0001, [Pub1, Pub2, Pub3]`   

*Après*:  
**`NCTId    Reference`**  
`NCT0001,   Pub1`  
`NCT0001,   Pub2`  
`NCT0001,   Pub3`   

In [66]:
df_pubmed = df_pubmed.explode('publications', ignore_index=True)
df_pubmed

Unnamed: 0,NCTId,publications
0,NCT02777229,"{'pmid': '33355914', 'title': 'Cost-Utility An..."
1,NCT02777229,"{'pmid': '33010241', 'title': 'Dolutegravir-ba..."
2,NCT02777229,"{'pmid': '38156046', 'title': 'Durability of t..."
3,NCT02777229,"{'pmid': '37851566', 'title': 'Improvements in..."
4,NCT02777229,"{'pmid': '31339676', 'title': 'Dolutegravir-Ba..."
...,...,...
493,NCT02172677,
494,NCT01037777,"{'pmid': '32822634', 'title': 'Conversion of i..."
495,NCT01037777,"{'pmid': '35264424', 'title': 'Levels of Neuro..."
496,NCT01037777,"{'pmid': '23707147', 'title': 'Biological and ..."


On vérifie que l'on a retrouvé plus de publications avec PubMed + CT que CT tout seul:

In [67]:
assert len(df_pubmed) >= len(df_ct)

Pour chaque NCTId, la colonne 'publications' contient maintenant un dictionnaire de la forme:  
`{  
"pmid": "17545707",  
"title": "Haematological ...",  
"authors": "Smith DJ; ...",  
}`

Dont on veut extraire de nouvelles colonnes en utilant les clés du dictionnaire:  
**`pmid      title                 authors`**  
`17545707, "Haematological ...", "Smith DJ; ..."`

In [68]:
df_pubmed_publications = pd.json_normalize(df_pubmed.pop('publications'))
df_pubmed_publications

Unnamed: 0,pmid,title,authors,doi,date,publication_types
0,33355914,Cost-Utility Analysis of a Dolutegravir-Based ...,Bousmah MA; Nishimwe ML; Tovar-Sanchez T; Lant...,10.1007/s40273-020-00987-3,2021/03/01 00:00,[Journal Article]
1,33010241,Dolutegravir-based and low-dose efavirenz-base...,Calmy A; Tovar Sanchez T; Kouanfack C; Mpoudi-...,10.1016/S2352-3018(20)30238-1,2020/10/01 00:00,"[Journal Article, Multicenter Study, Randomize..."
2,38156046,Durability of the Efficacy and Safety of Dolut...,Mpoudi-Etame M; Tovar Sanchez T; Bousmah MA; O...,10.1093/ofid/ofad582,2023/11/20 00:00,[Journal Article]
3,37851566,Improvements in Patient-Reported Outcomes Foll...,Bousmah MA; Protopopescu C; Mpoudi-Etame M; Om...,10.1097/QAI.0000000000003273,2023/11/01 00:00,"[Randomized Controlled Trial, Journal Article]"
4,31339676,Dolutegravir-Based or Low-Dose Efavirenz-Based...,NAMSAL ANRS 12313 Study Group; Kouanfack C; Mp...,10.1056/NEJMoa1904340,2019/08/29 00:00,"[Journal Article, Multicenter Study, Randomize..."
...,...,...,...,...,...,...
493,,,,,,
494,32822634,Conversion of individuals at risk for spinocer...,Jacobi H; du Montcel ST; Romanzetti S; Harmuth...,10.1016/S1474-4422(20)30235-0,2020/09/01 00:00,"[Clinical Trial, Journal Article, Multicenter ..."
495,35264424,Levels of Neurofilament Light at the Preataxic...,Wilke C; Mengel D; Schöls L; Hengel H; Rakowic...,10.1212/WNL.0000000000200257,2022/05/17 00:00,[Journal Article]
496,23707147,Biological and clinical characteristics of ind...,Jacobi H; Reetz K; du Montcel ST; Bauer P; Mar...,10.1016/S1474-4422(13)70104-2,2013/07/01 00:00,[Journal Article]


On réassemble la dataFrame complète:

In [69]:
df_pubmed = df_pubmed.join(
    df_pubmed_publications,
    validate = 'one_to_one',
)
df_pubmed

Unnamed: 0,NCTId,pmid,title,authors,doi,date,publication_types
0,NCT02777229,33355914,Cost-Utility Analysis of a Dolutegravir-Based ...,Bousmah MA; Nishimwe ML; Tovar-Sanchez T; Lant...,10.1007/s40273-020-00987-3,2021/03/01 00:00,[Journal Article]
1,NCT02777229,33010241,Dolutegravir-based and low-dose efavirenz-base...,Calmy A; Tovar Sanchez T; Kouanfack C; Mpoudi-...,10.1016/S2352-3018(20)30238-1,2020/10/01 00:00,"[Journal Article, Multicenter Study, Randomize..."
2,NCT02777229,38156046,Durability of the Efficacy and Safety of Dolut...,Mpoudi-Etame M; Tovar Sanchez T; Bousmah MA; O...,10.1093/ofid/ofad582,2023/11/20 00:00,[Journal Article]
3,NCT02777229,37851566,Improvements in Patient-Reported Outcomes Foll...,Bousmah MA; Protopopescu C; Mpoudi-Etame M; Om...,10.1097/QAI.0000000000003273,2023/11/01 00:00,"[Randomized Controlled Trial, Journal Article]"
4,NCT02777229,31339676,Dolutegravir-Based or Low-Dose Efavirenz-Based...,NAMSAL ANRS 12313 Study Group; Kouanfack C; Mp...,10.1056/NEJMoa1904340,2019/08/29 00:00,"[Journal Article, Multicenter Study, Randomize..."
...,...,...,...,...,...,...,...
493,NCT02172677,,,,,,
494,NCT01037777,32822634,Conversion of individuals at risk for spinocer...,Jacobi H; du Montcel ST; Romanzetti S; Harmuth...,10.1016/S1474-4422(20)30235-0,2020/09/01 00:00,"[Clinical Trial, Journal Article, Multicenter ..."
495,NCT01037777,35264424,Levels of Neurofilament Light at the Preataxic...,Wilke C; Mengel D; Schöls L; Hengel H; Rakowic...,10.1212/WNL.0000000000200257,2022/05/17 00:00,[Journal Article]
496,NCT01037777,23707147,Biological and clinical characteristics of ind...,Jacobi H; Reetz K; du Montcel ST; Bauer P; Mar...,10.1016/S1474-4422(13)70104-2,2013/07/01 00:00,[Journal Article]


### Jointure des DataFrame de CT et Pubmed:

In [70]:
df_final = df_ct.merge(
    df_pubmed, 
    on = ['NCTId', 'pmid'], 
    how = 'right',
    validate = 'one_to_one',
)

# Remove 'citation' column
df_final = df_final.drop(columns='citation')

df_final

Unnamed: 0,NCTId,BriefTitle,LeadSponsorName,CollaboratorName,OverallStatus,StudyType,HasResults,StudyFirstPostDate,StartDate,CompletionDate,pmid,type,title,authors,doi,date,publication_types
0,NCT02777229,Efficacy and Safety of a Dolutegravir-based Re...,"ANRS, Emerging Infectious Diseases",Institut de Recherche pour le Developpement | ...,COMPLETED,INTERVENTIONAL,False,2016-05-19,2016-07,2021-07,33355914,DERIVED,Cost-Utility Analysis of a Dolutegravir-Based ...,Bousmah MA; Nishimwe ML; Tovar-Sanchez T; Lant...,10.1007/s40273-020-00987-3,2021/03/01 00:00,[Journal Article]
1,NCT02777229,Efficacy and Safety of a Dolutegravir-based Re...,"ANRS, Emerging Infectious Diseases",Institut de Recherche pour le Developpement | ...,COMPLETED,INTERVENTIONAL,False,2016-05-19,2016-07,2021-07,33010241,DERIVED,Dolutegravir-based and low-dose efavirenz-base...,Calmy A; Tovar Sanchez T; Kouanfack C; Mpoudi-...,10.1016/S2352-3018(20)30238-1,2020/10/01 00:00,"[Journal Article, Multicenter Study, Randomize..."
2,NCT02777229,,,,,,,,,,38156046,,Durability of the Efficacy and Safety of Dolut...,Mpoudi-Etame M; Tovar Sanchez T; Bousmah MA; O...,10.1093/ofid/ofad582,2023/11/20 00:00,[Journal Article]
3,NCT02777229,,,,,,,,,,37851566,,Improvements in Patient-Reported Outcomes Foll...,Bousmah MA; Protopopescu C; Mpoudi-Etame M; Om...,10.1097/QAI.0000000000003273,2023/11/01 00:00,"[Randomized Controlled Trial, Journal Article]"
4,NCT02777229,Efficacy and Safety of a Dolutegravir-based Re...,"ANRS, Emerging Infectious Diseases",Institut de Recherche pour le Developpement | ...,COMPLETED,INTERVENTIONAL,False,2016-05-19,2016-07,2021-07,31339676,DERIVED,Dolutegravir-Based or Low-Dose Efavirenz-Based...,NAMSAL ANRS 12313 Study Group; Kouanfack C; Mp...,10.1056/NEJMoa1904340,2019/08/29 00:00,"[Journal Article, Multicenter Study, Randomize..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
493,NCT02172677,The Influence of Collective Schemas on Individ...,Institut National de la Santé Et de la Recherc...,La Région Basse-Normandie | Université de Caen...,COMPLETED,INTERVENTIONAL,False,2014-06-24,2014-10,2016-10-14,,,,,,,
494,NCT01037777,RISCA : Prospective Study of Individuals at Ri...,Institut National de la Santé Et de la Recherc...,,COMPLETED,OBSERVATIONAL,False,2009-12-23,2009-05-07,2017-12-14,32822634,DERIVED,Conversion of individuals at risk for spinocer...,Jacobi H; du Montcel ST; Romanzetti S; Harmuth...,10.1016/S1474-4422(20)30235-0,2020/09/01 00:00,"[Clinical Trial, Journal Article, Multicenter ..."
495,NCT01037777,RISCA : Prospective Study of Individuals at Ri...,Institut National de la Santé Et de la Recherc...,,COMPLETED,OBSERVATIONAL,False,2009-12-23,2009-05-07,2017-12-14,35264424,DERIVED,Levels of Neurofilament Light at the Preataxic...,Wilke C; Mengel D; Schöls L; Hengel H; Rakowic...,10.1212/WNL.0000000000200257,2022/05/17 00:00,[Journal Article]
496,NCT01037777,RISCA : Prospective Study of Individuals at Ri...,Institut National de la Santé Et de la Recherc...,,COMPLETED,OBSERVATIONAL,False,2009-12-23,2009-05-07,2017-12-14,23707147,DERIVED,Biological and clinical characteristics of ind...,Jacobi H; Reetz K; du Montcel ST; Bauer P; Mar...,10.1016/S1474-4422(13)70104-2,2013/07/01 00:00,[Journal Article]


Les nouveaux PMIDs trouvés via Pubmed, n'ont aucune des infos associées avec CT présentes: BriefTitle, LeadSponsorName, etc ...

In [71]:
# Index of empty rows we need to fill
index_empty_rows = df_final.loc[:, 'BriefTitle'].isna()

# Columns we need to fill
columns_to_fill = [
    'BriefTitle',
    'LeadSponsorName',
    'CollaboratorName',
    'OverallStatus',
    'StudyType',
    'HasResults',
    'StudyFirstPostDate',
    'StartDate',
    'CompletionDate',
    'type',
]

df_final.loc[index_empty_rows, ['NCTId'] + columns_to_fill].head(10)

Unnamed: 0,NCTId,BriefTitle,LeadSponsorName,CollaboratorName,OverallStatus,StudyType,HasResults,StudyFirstPostDate,StartDate,CompletionDate,type
2,NCT02777229,,,,,,,,,,
3,NCT02777229,,,,,,,,,,
7,NCT05349162,,,,,,,,,,
8,NCT01703962,,,,,,,,,,
15,NCT01453192,,,,,,,,,,
52,NCT01473472,,,,,,,,,,
61,NCT03335995,,,,,,,,,,
91,NCT02212379,,,,,,,,,,
94,NCT01426243,,,,,,,,,,
143,NCT03870438,,,,,,,,,,


On rajoute un type 'PUBMED' pour les PMIDs issues de PubMed uniquement:

In [72]:
# Add new 'PUBMED' category
df_final['type'] = df_final['type'].cat.add_categories('PUBMED').astype('category')

# We add a 'PUBMED' type to the PMIDs extracted from Pubmed exclusively
df_final.loc[index_empty_rows, 'type'] = 'PUBMED'

In [73]:
df_final.loc[index_empty_rows, ['NCTId'] + columns_to_fill].head(10)

Unnamed: 0,NCTId,BriefTitle,LeadSponsorName,CollaboratorName,OverallStatus,StudyType,HasResults,StudyFirstPostDate,StartDate,CompletionDate,type
2,NCT02777229,,,,,,,,,,PUBMED
3,NCT02777229,,,,,,,,,,PUBMED
7,NCT05349162,,,,,,,,,,PUBMED
8,NCT01703962,,,,,,,,,,PUBMED
15,NCT01453192,,,,,,,,,,PUBMED
52,NCT01473472,,,,,,,,,,PUBMED
61,NCT03335995,,,,,,,,,,PUBMED
91,NCT02212379,,,,,,,,,,PUBMED
94,NCT01426243,,,,,,,,,,PUBMED
143,NCT03870438,,,,,,,,,,PUBMED


On va remplir ces colonnes avec les infos contenus dans la DataFrame CT:

In [74]:
# NCTIds of empty rows
NCTIds_empty_rows = df_final.loc[index_empty_rows, 'NCTId']

# Columns we wish to copy
columns_to_copy = [
    'BriefTitle',
    'LeadSponsorName',
    'CollaboratorName',
    'OverallStatus',
    'StudyType',
    'HasResults',
    'StudyFirstPostDate',
    'StartDate',
    'CompletionDate',
]

# We copy the missing values from the CT dataframe
for index, nctid in NCTIds_empty_rows.items():
    # For an NCTId, we look in the CT dataframe for the first row with this NCTDId
    # and copy the missings columns
    df_final.loc[index, columns_to_copy] = df_ct.loc[
        df_ct.loc[:, 'NCTId'] == nctid, columns_to_copy
    ].iloc[0]

In [75]:
df_final.loc[index_empty_rows, ['NCTId'] + columns_to_fill].head(10)

Unnamed: 0,NCTId,BriefTitle,LeadSponsorName,CollaboratorName,OverallStatus,StudyType,HasResults,StudyFirstPostDate,StartDate,CompletionDate,type
2,NCT02777229,Efficacy and Safety of a Dolutegravir-based Re...,"ANRS, Emerging Infectious Diseases",Institut de Recherche pour le Developpement | ...,COMPLETED,INTERVENTIONAL,False,2016-05-19,2016-07,2021-07,PUBMED
3,NCT02777229,Efficacy and Safety of a Dolutegravir-based Re...,"ANRS, Emerging Infectious Diseases",Institut de Recherche pour le Developpement | ...,COMPLETED,INTERVENTIONAL,False,2016-05-19,2016-07,2021-07,PUBMED
7,NCT05349162,Epicardial vs. Transvenous ICDs in Children,Paris Cardiovascular Research Center (Inserm U...,Hôpital Necker-Enfants Malades,COMPLETED,OBSERVATIONAL,False,2022-04-27,2003-01-01,2022-04-01,PUBMED
8,NCT01703962,Non Invasive IDentification of Gliomas With ID...,Institut National de la Santé Et de la Recherc...,,COMPLETED,OBSERVATIONAL,False,2012-10-11,2012-03-14,2014-03-20,PUBMED
15,NCT01453192,Renal Transplantation and Raltegravir in HIV-I...,"ANRS, Emerging Infectious Diseases",Merck Sharp & Dohme LLC,COMPLETED,INTERVENTIONAL,False,2011-10-17,2011-12,2015-11,PUBMED
52,NCT01473472,On Demand Antiretroviral Pre-exposure Prophyla...,"ANRS, Emerging Infectious Diseases",,COMPLETED,INTERVENTIONAL,False,2011-11-17,2012-01,2016-12-15,PUBMED
61,NCT03335995,Stroke Prognosis in Intensive CarE,Institut National de la Santé Et de la Recherc...,,COMPLETED,OBSERVATIONAL,False,2017-11-08,2017-10-18,2020-11-18,PUBMED
91,NCT02212379,Capacity of the Dual Combination Raltegravir/E...,"ANRS, Emerging Infectious Diseases",Merck Sharp & Dohme LLC | Janssen-Cilag Ltd.,COMPLETED,INTERVENTIONAL,True,2014-08-08,2015-01,2018-04,PUBMED
94,NCT01426243,The Yellow Fever Vaccine Immunity in HIV Infec...,French National Agency for Research on AIDS an...,,COMPLETED,INTERVENTIONAL,False,2011-08-31,2011-07,2017-12,PUBMED
143,NCT03870438,Prevention of Mother-to-child Transmission of ...,"ANRS, Emerging Infectious Diseases","University Teaching Hospital, Lusaka, Zambia |...",COMPLETED,INTERVENTIONAL,False,2019-03-12,2019-12-14,2022-10-31,PUBMED


# Verification du status Open Access via l'API d'Unpaywall:

Ne pas oublier que le DOI peut être vide :

In [76]:
# df_from_excel.loc[df_from_excel.loc[:, 'doi'].isna()]
mask_doi_na = (df_final.loc[:, 'pmid'].notna() & df_final.loc[:, 'doi'].isna())
df_final.loc[mask_doi_na, ['NCTId', 'pmid', 'type', 'title', 'doi']]	

Unnamed: 0,NCTId,pmid,type,title,doi
358,NCT03537196,19839502,BACKGROUND,High prevalence of Hepatitis C virus genotype ...,


#### Envoi de la requête :

L'utilisation de l'API d'Unpaywall nécessite d'inclure son adresse email dans la requête:

> **Authentication**  
> Requests must include your email as a parameter at the end of the URL, like this: api.unpaywall.org/my/request?email=YOUR_EMAIL.

Pour éviter de diffuser trop largement son adresse email, plutôt que de l'inclure dans le notebook, on l'exporte comme variable d'environnement, comme on l'a fait pour la clé API Pubmed.

Ajouter `EMAIL_ADDRESS` aux variables d'environnement avec la commande suivante dans un Terminal:

**Windows :** 

`setx EMAIL_ADDRESS "unpaywall_01@example.com"`

**Linux/MacOS :**

`export EMAIL_ADDRESS=unpaywall_01@example.com`

In [77]:
assert os.getenv('EMAIL_ADDRESS', None) is not None

In [78]:
# counter = 0  # To keep track of progress
dois = df_final.loc[:, 'doi']
num_dois = len(dois)
session = requests.Session()

# TODO: Use dictionnary {'NCTId': ..., 'doi': ..., 'is_oa': ...} instead of relying on the list order ?
is_oa_list = []
for i, doi in enumerate(dois):
    # Display the progress on a single line
    print(f'\r{i+1} / {num_dois}...', end='', flush=True)

    # Test if DOI is NaN
    if pd.isna(doi):
        # Add to the list (NaN, NaN)
        is_oa_list.append((doi, doi))
    else:
        # Query Unpaywall's API
        
        # TODO: remove example address
        query_url = f'https://api.unpaywall.org/v2/{doi}?email={os.getenv('EMAIL_ADDRESS')}'
        response = session.get(query_url)
        response.raise_for_status()
    
        # Add to the list ('doi', 'is_oa')
        is_oa_list.append((doi, response.json()['is_oa']))
# is_oa_list

498 / 498...

In [79]:
df_is_oa = pd.DataFrame(is_oa_list, columns=['doi', 'is_oa'])
df_is_oa

Unnamed: 0,doi,is_oa
0,10.1007/s40273-020-00987-3,True
1,10.1016/S2352-3018(20)30238-1,False
2,10.1093/ofid/ofad582,True
3,10.1097/QAI.0000000000003273,False
4,10.1056/NEJMoa1904340,True
...,...,...
493,,
494,10.1016/S1474-4422(20)30235-0,True
495,10.1212/WNL.0000000000200257,True
496,10.1016/S1474-4422(13)70104-2,False


In [80]:
df_with_oa_status = df_final.join(
    df_is_oa.loc[:, 'is_oa'].astype('boolean'),
    validate = 'one_to_one',
)
# df_with_oa_status.loc[:, ['NCTId', 'pmid', 'doi', 'is_oa']]
df_with_oa_status

Unnamed: 0,NCTId,BriefTitle,LeadSponsorName,CollaboratorName,OverallStatus,StudyType,HasResults,StudyFirstPostDate,StartDate,CompletionDate,pmid,type,title,authors,doi,date,publication_types,is_oa
0,NCT02777229,Efficacy and Safety of a Dolutegravir-based Re...,"ANRS, Emerging Infectious Diseases",Institut de Recherche pour le Developpement | ...,COMPLETED,INTERVENTIONAL,False,2016-05-19,2016-07,2021-07,33355914,DERIVED,Cost-Utility Analysis of a Dolutegravir-Based ...,Bousmah MA; Nishimwe ML; Tovar-Sanchez T; Lant...,10.1007/s40273-020-00987-3,2021/03/01 00:00,[Journal Article],True
1,NCT02777229,Efficacy and Safety of a Dolutegravir-based Re...,"ANRS, Emerging Infectious Diseases",Institut de Recherche pour le Developpement | ...,COMPLETED,INTERVENTIONAL,False,2016-05-19,2016-07,2021-07,33010241,DERIVED,Dolutegravir-based and low-dose efavirenz-base...,Calmy A; Tovar Sanchez T; Kouanfack C; Mpoudi-...,10.1016/S2352-3018(20)30238-1,2020/10/01 00:00,"[Journal Article, Multicenter Study, Randomize...",False
2,NCT02777229,Efficacy and Safety of a Dolutegravir-based Re...,"ANRS, Emerging Infectious Diseases",Institut de Recherche pour le Developpement | ...,COMPLETED,INTERVENTIONAL,False,2016-05-19,2016-07,2021-07,38156046,PUBMED,Durability of the Efficacy and Safety of Dolut...,Mpoudi-Etame M; Tovar Sanchez T; Bousmah MA; O...,10.1093/ofid/ofad582,2023/11/20 00:00,[Journal Article],True
3,NCT02777229,Efficacy and Safety of a Dolutegravir-based Re...,"ANRS, Emerging Infectious Diseases",Institut de Recherche pour le Developpement | ...,COMPLETED,INTERVENTIONAL,False,2016-05-19,2016-07,2021-07,37851566,PUBMED,Improvements in Patient-Reported Outcomes Foll...,Bousmah MA; Protopopescu C; Mpoudi-Etame M; Om...,10.1097/QAI.0000000000003273,2023/11/01 00:00,"[Randomized Controlled Trial, Journal Article]",False
4,NCT02777229,Efficacy and Safety of a Dolutegravir-based Re...,"ANRS, Emerging Infectious Diseases",Institut de Recherche pour le Developpement | ...,COMPLETED,INTERVENTIONAL,False,2016-05-19,2016-07,2021-07,31339676,DERIVED,Dolutegravir-Based or Low-Dose Efavirenz-Based...,NAMSAL ANRS 12313 Study Group; Kouanfack C; Mp...,10.1056/NEJMoa1904340,2019/08/29 00:00,"[Journal Article, Multicenter Study, Randomize...",True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
493,NCT02172677,The Influence of Collective Schemas on Individ...,Institut National de la Santé Et de la Recherc...,La Région Basse-Normandie | Université de Caen...,COMPLETED,INTERVENTIONAL,False,2014-06-24,2014-10,2016-10-14,,,,,,,,
494,NCT01037777,RISCA : Prospective Study of Individuals at Ri...,Institut National de la Santé Et de la Recherc...,,COMPLETED,OBSERVATIONAL,False,2009-12-23,2009-05-07,2017-12-14,32822634,DERIVED,Conversion of individuals at risk for spinocer...,Jacobi H; du Montcel ST; Romanzetti S; Harmuth...,10.1016/S1474-4422(20)30235-0,2020/09/01 00:00,"[Clinical Trial, Journal Article, Multicenter ...",True
495,NCT01037777,RISCA : Prospective Study of Individuals at Ri...,Institut National de la Santé Et de la Recherc...,,COMPLETED,OBSERVATIONAL,False,2009-12-23,2009-05-07,2017-12-14,35264424,DERIVED,Levels of Neurofilament Light at the Preataxic...,Wilke C; Mengel D; Schöls L; Hengel H; Rakowic...,10.1212/WNL.0000000000200257,2022/05/17 00:00,[Journal Article],True
496,NCT01037777,RISCA : Prospective Study of Individuals at Ri...,Institut National de la Santé Et de la Recherc...,,COMPLETED,OBSERVATIONAL,False,2009-12-23,2009-05-07,2017-12-14,23707147,DERIVED,Biological and clinical characteristics of ind...,Jacobi H; Reetz K; du Montcel ST; Bauer P; Mar...,10.1016/S1474-4422(13)70104-2,2013/07/01 00:00,[Journal Article],False


### Resultat final:

In [81]:
df_with_oa_status.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 498 entries, 0 to 497
Data columns (total 18 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   NCTId               498 non-null    object  
 1   BriefTitle          498 non-null    string  
 2   LeadSponsorName     498 non-null    string  
 3   CollaboratorName    224 non-null    string  
 4   OverallStatus       498 non-null    category
 5   StudyType           498 non-null    category
 6   HasResults          498 non-null    boolean 
 7   StudyFirstPostDate  498 non-null    string  
 8   StartDate           498 non-null    string  
 9   CompletionDate      498 non-null    string  
 10  pmid                393 non-null    object  
 11  type                393 non-null    category
 12  title               393 non-null    object  
 13  authors             393 non-null    object  
 14  doi                 392 non-null    object  
 15  date                393 non-null    obje

# Export en CSV: 

In [82]:
df_with_oa_status.to_csv(
    'Data/outputs/df_extract.csv',
    sep=';',
    index=False,
    encoding='utf-8-sig',
)