# Objectif:

On souhaite évaluer le nombre de publications qui sont le résultat d'études cliniques liées à l'INSERM.

En particulier, on souhaite identifier les études cliniques qui ne donnent lieu à aucune publication et essayer de comprendre les raisons de cette absence de publication.

# Organisation :
Dans ce premier notebook, l'objectif est d'extraire de manière automatique les données des études cliniques:

- On récupère via l'API de *ClinicalTrial* les *ID* (**NCTId**) des études cliniques qui ont pour sponsors l'*INSERM*, l'*ANRS* etc.

- A partir de ces **NCTIds**, on récupère sur *ClinicalTrial* les **PMIDs** des publications liées à ces études.  
    Ces publications sont de 2 types:
    1. Elles ont été uploadées sur *CT* par les auteurs de l'étude: `BACKGROUND, RESULT`
    2. Elles ont été automatiquement récupérés sur PubMed par *CT*: `DERIVED`
- A partir de ces **NCTIds**, on récupère sur *PubMed* les **PMIDs** des publications liées à ces études.  
    On retrouve général un peu plus de publications que le traitement automatique réalisé par *CT*. 

- Pour chaque **NCTIds**, on fusionne l'ensemble des **PMIDs** retournées par *CT* et *PubMed*.

- A partir de cet ensemble de **PMIDs**, on récupère les infos liés : `titre, auteurs, doi`...

- On sauvegarde le résultat sous la forme d'une fichier CSV.

# Extraction des NCTIds dans ClinicalTrial:

## <span style="color:red">Obsolète</span>
<span style="color:red">**L'API v1 n'est plus disponible depuis mi-2024.**  
**Aller directement à [la section API v2](#API-v2:)**</span>

## API v1:

Pour faciliter la récupération des données via l'API v1 de ClinicalTrial, on utilise le wrapper Python: [pytrials](https://github.com/jvfe/pytrials)

Installer ***pytrials***:
- Dans Conda Navigator, se placer dans le meme environnement que celui qui execute Jupyter
- Lancer Powershell Prompt dans cet environnement
- Taper: `pip install pytrials`

In [1]:
# from pytrials.client import ClinicalTrials
# import urllib.parse

In [2]:
# ct = ClinicalTrials()

### Création de la requête:

On crée la requête qui sera envoyé à l'API de ClinicalTrial

#### Sponsors:

In [3]:
# sponsors = [
#     'anrs',
#     'inserm',
#     'institut national de la santé et de la recherche médicale',
#     'french national agency for research on aids and viral hepatitis',
# ]

In [4]:
# sponsors_expr = [f'AREA[LeadSponsorName]{sponsor}' for sponsor in sponsors]

# # Add OR keyword
# sponsors_expr = ' OR '.join(sponsors_expr)

# # Add parenthesis for correct interpretation of OR expression
# sponsors_expr = f'({sponsors_expr})'

# sponsors_expr

#### Status:

In [5]:
# status = 'completed'

In [6]:
# status_expr = f'AREA[OverallStatus]{status}'
# status_expr

#### Date de fin d'étude supérieure ou égale à 2013:

<span style="color:red">**Mettre à jour la date si nécessaire**</span>

In [7]:
# date_expr = 'AREA[CompletionDate]RANGE[01/01/2013,MAX]'
# date_expr

#### Search Expression:

In [8]:
# search_expr = ' AND '.join([sponsors_expr, status_expr, date_expr])
# search_expr

#### URL encode: 

In [9]:
# search_expr_url_encode = urllib.parse.quote_plus(search_expr)
# search_expr_url_encode

#### Fields:

Les champs que l'on veut récupérer:

In [10]:
# fields = [
#     'NCTId',
#     'BriefTitle',
#     'OverallStatus',
#     'StudyType',
#     'LeadSponsorName',
#     'CollaboratorName',
#     'OrgStudyId',
#     'SecondaryId',
#     'StudyFirstPostDate',
#     'ReferencePMID',
#     'ReferenceCitation',
#     'ReferenceType',
# ]

### Envoi de la requête :

In [11]:
# study_fields = ct.get_study_fields(
#     search_expr=search_expr_url_encode,
#     fields=fields,
#     max_studies=1000,
#     fmt='csv',
# )

In [12]:
# print(f'NStudiesReturned: {len(study_fields[1:])}')

### Lire le résultat de la requête dans Pandas :

In [13]:
# import pandas as pd

In [14]:
# pd.DataFrame.from_records(study_fields[1:], index='Rank', columns=study_fields[0])

## API v2:

L'API v1 ne sera plus supporté a [partir de mi-2024](https://clinicaltrials.gov/data-api/api) :

>***Notice to API users:  
>The new ClinicalTrials.gov API, version 2.0 is available. Classic API users are strongly encouraged to switch to the modernized API. We will continue to support the classic API until mid-2024 and are planning blackouts for the spring to help with the transition to the modernized API.***

De plus, l'API v2 supporte un nouveau champs **"HasResults"**, qui pour l'instant n'est que très peu utilisé mais qui pourrait être utile à l'avenir.

En contre-partie, l'export des données sous forme de CSV est limité à un certain nombre de champs visible sur cette page: https://clinicaltrials.gov/data-api/about-api/csv-download

On est donc obligé d'utiliser l'export de données sous forme de JSON.

In [15]:
import urllib.parse

### Création de la requête:

`pytrials` n'étant pas compatible avec la v2, on envoie la requête manuellement en utilisant [Requests](https://requests.readthedocs.io/en/latest/) 

#### Format:

In [16]:
format = 'json'

#### Sponsors:

In [17]:
sponsors = [
    'anrs',
    'inserm',
    'institut national de la santé et de la recherche médicale',
    'french national agency for research on aids and viral hepatitis',
]

In [18]:
sponsors_expr_v2 = ' OR '.join(sponsors)
sponsors_expr_v2 = urllib.parse.quote_plus(sponsors_expr_v2)
sponsors_expr_v2

'anrs+OR+inserm+OR+institut+national+de+la+sant%C3%A9+et+de+la+recherche+m%C3%A9dicale+OR+french+national+agency+for+research+on+aids+and+viral+hepatitis'

#### Overall_status:

In [19]:
overall_status = 'COMPLETED'

#### Fields:

In [20]:
fields_v2 = [
    'NCTId',
    'BriefTitle',
    # 'OfficialTitle',
    'OverallStatus',
    'StudyType',
    'LeadSponsorName',
    'CollaboratorName',
    # 'OrgStudyId',
    # 'SecondaryId',
    'StudyFirstPostDate',
    'StartDate',
    # 'PrimaryCompletionDate',
    'CompletionDate',
    'ReferencePMID',
    'ReferenceCitation',
    'ReferenceType',
    'hasResults',
]
fields_v2

['NCTId',
 'BriefTitle',
 'OverallStatus',
 'StudyType',
 'LeadSponsorName',
 'CollaboratorName',
 'StudyFirstPostDate',
 'StartDate',
 'CompletionDate',
 'ReferencePMID',
 'ReferenceCitation',
 'ReferenceType',
 'hasResults']

In [21]:
fields_expr_v2 = ','.join(fields_v2)
fields_expr_v2 = urllib.parse.quote_plus(fields_expr_v2)
fields_expr_v2

'NCTId%2CBriefTitle%2COverallStatus%2CStudyType%2CLeadSponsorName%2CCollaboratorName%2CStudyFirstPostDate%2CStartDate%2CCompletionDate%2CReferencePMID%2CReferenceCitation%2CReferenceType%2ChasResults'

#### Date de fin d'étude postérieure à 2013:

<span style="color:red">**Mettre à jour la date si nécessaire**</span>

In [22]:
date = '01/01/2013'
date_expr = urllib.parse.quote_plus(f'AREA[CompletionDate]RANGE[{date}, MAX]')
date_expr

'AREA%5BCompletionDate%5DRANGE%5B01%2F01%2F2013%2C+MAX%5D'

#### Nombre de résultats max :

In [23]:
count_total = 'true'

In [24]:
page_size = 1000

#### URL api:

In [25]:
query_url = f'https://clinicaltrials.gov/api/v2/studies?format={format}&query.lead={sponsors_expr_v2}&filter.overallStatus={overall_status}&fields={fields_expr_v2}&filter.advanced={date_expr}&countTotal={count_total}&pageSize={page_size}'
query_url

'https://clinicaltrials.gov/api/v2/studies?format=json&query.lead=anrs+OR+inserm+OR+institut+national+de+la+sant%C3%A9+et+de+la+recherche+m%C3%A9dicale+OR+french+national+agency+for+research+on+aids+and+viral+hepatitis&filter.overallStatus=COMPLETED&fields=NCTId%2CBriefTitle%2COverallStatus%2CStudyType%2CLeadSponsorName%2CCollaboratorName%2CStudyFirstPostDate%2CStartDate%2CCompletionDate%2CReferencePMID%2CReferenceCitation%2CReferenceType%2ChasResults&filter.advanced=AREA%5BCompletionDate%5DRANGE%5B01%2F01%2F2013%2C+MAX%5D&countTotal=true&pageSize=1000'

### Envoi de la requête :

In [26]:
import requests

In [27]:
response = requests.get(query_url)
response.raise_for_status()
response

<Response [200]>

In [28]:
print(f'Studies returned: {response.json()["totalCount"]}')

Studies returned: 200


### Traitement du JSON retourné:

In [29]:
import json

In [30]:
# print(json.dumps(response.json(), indent=2))

***La structure du JSON est bien trop imbriquée pour le normaliser avec Pandas, du coup on l'applatit à la main:***

A partir du JSON on construit un dictionnaire équivalent mais beaucoup plus "plat"

In [31]:
# Si la liste des collaborateurs est vide on renvoie None, sinon on concatène les valeurs de la liste
# sous la forme "collaborateur_0 | collaborateur_1 | ..."
def concatenate_collaborator_list(collaborator_list):
    if collaborator_list == []:
        return None
    else:
        return ' | '.join(collaborator_list)

In [32]:
studies_list = []
for study in response.json()['studies']:
    study_dict = {
        'NCTId': study['protocolSection']['identificationModule']['nctId'],
        'BriefTitle': study['protocolSection']['identificationModule']['briefTitle'],
        'LeadSponsorName': study['protocolSection']['sponsorCollaboratorsModule']['leadSponsor']['name'],
        'CollaboratorName': concatenate_collaborator_list(
            [
                c['name']
                for c in (
                    study['protocolSection']['sponsorCollaboratorsModule'].get('collaborators', [])  # can be missing
                )
            ]
        ),
        'OverallStatus': study['protocolSection']['statusModule']['overallStatus'],
        'StudyType': study['protocolSection']['designModule']['studyType'],
        'HasResults': study['hasResults'],
        'StudyFirstPostDate': study['protocolSection']['statusModule']['studyFirstPostDateStruct']['date'],
        'StartDate': study['protocolSection']['statusModule'].get('startDateStruct', {}).get('date', None),  # can be missing
        # 'PrimaryCompletionDate' : study["protocolSection"]["statusModule"].get('primaryCompletionDateStruct', {}).get('date', None), # can be missing
        'CompletionDate': study['protocolSection']['statusModule'].get('completionDateStruct', {}).get('date', None),  # can be missing
        'Reference': study['protocolSection'].get('referencesModule', {}).get('references', []),  # can be missing
    }
    studies_list.append(study_dict)

# print(json.dumps(studies_list, indent=2))

On vérifie que l'on n'a pas perdu de NCTId en route:

In [33]:
print(f'Nombre de NCTId: {len(studies_list)}')
assert response.json()['totalCount'] == len(studies_list)

Nombre de NCTId: 200


### Import dans Pandas

In [34]:
import pandas as pd

In [35]:
df_ct = pd.json_normalize(data=studies_list)
df_ct

Unnamed: 0,NCTId,BriefTitle,LeadSponsorName,CollaboratorName,OverallStatus,StudyType,HasResults,StudyFirstPostDate,StartDate,CompletionDate,Reference
0,NCT03671291,Missed Opportunities to Pre-exposure Prophylax...,French National Agency for Research on AIDS an...,"University Hospital, Marseille | University Ho...",COMPLETED,INTERVENTIONAL,False,2018-09-14,2019-04-03,2021-10-03,[]
1,NCT00323804,Interest of Ribavirin in the Maintenance Treat...,"ANRS, Emerging Infectious Diseases",Merck Sharp & Dohme LLC | Rennes University Ho...,COMPLETED,INTERVENTIONAL,False,2006-05-10,2006-05,2013-03,[]
2,NCT02267304,"Double Blind Randomized, Monocentric, Cross-ov...",Institut National de la Santé Et de la Recherc...,,COMPLETED,INTERVENTIONAL,False,2014-10-17,2013-10-30,2016-08-30,[]
3,NCT02777229,Efficacy and Safety of a Dolutegravir-based Re...,"ANRS, Emerging Infectious Diseases",Institut de Recherche pour le Developpement | ...,COMPLETED,INTERVENTIONAL,False,2016-05-19,2016-07,2021-07,"[{'pmid': '33355914', 'type': 'DERIVED', 'cita..."
4,NCT01066962,Study of Darunavir/r + Tenofovir/Emtricitabine...,"ANRS, Emerging Infectious Diseases",NEAT - European AIDS Treatment Network,COMPLETED,INTERVENTIONAL,False,2010-02-10,2010-08,2013-10,"[{'pmid': '33794182', 'type': 'DERIVED', 'cita..."
...,...,...,...,...,...,...,...,...,...,...,...
195,NCT04120415,A HIV Vaccine Trial in Individuals Who Started...,"ANRS, Emerging Infectious Diseases",EuroVacc Foundation | European AIDS Treatment ...,COMPLETED,INTERVENTIONAL,False,2019-10-09,2022-06-21,2023-07-12,[]
196,NCT01269632,Cohort of Young Adults Infected With HIV Since...,French National Agency for Research on AIDS an...,,COMPLETED,INTERVENTIONAL,False,2011-01-04,2010-06,2018-12-27,[]
197,NCT02102737,Comparison of A New Technique of Measure of th...,Institut National de la Santé Et de la Recherc...,,COMPLETED,INTERVENTIONAL,False,2014-04-03,2014-05-13,2018-03,[]
198,NCT01842477,Evaluation of Efficacy and Safety of Autologou...,Institut National de la Santé Et de la Recherc...,,COMPLETED,INTERVENTIONAL,False,2013-04-29,2013-05,2016-02-05,[]


#### On "explose" la colonne "References":

Pour chaque NCTId, la colonne réferences contient potentiellement une liste de plusieurs références.  
Si par exemple, on a 3 références, on veut se retrouver avec 3 lignes, chacune indexée par le même NCTId et contenant une unique réference.  

*Avant*:  
**`NCTId    References`**  
`NCT0001, [Ref1, Ref2, Ref3]`   

*Après*:  
**`NCTId    References`**  
`NCT0001,   Ref1`  
`NCT0001,   Ref2`  
`NCT0001,   Ref3`   

In [36]:
df_ct = df_ct.explode('Reference', ignore_index=True)
df_ct.loc[:, ['NCTId', 'Reference']]

Unnamed: 0,NCTId,Reference
0,NCT03671291,
1,NCT00323804,
2,NCT02267304,
3,NCT02777229,"{'pmid': '33355914', 'type': 'DERIVED', 'citat..."
4,NCT02777229,"{'pmid': '33010241', 'type': 'DERIVED', 'citat..."
...,...,...
477,NCT01842477,
478,NCT01037777,"{'pmid': '35264424', 'type': 'DERIVED', 'citat..."
479,NCT01037777,"{'pmid': '32822634', 'type': 'DERIVED', 'citat..."
480,NCT01037777,"{'pmid': '24780882', 'type': 'DERIVED', 'citat..."


Pour chaque NCTId, la colonne réference contient maintenant un dictionnaire de la forme:  
`{  
"pmid": "17545707",  
"type": "BACKGROUND",  
"citation": "...",  
}`

Dont on veut extraire de nouvelles colonnes en utilant les clés du dictionnaire:  
**`pmid     type        citation`**  
`17545707, BACKGROUND, "..."`

In [37]:
df_ct_references = pd.json_normalize(df_ct.pop('Reference'))
df_ct_references

Unnamed: 0,pmid,type,citation
0,,,
1,,,
2,,,
3,33355914,DERIVED,"Bousmah MA, Nishimwe ML, Tovar-Sanchez T, Lant..."
4,33010241,DERIVED,"Calmy A, Tovar Sanchez T, Kouanfack C, Mpoudi-..."
...,...,...,...
477,,,
478,35264424,DERIVED,"Wilke C, Mengel D, Schols L, Hengel H, Rakowic..."
479,32822634,DERIVED,"Jacobi H, du Montcel ST, Romanzetti S, Harmuth..."
480,24780882,DERIVED,"Tezenas du Montcel S, Durr A, Rakowicz M, Nane..."


On réassemble la dataFrame complète:

In [38]:
df_ct = df_ct.join(df_ct_references)
df_ct

Unnamed: 0,NCTId,BriefTitle,LeadSponsorName,CollaboratorName,OverallStatus,StudyType,HasResults,StudyFirstPostDate,StartDate,CompletionDate,pmid,type,citation
0,NCT03671291,Missed Opportunities to Pre-exposure Prophylax...,French National Agency for Research on AIDS an...,"University Hospital, Marseille | University Ho...",COMPLETED,INTERVENTIONAL,False,2018-09-14,2019-04-03,2021-10-03,,,
1,NCT00323804,Interest of Ribavirin in the Maintenance Treat...,"ANRS, Emerging Infectious Diseases",Merck Sharp & Dohme LLC | Rennes University Ho...,COMPLETED,INTERVENTIONAL,False,2006-05-10,2006-05,2013-03,,,
2,NCT02267304,"Double Blind Randomized, Monocentric, Cross-ov...",Institut National de la Santé Et de la Recherc...,,COMPLETED,INTERVENTIONAL,False,2014-10-17,2013-10-30,2016-08-30,,,
3,NCT02777229,Efficacy and Safety of a Dolutegravir-based Re...,"ANRS, Emerging Infectious Diseases",Institut de Recherche pour le Developpement | ...,COMPLETED,INTERVENTIONAL,False,2016-05-19,2016-07,2021-07,33355914,DERIVED,"Bousmah MA, Nishimwe ML, Tovar-Sanchez T, Lant..."
4,NCT02777229,Efficacy and Safety of a Dolutegravir-based Re...,"ANRS, Emerging Infectious Diseases",Institut de Recherche pour le Developpement | ...,COMPLETED,INTERVENTIONAL,False,2016-05-19,2016-07,2021-07,33010241,DERIVED,"Calmy A, Tovar Sanchez T, Kouanfack C, Mpoudi-..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...
477,NCT01842477,Evaluation of Efficacy and Safety of Autologou...,Institut National de la Santé Et de la Recherc...,,COMPLETED,INTERVENTIONAL,False,2013-04-29,2013-05,2016-02-05,,,
478,NCT01037777,RISCA : Prospective Study of Individuals at Ri...,Institut National de la Santé Et de la Recherc...,,COMPLETED,OBSERVATIONAL,False,2009-12-23,2009-05-07,2017-12-14,35264424,DERIVED,"Wilke C, Mengel D, Schols L, Hengel H, Rakowic..."
479,NCT01037777,RISCA : Prospective Study of Individuals at Ri...,Institut National de la Santé Et de la Recherc...,,COMPLETED,OBSERVATIONAL,False,2009-12-23,2009-05-07,2017-12-14,32822634,DERIVED,"Jacobi H, du Montcel ST, Romanzetti S, Harmuth..."
480,NCT01037777,RISCA : Prospective Study of Individuals at Ri...,Institut National de la Santé Et de la Recherc...,,COMPLETED,OBSERVATIONAL,False,2009-12-23,2009-05-07,2017-12-14,24780882,DERIVED,"Tezenas du Montcel S, Durr A, Rakowicz M, Nane..."


***On reconstruit l'index :***

In [39]:
# df_studies_v2.set_index('NCTId', inplace = True)

***On précise les types :***

In [40]:
df_ct = df_ct.convert_dtypes()
# df_ct = df_ct.astype({"OverallStatus" : 'category', "StudyType" : 'category', "type": 'category',})
df_ct = df_ct.astype({'OverallStatus': 'category', 'StudyType': 'category'})
df_ct.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 482 entries, 0 to 481
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   NCTId               482 non-null    string  
 1   BriefTitle          482 non-null    string  
 2   LeadSponsorName     482 non-null    string  
 3   CollaboratorName    217 non-null    string  
 4   OverallStatus       482 non-null    category
 5   StudyType           482 non-null    category
 6   HasResults          482 non-null    boolean 
 7   StudyFirstPostDate  482 non-null    string  
 8   StartDate           482 non-null    string  
 9   CompletionDate      482 non-null    string  
 10  pmid                364 non-null    string  
 11  type                364 non-null    string  
 12  citation            364 non-null    string  
dtypes: boolean(1), category(2), string(10)
memory usage: 39.9 KB


In [41]:
df_ct

Unnamed: 0,NCTId,BriefTitle,LeadSponsorName,CollaboratorName,OverallStatus,StudyType,HasResults,StudyFirstPostDate,StartDate,CompletionDate,pmid,type,citation
0,NCT03671291,Missed Opportunities to Pre-exposure Prophylax...,French National Agency for Research on AIDS an...,"University Hospital, Marseille | University Ho...",COMPLETED,INTERVENTIONAL,False,2018-09-14,2019-04-03,2021-10-03,,,
1,NCT00323804,Interest of Ribavirin in the Maintenance Treat...,"ANRS, Emerging Infectious Diseases",Merck Sharp & Dohme LLC | Rennes University Ho...,COMPLETED,INTERVENTIONAL,False,2006-05-10,2006-05,2013-03,,,
2,NCT02267304,"Double Blind Randomized, Monocentric, Cross-ov...",Institut National de la Santé Et de la Recherc...,,COMPLETED,INTERVENTIONAL,False,2014-10-17,2013-10-30,2016-08-30,,,
3,NCT02777229,Efficacy and Safety of a Dolutegravir-based Re...,"ANRS, Emerging Infectious Diseases",Institut de Recherche pour le Developpement | ...,COMPLETED,INTERVENTIONAL,False,2016-05-19,2016-07,2021-07,33355914,DERIVED,"Bousmah MA, Nishimwe ML, Tovar-Sanchez T, Lant..."
4,NCT02777229,Efficacy and Safety of a Dolutegravir-based Re...,"ANRS, Emerging Infectious Diseases",Institut de Recherche pour le Developpement | ...,COMPLETED,INTERVENTIONAL,False,2016-05-19,2016-07,2021-07,33010241,DERIVED,"Calmy A, Tovar Sanchez T, Kouanfack C, Mpoudi-..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...
477,NCT01842477,Evaluation of Efficacy and Safety of Autologou...,Institut National de la Santé Et de la Recherc...,,COMPLETED,INTERVENTIONAL,False,2013-04-29,2013-05,2016-02-05,,,
478,NCT01037777,RISCA : Prospective Study of Individuals at Ri...,Institut National de la Santé Et de la Recherc...,,COMPLETED,OBSERVATIONAL,False,2009-12-23,2009-05-07,2017-12-14,35264424,DERIVED,"Wilke C, Mengel D, Schols L, Hengel H, Rakowic..."
479,NCT01037777,RISCA : Prospective Study of Individuals at Ri...,Institut National de la Santé Et de la Recherc...,,COMPLETED,OBSERVATIONAL,False,2009-12-23,2009-05-07,2017-12-14,32822634,DERIVED,"Jacobi H, du Montcel ST, Romanzetti S, Harmuth..."
480,NCT01037777,RISCA : Prospective Study of Individuals at Ri...,Institut National de la Santé Et de la Recherc...,,COMPLETED,OBSERVATIONAL,False,2009-12-23,2009-05-07,2017-12-14,24780882,DERIVED,"Tezenas du Montcel S, Durr A, Rakowicz M, Nane..."


#### Export en CSV :

In [42]:
# df_ct.to_csv(
#     'Data/outputs/extract_CT_api_v2.csv',
#     sep=';',
#     index=False,
#     encoding='utf-8-sig',
# )

# PubMed

### Utilisation d'une clé pour l'API Pubmed : 

Il est recommandé d'utiliser une clé pour accéder à l'API Pubmed, ce qui permet de lancer jusqu'à 10 requêtes par seconde.  
Sans clé, la limite est de 3 requêtes par seconde.  

> E-utils users are allowed 3 requests/second without an API key. Create an API key to increase your e-utils limit to 10 requests/second.

**En pratique, l'API Pubmed étant beaucoup plus lente (~1 requête par seconde), cela ne semble pas changer grand chose.**

Pour récupérer sa clé, se rendre sur la page en étant loggué:
https://account.ncbi.nlm.nih.gov/settings/

Une fois la clé récupérée, l'ajouter aux variables d'environnement avec la commande suivante:

**Windows :** 

`setx NCBI_API_KEY “123456"`

**Linux/MacOS :**

`export NCBI_API_KEY = 123456`

In [43]:
import os

assert os.getenv('NCBI_API_KEY', None) is not None

### Installer metapub :
- Dans Conda Navigator, se placer dans le meme environnement que celui qui execute Jupyter  
- Lancer Powershell Prompt dans cet environnement  
- Taper la commande suivante :  `pip install metapub`

In [44]:
from metapub import PubMedFetcher
from metapub.exceptions import InvalidPMID

fetch = PubMedFetcher(cachedir='./.cache/')

### Récupération des PMIDs via PubMed:

Pour chaque NCTId de CT, on récupère les PMIDs des publications associées via PubMed:

In [45]:
# Liste unique des NCTId extraits de ClinicalTrial
nctid_array = df_ct.loc[:, 'NCTId'].unique()

pmids_pubmed_dict = {}
for i, nctid in enumerate(nctid_array):
    # Display the progress on a single line
    print(f'\r{i+1}/{len(nctid_array)}...', end='', flush=True)

    pmids = [pmid for pmid in fetch.pmids_for_query(nctid)]
    pmids_pubmed_dict[nctid] = set(pmids)

200/200...

In [46]:
# pmids_pubmed_dict

**On veut fusionner la liste PMIDs que l'on vient de récupérer sur PubMed, à la liste des PMIDs déjà récupérés via CT.**

On met les PMIDs de CT sous la même forme:

In [47]:
pmids_ct_dict = {}
for nctid in nctid_array:
    pmids = df_ct[df_ct.loc[:, 'NCTId'] == nctid].loc[:, 'pmid'].dropna()
    pmids_ct_dict[nctid] = set(pmids)
# pmids_ct_dict

Pour un NCTDId donné, on fait l'union des deux ensembles de PMIDs:

In [48]:
pmids_complete_dict = {}
for nctid in nctid_array:
    # L'ensemble des PMIds présents dans PubMed et CT
    pmids_complete_dict[nctid] = pmids_pubmed_dict[nctid] | pmids_ct_dict[nctid]
# pmids_complete_dict

#### Verifications:

In [49]:
num_pmids_ct = sum((len(v) for v in pmids_ct_dict.values()))
print(f'Nombre total de publications issus de CT: {num_pmids_ct}')

Nombre total de publications issus de CT: 364


In [50]:
num_pmids_complete = sum((len(v) for v in pmids_complete_dict.values()))
print(f'Nombre total de publications après consultation PubMed: {num_pmids_complete}')

Nombre total de publications après consultation PubMed: 393


In [51]:
pmids_pubmed_only_dict = {}
for nctid in nctid_array:
    # L'ensemble des PMIds présents dans PubMed uniquement
    pmids_pubmed_only_dict[nctid] = pmids_pubmed_dict[nctid] - pmids_ct_dict[nctid]

In [52]:
num_pmids_pubmed_only = sum((len(v) for v in pmids_pubmed_only_dict.values()))
print(f'Nombre de nouveaux PMIDs trouvés via Pubmed: {num_pmids_pubmed_only}')

Nombre de nouveaux PMIDs trouvés via Pubmed: 29


In [53]:
assert num_pmids_complete - num_pmids_ct == num_pmids_pubmed_only

In [54]:
print('NCTId des nouveaux PMIDs trouvés via Pubmed:')
{k: v for k, v in pmids_pubmed_only_dict.items() if v != set()}

NCTId des nouveaux PMIDs trouvés via Pubmed:


{'NCT02777229': {'37851566', '38156046'},
 'NCT01703962': {'37668523'},
 'NCT05349162': {'36735263'},
 'NCT01453192': {'30688008'},
 'NCT01473472': {'36601747'},
 'NCT03335995': {'37497675'},
 'NCT05311865': {'36438274', '37795682'},
 'NCT04315948': {'36695483', '38552208'},
 'NCT02405013': {'36686592'},
 'NCT01801618': {'29662875'},
 'NCT02212379': {'31269208'},
 'NCT03870438': {'38484756'},
 'NCT01426243': {'26314624'},
 'NCT01089387': {'26439886'},
 'NCT03215732': {'37143029'},
 'NCT00640263': {'34425825'},
 'NCT03078439': {'38408861'},
 'NCT02057796': {'36883573'},
 'NCT02833961': {'36318030'},
 'NCT04409405': {'38043556'},
 'NCT01688453': {'35272723'},
 'NCT02481453': {'38273639'},
 'NCT04392388': {'34293141'},
 'NCT04288128': {'38419144', '38421662'},
 'NCT03005652': {'38100477'}}

In [55]:
num_nctid_empty_ct = sum((1 for v in pmids_ct_dict.values() if v == set()))
print(f"Nombre d'études sans PMIDs issus de CT: {num_nctid_empty_ct}")

Nombre d'études sans PMIDs issus de CT: 118


In [56]:
num_nctid_empty_pubmed = sum((1 for v in pmids_complete_dict.values() if v == set()))
print(f"Nombre d'études sans PMIDs après consultation PubMed: {num_nctid_empty_pubmed}")

Nombre d'études sans PMIDs après consultation PubMed: 106


In [57]:
print("NCTIds qui n'avait aucun PMIDs sous CT, mais enrichis via Pubmed:")
nctids_previously_empty = {k for k, v in pmids_ct_dict.items() if v == set()} - {k for k, v in pmids_complete_dict.items() if v == set()}
{k: pmids_complete_dict[k] for k in nctids_previously_empty}

NCTIds qui n'avait aucun PMIDs sous CT, mais enrichis via Pubmed:


{'NCT04392388': {'34293141'},
 'NCT02405013': {'36686592'},
 'NCT05349162': {'36735263'},
 'NCT01453192': {'30688008'},
 'NCT04409405': {'38043556'},
 'NCT05311865': {'36438274', '37795682'},
 'NCT04288128': {'38419144', '38421662'},
 'NCT03078439': {'38408861'},
 'NCT02833961': {'36318030'},
 'NCT01703962': {'37668523'},
 'NCT01801618': {'29662875'},
 'NCT02212379': {'31269208'}}

In [58]:
len(nctids_previously_empty)

12

In [59]:
assert num_nctid_empty_ct - num_nctid_empty_pubmed == len(nctids_previously_empty)

### Enrichissement des PMIDs via l'API Pubmed

Pour chaque PMID récupéré, on l'enrichit avec les données de PubMed tel que le titre, les auteurs, ...:

In [60]:
counter = 0  # To keep track of progress
book_counter = 0  # type different from 'article', get ignored
total_publications_list = []

# For each NTCID...
for nctid, pmids in pmids_complete_dict.items():
    pmids_list = []

    # We process each PMID...
    for pmid in pmids:

        # Display the progress on a single line
        print(f'\r{counter+1} / {num_pmids_complete}...', end='', flush=True)

        try:
            # Fetch article details from Pubmed
            article = fetch.article_by_pmid(pmid)

            # We are not interested by articles with type 'book'
            # TODO: book special case ?
            if article.pubmed_type == 'article':
                pmids_list.append(
                    {
                        'pmid': pmid,
                        'title': article.title,
                        'authors': article.authors_str.strip(),
                        'doi': article.doi,
                        'year': article.year,
                        'publication_types': list(article.publication_types.values()),
                        'citation': article.citation,
                    }
                )
            else:
                book_counter += 1
                book = (nctid, pmid)

            counter += 1
        except InvalidPMID as e:
            print(f'\n{e}')

    publication_dict = {'NCTId': nctid, 'publications': pmids_list}

    total_publications_list.append(publication_dict)

393 / 393...

In [61]:
print(f'Nb de publications avec un type différent d"article": {book_counter}')
print(book)

Nb de publications avec un type différent d"article": 1
('NCT03537196', '27227200')


In [62]:
# print(json.dumps(total_publications_list, indent=2))

In [63]:
# The number of NCTId didn't change
assert len(total_publications_list) == len(studies_list)

### Import dans Pandas

In [64]:
df_pubmed = pd.DataFrame.from_records(total_publications_list)
df_pubmed

Unnamed: 0,NCTId,publications
0,NCT03671291,[]
1,NCT00323804,[]
2,NCT02267304,[]
3,NCT02777229,"[{'pmid': '38156046', 'title': 'Durability of ..."
4,NCT01066962,"[{'pmid': '26520926', 'title': 'Bone mineral d..."
...,...,...
195,NCT04120415,[]
196,NCT01269632,[]
197,NCT02102737,[]
198,NCT01842477,[]


#### On "explose" la colonne "publications":

Pour chaque NCTId, la colonne 'publications' contient potentiellement une liste de plusieurs publications.  
Si par exemple, on a 3 publications, on veut se retrouver avec 3 lignes, chacune indexée par le même NCTId et contenant une unique publication.  

*Avant*:  
**`NCTId    Reference`**  
`NCT0001, [Pub1, Pub2, Pub3]`   

*Après*:  
**`NCTId    Reference`**  
`NCT0001,   Pub1`  
`NCT0001,   Pub2`  
`NCT0001,   Pub3`   

In [65]:
df_pubmed = df_pubmed.explode('publications', ignore_index=True)
df_pubmed

Unnamed: 0,NCTId,publications
0,NCT03671291,
1,NCT00323804,
2,NCT02267304,
3,NCT02777229,"{'pmid': '38156046', 'title': 'Durability of t..."
4,NCT02777229,"{'pmid': '33010241', 'title': 'Dolutegravir-ba..."
...,...,...
493,NCT01842477,
494,NCT01037777,"{'pmid': '24780882', 'title': 'Prediction of t..."
495,NCT01037777,"{'pmid': '35264424', 'title': 'Levels of Neuro..."
496,NCT01037777,"{'pmid': '32822634', 'title': 'Conversion of i..."


On vérifie que l'on a retrouvé plus de publications avec PubMed + CT que CT tout seul:

In [66]:
assert len(df_pubmed) >= len(df_ct)

Pour chaque NCTId, la colonne 'publications' contient maintenant un dictionnaire de la forme:  
`{  
"pmid": "17545707",  
"title": "Haematological ...",  
"authors": "Smith DJ; ...",  
}`

Dont on veut extraire de nouvelles colonnes en utilant les clés du dictionnaire:  
**`pmid      title                 authors`**  
`17545707, "Haematological ...", "Smith DJ; ..."`

In [67]:
df_pubmed_publications = pd.json_normalize(df_pubmed.pop('publications'))
df_pubmed_publications

Unnamed: 0,pmid,title,authors,doi,year,publication_types,citation
0,,,,,,,
1,,,,,,,
2,,,,,,,
3,38156046,Durability of the Efficacy and Safety of Dolut...,Mpoudi-Etame M; Tovar Sanchez T; Bousmah MA; O...,10.1093/ofid/ofad582,2023,[Journal Article],"Mpoudi-Etame M, et al. Durability of the Effic..."
4,33010241,Dolutegravir-based and low-dose efavirenz-base...,Calmy A; Tovar Sanchez T; Kouanfack C; Mpoudi-...,10.1016/S2352-3018(20)30238-1,2020,"[Clinical Trial, Phase III, Journal Article, M...","Calmy A, et al. Dolutegravir-based and low-dos..."
...,...,...,...,...,...,...,...
493,,,,,,,
494,24780882,Prediction of the age at onset in spinocerebel...,Tezenas du Montcel S; Durr A; Rakowicz M; Nane...,10.1136/jmedgenet-2013-102200,2014,"[Journal Article, Research Support, Non-U.S. G...","Tezenas du Montcel S, et al. Prediction of the..."
495,35264424,Levels of Neurofilament Light at the Preataxic...,Wilke C; Mengel D; Schöls L; Hengel H; Rakowic...,10.1212/WNL.0000000000200257,2022,"[Journal Article, Research Support, Non-U.S. G...","Wilke C, et al. Levels of Neurofilament Light ..."
496,32822634,Conversion of individuals at risk for spinocer...,Jacobi H; du Montcel ST; Romanzetti S; Harmuth...,10.1016/S1474-4422(20)30235-0,2020,"[Clinical Trial, Journal Article, Multicenter ...","Jacobi H, et al. Conversion of individuals at ..."


On réassemble la dataFrame complète:

In [68]:
df_pubmed = df_pubmed.join(df_pubmed_publications)
df_pubmed

Unnamed: 0,NCTId,pmid,title,authors,doi,year,publication_types,citation
0,NCT03671291,,,,,,,
1,NCT00323804,,,,,,,
2,NCT02267304,,,,,,,
3,NCT02777229,38156046,Durability of the Efficacy and Safety of Dolut...,Mpoudi-Etame M; Tovar Sanchez T; Bousmah MA; O...,10.1093/ofid/ofad582,2023,[Journal Article],"Mpoudi-Etame M, et al. Durability of the Effic..."
4,NCT02777229,33010241,Dolutegravir-based and low-dose efavirenz-base...,Calmy A; Tovar Sanchez T; Kouanfack C; Mpoudi-...,10.1016/S2352-3018(20)30238-1,2020,"[Clinical Trial, Phase III, Journal Article, M...","Calmy A, et al. Dolutegravir-based and low-dos..."
...,...,...,...,...,...,...,...,...
493,NCT01842477,,,,,,,
494,NCT01037777,24780882,Prediction of the age at onset in spinocerebel...,Tezenas du Montcel S; Durr A; Rakowicz M; Nane...,10.1136/jmedgenet-2013-102200,2014,"[Journal Article, Research Support, Non-U.S. G...","Tezenas du Montcel S, et al. Prediction of the..."
495,NCT01037777,35264424,Levels of Neurofilament Light at the Preataxic...,Wilke C; Mengel D; Schöls L; Hengel H; Rakowic...,10.1212/WNL.0000000000200257,2022,"[Journal Article, Research Support, Non-U.S. G...","Wilke C, et al. Levels of Neurofilament Light ..."
496,NCT01037777,32822634,Conversion of individuals at risk for spinocer...,Jacobi H; du Montcel ST; Romanzetti S; Harmuth...,10.1016/S1474-4422(20)30235-0,2020,"[Clinical Trial, Journal Article, Multicenter ...","Jacobi H, et al. Conversion of individuals at ..."


### Jointure des DataFrame de CT et Pubmed:

In [69]:
df_final = df_ct.merge(df_pubmed, on=['NCTId', 'pmid'], how='right')
df_final

Unnamed: 0,NCTId,BriefTitle,LeadSponsorName,CollaboratorName,OverallStatus,StudyType,HasResults,StudyFirstPostDate,StartDate,CompletionDate,pmid,type,citation_x,title,authors,doi,year,publication_types,citation_y
0,NCT03671291,Missed Opportunities to Pre-exposure Prophylax...,French National Agency for Research on AIDS an...,"University Hospital, Marseille | University Ho...",COMPLETED,INTERVENTIONAL,False,2018-09-14,2019-04-03,2021-10-03,,,,,,,,,
1,NCT00323804,Interest of Ribavirin in the Maintenance Treat...,"ANRS, Emerging Infectious Diseases",Merck Sharp & Dohme LLC | Rennes University Ho...,COMPLETED,INTERVENTIONAL,False,2006-05-10,2006-05,2013-03,,,,,,,,,
2,NCT02267304,"Double Blind Randomized, Monocentric, Cross-ov...",Institut National de la Santé Et de la Recherc...,,COMPLETED,INTERVENTIONAL,False,2014-10-17,2013-10-30,2016-08-30,,,,,,,,,
3,NCT02777229,,,,,,,,,,38156046,,,Durability of the Efficacy and Safety of Dolut...,Mpoudi-Etame M; Tovar Sanchez T; Bousmah MA; O...,10.1093/ofid/ofad582,2023,[Journal Article],"Mpoudi-Etame M, et al. Durability of the Effic..."
4,NCT02777229,Efficacy and Safety of a Dolutegravir-based Re...,"ANRS, Emerging Infectious Diseases",Institut de Recherche pour le Developpement | ...,COMPLETED,INTERVENTIONAL,False,2016-05-19,2016-07,2021-07,33010241,DERIVED,"Calmy A, Tovar Sanchez T, Kouanfack C, Mpoudi-...",Dolutegravir-based and low-dose efavirenz-base...,Calmy A; Tovar Sanchez T; Kouanfack C; Mpoudi-...,10.1016/S2352-3018(20)30238-1,2020,"[Clinical Trial, Phase III, Journal Article, M...","Calmy A, et al. Dolutegravir-based and low-dos..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
493,NCT01842477,Evaluation of Efficacy and Safety of Autologou...,Institut National de la Santé Et de la Recherc...,,COMPLETED,INTERVENTIONAL,False,2013-04-29,2013-05,2016-02-05,,,,,,,,,
494,NCT01037777,RISCA : Prospective Study of Individuals at Ri...,Institut National de la Santé Et de la Recherc...,,COMPLETED,OBSERVATIONAL,False,2009-12-23,2009-05-07,2017-12-14,24780882,DERIVED,"Tezenas du Montcel S, Durr A, Rakowicz M, Nane...",Prediction of the age at onset in spinocerebel...,Tezenas du Montcel S; Durr A; Rakowicz M; Nane...,10.1136/jmedgenet-2013-102200,2014,"[Journal Article, Research Support, Non-U.S. G...","Tezenas du Montcel S, et al. Prediction of the..."
495,NCT01037777,RISCA : Prospective Study of Individuals at Ri...,Institut National de la Santé Et de la Recherc...,,COMPLETED,OBSERVATIONAL,False,2009-12-23,2009-05-07,2017-12-14,35264424,DERIVED,"Wilke C, Mengel D, Schols L, Hengel H, Rakowic...",Levels of Neurofilament Light at the Preataxic...,Wilke C; Mengel D; Schöls L; Hengel H; Rakowic...,10.1212/WNL.0000000000200257,2022,"[Journal Article, Research Support, Non-U.S. G...","Wilke C, et al. Levels of Neurofilament Light ..."
496,NCT01037777,RISCA : Prospective Study of Individuals at Ri...,Institut National de la Santé Et de la Recherc...,,COMPLETED,OBSERVATIONAL,False,2009-12-23,2009-05-07,2017-12-14,32822634,DERIVED,"Jacobi H, du Montcel ST, Romanzetti S, Harmuth...",Conversion of individuals at risk for spinocer...,Jacobi H; du Montcel ST; Romanzetti S; Harmuth...,10.1016/S1474-4422(20)30235-0,2020,"[Clinical Trial, Journal Article, Multicenter ...","Jacobi H, et al. Conversion of individuals at ..."


Suppression des colonnes 'citation':

In [70]:
df_final.drop(['citation_x', 'citation_y'], axis=1, inplace=True)

In [71]:
df_final

Unnamed: 0,NCTId,BriefTitle,LeadSponsorName,CollaboratorName,OverallStatus,StudyType,HasResults,StudyFirstPostDate,StartDate,CompletionDate,pmid,type,title,authors,doi,year,publication_types
0,NCT03671291,Missed Opportunities to Pre-exposure Prophylax...,French National Agency for Research on AIDS an...,"University Hospital, Marseille | University Ho...",COMPLETED,INTERVENTIONAL,False,2018-09-14,2019-04-03,2021-10-03,,,,,,,
1,NCT00323804,Interest of Ribavirin in the Maintenance Treat...,"ANRS, Emerging Infectious Diseases",Merck Sharp & Dohme LLC | Rennes University Ho...,COMPLETED,INTERVENTIONAL,False,2006-05-10,2006-05,2013-03,,,,,,,
2,NCT02267304,"Double Blind Randomized, Monocentric, Cross-ov...",Institut National de la Santé Et de la Recherc...,,COMPLETED,INTERVENTIONAL,False,2014-10-17,2013-10-30,2016-08-30,,,,,,,
3,NCT02777229,,,,,,,,,,38156046,,Durability of the Efficacy and Safety of Dolut...,Mpoudi-Etame M; Tovar Sanchez T; Bousmah MA; O...,10.1093/ofid/ofad582,2023,[Journal Article]
4,NCT02777229,Efficacy and Safety of a Dolutegravir-based Re...,"ANRS, Emerging Infectious Diseases",Institut de Recherche pour le Developpement | ...,COMPLETED,INTERVENTIONAL,False,2016-05-19,2016-07,2021-07,33010241,DERIVED,Dolutegravir-based and low-dose efavirenz-base...,Calmy A; Tovar Sanchez T; Kouanfack C; Mpoudi-...,10.1016/S2352-3018(20)30238-1,2020,"[Clinical Trial, Phase III, Journal Article, M..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
493,NCT01842477,Evaluation of Efficacy and Safety of Autologou...,Institut National de la Santé Et de la Recherc...,,COMPLETED,INTERVENTIONAL,False,2013-04-29,2013-05,2016-02-05,,,,,,,
494,NCT01037777,RISCA : Prospective Study of Individuals at Ri...,Institut National de la Santé Et de la Recherc...,,COMPLETED,OBSERVATIONAL,False,2009-12-23,2009-05-07,2017-12-14,24780882,DERIVED,Prediction of the age at onset in spinocerebel...,Tezenas du Montcel S; Durr A; Rakowicz M; Nane...,10.1136/jmedgenet-2013-102200,2014,"[Journal Article, Research Support, Non-U.S. G..."
495,NCT01037777,RISCA : Prospective Study of Individuals at Ri...,Institut National de la Santé Et de la Recherc...,,COMPLETED,OBSERVATIONAL,False,2009-12-23,2009-05-07,2017-12-14,35264424,DERIVED,Levels of Neurofilament Light at the Preataxic...,Wilke C; Mengel D; Schöls L; Hengel H; Rakowic...,10.1212/WNL.0000000000200257,2022,"[Journal Article, Research Support, Non-U.S. G..."
496,NCT01037777,RISCA : Prospective Study of Individuals at Ri...,Institut National de la Santé Et de la Recherc...,,COMPLETED,OBSERVATIONAL,False,2009-12-23,2009-05-07,2017-12-14,32822634,DERIVED,Conversion of individuals at risk for spinocer...,Jacobi H; du Montcel ST; Romanzetti S; Harmuth...,10.1016/S1474-4422(20)30235-0,2020,"[Clinical Trial, Journal Article, Multicenter ..."


Les nouveaux PMIDs trouvés via Pubmed, n'ont aucune des infos associées avec CT présentes: BriefTitle, LeadSponsorName, etc ...

In [72]:
# Index of empty rows we need to fill
index_empty_rows = df_final.loc[:, 'BriefTitle'].isna()

# Columns we need to fill
columns_to_fill = [
    'BriefTitle',
    'LeadSponsorName',
    'CollaboratorName',
    'OverallStatus',
    'StudyType',
    'HasResults',
    'StudyFirstPostDate',
    'StartDate',
    'CompletionDate',
    'type',
]

df_final.loc[index_empty_rows, ['NCTId'] + columns_to_fill].head(10)

Unnamed: 0,NCTId,BriefTitle,LeadSponsorName,CollaboratorName,OverallStatus,StudyType,HasResults,StudyFirstPostDate,StartDate,CompletionDate,type
3,NCT02777229,,,,,,,,,,
7,NCT02777229,,,,,,,,,,
13,NCT01703962,,,,,,,,,,
15,NCT05349162,,,,,,,,,,
30,NCT01453192,,,,,,,,,,
34,NCT01473472,,,,,,,,,,
40,NCT03335995,,,,,,,,,,
63,NCT05311865,,,,,,,,,,
64,NCT05311865,,,,,,,,,,
124,NCT04315948,,,,,,,,,,


On rajoute un type 'PUBMED' pour les PMIDs issues de PubMed uniquement:

In [73]:
# We add a 'PUBMED' type to the PMIDs extracted from Pubmed exclusively
df_final.loc[index_empty_rows, 'type'] = 'PUBMED'

In [74]:
df_final.loc[index_empty_rows, ['NCTId'] + columns_to_fill].head(10)

Unnamed: 0,NCTId,BriefTitle,LeadSponsorName,CollaboratorName,OverallStatus,StudyType,HasResults,StudyFirstPostDate,StartDate,CompletionDate,type
3,NCT02777229,,,,,,,,,,PUBMED
7,NCT02777229,,,,,,,,,,PUBMED
13,NCT01703962,,,,,,,,,,PUBMED
15,NCT05349162,,,,,,,,,,PUBMED
30,NCT01453192,,,,,,,,,,PUBMED
34,NCT01473472,,,,,,,,,,PUBMED
40,NCT03335995,,,,,,,,,,PUBMED
63,NCT05311865,,,,,,,,,,PUBMED
64,NCT05311865,,,,,,,,,,PUBMED
124,NCT04315948,,,,,,,,,,PUBMED


On va remplir ces colonnes avec les infos contenus dans la DataFrame CT:

In [75]:
# NCTIds of empty rows
NCTIds_empty_rows = df_final.loc[index_empty_rows, 'NCTId']

# Columns we wish to copy
columns_to_copy = [
    'BriefTitle',
    'LeadSponsorName',
    'CollaboratorName',
    'OverallStatus',
    'StudyType',
    'HasResults',
    'StudyFirstPostDate',
    'StartDate',
    'CompletionDate',
]

# We copy the missing values from the CT dataframe
for index, nctid in NCTIds_empty_rows.items():
    # For an NCTId, we look in the CT dataframe for the first row with this NCTDId
    # and copy the missings columns
    df_final.loc[index, columns_to_copy] = df_ct.loc[
        df_ct.loc[:, 'NCTId'] == nctid, columns_to_copy
    ].iloc[0]

In [76]:
df_final.loc[index_empty_rows, ['NCTId'] + columns_to_fill].head(10)

Unnamed: 0,NCTId,BriefTitle,LeadSponsorName,CollaboratorName,OverallStatus,StudyType,HasResults,StudyFirstPostDate,StartDate,CompletionDate,type
3,NCT02777229,Efficacy and Safety of a Dolutegravir-based Re...,"ANRS, Emerging Infectious Diseases",Institut de Recherche pour le Developpement | ...,COMPLETED,INTERVENTIONAL,False,2016-05-19,2016-07,2021-07,PUBMED
7,NCT02777229,Efficacy and Safety of a Dolutegravir-based Re...,"ANRS, Emerging Infectious Diseases",Institut de Recherche pour le Developpement | ...,COMPLETED,INTERVENTIONAL,False,2016-05-19,2016-07,2021-07,PUBMED
13,NCT01703962,Non Invasive IDentification of Gliomas With ID...,Institut National de la Santé Et de la Recherc...,,COMPLETED,OBSERVATIONAL,False,2012-10-11,2012-03-14,2014-03-20,PUBMED
15,NCT05349162,Epicardial vs. Transvenous ICDs in Children,Paris Cardiovascular Research Center (Inserm U...,Hôpital Necker-Enfants Malades,COMPLETED,OBSERVATIONAL,False,2022-04-27,2003-01-01,2022-04-01,PUBMED
30,NCT01453192,Renal Transplantation and Raltegravir in HIV-I...,"ANRS, Emerging Infectious Diseases",Merck Sharp & Dohme LLC,COMPLETED,INTERVENTIONAL,False,2011-10-17,2011-12,2015-11,PUBMED
34,NCT01473472,On Demand Antiretroviral Pre-exposure Prophyla...,"ANRS, Emerging Infectious Diseases",,COMPLETED,INTERVENTIONAL,False,2011-11-17,2012-01,2016-12-15,PUBMED
40,NCT03335995,Stroke Prognosis in Intensive CarE,Institut National de la Santé Et de la Recherc...,,COMPLETED,OBSERVATIONAL,False,2017-11-08,2017-10-18,2020-11-18,PUBMED
63,NCT05311865,Transmission of Covid-19 During Clubbing Event...,"ANRS, Emerging Infectious Diseases",Cerballiance | Kappa Santé,COMPLETED,INTERVENTIONAL,False,2022-04-05,2021-09-04,2022-02-26,PUBMED
64,NCT05311865,Transmission of Covid-19 During Clubbing Event...,"ANRS, Emerging Infectious Diseases",Cerballiance | Kappa Santé,COMPLETED,INTERVENTIONAL,False,2022-04-05,2021-09-04,2022-02-26,PUBMED
124,NCT04315948,Trial of Treatments for COVID-19 in Hospitaliz...,Institut National de la Santé Et de la Recherc...,,COMPLETED,INTERVENTIONAL,False,2020-03-20,2020-03-22,2023-09-25,PUBMED


### Resultat final:

In [77]:
df_final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 498 entries, 0 to 497
Data columns (total 17 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   NCTId               498 non-null    object  
 1   BriefTitle          498 non-null    string  
 2   LeadSponsorName     498 non-null    string  
 3   CollaboratorName    224 non-null    string  
 4   OverallStatus       498 non-null    category
 5   StudyType           498 non-null    category
 6   HasResults          498 non-null    boolean 
 7   StudyFirstPostDate  498 non-null    string  
 8   StartDate           498 non-null    string  
 9   CompletionDate      498 non-null    string  
 10  pmid                392 non-null    object  
 11  type                392 non-null    string  
 12  title               392 non-null    object  
 13  authors             392 non-null    object  
 14  doi                 391 non-null    object  
 15  year                392 non-null    obje

In [78]:
df_final = df_final.convert_dtypes()
# df_final = df_final.astype({"OverallStatus" : 'category', "StudyType" : 'category'})
df_final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 498 entries, 0 to 497
Data columns (total 17 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   NCTId               498 non-null    string  
 1   BriefTitle          498 non-null    string  
 2   LeadSponsorName     498 non-null    string  
 3   CollaboratorName    224 non-null    string  
 4   OverallStatus       498 non-null    category
 5   StudyType           498 non-null    category
 6   HasResults          498 non-null    boolean 
 7   StudyFirstPostDate  498 non-null    string  
 8   StartDate           498 non-null    string  
 9   CompletionDate      498 non-null    string  
 10  pmid                392 non-null    string  
 11  type                392 non-null    string  
 12  title               392 non-null    string  
 13  authors             392 non-null    string  
 14  doi                 391 non-null    string  
 15  year                392 non-null    stri

### Export en CSV: 

In [79]:
df_final.to_csv(
    'Data/outputs/df_extract.csv',
    sep=';',
    index=False,
    encoding='utf-8-sig',
)