# Objectif:

On souhaite évaluer le nombre de publications qui sont le résultat d'études cliniques liées à l'INSERM.

En particulier, on souhaite identifier les études cliniques qui ne donnent lieu à aucune publication et essayer de comprendre les raisons de cette absence de publication.

# Organisation :
- On récupère via l'API de *ClinicalTrial* les *ID* (**NCTId**) des études cliniques qui ont pour sponsors l'*INSERM*, l'*ANRS* etc.

- A partir de ces **NCTIds**, on récupère sur *ClinicalTrial* les **PMIDs** des publications liées à ces études.  
    Ces publications sont de 2 types:
    1. Elles ont été uploadées sur *CT* par les auteurs de l'étude: `BACKGROUND, RESULT`
    2. Elles ont été automatiquement récupérés sur PubMed par *CT*: `DERIVED`
- A partir de ces **NCTIds**, on récupère sur *PubMed* les **PMIDs** des publications liées à ces études.  
    On retrouve général un peu plus de publications que le traitement automatique réalisé par *CT*. 

- Pour chaque **NCTIds**, on fusionne l'ensemble des **PMIDs** retournées par *CT* et *PubMed*.

- A partir de cet ensemble de **PMIDs**, on récupère les infos liés : `titre, auteurs, doi`...

# Extraction des NCTIds dans ClinicalTrial:

## API v1:

Pour faciliter la récupération des données via l'API v1 de ClinicalTrial, on utilise le wrapper Python: [pytrials](https://github.com/jvfe/pytrials)

Installer ***pytrials***:
- Dans Conda Navigator, se placer dans le meme environnement que celui qui execute Jupyter
- Lancer Powershell Prompt dans cet environnement
- Taper: `pip install pytrials`

In [1]:
from pytrials.client import ClinicalTrials
import urllib.parse

In [2]:
ct = ClinicalTrials()

### Création de la requête:

On crée la requête qui sera envoyé à l'API de ClinicalTrial

#### Sponsors:

In [3]:
sponsors = [
    'anrs',
    'inserm',
    'institut national de la santé et de la recherche médicale',
    'french national agency for research on aids and viral hepatitis',
]

In [4]:
sponsors_expr = [f'AREA[LeadSponsorName]{sponsor}' for sponsor in sponsors]

# Add OR keyword
sponsors_expr = ' OR '.join(sponsors_expr)

# Add parenthesis for correct interpretation of OR expression
sponsors_expr = f'({sponsors_expr})'

sponsors_expr

'(AREA[LeadSponsorName]anrs OR AREA[LeadSponsorName]inserm OR AREA[LeadSponsorName]institut national de la santé et de la recherche médicale OR AREA[LeadSponsorName]french national agency for research on aids and viral hepatitis)'

#### Status:

In [5]:
status = 'completed'

In [6]:
status_expr = f'AREA[OverallStatus]{status}'
status_expr

'AREA[OverallStatus]completed'

#### Search Expression:

In [7]:
search_expr = ' AND '.join([sponsors_expr, status_expr])
search_expr

'(AREA[LeadSponsorName]anrs OR AREA[LeadSponsorName]inserm OR AREA[LeadSponsorName]institut national de la santé et de la recherche médicale OR AREA[LeadSponsorName]french national agency for research on aids and viral hepatitis) AND AREA[OverallStatus]completed'

#### URL encode: 

In [8]:
search_expr_url_encode = urllib.parse.quote_plus(search_expr)
search_expr_url_encode

'%28AREA%5BLeadSponsorName%5Danrs+OR+AREA%5BLeadSponsorName%5Dinserm+OR+AREA%5BLeadSponsorName%5Dinstitut+national+de+la+sant%C3%A9+et+de+la+recherche+m%C3%A9dicale+OR+AREA%5BLeadSponsorName%5Dfrench+national+agency+for+research+on+aids+and+viral+hepatitis%29+AND+AREA%5BOverallStatus%5Dcompleted'

#### Fields:

Les champs que l'on veut récupérer:

In [9]:
fields = [
    'NCTId',
    'BriefTitle',
    'OverallStatus',
    'StudyType',
    'LeadSponsorName',
    'CollaboratorName',
    'OrgStudyId',
    'SecondaryId',
    'StudyFirstPostDate',
    'ReferencePMID',
    'ReferenceCitation',
    'ReferenceType',
]

### Envoi de la requête :

In [10]:
study_fields = ct.get_study_fields(
    search_expr=search_expr_url_encode,
    fields=fields,
    max_studies=1000,
    fmt='csv',
)

In [11]:
print(f'NStudiesReturned: {len(study_fields[1:])}')

NStudiesReturned: 289


### Lire le résultat de la requête dans Pandas :

In [12]:
import pandas as pd

In [13]:
pd.DataFrame.from_records(
    study_fields[1:], index='Rank', columns=study_fields[0]
)

Unnamed: 0_level_0,NCTId,BriefTitle,OverallStatus,StudyType,LeadSponsorName,CollaboratorName,OrgStudyId,SecondaryId,StudyFirstPostDate,ReferencePMID,ReferenceCitation,ReferenceType
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,NCT03671291,Missed Opportunities to Pre-exposure Prophylax...,Completed,Interventional,French National Agency for Research on AIDS an...,"University Hospital, Marseille|University Hosp...",ANRS 95041 Missed Opportunity,,"September 14, 2018",,,
2,NCT01494961,Couple-oriented Prenatal HIV Counseling in Low...,Completed,Interventional,French National Agency for Research on AIDS an...,Elizabeth Glaser Pediatric AIDS Foundation,ANRS 12127 Prenahtest,,"December 19, 2011",20403152|21857289|34329355|29178852|23343912,"Orne-Gliemann J, Tchendjou PT, Miric M, Gadgil...",result|result|derived|derived|derived
3,NCT01463956,Efficacy of PegInterferon-Ribavirin-Boceprevir...,Completed,Interventional,French National Agency for Research on AIDS an...,Merck Sharp & Dohme LLC,2011- 001089 -17,,"November 2, 2011",,,
4,NCT01426243,The Yellow Fever Vaccine Immunity in HIV Infec...,Completed,Interventional,French National Agency for Research on AIDS an...,,2009-014921-17,,"August 31, 2011",30096071,"Colin de Verdiere N, Durier C, Samri A, Meiffr...",derived
5,NCT01413152,Residual Risk Assessment Of HIV Transmission,Completed,Interventional,French National Agency for Research on AIDS an...,,AOO 370-41,,"August 10, 2011",,,
...,...,...,...,...,...,...,...,...,...,...,...,...
285,NCT00265642,Evaluation of Irbesartan on Hepatic Fibrosis i...,Completed,Interventional,"ANRS, Emerging Infectious Diseases",Sanofi,2005-006027-37,ANRS HC 19 Fibrosar,"December 15, 2005",,,
286,NCT00116454,Trial for Hepatocellular Carcinoma Adjuvant Tr...,Completed,Interventional,"ANRS, Emerging Infectious Diseases",,2004-003883-31,ANRS HC06 LIPIOCIS,"June 30, 2005",,,
287,NCT05349162,Epicardial vs. Transvenous ICDs in Children,Completed,Observational,Paris Cardiovascular Research Center (Inserm U...,Hôpital Necker-Enfants Malades,134526,,"April 27, 2022",,,
288,NCT04842851,Cardiac Resynchronization Therapy in Congenita...,Completed,Observational,Paris Cardiovascular Research Center (Inserm U...,"Marie Lannelongue Hospital, Le Plessis Robinso...",2219047,,"April 13, 2021",,,


## API v2:

L'API v1 ne sera plus supporté a [partir de mi-2024](https://clinicaltrials.gov/data-api/api) :

>***Notice to API users:  
>The new ClinicalTrials.gov API, version 2.0 is available. Classic API users are strongly encouraged to switch to the modernized API. We will continue to support the classic API until mid-2024 and are planning blackouts for the spring to help with the transition to the modernized API.***

De plus, l'API v2 supporte un nouveau champs **"HasResults"**, qui pour l'instant n'est que très peu utilisé mais qui pourrait être utile à l'avenir.

En contre-partie, l'export des données sous forme de CSV est limité à un certain nombre de champs visible sur cette page: https://clinicaltrials.gov/data-api/about-api/csv-download

On est donc obligé d'utiliser l'export de données sous forme de JSON.

### Création de la requête:

`pytrials` n'étant pas compatible avec la v2, on envoie la requête manuellement en utilisant [Requests](https://requests.readthedocs.io/en/latest/) 

#### Format:

In [14]:
format = 'json'

#### Sponsors:

In [15]:
sponsors

['anrs',
 'inserm',
 'institut national de la santé et de la recherche médicale',
 'french national agency for research on aids and viral hepatitis']

In [16]:
sponsors_expr_v2 = ' OR '.join(sponsors)
sponsors_expr_v2 = urllib.parse.quote_plus(sponsors_expr_v2)
sponsors_expr_v2

'anrs+OR+inserm+OR+institut+national+de+la+sant%C3%A9+et+de+la+recherche+m%C3%A9dicale+OR+french+national+agency+for+research+on+aids+and+viral+hepatitis'

#### Overall_status:

In [17]:
overall_status = 'COMPLETED'

#### Fields:

In [18]:
fields_v2 = [
    'NCTId',
    'BriefTitle',
    # 'OfficialTitle',
    'OverallStatus',
    'StudyType',
    'LeadSponsorName',
    'CollaboratorName',
    # 'OrgStudyId',
    # 'SecondaryId',
    'StudyFirstPostDate',
    'StartDate',
    # 'PrimaryCompletionDate',
    'CompletionDate',
    'ReferencePMID',
    'ReferenceCitation',
    'ReferenceType',
    'hasResults',
]
fields_v2

['NCTId',
 'BriefTitle',
 'OverallStatus',
 'StudyType',
 'LeadSponsorName',
 'CollaboratorName',
 'StudyFirstPostDate',
 'StartDate',
 'CompletionDate',
 'ReferencePMID',
 'ReferenceCitation',
 'ReferenceType',
 'hasResults']

In [19]:
fields_expr_v2 = ','.join(fields_v2)
fields_expr_v2 = urllib.parse.quote_plus(fields_expr_v2)
fields_expr_v2

'NCTId%2CBriefTitle%2COverallStatus%2CStudyType%2CLeadSponsorName%2CCollaboratorName%2CStudyFirstPostDate%2CStartDate%2CCompletionDate%2CReferencePMID%2CReferenceCitation%2CReferenceType%2ChasResults'

#### Nombre de résultats max :

In [20]:
count_total = 'true'

In [21]:
page_size = 1000

#### URL api:

In [22]:
query_url = f'https://clinicaltrials.gov/api/v2/studies?format={format}&query.lead={sponsors_expr_v2}&filter.overallStatus={overall_status}&fields={fields_expr_v2}&countTotal={count_total}&pageSize={page_size}'
query_url

'https://clinicaltrials.gov/api/v2/studies?format=json&query.lead=anrs+OR+inserm+OR+institut+national+de+la+sant%C3%A9+et+de+la+recherche+m%C3%A9dicale+OR+french+national+agency+for+research+on+aids+and+viral+hepatitis&filter.overallStatus=COMPLETED&fields=NCTId%2CBriefTitle%2COverallStatus%2CStudyType%2CLeadSponsorName%2CCollaboratorName%2CStudyFirstPostDate%2CStartDate%2CCompletionDate%2CReferencePMID%2CReferenceCitation%2CReferenceType%2ChasResults&countTotal=true&pageSize=1000'

### Envoi de la requête :

In [23]:
import requests

In [24]:
response = requests.get(query_url)
response.raise_for_status()
response

<Response [200]>

In [25]:
print(f'Studies returned: {response.json()["totalCount"]}')

Studies returned: 289


### Traitement du JSON retourné:

In [26]:
import json

In [27]:
print(json.dumps(response.json(), indent=2))

{
  "totalCount": 289,
  "studies": [
    {
      "protocolSection": {
        "identificationModule": {
          "nctId": "NCT02014727",
          "briefTitle": "Safety and Immunogenicity of Recombinant Pichia Pastoris AMA1-DiCo Candidate Malaria Vaccine With GLA-SE and Alhydrogel \u00ae as Adjuvant in Healthy Malaria Non-Exposed European and Malaria Exposed African Adults"
        },
        "statusModule": {
          "overallStatus": "COMPLETED",
          "startDateStruct": {
            "date": "2014-01"
          },
          "completionDateStruct": {
            "date": "2015-07"
          },
          "studyFirstPostDateStruct": {
            "date": "2013-12-18"
          }
        },
        "sponsorCollaboratorsModule": {
          "leadSponsor": {
            "name": "Institut National de la Sant\u00e9 Et de la Recherche M\u00e9dicale, France"
          },
          "collaborators": [
            {
              "name": "EVI Industries, Inc."
            },
            {


***La structure du JSON est bien trop imbriquée pour le normaliser avec Pandas, du coup on l'applatit à la main:***

A partir du JSON on construit un dictionnaire équivalent mais beaucoup plus "plat"

In [28]:
# Si la liste des collaborateurs est vide on renvoie None, sinon on concatène les valeurs de la liste
# sous la forme "collaborateur_0 | collaborateur_1 | ..."
def concatenate_collaborator_list(collaborator_list):
    if collaborator_list == []:
        return None
    else:
        return ' | '.join(collaborator_list)

In [29]:
studies_list = []
for study in response.json()['studies']:
    study_dict = {
        'NCTId': study['protocolSection']['identificationModule']['nctId'],
        'BriefTitle': study['protocolSection']['identificationModule']['briefTitle'],
        'LeadSponsorName': study['protocolSection']['sponsorCollaboratorsModule']['leadSponsor']['name'],
        'CollaboratorName': concatenate_collaborator_list(
            [
                c['name']
                for c in (
                    study['protocolSection']['sponsorCollaboratorsModule'].get('collaborators', [])  # can be missing
                )
            ]
        ),
        'OverallStatus': study['protocolSection']['statusModule']['overallStatus'],
        'StudyType': study['protocolSection']['designModule']['studyType'],
        'HasResults': study['hasResults'],
        'StudyFirstPostDate': study['protocolSection']['statusModule']['studyFirstPostDateStruct']['date'],
        'StartDate': study['protocolSection']['statusModule'].get('startDateStruct', {}).get('date', None),  # can be missing
        # 'PrimaryCompletionDate' : study["protocolSection"]["statusModule"].get('primaryCompletionDateStruct', {}).get('date', None), # can be missing
        'CompletionDate': study['protocolSection']['statusModule'].get('completionDateStruct', {}).get('date', None),  # can be missing
        'Reference': study['protocolSection'].get('referencesModule', {}).get('references', []),  # can be missing
    }
    studies_list.append(study_dict)

print(json.dumps(studies_list, indent=2))

[
  {
    "NCTId": "NCT02014727",
    "BriefTitle": "Safety and Immunogenicity of Recombinant Pichia Pastoris AMA1-DiCo Candidate Malaria Vaccine With GLA-SE and Alhydrogel \u00ae as Adjuvant in Healthy Malaria Non-Exposed European and Malaria Exposed African Adults",
    "LeadSponsorName": "Institut National de la Sant\u00e9 Et de la Recherche M\u00e9dicale, France",
    "CollaboratorName": "EVI Industries, Inc. | BPRC | Recherche Clinique Paris Descartes Necker Cochin Sainte Anne | Centre national de recherche et de formation sur le paludisme",
    "OverallStatus": "COMPLETED",
    "StudyType": "INTERVENTIONAL",
    "HasResults": false,
    "StudyFirstPostDate": "2013-12-18",
    "StartDate": "2014-01",
    "CompletionDate": "2015-07",
    "Reference": [
      {
        "pmid": "28947345",
        "type": "DERIVED",
        "citation": "Sirima SB, Durier C, Kara L, Houard S, Gansane A, Loulergue P, Bahuaud M, Benhamouda N, Nebie I, Faber B, Remarque E, Launay O; AMA1-DiCo Study Group

On vérifie que l'on n'a pas perdu de NCTId en route:

In [30]:
print(f'Nombre de NCTId: {len(studies_list)}')
assert response.json()['totalCount'] == len(studies_list)

Nombre de NCTId: 289


### Import dans Pandas

In [31]:
df_ct = pd.json_normalize(data=studies_list)
df_ct

Unnamed: 0,NCTId,BriefTitle,LeadSponsorName,CollaboratorName,OverallStatus,StudyType,HasResults,StudyFirstPostDate,StartDate,CompletionDate,Reference
0,NCT02014727,Safety and Immunogenicity of Recombinant Pichi...,Institut National de la Santé Et de la Recherc...,"EVI Industries, Inc. | BPRC | Recherche Cliniq...",COMPLETED,INTERVENTIONAL,False,2013-12-18,2014-01,2015-07,"[{'pmid': '28947345', 'type': 'DERIVED', 'cita..."
1,NCT00117494,Rosuvastatin Versus Pravastatin in HIV Patient...,French National Agency for Research on AIDS an...,,COMPLETED,INTERVENTIONAL,False,2005-07-07,2005-10,2007-06,"[{'pmid': '22739396', 'type': 'DERIVED', 'cita..."
2,NCT00536627,Efficacy and Tolerance of Naked DNA Vaccine in...,French National Agency for Research on AIDS an...,,COMPLETED,INTERVENTIONAL,False,2007-09-28,2008-01,2010-11,"[{'pmid': '16310901', 'type': 'BACKGROUND', 'c..."
3,NCT01033760,Optimisation of Primary HIV1 Infection Treatme...,"ANRS, Emerging Infectious Diseases",Gilead Sciences | Merck Sharp & Dohme LLC | Pf...,COMPLETED,INTERVENTIONAL,False,2009-12-16,2010-04,2013-12,"[{'pmid': '28708873', 'type': 'DERIVED', 'cita..."
4,NCT05349162,Epicardial vs. Transvenous ICDs in Children,Paris Cardiovascular Research Center (Inserm U...,Hôpital Necker-Enfants Malades,COMPLETED,OBSERVATIONAL,False,2022-04-27,2003-01-01,2022-04-01,[]
...,...,...,...,...,...,...,...,...,...,...,...
284,NCT01880151,Neuroelectrical Biomarkers for Alzheimer's Dis...,Institut National de la Santé Et de la Recherc...,,COMPLETED,INTERVENTIONAL,False,2013-06-18,2013-07-12,2019-01,[]
285,NCT02658253,Trial to Evaluate the Safety and Immunogenicit...,Institut National de la Santé Et de la Recherc...,"EVI Industries, Inc. | Recherche Clinique Pari...",COMPLETED,INTERVENTIONAL,False,2016-01-18,2016-01,2019-02-21,"[{'pmid': '33717176', 'type': 'DERIVED', 'cita..."
286,NCT02027051,Etude génétique Des Arméniens,Institut National de la Santé Et de la Recherc...,,COMPLETED,OBSERVATIONAL,False,2014-01-03,2014-01,2017-01,[]
287,NCT00118677,Long-Term Supervised Treatment Interruption in...,French National Agency for Research on AIDS an...,,COMPLETED,INTERVENTIONAL,False,2005-07-12,2003-02,2007-05,"[{'pmid': '20657770', 'type': 'DERIVED', 'cita..."


#### On "explose" la colonne "References":

Pour chaque NCTId, la colonne réferences contient potentiellement une liste de plusieurs références.  
Si par exemple, on a 3 références, on veut se retrouver avec 3 lignes, chacune indexée par le même NCTId et contenant une unique réference.  

*Avant*:  
**`NCTId    References`**  
`NCT0001, [Ref1, Ref2, Ref3]`   

*Après*:  
**`NCTId    References`**  
`NCT0001,   Ref1`  
`NCT0001,   Ref2`  
`NCT0001,   Ref3`   

In [32]:
df_ct = df_ct.explode('Reference', ignore_index=True)
df_ct

Unnamed: 0,NCTId,BriefTitle,LeadSponsorName,CollaboratorName,OverallStatus,StudyType,HasResults,StudyFirstPostDate,StartDate,CompletionDate,Reference
0,NCT02014727,Safety and Immunogenicity of Recombinant Pichi...,Institut National de la Santé Et de la Recherc...,"EVI Industries, Inc. | BPRC | Recherche Cliniq...",COMPLETED,INTERVENTIONAL,False,2013-12-18,2014-01,2015-07,"{'pmid': '28947345', 'type': 'DERIVED', 'citat..."
1,NCT00117494,Rosuvastatin Versus Pravastatin in HIV Patient...,French National Agency for Research on AIDS an...,,COMPLETED,INTERVENTIONAL,False,2005-07-07,2005-10,2007-06,"{'pmid': '22739396', 'type': 'DERIVED', 'citat..."
2,NCT00117494,Rosuvastatin Versus Pravastatin in HIV Patient...,French National Agency for Research on AIDS an...,,COMPLETED,INTERVENTIONAL,False,2005-07-07,2005-10,2007-06,"{'pmid': '22318219', 'type': 'DERIVED', 'citat..."
3,NCT00536627,Efficacy and Tolerance of Naked DNA Vaccine in...,French National Agency for Research on AIDS an...,,COMPLETED,INTERVENTIONAL,False,2007-09-28,2008-01,2010-11,"{'pmid': '16310901', 'type': 'BACKGROUND', 'ci..."
4,NCT00536627,Efficacy and Tolerance of Naked DNA Vaccine in...,French National Agency for Research on AIDS an...,,COMPLETED,INTERVENTIONAL,False,2007-09-28,2008-01,2010-11,"{'pmid': '15382173', 'type': 'BACKGROUND', 'ci..."
...,...,...,...,...,...,...,...,...,...,...,...
686,NCT02658253,Trial to Evaluate the Safety and Immunogenicit...,Institut National de la Santé Et de la Recherc...,"EVI Industries, Inc. | Recherche Clinique Pari...",COMPLETED,INTERVENTIONAL,False,2016-01-18,2016-01,2019-02-21,"{'pmid': '33717176', 'type': 'DERIVED', 'citat..."
687,NCT02658253,Trial to Evaluate the Safety and Immunogenicit...,Institut National de la Santé Et de la Recherc...,"EVI Industries, Inc. | Recherche Clinique Pari...",COMPLETED,INTERVENTIONAL,False,2016-01-18,2016-01,2019-02-21,"{'pmid': '32032566', 'type': 'DERIVED', 'citat..."
688,NCT02027051,Etude génétique Des Arméniens,Institut National de la Santé Et de la Recherc...,,COMPLETED,OBSERVATIONAL,False,2014-01-03,2014-01,2017-01,
689,NCT00118677,Long-Term Supervised Treatment Interruption in...,French National Agency for Research on AIDS an...,,COMPLETED,INTERVENTIONAL,False,2005-07-12,2003-02,2007-05,"{'pmid': '20657770', 'type': 'DERIVED', 'citat..."


Pour chaque NCTId, la colonne réference contient maintenant un dictionnaire de la forme:  
`{  
"pmid": "17545707",  
"type": "BACKGROUND",  
"citation": "...",  
}`

Dont on veut extraire de nouvelles colonnes en utilant les clés du dictionnaire:  
**`pmid     type        citation`**  
`17545707, BACKGROUND, "..."`

In [33]:
df_ct_references = pd.json_normalize(df_ct.pop('Reference'))
df_ct_references

Unnamed: 0,pmid,type,citation
0,28947345,DERIVED,"Sirima SB, Durier C, Kara L, Houard S, Gansane..."
1,22739396,DERIVED,"Bittar R, Giral P, Aslangul E, Assoumou L, Val..."
2,22318219,DERIVED,"Bittar R, Giral P, Aslangul E, Assoumou L, Val..."
3,16310901,BACKGROUND,"Mancini-Bourgine M, Fontaine H, Brechot C, Pol..."
4,15382173,BACKGROUND,"Mancini-Bourgine M, Fontaine H, Scott-Algara D..."
...,...,...,...
686,33717176,DERIVED,"Gamain B, Chene A, Viebig NK, Tuikue Ndam N, N..."
687,32032566,DERIVED,"Sirima SB, Richert L, Chene A, Konate AT, Camp..."
688,,,
689,20657770,DERIVED,"Weiss L, Piketty C, Assoumou L, Didier C, Cacc..."


On réassemble la dataFrame complète:

In [34]:
df_ct = df_ct.join(df_ct_references)
df_ct

Unnamed: 0,NCTId,BriefTitle,LeadSponsorName,CollaboratorName,OverallStatus,StudyType,HasResults,StudyFirstPostDate,StartDate,CompletionDate,pmid,type,citation
0,NCT02014727,Safety and Immunogenicity of Recombinant Pichi...,Institut National de la Santé Et de la Recherc...,"EVI Industries, Inc. | BPRC | Recherche Cliniq...",COMPLETED,INTERVENTIONAL,False,2013-12-18,2014-01,2015-07,28947345,DERIVED,"Sirima SB, Durier C, Kara L, Houard S, Gansane..."
1,NCT00117494,Rosuvastatin Versus Pravastatin in HIV Patient...,French National Agency for Research on AIDS an...,,COMPLETED,INTERVENTIONAL,False,2005-07-07,2005-10,2007-06,22739396,DERIVED,"Bittar R, Giral P, Aslangul E, Assoumou L, Val..."
2,NCT00117494,Rosuvastatin Versus Pravastatin in HIV Patient...,French National Agency for Research on AIDS an...,,COMPLETED,INTERVENTIONAL,False,2005-07-07,2005-10,2007-06,22318219,DERIVED,"Bittar R, Giral P, Aslangul E, Assoumou L, Val..."
3,NCT00536627,Efficacy and Tolerance of Naked DNA Vaccine in...,French National Agency for Research on AIDS an...,,COMPLETED,INTERVENTIONAL,False,2007-09-28,2008-01,2010-11,16310901,BACKGROUND,"Mancini-Bourgine M, Fontaine H, Brechot C, Pol..."
4,NCT00536627,Efficacy and Tolerance of Naked DNA Vaccine in...,French National Agency for Research on AIDS an...,,COMPLETED,INTERVENTIONAL,False,2007-09-28,2008-01,2010-11,15382173,BACKGROUND,"Mancini-Bourgine M, Fontaine H, Scott-Algara D..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...
686,NCT02658253,Trial to Evaluate the Safety and Immunogenicit...,Institut National de la Santé Et de la Recherc...,"EVI Industries, Inc. | Recherche Clinique Pari...",COMPLETED,INTERVENTIONAL,False,2016-01-18,2016-01,2019-02-21,33717176,DERIVED,"Gamain B, Chene A, Viebig NK, Tuikue Ndam N, N..."
687,NCT02658253,Trial to Evaluate the Safety and Immunogenicit...,Institut National de la Santé Et de la Recherc...,"EVI Industries, Inc. | Recherche Clinique Pari...",COMPLETED,INTERVENTIONAL,False,2016-01-18,2016-01,2019-02-21,32032566,DERIVED,"Sirima SB, Richert L, Chene A, Konate AT, Camp..."
688,NCT02027051,Etude génétique Des Arméniens,Institut National de la Santé Et de la Recherc...,,COMPLETED,OBSERVATIONAL,False,2014-01-03,2014-01,2017-01,,,
689,NCT00118677,Long-Term Supervised Treatment Interruption in...,French National Agency for Research on AIDS an...,,COMPLETED,INTERVENTIONAL,False,2005-07-12,2003-02,2007-05,20657770,DERIVED,"Weiss L, Piketty C, Assoumou L, Didier C, Cacc..."


***On reconstruit l'index :***

In [35]:
# df_studies_v2.set_index('NCTId', inplace = True)

***On précise les types :***

In [36]:
df_ct = df_ct.convert_dtypes()
# df_ct = df_ct.astype({"OverallStatus" : 'category', "StudyType" : 'category', "type": 'category',})
df_ct = df_ct.astype({'OverallStatus': 'category', 'StudyType': 'category'})
df_ct.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 691 entries, 0 to 690
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   NCTId               691 non-null    string  
 1   BriefTitle          691 non-null    string  
 2   LeadSponsorName     691 non-null    string  
 3   CollaboratorName    350 non-null    string  
 4   OverallStatus       691 non-null    category
 5   StudyType           691 non-null    category
 6   HasResults          691 non-null    boolean 
 7   StudyFirstPostDate  691 non-null    string  
 8   StartDate           691 non-null    string  
 9   CompletionDate      690 non-null    string  
 10  pmid                541 non-null    string  
 11  type                541 non-null    string  
 12  citation            541 non-null    string  
dtypes: boolean(1), category(2), string(10)
memory usage: 57.0 KB


In [37]:
df_ct

Unnamed: 0,NCTId,BriefTitle,LeadSponsorName,CollaboratorName,OverallStatus,StudyType,HasResults,StudyFirstPostDate,StartDate,CompletionDate,pmid,type,citation
0,NCT02014727,Safety and Immunogenicity of Recombinant Pichi...,Institut National de la Santé Et de la Recherc...,"EVI Industries, Inc. | BPRC | Recherche Cliniq...",COMPLETED,INTERVENTIONAL,False,2013-12-18,2014-01,2015-07,28947345,DERIVED,"Sirima SB, Durier C, Kara L, Houard S, Gansane..."
1,NCT00117494,Rosuvastatin Versus Pravastatin in HIV Patient...,French National Agency for Research on AIDS an...,,COMPLETED,INTERVENTIONAL,False,2005-07-07,2005-10,2007-06,22739396,DERIVED,"Bittar R, Giral P, Aslangul E, Assoumou L, Val..."
2,NCT00117494,Rosuvastatin Versus Pravastatin in HIV Patient...,French National Agency for Research on AIDS an...,,COMPLETED,INTERVENTIONAL,False,2005-07-07,2005-10,2007-06,22318219,DERIVED,"Bittar R, Giral P, Aslangul E, Assoumou L, Val..."
3,NCT00536627,Efficacy and Tolerance of Naked DNA Vaccine in...,French National Agency for Research on AIDS an...,,COMPLETED,INTERVENTIONAL,False,2007-09-28,2008-01,2010-11,16310901,BACKGROUND,"Mancini-Bourgine M, Fontaine H, Brechot C, Pol..."
4,NCT00536627,Efficacy and Tolerance of Naked DNA Vaccine in...,French National Agency for Research on AIDS an...,,COMPLETED,INTERVENTIONAL,False,2007-09-28,2008-01,2010-11,15382173,BACKGROUND,"Mancini-Bourgine M, Fontaine H, Scott-Algara D..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...
686,NCT02658253,Trial to Evaluate the Safety and Immunogenicit...,Institut National de la Santé Et de la Recherc...,"EVI Industries, Inc. | Recherche Clinique Pari...",COMPLETED,INTERVENTIONAL,False,2016-01-18,2016-01,2019-02-21,33717176,DERIVED,"Gamain B, Chene A, Viebig NK, Tuikue Ndam N, N..."
687,NCT02658253,Trial to Evaluate the Safety and Immunogenicit...,Institut National de la Santé Et de la Recherc...,"EVI Industries, Inc. | Recherche Clinique Pari...",COMPLETED,INTERVENTIONAL,False,2016-01-18,2016-01,2019-02-21,32032566,DERIVED,"Sirima SB, Richert L, Chene A, Konate AT, Camp..."
688,NCT02027051,Etude génétique Des Arméniens,Institut National de la Santé Et de la Recherc...,,COMPLETED,OBSERVATIONAL,False,2014-01-03,2014-01,2017-01,,,
689,NCT00118677,Long-Term Supervised Treatment Interruption in...,French National Agency for Research on AIDS an...,,COMPLETED,INTERVENTIONAL,False,2005-07-12,2003-02,2007-05,20657770,DERIVED,"Weiss L, Piketty C, Assoumou L, Didier C, Cacc..."


#### Export en CSV :

In [38]:
# df_ct.to_csv('Data/outputs/extract_CT_api_v2.csv', sep=";", encoding='utf-8-sig')
df_ct.to_csv(
    'Data/outputs/extract_CT_api_v2.csv',
    sep=';',
    index=False,
    encoding='utf-8-sig',
)

# PubMed

### Utilisation d'une clé pour l'API Pubmed : 

Il est recommandé d'utiliser une clé pour accéder à l'API Pubmed, ce qui permet de lancer jusqu'à 10 requêtes par seconde.  
Sans clé, la limite est de 3 requêtes par seconde.  

> E-utils users are allowed 3 requests/second without an API key. Create an API key to increase your e-utils limit to 10 requests/second.

**En pratique, l'API Pubmed étant beaucoup plus lente (~1 requête par seconde), cela ne semble pas changer grand chose.**

Pour récupérer sa clé, se rendre sur la page en étant loggué:
https://account.ncbi.nlm.nih.gov/settings/

Une fois la clé récupérée, l'ajouter aux variables d'environnement avec la commande suivante:

**Windows :** 

`setx NCBI_API_KEY “123456"`

**Linux/MacOS :**

`export NCBI_API_KEY = 123456`

In [39]:
import os

assert os.getenv('NCBI_API_KEY', None) is not None

### Installer metapub :
- Dans Conda Navigator, se placer dans le meme environnement que celui qui execute Jupyter  
- Lancer Powershell Prompt dans cet environnement  
- Taper la commande suivante :  `pip install metapub`

In [40]:
from metapub import PubMedFetcher
from metapub.exceptions import InvalidPMID

fetch = PubMedFetcher(cachedir='./.cache/')

### Récupération des PMIDs via PubMed:

Pour chaque NCTId de CT, on récupère les PMIDs des publications associées via PubMed:

In [41]:
# Liste unique des NCTId extraits de ClinicalTrial
nctid_array = df_ct.loc[:, 'NCTId'].unique()

pmids_pubmed_dict = {}
for i, nctid in enumerate(nctid_array):
    # Display the progress on a single line
    print(f'\r{i+1}/{len(nctid_array)}...', end='', flush=True)

    pmids = [pmid for pmid in fetch.pmids_for_query(nctid)]
    pmids_pubmed_dict[nctid] = set(pmids)

289/289...

In [42]:
pmids_pubmed_dict

{'NCT02014727': {'28947345'},
 'NCT00117494': {'22318219', '22739396'},
 'NCT00536627': {'24394187', '24555998'},
 'NCT01033760': {'25701561', '28708873'},
 'NCT05349162': {'36735263'},
 'NCT01164462': set(),
 'NCT02777229': {'31339676', '33010241', '33355914', '37851566', '38156046'},
 'NCT00480792': {'21486976', '27064975'},
 'NCT01095744': set(),
 'NCT01489592': set(),
 'NCT01453192': {'30688008'},
 'NCT00460382': set(),
 'NCT02486731': set(),
 'NCT02116374': set(),
 'NCT02497274': set(),
 'NCT01022476': set(),
 'NCT00528060': {'23459496'},
 'NCT02078076': set(),
 'NCT04008927': {'35774932'},
 'NCT01331876': set(),
 'NCT00820820': set(),
 'NCT02273765': {'33667406'},
 'NCT02107365': set(),
 'NCT00113282': set(),
 'NCT01055873': {'22427678'},
 'NCT05199831': set(),
 'NCT01359774': set(),
 'NCT02099474': {'32661003'},
 'NCT03546127': set(),
 'NCT04315948': {'32304640',
  '32958495',
  '33264556',
  '34048876',
  '34350582',
  '34473343',
  '34534511',
  '35233617',
  '35512728',
  '36

**On veut fusionner la liste PMIDs que l'on vient de récupérer sur PubMed, à la liste des PMIDs déjà récupérés via CT.**

On met les PMIDs de CT sous la même forme:

In [43]:
pmids_ct_dict = {}
for nctid in nctid_array:
    pmids = df_ct[df_ct.loc[:, 'NCTId'] == nctid].loc[:, 'pmid'].dropna()
    pmids_ct_dict[nctid] = set(pmids)
pmids_ct_dict

{'NCT02014727': {'28947345'},
 'NCT00117494': {'22318219', '22739396'},
 'NCT00536627': {'15382173', '16310901', '24394187', '24555998'},
 'NCT01033760': {'25701561', '28708873'},
 'NCT05349162': set(),
 'NCT01164462': set(),
 'NCT02777229': {'31339676', '33010241', '33355914'},
 'NCT00480792': {'21486976', '27064975'},
 'NCT01095744': set(),
 'NCT01489592': {'15501961', '15974953', '16946298', '18559556', '20169331'},
 'NCT01453192': set(),
 'NCT00460382': set(),
 'NCT02486731': set(),
 'NCT02116374': set(),
 'NCT02497274': set(),
 'NCT01022476': set(),
 'NCT00528060': {'23459496'},
 'NCT02078076': set(),
 'NCT04008927': {'22310560',
  '26806260',
  '27168667',
  '27178119',
  '27537841',
  '28004616',
  '29432973',
  '35774932'},
 'NCT01331876': set(),
 'NCT00820820': set(),
 'NCT02273765': {'33667406'},
 'NCT02107365': set(),
 'NCT00113282': set(),
 'NCT01055873': {'22427678'},
 'NCT05199831': set(),
 'NCT01359774': {'17653274', '21285522', '21320997'},
 'NCT02099474': {'32661003'},

Pour un NCTDId donné, on fait l'union des deux ensembles de PMIDs:

In [44]:
pmids_complete_dict = {}
for nctid in nctid_array:
    # L'ensemble des PMIds présents dans PubMed et CT
    pmids_complete_dict[nctid] = pmids_pubmed_dict[nctid] | pmids_ct_dict[nctid]
pmids_complete_dict

{'NCT02014727': {'28947345'},
 'NCT00117494': {'22318219', '22739396'},
 'NCT00536627': {'15382173', '16310901', '24394187', '24555998'},
 'NCT01033760': {'25701561', '28708873'},
 'NCT05349162': {'36735263'},
 'NCT01164462': set(),
 'NCT02777229': {'31339676', '33010241', '33355914', '37851566', '38156046'},
 'NCT00480792': {'21486976', '27064975'},
 'NCT01095744': set(),
 'NCT01489592': {'15501961', '15974953', '16946298', '18559556', '20169331'},
 'NCT01453192': {'30688008'},
 'NCT00460382': set(),
 'NCT02486731': set(),
 'NCT02116374': set(),
 'NCT02497274': set(),
 'NCT01022476': set(),
 'NCT00528060': {'23459496'},
 'NCT02078076': set(),
 'NCT04008927': {'22310560',
  '26806260',
  '27168667',
  '27178119',
  '27537841',
  '28004616',
  '29432973',
  '35774932'},
 'NCT01331876': set(),
 'NCT00820820': set(),
 'NCT02273765': {'33667406'},
 'NCT02107365': set(),
 'NCT00113282': set(),
 'NCT01055873': {'22427678'},
 'NCT05199831': set(),
 'NCT01359774': {'17653274', '21285522', '213

#### Verifications:

In [45]:
num_pmids_ct = sum((len(v) for v in pmids_ct_dict.values()))
print(f'Nombre total de publications issus de CT: {num_pmids_ct}')

Nombre total de publications issus de CT: 541


In [46]:
num_pmids_complete = sum((len(v) for v in pmids_complete_dict.values()))
print(f'Nombre total de publications après consultation PubMed: {num_pmids_complete}')

Nombre total de publications après consultation PubMed: 569


In [47]:
pmids_pubmed_only_dict = {}
for nctid in nctid_array:
    # L'ensemble des PMIds présents dans PubMed uniquement
    pmids_pubmed_only_dict[nctid] = pmids_pubmed_dict[nctid] - pmids_ct_dict[nctid]

In [48]:
num_pmids_pubmed_only = sum((len(v) for v in pmids_pubmed_only_dict.values()))
print(f'Nombre de nouveaux PMIDs trouvés via Pubmed: {num_pmids_pubmed_only}')

Nombre de nouveaux PMIDs trouvés via Pubmed: 28


In [49]:
assert num_pmids_complete - num_pmids_ct == num_pmids_pubmed_only

In [50]:
print('NCTId des nouveaux PMIDs trouvés via Pubmed:')
{k: v for k, v in pmids_pubmed_only_dict.items() if v != set()}

NCTId des nouveaux PMIDs trouvés via Pubmed:


{'NCT05349162': {'36735263'},
 'NCT02777229': {'37851566', '38156046'},
 'NCT01453192': {'30688008'},
 'NCT04315948': {'36695483'},
 'NCT05311865': {'36438274', '37795682'},
 'NCT00640263': {'34425825'},
 'NCT01801618': {'29662875'},
 'NCT02405013': {'36686592'},
 'NCT01703962': {'37668523'},
 'NCT01473472': {'36601747'},
 'NCT02212379': {'31269208'},
 'NCT03335995': {'37497675'},
 'NCT01426243': {'26314624'},
 'NCT03870438': {'38484756'},
 'NCT03078439': {'38408861'},
 'NCT02057796': {'36883573'},
 'NCT04392388': {'34293141'},
 'NCT01089387': {'26439886'},
 'NCT02833961': {'36318030'},
 'NCT02481453': {'38273639'},
 'NCT04409405': {'38043556'},
 'NCT04288128': {'38419144', '38421662'},
 'NCT03215732': {'37143029'},
 'NCT03005652': {'38100477'},
 'NCT01688453': {'35272723'}}

In [51]:
num_nctid_empty_ct = sum((1 for v in pmids_ct_dict.values() if v == set()))
print(f"Nombre d'études sans PMIDs issus de CT: {num_nctid_empty_ct}")

Nombre d'études sans PMIDs issus de CT: 150


In [52]:
num_nctid_empty_pubmed = sum((1 for v in pmids_complete_dict.values() if v == set()))
print(f"Nombre d'études sans PMIDs après consultation PubMed: {num_nctid_empty_pubmed}")

Nombre d'études sans PMIDs après consultation PubMed: 138


In [53]:
print("NCTIds qui n'avait aucun PMIDs sous CT, mais enrichis via Pubmed:")
nctids_previously_empty = {k for k, v in pmids_ct_dict.items() if v == set()} - {k for k, v in pmids_complete_dict.items() if v == set()}
{k: pmids_complete_dict[k] for k in nctids_previously_empty}

NCTIds qui n'avait aucun PMIDs sous CT, mais enrichis via Pubmed:


{'NCT02405013': {'36686592'},
 'NCT05349162': {'36735263'},
 'NCT04288128': {'38419144', '38421662'},
 'NCT01453192': {'30688008'},
 'NCT01801618': {'29662875'},
 'NCT02212379': {'31269208'},
 'NCT01703962': {'37668523'},
 'NCT03078439': {'38408861'},
 'NCT04409405': {'38043556'},
 'NCT02833961': {'36318030'},
 'NCT04392388': {'34293141'},
 'NCT05311865': {'36438274', '37795682'}}

In [54]:
len(nctids_previously_empty)

12

In [55]:
assert num_nctid_empty_ct - num_nctid_empty_pubmed == len(nctids_previously_empty)

### Enrichissement des PMIDs via l'API Pubmed

Pour chaque PMID récupéré, on l'enrichit avec les données de PubMed tel que le titre, les auteurs, ...:

In [56]:
counter = 0   # To keep track of progress
book_counter = 0   # type different from 'article', get ignored
total_publications_list = []

# For each NTCID...
for nctid, pmids in pmids_complete_dict.items():
    pmids_list = []

    # We process each PMID...
    for pmid in pmids:

        # Display the progress on a single line
        print(f'\r{counter+1} / {num_pmids_complete}...', end='', flush=True)

        try:
            # Fetch article details from Pubmed
            article = fetch.article_by_pmid(pmid)

            # We are not interested by articles with type 'book'
            # TODO: book special case ?
            if article.pubmed_type == 'article':
                pmids_list.append(
                    {
                        'pmid': pmid,
                        'title': article.title,
                        'authors': article.authors_str.strip(),
                        'doi': article.doi,
                        'year': article.year,
                        'publication_types': list(article.publication_types.values()),
                        'citation': article.citation,
                    }
                )
            else:
                book_counter += 1
                book = (nctid, pmid)

            counter += 1
        except InvalidPMID as e:
            print(f'\n{e}')

    publication_dict = {'NCTId': nctid, 'publications': pmids_list}

    total_publications_list.append(publication_dict)

569 / 569...

In [57]:
print(f'Nb de publications avec un type différent d"article": {book_counter}')
print(book)

Nb de publications avec un type différent d"article": 1
('NCT03537196', '27227200')


In [58]:
print(json.dumps(total_publications_list, indent=2))

[
  {
    "NCTId": "NCT02014727",
    "publications": [
      {
        "pmid": "28947345",
        "title": "Safety and immunogenicity of a recombinant Plasmodium falciparum AMA1-DiCo malaria vaccine adjuvanted with GLA-SE or Alhydrogel\u00ae in European and African adults: A phase 1a/1b, randomized, double-blind multi-centre trial.",
        "authors": "Sirima SB; Durier C; Kara L; Houard S; Gansane A; Loulergue P; Bahuaud M; Benhamouda N; Nebi\u00e9 I; Faber B; Remarque E; Launay O; AMA1-DiCo Study Group",
        "doi": "10.1016/j.vaccine.2017.09.027",
        "year": "2017",
        "publication_types": [
          "Clinical Trial, Phase I",
          "Journal Article",
          "Randomized Controlled Trial",
          "Research Support, Non-U.S. Gov't"
        ],
        "citation": "Sirima SB, et al. Safety and immunogenicity of a recombinant Plasmodium falciparum AMA1-DiCo malaria vaccine adjuvanted with GLA-SE or Alhydrogel\u00ae in European and African adults: A phase 1a/1b,

In [59]:
# The number of NCTId didn't change
assert len(total_publications_list) == len(studies_list)

### Import dans Pandas

In [60]:
df_pubmed = pd.DataFrame.from_records(total_publications_list)
df_pubmed

Unnamed: 0,NCTId,publications
0,NCT02014727,"[{'pmid': '28947345', 'title': 'Safety and imm..."
1,NCT00117494,"[{'pmid': '22739396', 'title': 'Effects of ros..."
2,NCT00536627,"[{'pmid': '24394187', 'title': 'Immunological ..."
3,NCT01033760,"[{'pmid': '25701561', 'title': 'Intensive five..."
4,NCT05349162,"[{'pmid': '36735263', 'title': 'Epicardial vs...."
...,...,...
284,NCT01880151,[]
285,NCT02658253,"[{'pmid': '33717176', 'title': 'Progress and I..."
286,NCT02027051,[]
287,NCT00118677,"[{'pmid': '20657770', 'title': 'Relationship b..."


#### On "explose" la colonne "publications":

Pour chaque NCTId, la colonne 'publications' contient potentiellement une liste de plusieurs publications.  
Si par exemple, on a 3 publications, on veut se retrouver avec 3 lignes, chacune indexée par le même NCTId et contenant une unique publication.  

*Avant*:  
**`NCTId    Reference`**  
`NCT0001, [Pub1, Pub2, Pub3]`   

*Après*:  
**`NCTId    Reference`**  
`NCT0001,   Pub1`  
`NCT0001,   Pub2`  
`NCT0001,   Pub3`   

In [61]:
df_pubmed = df_pubmed.explode('publications', ignore_index=True)
df_pubmed

Unnamed: 0,NCTId,publications
0,NCT02014727,"{'pmid': '28947345', 'title': 'Safety and immu..."
1,NCT00117494,"{'pmid': '22739396', 'title': 'Effects of rosu..."
2,NCT00117494,"{'pmid': '22318219', 'title': 'Determinants of..."
3,NCT00536627,"{'pmid': '24394187', 'title': 'Immunological a..."
4,NCT00536627,"{'pmid': '15382173', 'title': 'Induction or ex..."
...,...,...
701,NCT02658253,"{'pmid': '33717176', 'title': 'Progress and In..."
702,NCT02658253,"{'pmid': '32032566', 'title': 'PRIMVAC vaccine..."
703,NCT02027051,
704,NCT00118677,"{'pmid': '20657770', 'title': 'Relationship be..."


On vérifie que l'on a retrouvé plus de publications avec PubMed + CT que CT tout seul:

In [62]:
assert len(df_pubmed) >= len(df_ct)

Pour chaque NCTId, la colonne 'publications' contient maintenant un dictionnaire de la forme:  
`{  
"pmid": "17545707",  
"title": "Haematological ...",  
"authors": "Smith DJ; ...",  
}`

Dont on veut extraire de nouvelles colonnes en utilant les clés du dictionnaire:  
**`pmid      title                 authors`**  
`17545707, "Haematological ...", "Smith DJ; ..."`

In [63]:
df_pubmed_publications = pd.json_normalize(df_pubmed.pop('publications'))
df_pubmed_publications

Unnamed: 0,pmid,title,authors,doi,year,publication_types,citation
0,28947345,Safety and immunogenicity of a recombinant Pla...,Sirima SB; Durier C; Kara L; Houard S; Gansane...,10.1016/j.vaccine.2017.09.027,2017,"[Clinical Trial, Phase I, Journal Article, Ran...","Sirima SB, et al. Safety and immunogenicity of..."
1,22739396,Effects of rosuvastatin versus pravastatin on ...,Bittar R; Giral P; Aslangul E; Assoumou L; Val...,10.1097/QAD.0b013e328357063c,2012,"[Comparative Study, Journal Article, Multicent...","Bittar R, et al. Effects of rosuvastatin versu..."
2,22318219,Determinants of low-density lipoprotein partic...,Bittar R; Giral P; Aslangul E; Assoumou L; Val...,10.3851/IMP2065,2012,"[Journal Article, Research Support, Non-U.S. G...","Bittar R, et al. Determinants of low-density l..."
3,24394187,Immunological and antiviral responses after th...,Godon O; Fontaine H; Kahi S; Meritet JF; Scott...,10.1038/mt.2013.274,2014,"[Clinical Trial, Phase I, Clinical Trial, Phas...","Godon O, et al. Immunological and antiviral re..."
4,15382173,Induction or expansion of T-cell responses by ...,Mancini-Bourgine M; Fontaine H; Scott-Algara D...,10.1002/hep.20408,2004,"[Clinical Trial, Clinical Trial, Phase I, Jour...","Mancini-Bourgine M, et al. Induction or expans..."
...,...,...,...,...,...,...,...
701,33717176,Progress and Insights Toward an Effective Plac...,Gamain B; Chêne A; Viebig NK; Tuikue Ndam N; N...,10.3389/fimmu.2021.634508,2021,"[Journal Article, Research Support, Non-U.S. G...","Gamain B, et al. Progress and Insights Toward ..."
702,32032566,PRIMVAC vaccine adjuvanted with Alhydrogel or ...,Sirima SB; Richert L; Chêne A; Konate AT; Camp...,10.1016/S1473-3099(19)30739-X,2020,"[Clinical Trial, Phase I, Journal Article, Ran...","Sirima SB, et al. PRIMVAC vaccine adjuvanted w..."
703,,,,,,,
704,20657770,Relationship between regulatory T cells and im...,Weiss L; Piketty C; Assoumou L; Didier C; Cacc...,10.1371/journal.pone.0011659,2010,"[Clinical Trial, Journal Article, Multicenter ...","Weiss L, et al. Relationship between regulator..."


On réassemble la dataFrame complète:

In [64]:
df_pubmed = df_pubmed.join(df_pubmed_publications)
df_pubmed

Unnamed: 0,NCTId,pmid,title,authors,doi,year,publication_types,citation
0,NCT02014727,28947345,Safety and immunogenicity of a recombinant Pla...,Sirima SB; Durier C; Kara L; Houard S; Gansane...,10.1016/j.vaccine.2017.09.027,2017,"[Clinical Trial, Phase I, Journal Article, Ran...","Sirima SB, et al. Safety and immunogenicity of..."
1,NCT00117494,22739396,Effects of rosuvastatin versus pravastatin on ...,Bittar R; Giral P; Aslangul E; Assoumou L; Val...,10.1097/QAD.0b013e328357063c,2012,"[Comparative Study, Journal Article, Multicent...","Bittar R, et al. Effects of rosuvastatin versu..."
2,NCT00117494,22318219,Determinants of low-density lipoprotein partic...,Bittar R; Giral P; Aslangul E; Assoumou L; Val...,10.3851/IMP2065,2012,"[Journal Article, Research Support, Non-U.S. G...","Bittar R, et al. Determinants of low-density l..."
3,NCT00536627,24394187,Immunological and antiviral responses after th...,Godon O; Fontaine H; Kahi S; Meritet JF; Scott...,10.1038/mt.2013.274,2014,"[Clinical Trial, Phase I, Clinical Trial, Phas...","Godon O, et al. Immunological and antiviral re..."
4,NCT00536627,15382173,Induction or expansion of T-cell responses by ...,Mancini-Bourgine M; Fontaine H; Scott-Algara D...,10.1002/hep.20408,2004,"[Clinical Trial, Clinical Trial, Phase I, Jour...","Mancini-Bourgine M, et al. Induction or expans..."
...,...,...,...,...,...,...,...,...
701,NCT02658253,33717176,Progress and Insights Toward an Effective Plac...,Gamain B; Chêne A; Viebig NK; Tuikue Ndam N; N...,10.3389/fimmu.2021.634508,2021,"[Journal Article, Research Support, Non-U.S. G...","Gamain B, et al. Progress and Insights Toward ..."
702,NCT02658253,32032566,PRIMVAC vaccine adjuvanted with Alhydrogel or ...,Sirima SB; Richert L; Chêne A; Konate AT; Camp...,10.1016/S1473-3099(19)30739-X,2020,"[Clinical Trial, Phase I, Journal Article, Ran...","Sirima SB, et al. PRIMVAC vaccine adjuvanted w..."
703,NCT02027051,,,,,,,
704,NCT00118677,20657770,Relationship between regulatory T cells and im...,Weiss L; Piketty C; Assoumou L; Didier C; Cacc...,10.1371/journal.pone.0011659,2010,"[Clinical Trial, Journal Article, Multicenter ...","Weiss L, et al. Relationship between regulator..."


### Jointure des DataFrame de CT et Pubmed:

In [65]:
df_final = df_ct.merge(df_pubmed, on=['NCTId', 'pmid'], how='right')
df_final

Unnamed: 0,NCTId,BriefTitle,LeadSponsorName,CollaboratorName,OverallStatus,StudyType,HasResults,StudyFirstPostDate,StartDate,CompletionDate,pmid,type,citation_x,title,authors,doi,year,publication_types,citation_y
0,NCT02014727,Safety and Immunogenicity of Recombinant Pichi...,Institut National de la Santé Et de la Recherc...,"EVI Industries, Inc. | BPRC | Recherche Cliniq...",COMPLETED,INTERVENTIONAL,False,2013-12-18,2014-01,2015-07,28947345,DERIVED,"Sirima SB, Durier C, Kara L, Houard S, Gansane...",Safety and immunogenicity of a recombinant Pla...,Sirima SB; Durier C; Kara L; Houard S; Gansane...,10.1016/j.vaccine.2017.09.027,2017,"[Clinical Trial, Phase I, Journal Article, Ran...","Sirima SB, et al. Safety and immunogenicity of..."
1,NCT00117494,Rosuvastatin Versus Pravastatin in HIV Patient...,French National Agency for Research on AIDS an...,,COMPLETED,INTERVENTIONAL,False,2005-07-07,2005-10,2007-06,22739396,DERIVED,"Bittar R, Giral P, Aslangul E, Assoumou L, Val...",Effects of rosuvastatin versus pravastatin on ...,Bittar R; Giral P; Aslangul E; Assoumou L; Val...,10.1097/QAD.0b013e328357063c,2012,"[Comparative Study, Journal Article, Multicent...","Bittar R, et al. Effects of rosuvastatin versu..."
2,NCT00117494,Rosuvastatin Versus Pravastatin in HIV Patient...,French National Agency for Research on AIDS an...,,COMPLETED,INTERVENTIONAL,False,2005-07-07,2005-10,2007-06,22318219,DERIVED,"Bittar R, Giral P, Aslangul E, Assoumou L, Val...",Determinants of low-density lipoprotein partic...,Bittar R; Giral P; Aslangul E; Assoumou L; Val...,10.3851/IMP2065,2012,"[Journal Article, Research Support, Non-U.S. G...","Bittar R, et al. Determinants of low-density l..."
3,NCT00536627,Efficacy and Tolerance of Naked DNA Vaccine in...,French National Agency for Research on AIDS an...,,COMPLETED,INTERVENTIONAL,False,2007-09-28,2008-01,2010-11,24394187,DERIVED,"Godon O, Fontaine H, Kahi S, Meritet JF, Scott...",Immunological and antiviral responses after th...,Godon O; Fontaine H; Kahi S; Meritet JF; Scott...,10.1038/mt.2013.274,2014,"[Clinical Trial, Phase I, Clinical Trial, Phas...","Godon O, et al. Immunological and antiviral re..."
4,NCT00536627,Efficacy and Tolerance of Naked DNA Vaccine in...,French National Agency for Research on AIDS an...,,COMPLETED,INTERVENTIONAL,False,2007-09-28,2008-01,2010-11,15382173,BACKGROUND,"Mancini-Bourgine M, Fontaine H, Scott-Algara D...",Induction or expansion of T-cell responses by ...,Mancini-Bourgine M; Fontaine H; Scott-Algara D...,10.1002/hep.20408,2004,"[Clinical Trial, Clinical Trial, Phase I, Jour...","Mancini-Bourgine M, et al. Induction or expans..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
701,NCT02658253,Trial to Evaluate the Safety and Immunogenicit...,Institut National de la Santé Et de la Recherc...,"EVI Industries, Inc. | Recherche Clinique Pari...",COMPLETED,INTERVENTIONAL,False,2016-01-18,2016-01,2019-02-21,33717176,DERIVED,"Gamain B, Chene A, Viebig NK, Tuikue Ndam N, N...",Progress and Insights Toward an Effective Plac...,Gamain B; Chêne A; Viebig NK; Tuikue Ndam N; N...,10.3389/fimmu.2021.634508,2021,"[Journal Article, Research Support, Non-U.S. G...","Gamain B, et al. Progress and Insights Toward ..."
702,NCT02658253,Trial to Evaluate the Safety and Immunogenicit...,Institut National de la Santé Et de la Recherc...,"EVI Industries, Inc. | Recherche Clinique Pari...",COMPLETED,INTERVENTIONAL,False,2016-01-18,2016-01,2019-02-21,32032566,DERIVED,"Sirima SB, Richert L, Chene A, Konate AT, Camp...",PRIMVAC vaccine adjuvanted with Alhydrogel or ...,Sirima SB; Richert L; Chêne A; Konate AT; Camp...,10.1016/S1473-3099(19)30739-X,2020,"[Clinical Trial, Phase I, Journal Article, Ran...","Sirima SB, et al. PRIMVAC vaccine adjuvanted w..."
703,NCT02027051,Etude génétique Des Arméniens,Institut National de la Santé Et de la Recherc...,,COMPLETED,OBSERVATIONAL,False,2014-01-03,2014-01,2017-01,,,,,,,,,
704,NCT00118677,Long-Term Supervised Treatment Interruption in...,French National Agency for Research on AIDS an...,,COMPLETED,INTERVENTIONAL,False,2005-07-12,2003-02,2007-05,20657770,DERIVED,"Weiss L, Piketty C, Assoumou L, Didier C, Cacc...",Relationship between regulatory T cells and im...,Weiss L; Piketty C; Assoumou L; Didier C; Cacc...,10.1371/journal.pone.0011659,2010,"[Clinical Trial, Journal Article, Multicenter ...","Weiss L, et al. Relationship between regulator..."


Suppression des colonnes 'citation':

In [66]:
df_final.drop(['citation_x', 'citation_y'], axis=1, inplace=True)

In [67]:
df_final

Unnamed: 0,NCTId,BriefTitle,LeadSponsorName,CollaboratorName,OverallStatus,StudyType,HasResults,StudyFirstPostDate,StartDate,CompletionDate,pmid,type,title,authors,doi,year,publication_types
0,NCT02014727,Safety and Immunogenicity of Recombinant Pichi...,Institut National de la Santé Et de la Recherc...,"EVI Industries, Inc. | BPRC | Recherche Cliniq...",COMPLETED,INTERVENTIONAL,False,2013-12-18,2014-01,2015-07,28947345,DERIVED,Safety and immunogenicity of a recombinant Pla...,Sirima SB; Durier C; Kara L; Houard S; Gansane...,10.1016/j.vaccine.2017.09.027,2017,"[Clinical Trial, Phase I, Journal Article, Ran..."
1,NCT00117494,Rosuvastatin Versus Pravastatin in HIV Patient...,French National Agency for Research on AIDS an...,,COMPLETED,INTERVENTIONAL,False,2005-07-07,2005-10,2007-06,22739396,DERIVED,Effects of rosuvastatin versus pravastatin on ...,Bittar R; Giral P; Aslangul E; Assoumou L; Val...,10.1097/QAD.0b013e328357063c,2012,"[Comparative Study, Journal Article, Multicent..."
2,NCT00117494,Rosuvastatin Versus Pravastatin in HIV Patient...,French National Agency for Research on AIDS an...,,COMPLETED,INTERVENTIONAL,False,2005-07-07,2005-10,2007-06,22318219,DERIVED,Determinants of low-density lipoprotein partic...,Bittar R; Giral P; Aslangul E; Assoumou L; Val...,10.3851/IMP2065,2012,"[Journal Article, Research Support, Non-U.S. G..."
3,NCT00536627,Efficacy and Tolerance of Naked DNA Vaccine in...,French National Agency for Research on AIDS an...,,COMPLETED,INTERVENTIONAL,False,2007-09-28,2008-01,2010-11,24394187,DERIVED,Immunological and antiviral responses after th...,Godon O; Fontaine H; Kahi S; Meritet JF; Scott...,10.1038/mt.2013.274,2014,"[Clinical Trial, Phase I, Clinical Trial, Phas..."
4,NCT00536627,Efficacy and Tolerance of Naked DNA Vaccine in...,French National Agency for Research on AIDS an...,,COMPLETED,INTERVENTIONAL,False,2007-09-28,2008-01,2010-11,15382173,BACKGROUND,Induction or expansion of T-cell responses by ...,Mancini-Bourgine M; Fontaine H; Scott-Algara D...,10.1002/hep.20408,2004,"[Clinical Trial, Clinical Trial, Phase I, Jour..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
701,NCT02658253,Trial to Evaluate the Safety and Immunogenicit...,Institut National de la Santé Et de la Recherc...,"EVI Industries, Inc. | Recherche Clinique Pari...",COMPLETED,INTERVENTIONAL,False,2016-01-18,2016-01,2019-02-21,33717176,DERIVED,Progress and Insights Toward an Effective Plac...,Gamain B; Chêne A; Viebig NK; Tuikue Ndam N; N...,10.3389/fimmu.2021.634508,2021,"[Journal Article, Research Support, Non-U.S. G..."
702,NCT02658253,Trial to Evaluate the Safety and Immunogenicit...,Institut National de la Santé Et de la Recherc...,"EVI Industries, Inc. | Recherche Clinique Pari...",COMPLETED,INTERVENTIONAL,False,2016-01-18,2016-01,2019-02-21,32032566,DERIVED,PRIMVAC vaccine adjuvanted with Alhydrogel or ...,Sirima SB; Richert L; Chêne A; Konate AT; Camp...,10.1016/S1473-3099(19)30739-X,2020,"[Clinical Trial, Phase I, Journal Article, Ran..."
703,NCT02027051,Etude génétique Des Arméniens,Institut National de la Santé Et de la Recherc...,,COMPLETED,OBSERVATIONAL,False,2014-01-03,2014-01,2017-01,,,,,,,
704,NCT00118677,Long-Term Supervised Treatment Interruption in...,French National Agency for Research on AIDS an...,,COMPLETED,INTERVENTIONAL,False,2005-07-12,2003-02,2007-05,20657770,DERIVED,Relationship between regulatory T cells and im...,Weiss L; Piketty C; Assoumou L; Didier C; Cacc...,10.1371/journal.pone.0011659,2010,"[Clinical Trial, Journal Article, Multicenter ..."


Les nouveaux PMIDs trouvés via Pubmed, n'ont aucune des infos associées avec CT présentes: BriefTitle, LeadSponsorName, etc ...


In [68]:
# Index of empty rows we need to fill
index_empty_rows = df_final.loc[:, 'BriefTitle'].isna()

df_final[index_empty_rows]

Unnamed: 0,NCTId,BriefTitle,LeadSponsorName,CollaboratorName,OverallStatus,StudyType,HasResults,StudyFirstPostDate,StartDate,CompletionDate,pmid,type,title,authors,doi,year,publication_types
9,NCT05349162,,,,,,,,,,36735263,,Epicardial vs. transvenous implantable cardiov...,Le Bos PA; Pontailler M; Maltret A; Kraiche D;...,10.1093/europace/euad015,2023,"[Clinical Trial, Journal Article]"
11,NCT02777229,,,,,,,,,,37851566,,Improvements in Patient-Reported Outcomes Foll...,Bousmah MA; Protopopescu C; Mpoudi-Etame M; Om...,10.1097/QAI.0000000000003273,2023,"[Randomized Controlled Trial, Journal Article,..."
14,NCT02777229,,,,,,,,,,38156046,,Durability of the Efficacy and Safety of Dolut...,Mpoudi-Etame M; Tovar Sanchez T; Bousmah MA; O...,10.1093/ofid/ofad582,2023,[Journal Article]
24,NCT01453192,,,,,,,,,,30688008,,Low incidence of acute rejection within 6 mont...,Matignon M; Lelièvre JD; Lahiani A; Abbassi K;...,10.1111/hiv.12700,2019,"[Journal Article, Multicenter Study, Research ..."
60,NCT04315948,,,,,,,,,,36695483,,Remdesivir for the treatment of COVID-19.,Grundeis F; Ansems K; Dahms K; Thieme V; Metze...,10.1002/14651858.CD014962.pub2,2023,"[Journal Article, Review, Research Support, No..."
74,NCT05311865,,,,,,,,,,37795682,,A randomised controlled trial to study the tra...,Luong Nguyen LB; Goupil de Bouillé J; Menant L...,10.1093/cid/ciad603,2023,[Journal Article]
75,NCT05311865,,,,,,,,,,36438274,,Transmission of SARS-CoV-2 during indoor clubb...,Goupil de Bouillé J; Luong Nguyen LB; Crépey P...,10.3389/fpubh.2022.981213,2022,"[Clinical Trial Protocol, Journal Article, Res..."
103,NCT00640263,,,,,,,,,,34425825,,The prevalence and socio-behavioural and clini...,Birungi N; Fadnes LT; Engebretsen IMS; Tumwine...,10.1186/s12955-021-01844-3,2021,[Journal Article]
129,NCT01801618,,,,,,,,,,29662875,,Positive Virological Outcomes of HIV-Infected ...,Ségéral O; Nerrienet E; Neth S; Spire B; Khol ...,10.3389/fpubh.2018.00063,2018,[Journal Article]
130,NCT02405013,,,,,,,,,,36686592,,Patient-reported outcomes with direct-acting a...,Marcellin F; Mourad A; Lemoine M; Kouanfack C;...,10.1016/j.jhepr.2022.100665,2023,[Journal Article]


On rajoute un type 'PUBMED' pour les PMIDs issues de PubMed uniquement:

In [69]:
# We add a 'PUBMED' type to the PMIDs extracted from Pubmed exclusively
df_final.loc[index_empty_rows, 'type'] = 'PUBMED'

In [70]:
df_final[index_empty_rows]

Unnamed: 0,NCTId,BriefTitle,LeadSponsorName,CollaboratorName,OverallStatus,StudyType,HasResults,StudyFirstPostDate,StartDate,CompletionDate,pmid,type,title,authors,doi,year,publication_types
9,NCT05349162,,,,,,,,,,36735263,PUBMED,Epicardial vs. transvenous implantable cardiov...,Le Bos PA; Pontailler M; Maltret A; Kraiche D;...,10.1093/europace/euad015,2023,"[Clinical Trial, Journal Article]"
11,NCT02777229,,,,,,,,,,37851566,PUBMED,Improvements in Patient-Reported Outcomes Foll...,Bousmah MA; Protopopescu C; Mpoudi-Etame M; Om...,10.1097/QAI.0000000000003273,2023,"[Randomized Controlled Trial, Journal Article,..."
14,NCT02777229,,,,,,,,,,38156046,PUBMED,Durability of the Efficacy and Safety of Dolut...,Mpoudi-Etame M; Tovar Sanchez T; Bousmah MA; O...,10.1093/ofid/ofad582,2023,[Journal Article]
24,NCT01453192,,,,,,,,,,30688008,PUBMED,Low incidence of acute rejection within 6 mont...,Matignon M; Lelièvre JD; Lahiani A; Abbassi K;...,10.1111/hiv.12700,2019,"[Journal Article, Multicenter Study, Research ..."
60,NCT04315948,,,,,,,,,,36695483,PUBMED,Remdesivir for the treatment of COVID-19.,Grundeis F; Ansems K; Dahms K; Thieme V; Metze...,10.1002/14651858.CD014962.pub2,2023,"[Journal Article, Review, Research Support, No..."
74,NCT05311865,,,,,,,,,,37795682,PUBMED,A randomised controlled trial to study the tra...,Luong Nguyen LB; Goupil de Bouillé J; Menant L...,10.1093/cid/ciad603,2023,[Journal Article]
75,NCT05311865,,,,,,,,,,36438274,PUBMED,Transmission of SARS-CoV-2 during indoor clubb...,Goupil de Bouillé J; Luong Nguyen LB; Crépey P...,10.3389/fpubh.2022.981213,2022,"[Clinical Trial Protocol, Journal Article, Res..."
103,NCT00640263,,,,,,,,,,34425825,PUBMED,The prevalence and socio-behavioural and clini...,Birungi N; Fadnes LT; Engebretsen IMS; Tumwine...,10.1186/s12955-021-01844-3,2021,[Journal Article]
129,NCT01801618,,,,,,,,,,29662875,PUBMED,Positive Virological Outcomes of HIV-Infected ...,Ségéral O; Nerrienet E; Neth S; Spire B; Khol ...,10.3389/fpubh.2018.00063,2018,[Journal Article]
130,NCT02405013,,,,,,,,,,36686592,PUBMED,Patient-reported outcomes with direct-acting a...,Marcellin F; Mourad A; Lemoine M; Kouanfack C;...,10.1016/j.jhepr.2022.100665,2023,[Journal Article]


On va remplir ces colonnes avec les infos contenus dans la DataFrame CT:

In [71]:
# NCTIds of empty rows
NCTIds_empty_rows = df_final.loc[index_empty_rows, 'NCTId']
NCTIds_empty_rows

# Columns we wish to copy
columns_to_copy = [
    'BriefTitle',
    'LeadSponsorName',
    'CollaboratorName',
    'OverallStatus',
    'StudyType',
    'HasResults',
    'StudyFirstPostDate',
    'StartDate',
    'CompletionDate',
]

# We copy the missing values from the CT dataframe
for index, nctid in NCTIds_empty_rows.items():
    # For an NCTId, we look in the CT dataframe for the first row with this NCTDId
    # and copy the missings columns
    df_final.loc[index, columns_to_copy] = df_ct.loc[
        df_ct.loc[:, 'NCTId'] == nctid, columns_to_copy
    ].iloc[0]

In [72]:
df_final[index_empty_rows]

Unnamed: 0,NCTId,BriefTitle,LeadSponsorName,CollaboratorName,OverallStatus,StudyType,HasResults,StudyFirstPostDate,StartDate,CompletionDate,pmid,type,title,authors,doi,year,publication_types
9,NCT05349162,Epicardial vs. Transvenous ICDs in Children,Paris Cardiovascular Research Center (Inserm U...,Hôpital Necker-Enfants Malades,COMPLETED,OBSERVATIONAL,False,2022-04-27,2003-01-01,2022-04-01,36735263,PUBMED,Epicardial vs. transvenous implantable cardiov...,Le Bos PA; Pontailler M; Maltret A; Kraiche D;...,10.1093/europace/euad015,2023,"[Clinical Trial, Journal Article]"
11,NCT02777229,Efficacy and Safety of a Dolutegravir-based Re...,"ANRS, Emerging Infectious Diseases",Institut de Recherche pour le Developpement | ...,COMPLETED,INTERVENTIONAL,False,2016-05-19,2016-07,2021-07,37851566,PUBMED,Improvements in Patient-Reported Outcomes Foll...,Bousmah MA; Protopopescu C; Mpoudi-Etame M; Om...,10.1097/QAI.0000000000003273,2023,"[Randomized Controlled Trial, Journal Article,..."
14,NCT02777229,Efficacy and Safety of a Dolutegravir-based Re...,"ANRS, Emerging Infectious Diseases",Institut de Recherche pour le Developpement | ...,COMPLETED,INTERVENTIONAL,False,2016-05-19,2016-07,2021-07,38156046,PUBMED,Durability of the Efficacy and Safety of Dolut...,Mpoudi-Etame M; Tovar Sanchez T; Bousmah MA; O...,10.1093/ofid/ofad582,2023,[Journal Article]
24,NCT01453192,Renal Transplantation and Raltegravir in HIV-I...,"ANRS, Emerging Infectious Diseases",Merck Sharp & Dohme LLC,COMPLETED,INTERVENTIONAL,False,2011-10-17,2011-12,2015-11,30688008,PUBMED,Low incidence of acute rejection within 6 mont...,Matignon M; Lelièvre JD; Lahiani A; Abbassi K;...,10.1111/hiv.12700,2019,"[Journal Article, Multicenter Study, Research ..."
60,NCT04315948,Trial of Treatments for COVID-19 in Hospitaliz...,Institut National de la Santé Et de la Recherc...,,COMPLETED,INTERVENTIONAL,False,2020-03-20,2020-03-22,2023-09-25,36695483,PUBMED,Remdesivir for the treatment of COVID-19.,Grundeis F; Ansems K; Dahms K; Thieme V; Metze...,10.1002/14651858.CD014962.pub2,2023,"[Journal Article, Review, Research Support, No..."
74,NCT05311865,Transmission of Covid-19 During Clubbing Event...,"ANRS, Emerging Infectious Diseases",Cerballiance | Kappa Santé,COMPLETED,INTERVENTIONAL,False,2022-04-05,2021-09-04,2022-02-26,37795682,PUBMED,A randomised controlled trial to study the tra...,Luong Nguyen LB; Goupil de Bouillé J; Menant L...,10.1093/cid/ciad603,2023,[Journal Article]
75,NCT05311865,Transmission of Covid-19 During Clubbing Event...,"ANRS, Emerging Infectious Diseases",Cerballiance | Kappa Santé,COMPLETED,INTERVENTIONAL,False,2022-04-05,2021-09-04,2022-02-26,36438274,PUBMED,Transmission of SARS-CoV-2 during indoor clubb...,Goupil de Bouillé J; Luong Nguyen LB; Crépey P...,10.3389/fpubh.2022.981213,2022,"[Clinical Trial Protocol, Journal Article, Res..."
103,NCT00640263,Comparison of Efficacy and Safety of Infant Pe...,French National Agency for Research on AIDS an...,European and Developing Countries Clinical Tri...,COMPLETED,INTERVENTIONAL,False,2008-03-21,2009-12,2014-02,34425825,PUBMED,The prevalence and socio-behavioural and clini...,Birungi N; Fadnes LT; Engebretsen IMS; Tumwine...,10.1186/s12955-021-01844-3,2021,[Journal Article]
129,NCT01801618,National Evaluation of PI-based 2nd Line Effic...,"ANRS, Emerging Infectious Diseases",,COMPLETED,OBSERVATIONAL,False,2013-03-01,2013-02,2014-12,29662875,PUBMED,Positive Virological Outcomes of HIV-Infected ...,Ségéral O; Nerrienet E; Neth S; Spire B; Khol ...,10.3389/fpubh.2018.00063,2018,[Journal Article]
130,NCT02405013,"Feasibility, Tolerance and Efficacy of Interfe...","ANRS, Emerging Infectious Diseases",,COMPLETED,INTERVENTIONAL,False,2015-04-01,2015-10,2017-11,36686592,PUBMED,Patient-reported outcomes with direct-acting a...,Marcellin F; Mourad A; Lemoine M; Kouanfack C;...,10.1016/j.jhepr.2022.100665,2023,[Journal Article]


### Resultat final:

In [73]:
df_final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 706 entries, 0 to 705
Data columns (total 17 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   NCTId               706 non-null    object  
 1   BriefTitle          706 non-null    string  
 2   LeadSponsorName     706 non-null    string  
 3   CollaboratorName    357 non-null    string  
 4   OverallStatus       706 non-null    category
 5   StudyType           706 non-null    category
 6   HasResults          706 non-null    boolean 
 7   StudyFirstPostDate  706 non-null    string  
 8   StartDate           706 non-null    string  
 9   CompletionDate      705 non-null    string  
 10  pmid                568 non-null    object  
 11  type                568 non-null    string  
 12  title               568 non-null    object  
 13  authors             568 non-null    object  
 14  doi                 563 non-null    object  
 15  year                568 non-null    obje

In [74]:
df_final = df_final.convert_dtypes()
# df_final = df_final.astype({"OverallStatus" : 'category', "StudyType" : 'category'})
df_final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 706 entries, 0 to 705
Data columns (total 17 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   NCTId               706 non-null    string  
 1   BriefTitle          706 non-null    string  
 2   LeadSponsorName     706 non-null    string  
 3   CollaboratorName    357 non-null    string  
 4   OverallStatus       706 non-null    category
 5   StudyType           706 non-null    category
 6   HasResults          706 non-null    boolean 
 7   StudyFirstPostDate  706 non-null    string  
 8   StartDate           706 non-null    string  
 9   CompletionDate      705 non-null    string  
 10  pmid                568 non-null    string  
 11  type                568 non-null    string  
 12  title               568 non-null    string  
 13  authors             568 non-null    string  
 14  doi                 563 non-null    string  
 15  year                568 non-null    stri

### Export en CSV: 

In [75]:
df_final.to_csv(
    'Data/outputs/extract_df_final.csv',
    sep=';',
    index=False,
    encoding='utf-8-sig',
)