# Extraction des NCTIds dans ClinicalTrial:

## API v1:

Pour faciliter la récupération des données via l'API v1 de ClinicalTrial, on utilise le wrapper Python: [pytrials](https://github.com/jvfe/pytrials)

Installer ***pytrials***:
- Dans Conda Navigator, se placer dans le meme environnement que celui qui execute Jupyter
- Lancer Powershell Prompt dans cet environnement
- Taper: `pip install pytrials`

In [1]:
from pytrials.client import ClinicalTrials
import urllib.parse

ModuleNotFoundError: No module named 'pytrials'

In [None]:
ct = ClinicalTrials()

### Création de la requête:

On crée la requête qui sera envoyé à l'API de ClinicalTrial

#### Sponsors:

In [None]:
sponsors = [
    "anrs",
    "inserm",
    "institut national de la santé et de la recherche médicale",
    "french national agency for research on aids and viral hepatitis",
]

In [None]:
sponsors_expr = [f"AREA[LeadSponsorName]{sponsor}" for sponsor in sponsors]

# Add OR keyword
sponsors_expr = " OR ".join(sponsors_expr)

# Add parenthesis for correct interpretation of OR expression
sponsors_expr = f"({sponsors_expr})"

sponsors_expr

#### Status:

In [None]:
status = "completed"

In [None]:
status_expr = f"AREA[OverallStatus]{status}"
status_expr

#### Search Expression:

In [None]:
search_expr = " AND ".join([sponsors_expr, status_expr])
search_expr

#### URL encode: 

In [None]:
search_expr_url_encode = urllib.parse.quote_plus(search_expr)
search_expr_url_encode

#### Fields:

Les champs que l'on veut récupérer:

In [None]:
fields = [
    "NCTId",
    "BriefTitle",
    "OverallStatus",
    "StudyType",
    "LeadSponsorName",
    "CollaboratorName",
    "OrgStudyId",
    "SecondaryId",
    "StudyFirstPostDate",
    "ReferencePMID",
    "ReferenceCitation",
    "ReferenceType",
]

### Envoi de la requête :

In [None]:
study_fields = ct.get_study_fields(
    search_expr=search_expr_url_encode,
    fields=fields,
    max_studies=1000,
    fmt="csv",
)

In [None]:
print(f"NStudiesReturned: {len(study_fields[1:])}")

### Lire le résultat de la requête dans Pandas :

In [None]:
import pandas as pd

In [None]:
pd.DataFrame.from_records(study_fields[1:], index="Rank", columns=study_fields[0])

## API v2:

L'API v1 ne sera plus supporté a partir de mi-2024

>***Notice to API users:  
>The new ClinicalTrials.gov API, version 2.0 is available. Classic API users are strongly encouraged to switch to the modernized API. We will continue to support the classic API until mid-2024 and are planning blackouts for the spring to help with the transition to the modernized API.***

De plus, l'API v2 supporte un nouveau champs **"HasResults"**, qui pour l'instant n'est que très peu utilisé mais qui pourrait être utile à l'avenir.

En contre-partie, l'export des données sous forme de CSV est limité à un certain nombre de champs visible sur cette page: https://clinicaltrials.gov/data-api/about-api/csv-download

On est donc obligé d'utiliser l'export de données sous forme de JSON.

### Création de la requête:

`pytrials` n'étant pas compatible avec la v2, on envoie la requête manuellement en utilisant [Requests](https://requests.readthedocs.io/en/latest/) 

#### Format:

In [None]:
format = "json"

#### Sponsors:

In [None]:
sponsors

In [None]:
sponsors_expr_v2 = " OR ".join(sponsors)
sponsors_expr_v2 = urllib.parse.quote_plus(sponsors_expr_v2)
sponsors_expr_v2

#### Overall_status:

In [None]:
overall_status = "COMPLETED"

#### Fields:

In [None]:
fields_v2 = [
    'NCTId',
    'BriefTitle',
    'OverallStatus',
    'StudyType',
    'LeadSponsorName',
    'CollaboratorName',
    # 'OrgStudyId',
    # 'SecondaryId',
    'StudyFirstPostDate',
    'ReferencePMID',
    'ReferenceCitation',
    'ReferenceType',
    'hasResults'
]
fields_v2

In [None]:
fields_expr_v2 = ",".join(fields_v2)
fields_expr_v2 = urllib.parse.quote_plus(fields_expr_v2)
fields_expr_v2

#### Nombre de résultats:

In [None]:
count_total = "true"

In [None]:
page_size = 1000

#### URL api:

In [None]:
query_url = f"https://clinicaltrials.gov/api/v2/studies?format={format}&query.lead={sponsors_expr_v2}&filter.overallStatus={overall_status}&fields={fields_expr_v2}&countTotal={count_total}&pageSize={page_size}"
query_url

### Envoi de la requête :

In [None]:
import requests

In [None]:
response = requests.get(query_url)
response.raise_for_status()
response

In [None]:
print(f'Studies returned: {response.json()["totalCount"]}')

### Traitement du JSON retourné:

In [None]:
import json

In [None]:
print(json.dumps(response.json(), indent=2))

***La structure du JSON est bien trop imbriquée pour le normaliser avec Pandas, du coup on l'applatit à la main:***

A partir du JSON on construit un dictionnaire équivalent mais beaucoup plus "plat"

In [None]:
# Si la liste des collaborateurs est vide on renvoie None, sinon on concatène les valeurs de la liste
# sous la forme "collaborateur_0 | collaborateur_1 | ..."
def concatenate_collaborator_list(collaborator_list):
    if collaborator_list == []:
        return None
    else:
        return " | ".join(collaborator_list)

In [None]:
studies_list = []
for study in response.json()["studies"]:
    study_dict = {
        'NCTId' : study["protocolSection"]["identificationModule"]["nctId"],
        'BriefTitle' : study["protocolSection"]["identificationModule"]["briefTitle"],
        'LeadSponsorName' : study["protocolSection"]["sponsorCollaboratorsModule"]["leadSponsor"]["name"],
        "CollaboratorName": concatenate_collaborator_list(
            [c["name"] for c in (
                    study["protocolSection"]["sponsorCollaboratorsModule"].get("collaborators", [])  # can be missing
                )
            ]
        ),    
        'OverallStatus' : study["protocolSection"]["statusModule"]["overallStatus"],
        'StudyType' : study["protocolSection"]["designModule"]["studyType"],    
        'HasResults' : study["hasResults"],
        'StudyFirstPostDate' : study["protocolSection"]["statusModule"]["studyFirstPostDateStruct"]["date"],
        'Reference' : study["protocolSection"].get("referencesModule", {}).get("references"), # can be missing       
    }
    studies_list.append(study_dict)
print(json.dumps(studies_list, indent=2))

On vérifie que l'on n'a pas perdu de NCTId en route:

In [None]:
print(f"Nombre de NCTId: {len(studies_list)}")
assert(response.json()["totalCount"] == len(studies_list))

### Import dans Pandas

In [None]:
df_ct = pd.json_normalize(data = studies_list)
df_ct

#### On "explose" la colonne "Reference":

Pour chaque NCTId, la colonne réference contient potentiellement une liste de plusieurs références.  
Si par exemple, on a 3 références, on veut se retrouver avec 3 lignes, chacune indexée par le même NCTId et contenant une unique réference.  

*Avant*:  
**`NCTId    Reference`**  
`NCT0001, [Ref1, Ref2, Ref3]`   

*Après*:  
**`NCTId    Reference`**  
`NCT0001,   Ref1`  
`NCT0001,   Ref2`  
`NCT0001,   Ref3`   

In [None]:
df_ct = df_ct.explode('Reference', ignore_index = True)
df_ct

Pour chaque NCTId, la colonne réference contient maintenant un dictionnaire de la forme:  
`{  
"pmid": "17545707",  
"type": "BACKGROUND",  
"citation": "...",  
}`

Dont on veut extraire de nouvelles colonnes en utilant les clés du dictionnaire:  
**`pmid     type        citation`**  
`17545707, BACKGROUND, "..."`

In [None]:
df_ct_references = pd.json_normalize(df_ct.pop('Reference'))
df_ct_references

On réassemble la dataFrame complète:

In [None]:
df_ct = df_ct.join(df_references)
df_ct

***On reconstruit l'index :***

In [None]:
# df_studies_v2.set_index('NCTId', inplace = True)

***On précise les types :***

In [None]:
df_ct = df_ct.convert_dtypes()
# df_ct = df_ct.astype({"OverallStatus" : 'category', "StudyType" : 'category', "type": 'category',})
df_ct = df_ct.astype({"OverallStatus" : 'category', "StudyType" : 'category'})
df_ct.info()

In [338]:
df_ct.head(10)

Unnamed: 0,NCTId,BriefTitle,LeadSponsorName,CollaboratorName,OverallStatus,StudyType,HasResults,StudyFirstPostDate,pmid,type,citation
0,NCT01895920,Viral Biofilms: Hijacking T Cell Extracellular...,"ANRS, Emerging Infectious Diseases",,COMPLETED,INTERVENTIONAL,False,2013-07-11,,,
1,NCT02486731,Hormonal Sensitivity in Patients With Noonan a...,Institut National de la Santé Et de la Recherc...,,COMPLETED,OBSERVATIONAL,False,2015-07-01,,,
2,NCT02116374,Physiopathology Study of the Microbiota Biodiv...,"ANRS, Emerging Infectious Diseases",,COMPLETED,OBSERVATIONAL,False,2014-04-16,,,
3,NCT03537196,DRug Use & Infections in ViEtnam - Hepatitis C...,"ANRS, Emerging Infectious Diseases",,COMPLETED,INTERVENTIONAL,False,2018-05-25,25086286.0,BACKGROUND,"Gower E, Estes C, Blach S, Razavi-Shearer K, R..."
4,NCT03537196,DRug Use & Infections in ViEtnam - Hepatitis C...,"ANRS, Emerging Infectious Diseases",,COMPLETED,INTERVENTIONAL,False,2018-05-25,23172780.0,BACKGROUND,"Mohd Hanafiah K, Groeger J, Flaxman AD, Wiersm..."
5,NCT03537196,DRug Use & Infections in ViEtnam - Hepatitis C...,"ANRS, Emerging Infectious Diseases",,COMPLETED,INTERVENTIONAL,False,2018-05-25,22828983.0,BACKGROUND,"Sereno L, Mesquita F, Kato M, Jacka D, Nguyen ..."
6,NCT03537196,DRug Use & Infections in ViEtnam - Hepatitis C...,"ANRS, Emerging Infectious Diseases",,COMPLETED,INTERVENTIONAL,False,2018-05-25,20041309.0,BACKGROUND,"Clatts MC, Colon-Lopez V, Giang LM, Goldsamt L..."
7,NCT03537196,DRug Use & Infections in ViEtnam - Hepatitis C...,"ANRS, Emerging Infectious Diseases",,COMPLETED,INTERVENTIONAL,False,2018-05-25,22098550.0,BACKGROUND,"Gish RG, Bui TD, Nguyen CT, Nguyen DT, Tran HV..."
8,NCT03537196,DRug Use & Infections in ViEtnam - Hepatitis C...,"ANRS, Emerging Infectious Diseases",,COMPLETED,INTERVENTIONAL,False,2018-05-25,20196807.0,BACKGROUND,"Kallman JB, Tran S, Arsalla A, Haddad D, Stepa..."
9,NCT03537196,DRug Use & Infections in ViEtnam - Hepatitis C...,"ANRS, Emerging Infectious Diseases",,COMPLETED,INTERVENTIONAL,False,2018-05-25,19839502.0,BACKGROUND,"Pham DA, Leuangwutiwong P, Jittmittraphap A, L..."


#### Export en CSV :

In [263]:
# df_studies_v2.to_csv('Data/outputs/extract_CT_api_v2.csv', sep=";", encoding='utf-8-sig')
df_ct.to_csv('Data/outputs/extract_CT_api_v2.csv', sep=";", index=False, encoding='utf-8-sig')

# PubMed

### Utilisation d'une clé pour l'API Pubmed : 

Il est recommandé d'utiliser une clé pour accéder à l'API Pubmed, ce qui permet de lancer jusqu'à 10 requêtes par seconde.  
Sans clé, la limite est de 3 requêtes par seconde.  

> E-utils users are allowed 3 requests/second without an API key. Create an API key to increase your e-utils limit to 10 requests/second.

**En pratique, l'API Pubmed étant beaucoup plus lente (~1 requête par seconde), cela ne semble pas changer grand chose.**

Pour récupérer sa clé, se rendre sur la page en étant loggué:
https://account.ncbi.nlm.nih.gov/settings/

Une fois la clé récupérée, l'ajouter aux variables d'environnement avec la commande suivante:

**Windows :** 

`setx NCBI_API_KEY “123456"`

**Linux/MacOS :**

`export NCBI_API_KEY = 123456`

In [264]:
import os

assert(os.getenv('NCBI_API_KEY', None) is not None)

### Installer metapub :
- Dans Conda Navigator, se placer dans le meme environnement que celui qui execute Jupyter  
- Lancer Powershell Prompt dans cet environnement  
- Taper la commande suivante :  `pip install metapub`

In [69]:
from metapub import PubMedFetcher
from metapub.exceptions import InvalidPMID

fetch = PubMedFetcher(cachedir='./.cache/')

#### Récupération des PMIDs via PubMed:

Pour chaque NCTId de CT, on récupère les PMIDs des publications associées via PubMed:

In [70]:
# Liste unique des NCTId extraits de ClinicalTrial
nctid_array = df_ct.loc[:, "NCTId"].unique()

pmids_pubmed_dict = {}
for i, nctid in enumerate(nctid_array):
    # Display the progress on a single line
    print(f"\r{i+1}/{len(nctid_array)}...", end="", flush=True)
    
    pmids = [pmid for pmid in fetch.pmids_for_query(nctid)]
    pmids_pubmed_dict[nctid] = set(pmids)

286/286...

In [71]:
pmids_pubmed_dict

{'NCT01895920': set(),
 'NCT02486731': set(),
 'NCT02116374': set(),
 'NCT03537196': {'33208326'},
 'NCT01899196': set(),
 'NCT04631224': set(),
 'NCT01579435': set(),
 'NCT03235258': set(),
 'NCT00383734': {'23432777'},
 'NCT00901524': {'24905490'},
 'NCT01493934': set(),
 'NCT03078439': {'38408861'},
 'NCT04808986': set(),
 'NCT01207986': set(),
 'NCT02626286': {'33753460', '34022820', '34352834'},
 'NCT02527096': set(),
 'NCT00121758': {'20625264', '23759749'},
 'NCT04684758': set(),
 'NCT02057796': {'32558469', '36883573'},
 'NCT03071458': set(),
 'NCT02192658': {'26846895'},
 'NCT01713335': set(),
 'NCT00158522': set(),
 'NCT00640263': {'23039034',
  '26603917',
  '27895016',
  '28469697',
  '30814028',
  '31633158',
  '31994250',
  '32067040',
  '32228542',
  '32991450',
  '34425825'},
 'NCT02428816': set(),
 'NCT00122616': {'23800784'},
 'NCT00196586': set(),
 'NCT00148863': {'23190183'},
 'NCT03150290': set(),
 'NCT01514890': {'23669289', '24704719', '25556540'},
 'NCT03652090'

**On veut fusionner la liste PMIDs que l'on vient de récupérer sur PubMed, à la liste des PMIDs déjà récupérés via CT.**

On met les PMIDs de CT sous la même forme:

In [72]:
pmids_ct_dict = {}
for nctid in nctid_array:
    pmids = df_studies_v2[df_studies_v2.loc[:, "NCTId"] == nctid].loc[:, "pmid"].dropna()
    pmids_ct_dict[nctid] = set(pmids)
pmids_ct_dict

{'NCT01895920': set(),
 'NCT02486731': set(),
 'NCT02116374': set(),
 'NCT03537196': {'19839502',
  '20041309',
  '20196807',
  '20572071',
  '22098550',
  '22564041',
  '22828983',
  '23172780',
  '23553643',
  '23675659',
  '23728143',
  '23884064',
  '25086286',
  '25245939',
  '25920094',
  '26298331',
  '27148964',
  '27178119',
  '27227200',
  '27349488',
  '27427455',
  '27667367',
  '33208326'},
 'NCT01899196': set(),
 'NCT04631224': set(),
 'NCT01579435': set(),
 'NCT03235258': {'17712765',
  '18614870',
  '19194272',
  '20536367',
  '21511330',
  '21971357',
  '22198788',
  '22427678',
  '24253249',
  '24705410',
  '24945880',
  '26762993'},
 'NCT00383734': {'23432777'},
 'NCT00901524': {'24905490'},
 'NCT01493934': set(),
 'NCT03078439': set(),
 'NCT04808986': set(),
 'NCT01207986': set(),
 'NCT02626286': {'33753460', '34022820', '34352834'},
 'NCT02527096': set(),
 'NCT00121758': {'20625264', '23759749'},
 'NCT04684758': set(),
 'NCT02057796': {'32558469'},
 'NCT03071458': 

Pour un NCTDId donné, on fait l'union des deux ensembles de PMIDs:

In [75]:
pmids_complete_dict = {}
for nctid in nctid_array:
    # L'ensemble des PMIds présents dans PubMed et CT 
    pmids_complete_dict[nctid] = pmids_pubmed_dict[nctid] | pmids_ct_dict[nctid]
pmids_complete_dict

{'NCT01895920': set(),
 'NCT02486731': set(),
 'NCT02116374': set(),
 'NCT03537196': {'19839502',
  '20041309',
  '20196807',
  '20572071',
  '22098550',
  '22564041',
  '22828983',
  '23172780',
  '23553643',
  '23675659',
  '23728143',
  '23884064',
  '25086286',
  '25245939',
  '25920094',
  '26298331',
  '27148964',
  '27178119',
  '27227200',
  '27349488',
  '27427455',
  '27667367',
  '33208326'},
 'NCT01899196': set(),
 'NCT04631224': set(),
 'NCT01579435': set(),
 'NCT03235258': {'17712765',
  '18614870',
  '19194272',
  '20536367',
  '21511330',
  '21971357',
  '22198788',
  '22427678',
  '24253249',
  '24705410',
  '24945880',
  '26762993'},
 'NCT00383734': {'23432777'},
 'NCT00901524': {'24905490'},
 'NCT01493934': set(),
 'NCT03078439': {'38408861'},
 'NCT04808986': set(),
 'NCT01207986': set(),
 'NCT02626286': {'33753460', '34022820', '34352834'},
 'NCT02527096': set(),
 'NCT00121758': {'20625264', '23759749'},
 'NCT04684758': set(),
 'NCT02057796': {'32558469', '36883573'

#### Verifications:

In [150]:
num_pmids_ct = sum((len(v) for v in pmids_ct_dict.values()))
print(f"Nombre total de publications issus de CT: {num_pmids_ct}")

Nombre total de publications issus de CT: 521


In [166]:
num_pmids_complete = sum((len(v) for v in pmids_complete_dict.values()))
print(f"Nombre total de publications après consultation PubMed: {num_pmids_complete}")

Nombre total de publications après consultation PubMed: 548


In [173]:
pmids_pubmed_only_dict = {}
for nctid in nctid_array:
    # L'ensemble des PMIds présents dans PubMed uniquement 
    pmids_pubmed_only_dict[nctid] = pmids_pubmed_dict[nctid] - pmids_ct_dict[nctid]

In [185]:
num_pmids_pubmed_only = sum((len(v) for v in pmids_pubmed_only_dict.values()))
print(f"Nombre de nouveaux PMIDs trouvés via Pubmed: {num_pmids_pubmed_only}")

Nombre de nouveaux PMIDs trouvés via Pubmed: 27


In [182]:
assert (num_pmids_complete - num_pmids_ct == num_pmids_pubmed_only)

In [186]:
print("NCTId des nouveaux PMIDs trouvés via Pubmed:")
{k: v for k, v in pmids_pubmed_only_dict.items() if v != set()}

NCTId des nouveaux PMIDs trouvés via Pubmed:


{'NCT03078439': {'38408861'},
 'NCT02057796': {'36883573'},
 'NCT00640263': {'34425825'},
 'NCT02212379': {'31269208'},
 'NCT01473472': {'36601747'},
 'NCT03335995': {'37497675'},
 'NCT01426243': {'26314624'},
 'NCT03005652': {'38100477'},
 'NCT03215732': {'37143029'},
 'NCT02405013': {'36686592'},
 'NCT01089387': {'26439886'},
 'NCT05311865': {'36438274', '37795682'},
 'NCT04315948': {'36695483'},
 'NCT05349162': {'36735263'},
 'NCT01453192': {'30688008'},
 'NCT02777229': {'37851566', '38156046'},
 'NCT01703962': {'37668523'},
 'NCT01801618': {'29662875'},
 'NCT04392388': {'34293141'},
 'NCT02481453': {'38273639'},
 'NCT01688453': {'35272723'},
 'NCT04288128': {'38419144', '38421662'},
 'NCT02833961': {'36318030'},
 'NCT04409405': {'38043556'}}

In [203]:
num_nctid_empty_ct = sum((1 for v in pmids_ct_dict.values() if v == set()))
print(f"Nombre d'études sans PMIDs issus de CT: {num_nctid_empty_ct}")

Nombre d'études sans PMIDs issus de CT: 148


In [204]:
num_nctid_empty_pubmed = sum((1 for v in pmids_complete_dict.values() if v == set()))
print(f"Nombre d'études sans PMIDs après consultation PubMed: {num_nctid_empty_pubmed}")

Nombre d'études sans PMIDs après consultation PubMed: 136


In [202]:
print("NCTIds qui n'avait aucun PMIDs sous CT, mais enrichis via Pubmed:")
nctids_previously_empty = {k for k, v in pmids_ct_dict.items() if v == set()} - {k for k, v in pmids_complete_dict.items() if v == set()}
{k : pmids_complete_dict[k] for k in nctids_previously_empty}

NCTIds qui n'avait aucun PMIDs sous CT, mais enrichis via Pubmed:


{'NCT02405013': {'36686592'},
 'NCT04392388': {'34293141'},
 'NCT01703962': {'37668523'},
 'NCT02833961': {'36318030'},
 'NCT01801618': {'29662875'},
 'NCT01453192': {'30688008'},
 'NCT02212379': {'31269208'},
 'NCT05311865': {'36438274', '37795682'},
 'NCT04409405': {'38043556'},
 'NCT05349162': {'36735263'},
 'NCT03078439': {'38408861'},
 'NCT04288128': {'38419144', '38421662'}}

In [241]:
len(nctids_previously_empty)

12

In [201]:
assert (num_nctid_empty_ct - num_nctid_empty_pubmed == len(nctids_previously_empty))

### Enrichissement des PMIDs via l'API Pubmed

Pour chaque PMID récupéré, on l'enrichit avec les données de PubMed tel que le titre, les auteurs, ...:

In [265]:
counter = 0 # To keep track of progress
book_counter = 0 # type different from 'article', get ignored
total_publications_list = []

#For each NTCID...
for nctid, pmids in pmids_complete_dict.items():
    pmids_list = []
    
    # We process each PMID...
    for pmid in pmids:
        
        # Display the progress on a single line
        print(f'\r{counter+1} / {num_pmids_complete}...', end='', flush=True)

        try:
            # Fetch article details from Pubmed
            article = fetch.article_by_pmid(pmid)

            # We are not interested by articles with type 'book'
            # TODO: book special case ?
            if article.pubmed_type == 'article':
                pmids_list.append(
                    {
                        'pmid': pmid,
                        'title': article.title,
                        'authors': article.authors_str.strip(),
                        'doi': article.doi,
                        'year': article.year,
                        'publication_types': list(
                            article.publication_types.values()
                        ),
                        'citation': article.citation,
                    }
                )
            else:
                book_counter += 1
                book = (nctid, pmid)
                
            counter += 1
        except InvalidPMID as e:
            print(f'\n{e}')

    publication_dict = {'NCTId': nctid, 'publications': pmids_list}

    total_publications_list.append(publication_dict)

548 / 548...

In [307]:
print(f'Nb de publications avec un type différent d"article": {book_counter}')
print(book)

Nb de publications avec un type différent d"article": 1
('NCT03537196', '27227200')


In [278]:
print(json.dumps(total_publications_list, indent=2))

[
  {
    "NCTId": "NCT01895920",
    "publications": []
  },
  {
    "NCTId": "NCT02486731",
    "publications": []
  },
  {
    "NCTId": "NCT02116374",
    "publications": []
  },
  {
    "NCTId": "NCT03537196",
    "publications": [
      {
        "pmid": "26298331",
        "title": "Hepatitis C virus (HCV) disease progression in people who inject drugs (PWID): A systematic review and meta-analysis.",
        "authors": "Smith DJ; Combellick J; Jordan AE; Hagan H",
        "doi": "10.1016/j.drugpo.2015.07.004",
        "year": "2015",
        "publication_types": [
          "Journal Article",
          "Meta-Analysis",
          "Research Support, N.I.H., Extramural",
          "Review",
          "Systematic Review"
        ],
        "citation": "Smith DJ, et al. Hepatitis C virus (HCV) disease progression in people who inject drugs (PWID): A systematic review and meta-analysis. Hepatitis C virus (HCV) disease progression in people who inject drugs (PWID): A systematic review a

In [339]:
# The number of NCTId didn't change
assert(len(total_publications_list) == len(studies_list))

### Import dans Pandas

In [340]:
df_pubmed = pd.DataFrame.from_records(total_publications_list)
df_pubmed

Unnamed: 0,NCTId,publications
0,NCT01895920,[]
1,NCT02486731,[]
2,NCT02116374,[]
3,NCT03537196,"[{'pmid': '26298331', 'title': 'Hepatitis C vi..."
4,NCT01899196,[]
...,...,...
281,NCT00158483,"[{'pmid': '19530940', 'title': 'Impact of acyc..."
282,NCT00301561,"[{'pmid': '21831714', 'title': 'Monitoring of ..."
283,NCT01075061,[]
284,NCT00158405,"[{'pmid': '16152755', 'title': 'Haematological..."


#### On "explose" la colonne "publications":

Pour chaque NCTId, la colonne 'publications' contient potentiellement une liste de plusieurs publications.  
Si par exemple, on a 3 publications, on veut se retrouver avec 3 lignes, chacune indexée par le même NCTId et contenant une unique publication.  

*Avant*:  
**`NCTId    Reference`**  
`NCT0001, [Pub1, Pub2, Pub3]`   

*Après*:  
**`NCTId    Reference`**  
`NCT0001,   Pub1`  
`NCT0001,   Pub2`  
`NCT0001,   Pub3`   

In [341]:
df_pubmed = df_pubmed.explode('publications', ignore_index = True)
df_pubmed

Unnamed: 0,NCTId,publications
0,NCT01895920,
1,NCT02486731,
2,NCT02116374,
3,NCT03537196,"{'pmid': '26298331', 'title': 'Hepatitis C vir..."
4,NCT03537196,"{'pmid': '23172780', 'title': 'Global epidemio..."
...,...,...
678,NCT01075061,
679,NCT00158405,"{'pmid': '16152755', 'title': 'Haematological ..."
680,NCT00158405,"{'pmid': '16782488', 'title': 'CD4-guided stru..."
681,NCT00158405,"{'pmid': '18986246', 'title': 'Two-months-off,..."


On vérifie que l'on a retrouvé plus de publications avec PubMed + CT que CT tout seul:

In [342]:
assert(len(df_pubmed) >= len(df_ct))

Pour chaque NCTId, la colonne 'publications' contient maintenant un dictionnaire de la forme:  
`{  
"pmid": "17545707",  
"title": "Haematological ...",  
"authors": "Smith DJ; ...",  
}`

Dont on veut extraire de nouvelles colonnes en utilant les clés du dictionnaire:  
**`pmid      title                 authors`**  
`17545707, "Haematological ...", "Smith DJ; ..."`

In [343]:
df_pubmed_publications = pd.json_normalize(df_pubmed.pop('publications'))
df_pubmed_publications

Unnamed: 0,pmid,title,authors,doi,year,publication_types,citation
0,,,,,,,
1,,,,,,,
2,,,,,,,
3,26298331,Hepatitis C virus (HCV) disease progression in...,Smith DJ; Combellick J; Jordan AE; Hagan H,10.1016/j.drugpo.2015.07.004,2015,"[Journal Article, Meta-Analysis, Research Supp...","Smith DJ, et al. Hepatitis C virus (HCV) disea..."
4,23172780,Global epidemiology of hepatitis C virus infec...,Mohd Hanafiah K; Groeger J; Flaxman AD; Wiersm...,10.1002/hep.26141,2013,"[Journal Article, Meta-Analysis, Review, Syste...","Mohd Hanafiah K, et al. Global epidemiology of..."
...,...,...,...,...,...,...,...
678,,,,,,,
679,16152755,Haematological changes in adults receiving a z...,Moh R; Danel C; Sorho S; Sauvageot D; Anzian A...,10.1177/135965350501000510,2005,"[Clinical Trial, Journal Article, Randomized C...","Moh R, et al. Haematological changes in adults..."
680,16782488,CD4-guided structured antiretroviral treatment...,Danel C; Moh R; Minga A; Anzian A; Ba-Gomis O;...,10.1016/S0140-6736(06)68887-9,2006,"[Journal Article, Multicenter Study, Randomize...","Danel C, et al. CD4-guided structured antiretr..."
681,18986246,"Two-months-off, four-months-on antiretroviral ...",Danel C; Moh R; Chaix ML; Gabillard D; Gnokoro...,10.1086/595298,2009,"[Journal Article, Multicenter Study, Randomize...","Danel C, et al. Two-months-off, four-months-on..."


On réassemble la dataFrame complète:

In [344]:
df_pubmed = df_pubmed.join(df_pubmed_publications)
df_pubmed

Unnamed: 0,NCTId,pmid,title,authors,doi,year,publication_types,citation
0,NCT01895920,,,,,,,
1,NCT02486731,,,,,,,
2,NCT02116374,,,,,,,
3,NCT03537196,26298331,Hepatitis C virus (HCV) disease progression in...,Smith DJ; Combellick J; Jordan AE; Hagan H,10.1016/j.drugpo.2015.07.004,2015,"[Journal Article, Meta-Analysis, Research Supp...","Smith DJ, et al. Hepatitis C virus (HCV) disea..."
4,NCT03537196,23172780,Global epidemiology of hepatitis C virus infec...,Mohd Hanafiah K; Groeger J; Flaxman AD; Wiersm...,10.1002/hep.26141,2013,"[Journal Article, Meta-Analysis, Review, Syste...","Mohd Hanafiah K, et al. Global epidemiology of..."
...,...,...,...,...,...,...,...,...
678,NCT01075061,,,,,,,
679,NCT00158405,16152755,Haematological changes in adults receiving a z...,Moh R; Danel C; Sorho S; Sauvageot D; Anzian A...,10.1177/135965350501000510,2005,"[Clinical Trial, Journal Article, Randomized C...","Moh R, et al. Haematological changes in adults..."
680,NCT00158405,16782488,CD4-guided structured antiretroviral treatment...,Danel C; Moh R; Minga A; Anzian A; Ba-Gomis O;...,10.1016/S0140-6736(06)68887-9,2006,"[Journal Article, Multicenter Study, Randomize...","Danel C, et al. CD4-guided structured antiretr..."
681,NCT00158405,18986246,"Two-months-off, four-months-on antiretroviral ...",Danel C; Moh R; Chaix ML; Gabillard D; Gnokoro...,10.1086/595298,2009,"[Journal Article, Multicenter Study, Randomize...","Danel C, et al. Two-months-off, four-months-on..."


### Jointure des DataFrame de CT et Pubmed:

In [370]:
df_final = df_ct.merge(df_pubmed, on = ['NCTId', 'pmid'], how='right')
df_final

Unnamed: 0,NCTId,BriefTitle,LeadSponsorName,CollaboratorName,OverallStatus,StudyType,HasResults,StudyFirstPostDate,pmid,type,citation_x,title,authors,doi,year,publication_types,citation_y
0,NCT01895920,Viral Biofilms: Hijacking T Cell Extracellular...,"ANRS, Emerging Infectious Diseases",,COMPLETED,INTERVENTIONAL,False,2013-07-11,,,,,,,,,
1,NCT02486731,Hormonal Sensitivity in Patients With Noonan a...,Institut National de la Santé Et de la Recherc...,,COMPLETED,OBSERVATIONAL,False,2015-07-01,,,,,,,,,
2,NCT02116374,Physiopathology Study of the Microbiota Biodiv...,"ANRS, Emerging Infectious Diseases",,COMPLETED,OBSERVATIONAL,False,2014-04-16,,,,,,,,,
3,NCT03537196,DRug Use & Infections in ViEtnam - Hepatitis C...,"ANRS, Emerging Infectious Diseases",,COMPLETED,INTERVENTIONAL,False,2018-05-25,26298331,BACKGROUND,"Smith DJ, Combellick J, Jordan AE, Hagan H. He...",Hepatitis C virus (HCV) disease progression in...,Smith DJ; Combellick J; Jordan AE; Hagan H,10.1016/j.drugpo.2015.07.004,2015,"[Journal Article, Meta-Analysis, Research Supp...","Smith DJ, et al. Hepatitis C virus (HCV) disea..."
4,NCT03537196,DRug Use & Infections in ViEtnam - Hepatitis C...,"ANRS, Emerging Infectious Diseases",,COMPLETED,INTERVENTIONAL,False,2018-05-25,23172780,BACKGROUND,"Mohd Hanafiah K, Groeger J, Flaxman AD, Wiersm...",Global epidemiology of hepatitis C virus infec...,Mohd Hanafiah K; Groeger J; Flaxman AD; Wiersm...,10.1002/hep.26141,2013,"[Journal Article, Meta-Analysis, Review, Syste...","Mohd Hanafiah K, et al. Global epidemiology of..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
678,NCT01075061,Study of a Large Family With Congenital Mirror...,Institut National de la Santé Et de la Recherc...,,COMPLETED,INTERVENTIONAL,False,2010-02-24,,,,,,,,,
679,NCT00158405,Randomised Trial of Structured Treatment Inter...,French National Agency for Research on AIDS an...,Bristol-Myers Squibb,COMPLETED,INTERVENTIONAL,False,2005-09-12,16152755,BACKGROUND,"Moh R, Danel C, Sorho S, Sauvageot D, Anzian A...",Haematological changes in adults receiving a z...,Moh R; Danel C; Sorho S; Sauvageot D; Anzian A...,10.1177/135965350501000510,2005,"[Clinical Trial, Journal Article, Randomized C...","Moh R, et al. Haematological changes in adults..."
680,NCT00158405,Randomised Trial of Structured Treatment Inter...,French National Agency for Research on AIDS an...,Bristol-Myers Squibb,COMPLETED,INTERVENTIONAL,False,2005-09-12,16782488,RESULT,"Danel C, Moh R, Minga A, Anzian A, Ba-Gomis O,...",CD4-guided structured antiretroviral treatment...,Danel C; Moh R; Minga A; Anzian A; Ba-Gomis O;...,10.1016/S0140-6736(06)68887-9,2006,"[Journal Article, Multicenter Study, Randomize...","Danel C, et al. CD4-guided structured antiretr..."
681,NCT00158405,Randomised Trial of Structured Treatment Inter...,French National Agency for Research on AIDS an...,Bristol-Myers Squibb,COMPLETED,INTERVENTIONAL,False,2005-09-12,18986246,RESULT,"Danel C, Moh R, Chaix ML, Gabillard D, Gnokoro...","Two-months-off, four-months-on antiretroviral ...",Danel C; Moh R; Chaix ML; Gabillard D; Gnokoro...,10.1086/595298,2009,"[Journal Article, Multicenter Study, Randomize...","Danel C, et al. Two-months-off, four-months-on..."


Suppression des colonnes 'citation':

In [371]:
df_final.drop(['citation_x', 'citation_y'], axis = 1, inplace = True)

In [487]:
df_final

Unnamed: 0,NCTId,BriefTitle,LeadSponsorName,CollaboratorName,OverallStatus,StudyType,HasResults,StudyFirstPostDate,pmid,type,title,authors,doi,year,publication_types
0,NCT01895920,Viral Biofilms: Hijacking T Cell Extracellular...,"ANRS, Emerging Infectious Diseases",,COMPLETED,INTERVENTIONAL,False,2013-07-11,,,,,,,
1,NCT02486731,Hormonal Sensitivity in Patients With Noonan a...,Institut National de la Santé Et de la Recherc...,,COMPLETED,OBSERVATIONAL,False,2015-07-01,,,,,,,
2,NCT02116374,Physiopathology Study of the Microbiota Biodiv...,"ANRS, Emerging Infectious Diseases",,COMPLETED,OBSERVATIONAL,False,2014-04-16,,,,,,,
3,NCT03537196,DRug Use & Infections in ViEtnam - Hepatitis C...,"ANRS, Emerging Infectious Diseases",,COMPLETED,INTERVENTIONAL,False,2018-05-25,26298331,BACKGROUND,Hepatitis C virus (HCV) disease progression in...,Smith DJ; Combellick J; Jordan AE; Hagan H,10.1016/j.drugpo.2015.07.004,2015,"[Journal Article, Meta-Analysis, Research Supp..."
4,NCT03537196,DRug Use & Infections in ViEtnam - Hepatitis C...,"ANRS, Emerging Infectious Diseases",,COMPLETED,INTERVENTIONAL,False,2018-05-25,23172780,BACKGROUND,Global epidemiology of hepatitis C virus infec...,Mohd Hanafiah K; Groeger J; Flaxman AD; Wiersm...,10.1002/hep.26141,2013,"[Journal Article, Meta-Analysis, Review, Syste..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
678,NCT01075061,Study of a Large Family With Congenital Mirror...,Institut National de la Santé Et de la Recherc...,,COMPLETED,INTERVENTIONAL,False,2010-02-24,,,,,,,
679,NCT00158405,Randomised Trial of Structured Treatment Inter...,French National Agency for Research on AIDS an...,Bristol-Myers Squibb,COMPLETED,INTERVENTIONAL,False,2005-09-12,16152755,BACKGROUND,Haematological changes in adults receiving a z...,Moh R; Danel C; Sorho S; Sauvageot D; Anzian A...,10.1177/135965350501000510,2005,"[Clinical Trial, Journal Article, Randomized C..."
680,NCT00158405,Randomised Trial of Structured Treatment Inter...,French National Agency for Research on AIDS an...,Bristol-Myers Squibb,COMPLETED,INTERVENTIONAL,False,2005-09-12,16782488,RESULT,CD4-guided structured antiretroviral treatment...,Danel C; Moh R; Minga A; Anzian A; Ba-Gomis O;...,10.1016/S0140-6736(06)68887-9,2006,"[Journal Article, Multicenter Study, Randomize..."
681,NCT00158405,Randomised Trial of Structured Treatment Inter...,French National Agency for Research on AIDS an...,Bristol-Myers Squibb,COMPLETED,INTERVENTIONAL,False,2005-09-12,18986246,RESULT,"Two-months-off, four-months-on antiretroviral ...",Danel C; Moh R; Chaix ML; Gabillard D; Gnokoro...,10.1086/595298,2009,"[Journal Article, Multicenter Study, Randomize..."


Les nouveaux PMIDs trouvés via Pubmed, n'ont aucune des infos associées avec CT présents: BriefTitle, LeadSponsorName, etc ...


In [382]:
df_final[df_final.loc[:, 'BriefTitle'].isna()]

Unnamed: 0,NCTId,BriefTitle,LeadSponsorName,CollaboratorName,OverallStatus,StudyType,HasResults,StudyFirstPostDate,pmid,type,title,authors,doi,year,publication_types
43,NCT03078439,,,,,,,,38408861,PUBMED,Birth weight and head circumference discordanc...,Guellec I; Brunet A; Lapillonne A; Taine M; To...,10.1136/archdischild-2023-326336,2024,[Journal Article]
54,NCT02057796,,,,,,,,36883573,PUBMED,High performance of systematic combined urine ...,Bonnet M; Gabillard D; Domoua S; Muzoora C; Me...,10.1093/cid/ciad125,2023,[Journal Article]
70,NCT00640263,,,,,,,,34425825,PUBMED,The prevalence and socio-behavioural and clini...,Birungi N; Fadnes LT; Engebretsen IMS; Tumwine...,10.1186/s12955-021-01844-3,2021,[Journal Article]
88,NCT02212379,,,,,,,,31269208,PUBMED,Dual therapy combining raltegravir with etravi...,Katlama C; Assoumou L; Valantin MA; Soulié C; ...,10.1093/jac/dkz224,2019,"[Journal Article, Multicenter Study, Research ..."
124,NCT01473472,,,,,,,,36601747,PUBMED,Hepatitis A and B vaccine uptake and immunisat...,Le Turnier P; Charreau I; Gabassi A; Carette D...,10.1136/sextrans-2022-055634,2023,"[Clinical Trial, Research Support, Non-U.S. Go..."
143,NCT03335995,,,,,,,,37497675,PUBMED,One-Year Outcomes in Patients With Acute Strok...,Sonneville R; Mazighi M; Collet M; Gayat E; De...,10.1161/STROKEAHA.123.042910,2023,"[Multicenter Study, Journal Article, Research ..."
161,NCT01426243,,,,,,,,26314624,PUBMED,Molecular characterization of the 17D-204 yell...,Salmona M; Gazaignes S; Mercier-Delarue S; Gar...,10.1016/j.vaccine.2015.08.055,2015,"[Journal Article, Research Support, Non-U.S. G..."
277,NCT03005652,,,,,,,,38100477,PUBMED,Effects of a mindfulness-based intervention an...,Schlosser M; Demnitz-King H; Barnhofer T; Coll...,10.1371/journal.pone.0295175,2023,"[Randomized Controlled Trial, Journal Article]"
283,NCT03215732,,,,,,,,37143029,PUBMED,Hepatitis B prevention and treatment needs in ...,Djaogol T; Périères L; Marcellin F; Diouf A; C...,10.1186/s12889-023-15710-y,2023,"[Journal Article, Research Support, Non-U.S. G..."
284,NCT02405013,,,,,,,,36686592,PUBMED,Patient-reported outcomes with direct-acting a...,Marcellin F; Mourad A; Lemoine M; Kouanfack C;...,10.1016/j.jhepr.2022.100665,2023,[Journal Article]


On va remplir ces colonnes avec les infos contenus dans la DataFrame CT:

In [413]:
# Index of empty rows we need to fill 
change_index = df_final.loc[:, 'BriefTitle'].isna()

# Columns we wish to copy
columns_to_copy = [
    'BriefTitle',
    'LeadSponsorName',
    'CollaboratorName',
    'OverallStatus',
    'StudyType',
    'HasResults',
    'StudyFirstPostDate',
]

# We copy the data from the CT dataframe to the final dataframe
df_final.loc[change_index, columns_to_copy] = df_ct.loc[change_index, columns_to_copy]

On rajoute un type 'PUBMED' pour les PMIDs issues de PubMed uniquement:

In [417]:
# We add a 'PUBMED' type to the PMIDs extracted from Pubmed exclusively
df_final.loc[change_index, 'type'] = 'PUBMED'

### Resultat final:

In [438]:
df_final.loc[df_final['HasResults'] == True, 'NCTId'].unique()

<StringArray>
['NCT01605890', 'NCT00928187', 'NCT02573948', 'NCT02453048', 'NCT01882062']
Length: 5, dtype: string

In [444]:
df_final[df_final.loc[:, 'NCTId'].isin(['NCT01605890', 'NCT00928187', 'NCT02573948', 'NCT02453048', 'NCT01882062'])]

Unnamed: 0,NCTId,BriefTitle,LeadSponsorName,CollaboratorName,OverallStatus,StudyType,HasResults,StudyFirstPostDate,pmid,type,title,authors,doi,year,publication_types
90,NCT01605890,Trial Evaluating a First Line Combination Ther...,"ANRS, Emerging Infectious Diseases",Gilead Sciences | Merck Sharp & Dohme LLC,COMPLETED,INTERVENTIONAL,True,2012-05-25,29590335,DERIVED,First-line Raltegravir/Emtricitabine/Tenofovir...,Matheron S; Descamps D; Gallien S; Besseghir A...,10.1093/cid/ciy245,2018,"[Clinical Trial, Phase II, Journal Article, Mu..."
300,NCT00928187,Evaluation of Three Strategies of Second-line ...,"ANRS, Emerging Infectious Diseases",Gilead Sciences | Janssen Pharmaceutica,COMPLETED,INTERVENTIONAL,True,2009-06-25,31273686,DERIVED,Cost-Effectiveness of Three Alternative Booste...,Boyer S; Nishimwe ML; Sagaon-Teyssier L; March...,10.1007/s41669-019-0157-9,2020,[Journal Article]
404,NCT02573948,Feasibility of Interventions on People Who Inj...,"ANRS, Emerging Infectious Diseases",National Institute on Drug Abuse (NIDA),COMPLETED,OBSERVATIONAL,True,2015-10-12,27006257,RESULT,Prospects for ending the HIV epidemic among pe...,Des Jarlais DC; Thi Huong D; Thi Hai Oanh K; K...,10.1016/j.drugpo.2016.02.021,2016,"[Journal Article, Research Support, Non-U.S. G..."
405,NCT02573948,Feasibility of Interventions on People Who Inj...,"ANRS, Emerging Infectious Diseases",National Institute on Drug Abuse (NIDA),COMPLETED,OBSERVATIONAL,True,2015-10-12,25540950,BACKGROUND,"AIDS, people who use drugs, and altruism: refl...",Des Jarlais DC,10.3109/10826084.2015.978185,2015,[Journal Article]
406,NCT02573948,Feasibility of Interventions on People Who Inj...,"ANRS, Emerging Infectious Diseases",National Institute on Drug Abuse (NIDA),COMPLETED,OBSERVATIONAL,True,2015-10-12,28800503,RESULT,"Intravenous heroin use in Haiphong, Vietnam: N...",Michel L; Des Jarlais DC; Duong Thi H; Khuat T...,10.1016/j.drugalcdep.2017.07.004,2017,[Journal Article]
407,NCT02573948,Feasibility of Interventions on People Who Inj...,"ANRS, Emerging Infectious Diseases",National Institute on Drug Abuse (NIDA),COMPLETED,OBSERVATIONAL,True,2015-10-12,26050614,BACKGROUND,An international perspective on using opioid s...,Perlman DC; Jordan AE; Uuskula A; Huong DT; Ma...,10.1016/j.drugpo.2015.04.015,2015,"[Journal Article, Research Support, N.I.H., Ex..."
408,NCT02573948,Feasibility of Interventions on People Who Inj...,"ANRS, Emerging Infectious Diseases",National Institute on Drug Abuse (NIDA),COMPLETED,OBSERVATIONAL,True,2015-10-12,26075647,BACKGROUND,Can HIV and Hepatitis C Virus Infection be Eli...,Perlman DC; Des Jarlais DC; Feelemyer J,10.1080/10550887.2015.1059111,2015,"[Journal Article, Research Support, N.I.H., Ex..."
409,NCT02573948,Feasibility of Interventions on People Who Inj...,"ANRS, Emerging Infectious Diseases",National Institute on Drug Abuse (NIDA),COMPLETED,OBSERVATIONAL,True,2015-10-12,28612212,RESULT,Risk Behaviors for HIV and HCV Infection Among...,Duong HT; Jarlais DD; Khuat OHT; Arasteh K; Fe...,10.1007/s10461-017-1814-6,2018,[Journal Article]
410,NCT02573948,Feasibility of Interventions on People Who Inj...,"ANRS, Emerging Infectious Diseases",National Institute on Drug Abuse (NIDA),COMPLETED,OBSERVATIONAL,True,2015-10-12,26032121,BACKGROUND,Design and baseline findings of a large-scale ...,Hatzakis A; Sypsa V; Paraskevis D; Nikolopoulo...,10.1111/add.12999,2015,"[Journal Article, Research Support, N.I.H., Ex..."
411,NCT02573948,Feasibility of Interventions on People Who Inj...,"ANRS, Emerging Infectious Diseases",National Institute on Drug Abuse (NIDA),COMPLETED,OBSERVATIONAL,True,2015-10-12,27178119,RESULT,Integrated respondent-driven sampling and peer...,Des Jarlais D; Duong HT; Pham Minh K; Khuat OH...,10.1080/09540121.2016.1178698,2016,[Journal Article]


In [424]:
df_final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 683 entries, 0 to 682
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   NCTId               683 non-null    object  
 1   BriefTitle          683 non-null    string  
 2   LeadSponsorName     683 non-null    string  
 3   CollaboratorName    357 non-null    string  
 4   OverallStatus       683 non-null    category
 5   StudyType           683 non-null    category
 6   HasResults          683 non-null    boolean 
 7   StudyFirstPostDate  683 non-null    string  
 8   pmid                547 non-null    object  
 9   type                547 non-null    string  
 10  title               547 non-null    object  
 11  authors             547 non-null    object  
 12  doi                 542 non-null    object  
 13  year                547 non-null    object  
 14  publication_types   547 non-null    object  
dtypes: boolean(1), category(2), object(7), s

In [425]:
df_final = df_final.convert_dtypes()
# df_final = df_final.astype({"OverallStatus" : 'category', "StudyType" : 'category'})
df_final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 683 entries, 0 to 682
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   NCTId               683 non-null    string  
 1   BriefTitle          683 non-null    string  
 2   LeadSponsorName     683 non-null    string  
 3   CollaboratorName    357 non-null    string  
 4   OverallStatus       683 non-null    category
 5   StudyType           683 non-null    category
 6   HasResults          683 non-null    boolean 
 7   StudyFirstPostDate  683 non-null    string  
 8   pmid                547 non-null    string  
 9   type                547 non-null    string  
 10  title               547 non-null    string  
 11  authors             547 non-null    string  
 12  doi                 542 non-null    string  
 13  year                547 non-null    string  
 14  publication_types   547 non-null    object  
dtypes: boolean(1), category(2), object(1), s

### Export en CSV: 

In [488]:
df_final.to_csv('Data/outputs/extract_df_final.csv', sep=";", index=False, encoding='utf-8-sig')