# PubMed Article Scraping 

PubMed is a database of medical research articles collected by The US National Library of Medicine in collaboration with the National Institutes of Health. The database is built upon the Entrez system of information retrieval and holds titles, authors, and absracts of millions of research articles. PubMed also has an open-access domain named 'PubMed Central', where articles with licenses which adhere to open-science principles are stored. The PDFs of such articles are freely available and can be scraped using the Entrez API. 

Due to the complex nature of research publishing and licenses, the API is not overly effective and often needs complimented by the manual scraping of articles. This became the case for 40% of the articles in this project as it was not absracts that were being used but instead the discussions and conclusions.  

### Install dependencies

In [1]:
#pubmeds API package 
!pip install biopython



In [2]:
#Dependencies 
from Bio import Entrez
import pandas as pd

### Load in the large cleaned datafile 

In [3]:
#Load data 
df = pd.read_csv("Full_cohort_meta.csv")

In [4]:
df[:5]

Unnamed: 0.1,Unnamed: 0,Authors,Author Full Names,Article Title,Source Title,Language,Document Type,Author Keywords,Keywords Plus,Abstract,...,WoS Categories,Research Areas,IDS Number,UT (Unique WOS ID),Pubmed Id,Open Access Designations,Highly Cited Status,Hot Paper Status,Date of Export,Cohort
0,1,"Griffiths, RR; Johnson, MW; Carducci, MA; Umbr...","Griffiths, Roland R.; Johnson, Matthew W.; Car...",Psilocybin produces substantial and sustained ...,JOURNAL OF PSYCHOPHARMACOLOGY,English,Article,Psilocybin; hallucinogen; cancer; anxiety; dep...,QUALITY-OF-LIFE; MYSTICAL EXPERIENCE QUESTIONN...,"Cancer patients often develop chronic, clinica...",...,Clinical Neurology; Neurosciences; Pharmacolog...,Neurosciences & Neurology; Pharmacology & Phar...,EE9AC,WOS:000389917000003,27909165.0,"hybrid, Green Published",Y,N,21/12/21,TNC
1,2,"Grob, CS; Danforth, AL; Chopra, GS; Hagerty, M...","Grob, Charles S.; Danforth, Alicia L.; Chopra,...",Pilot Study of Psilocybin Treatment for Anxiet...,ARCHIVES OF GENERAL PSYCHIATRY,English,Article,,PSYCHOTHERAPY,Context: Researchers conducted extensive inves...,...,Psychiatry,Psychiatry,702WY,WOS:000285927800014,20819978.0,Bronze,Y,N,21/12/21,TNC
2,3,"Nichols, DE","Nichols, David E.",Psychedelics,PHARMACOLOGICAL REVIEWS,English,Review,,LYSERGIC-ACID DIETHYLAMIDE; SEROTONIN 5-HT2A R...,Psychedelics (serotonergic hallucinogens) are ...,...,Pharmacology & Pharmacy,Pharmacology & Pharmacy,DI8WO,WOS:000373783300002,26841800.0,"Bronze, Green Published",Y,N,21/12/21,TNC
3,4,"Carhart-Harris, RL; Bolstridge, M; Rucker, J; ...","Carhart-Harris, Robin L.; Bolstridge, Mark; Ru...",Psilocybin with psychological support for trea...,LANCET PSYCHIATRY,English,Article,,LYSERGIC-ACID DIETHYLAMIDE; 5-HT2A RECEPTOR; LSD,Background Psilocybin is a serotonin receptor ...,...,Psychiatry,Psychiatry,DU5RP,WOS:000382269300024,27210031.0,"Green Published, Green Submitted, hybrid",Y,N,21/12/21,TNC
4,5,"Carhart-Harris, RL; Erritzoe, D; Williams, T; ...","Carhart-Harris, Robin L.; Erritzoe, David; Wil...",Neural correlates of the psychedelic state as ...,PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCE...,English,Article,default mode network; hallucinogens; serotonin...,MEDIAL PREFRONTAL CORTEX; MYSTICAL-TYPE EXPERI...,Psychedelic drugs have a long history of use i...,...,Multidisciplinary Sciences,Science & Technology - Other Topics,887KC,WOS:000299925000068,22308440.0,"Bronze, Green Published",Y,N,21/12/21,TNC


### Extract the PubMed ids 

These are used to identify the articles within Pubmed Central - much like a doi number. 

In [6]:
#Get list of pubmed IDs (as strings)
pubmed_id_list = df["Pubmed Id"].tolist()
print(len(pubmed_id_list))
print(type(pubmed_id_list))
print(pubmed_id_list[:5])

print(len(pubmed_id_list))
print(type(pubmed_id_list[0]))

84
<class 'list'>
[27909165.0, 20819978.0, 26841800.0, 27210031.0, 22308440.0]
84
<class 'float'>


___Clean up the entries___

In [7]:
#Remove .0 from end of each entry 
cleaned_pubmed_ids = []

for ID in pubmed_id_list:
    ID = str(ID)
    if ID.endswith('.0'):
        ID = ID.removesuffix('.0')
    cleaned_pubmed_ids.append(ID)

print(len(cleaned_pubmed_ids))
print(type(cleaned_pubmed_ids))
print(cleaned_pubmed_ids[:5])

84
<class 'list'>
['27909165', '20819978', '26841800', '27210031', '22308440']


### Fetch information from PubMed 

In [8]:
#Function to connect to pubmed 
def fetch_details(id_list):
    ids = ','.join(id_list)
    Entrez.email = 'orla.mallon@icloud.com'
    handle = Entrez.efetch(db='pubmed',
                           retmode='xml',
                           id=ids)
    results = Entrez.read(handle)
    return results

In [9]:
papers = fetch_details(cleaned_pubmed_ids)

In [10]:
# demonstrate on a small subsection of the data 
papers_subset = cleaned_pubmed_ids[:10]
papers_subset = fetch_details(papers_subset)

___Print out the article text___

In [11]:
for i, paper in enumerate(papers_subset['PubmedArticle']):
    print("{}) {}".format(i+1, paper['MedlineCitation']['Article']['ArticleTitle']))

1) Psilocybin produces substantial and sustained decreases in depression and anxiety in patients with life-threatening cancer: A randomized double-blind trial.
2) Pilot study of psilocybin treatment for anxiety in patients with advanced-stage cancer.
3) Psychedelics.
4) Psilocybin with psychological support for treatment-resistant depression: an open-label feasibility study.
5) Neural correlates of the psychedelic state as determined by fMRI studies with psilocybin.
6) Rapid and sustained symptom reduction following psilocybin treatment for anxiety and depression in patients with life-threatening cancer: a randomized controlled trial.
7) The entropic brain: a theory of conscious states informed by neuroimaging research with psychedelic drugs.
8) Antidepressant effects of a single dose of ayahuasca in patients with recurrent depression: a preliminary report.
9) Serotonin and brain function: a tale of two receptors.
10) Psychedelics Promote Structural and Functional Neural Plasticity.


___Extract metadata___

In [12]:
# --- Extract relevant meta info --- 
subset_metadata = json.dumps(papers_subset['PubmedArticle'][0], indent=2)
print(subset_metadata)

{
  "MedlineCitation": {
    "OtherAbstract": [],
    "CitationSubset": [
      "IM"
    ],
    "GeneralNote": [],
    "OtherID": [],
    "KeywordList": [
      [
        "Psilocybin",
        "anxiety",
        "cancer",
        "depression",
        "hallucinogen",
        "mystical experience",
        "symptom remission"
      ]
    ],
    "SpaceFlightMission": [],
    "PMID": "27909165",
    "DateCompleted": {
      "Year": "2017",
      "Month": "12",
      "Day": "27"
    },
    "DateRevised": {
      "Year": "2018",
      "Month": "11",
      "Day": "13"
    },
    "Article": {
      "ArticleDate": [],
      "ELocationID": [],
      "Language": [
        "eng"
      ],
      "Journal": {
        "ISSN": "1461-7285",
        "JournalIssue": {
          "Volume": "30",
          "Issue": "12",
          "PubDate": {
            "Year": "2016",
            "Month": "12"
          }
        },
        "Title": "Journal of psychopharmacology (Oxford, England)",
        "ISOAbbreviat

### Extract the Full Texts 

The Entrez package is good for getting the meta-data and abstracts, but it doesn't cope well with retrieving the full body text which we need here. Thus we can employ the requests package to gather the information from the page, due to PubMed central's Open Access HTML scraper ablities. You can read more about these here: https://www.ncbi.nlm.nih.gov/research/bionlp/APIs/BioC-PMC/ 

___Demonstrating on one article___

In [15]:
# --- Printing one article ---
import requests

URL = "https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/28401522/unicode"
page = requests.get(URL)

page.json()

{'source': 'PMC',
 'date': '20210104',
 'key': 'pmc.key',
 'infons': {},
 'documents': [{'id': '6707356',
   'infons': {'license': 'author_manuscript'},
   'passages': [{'offset': 0,
     'infons': {'article-id_doi': '10.1007/7854_2017_474',
      'article-id_manuscript': 'NIHMS1044542',
      'article-id_pmc': '6707356',
      'article-id_pmid': '28401522',
      'fpage': '393',
      'kwd': 'psilocybin hallucinogens meditation mystical experiences neural model default mode network medial prefrontal cortex posterior cingulate angular gyrus inferior parietal lobule',
      'license': '\n          This file is available for text mining. It may also be used consistent with the principles of fair use under the copyright law.\n        ',
      'lpage': '430',
      'name_0': 'surname:Barrett;given-names:Frederick S.',
      'name_1': 'surname:Griffiths;given-names:Roland R.',
      'section_type': 'TITLE',
      'type': 'front',
      'volume': '36',
      'year': '2019'},
     'text': 'Cl

In [16]:
# --- Inspect the articles information --- 
print(page.headers)

{'Date': 'Thu, 13 Jan 2022 09:24:45 GMT', 'Server': 'Apache', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains; preload', 'Referrer-Policy': 'origin-when-cross-origin', 'Content-Security-Policy': 'upgrade-insecure-requests', 'Access-Control-Allow-Methods': 'POST, GET, PUT, OPTIONS, PATCH, DELETE', 'Access-Control-Allow-Origin': '', 'Access-Control-Allow-Credentials': 'true', 'Access-Control-Allow-Headers': 'Origin,X-Accept-Charset,X-Accept,Content-Type,X-Requested-With,NCBI-SID,NCBI-PHID', 'Content-Type': 'application/json', 'Vary': 'Accept-Encoding', 'Content-Encoding': 'gzip', 'X-UA-Compatible': 'IE=Edge', 'X-XSS-Protection': '1; mode=block', 'Content-Length': '38508', 'Keep-Alive': 'timeout=1, max=10', 'Connection': 'Keep-Alive'}


In [17]:
print(len(page.headers))

17


### Extract a list of PubMed search urls 

Here we will construct the URL strings to send through to the requests function for extracting the full articles

In [19]:
URL_list = []

for pubmed_id in cleaned_pubmed_ids:
    start = "https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/"
    id_string = pubmed_id
    end = "/unicode"
    full_string = start + id_test + end
    URL_list.append(full_string)

print(f"There are {len(URL_list)} URLs extracted.")
print(f"Here is an example of the first 3 {URL_list[:3]}") 

There are 84 URLs extracted.
Here is an example of the first 3 ['https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/27909165/unicode', 'https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/27909165/unicode', 'https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/27909165/unicode']


___Next, we'll loop these through to the requests package__

In [20]:
# --- First we'll check how many can be retrieved ---

successful = 0
unsuccessful = 0 

for URL in URL_list:
    page = requests.get(URL) 
    if page.status_code == 200:
        successful = successful + 1
    elif page.status_code == 404:
        unsuccessful = unsuccessful + 1

print(f"There were {successful} articles successfully accessed.")
print(f"There were {unsuccessful} articles unsuccessfully accessed.")

84
0


Great, we can see that all our articles returned a successsful response. This means we can access their data using the API, but it doesn't necessarily mean we can extract the text. 

In [21]:
# Next we'll extract the text into a list
texts = []

for URL in URL_list:
    page = requests.get(URL)
    text = page.json()
    texts.append(text)

In [22]:
len(texts)

84

In [56]:
# -- Inspect a random article --
texts[5]

{'source': 'PMC',
 'date': '20201223',
 'key': 'pmc.key',
 'infons': {},
 'documents': [{'id': '5367557',
   'infons': {'license': 'CC BY'},
   'passages': [{'offset': 0,
     'infons': {'article-id_doi': '10.1177/0269881116675513',
      'article-id_pmc': '5367557',
      'article-id_pmid': '27909165',
      'article-id_publisher-id': '10.1177_0269881116675513',
      'fpage': '1181',
      'issue': '12',
      'kwd': 'Psilocybin hallucinogen cancer anxiety depression symptom remission mystical experience',
      'license': 'This article is distributed under the terms of the Creative Commons Attribution 3.0 License (http://www.creativecommons.org/licenses/by/3.0/) which permits any use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access page (https://us.sagepub.com/en-us/nam/open-access-at-sage).',
      'lpage': '1197',
      'name_0': 'surname:Griffiths;given-names:Roland R',
      'n

### Inspect the information within the above printed article

In [24]:
article_exploring = texts[5] 
print(type(article_exploring))

<class 'dict'>


___Inspect each of the values___

In [37]:
values_view = article_exploring.values()
value_iterator = iter(values_view)
first_value = next(value_iterator)
second_value = next(value_iterator)
third_value = next(value_iterator)
forth_value = next(value_iterator)
fifth_value = next(value_iterator)

In [38]:
print(first_value)
print(second_value)
print(third_value)
print(forth_value)
print(fifth_value)

PMC
20201223
pmc.key
{}
[{'id': '5367557', 'infons': {'license': 'CC BY'}, 'passages': [{'offset': 0, 'infons': {'article-id_doi': '10.1177/0269881116675513', 'article-id_pmc': '5367557', 'article-id_pmid': '27909165', 'article-id_publisher-id': '10.1177_0269881116675513', 'fpage': '1181', 'issue': '12', 'kwd': 'Psilocybin hallucinogen cancer anxiety depression symptom remission mystical experience', 'license': 'This article is distributed under the terms of the Creative Commons Attribution 3.0 License (http://www.creativecommons.org/licenses/by/3.0/) which permits any use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access page (https://us.sagepub.com/en-us/nam/open-access-at-sage).', 'lpage': '1197', 'name_0': 'surname:Griffiths;given-names:Roland R', 'name_1': 'surname:Johnson;given-names:Matthew W', 'name_2': 'surname:Carducci;given-names:Michael A', 'name_3': 'surname:Umbricht;given

### Save the extracted texts as JSON file 

In [57]:
# -- import json -- 
import json

# create json object from dictionary
json = json.dumps(texts)

# open file for writing, "w" 
f = open("extracted_texts.json","w")

# write json object to file
f.write(json)

# close file
f.close()