Purpose of this notebook is to get full text articles from PMC

In [2]:
# imports
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import re
import requests

# import nltk
# from nltk.stem import WordNetLemmatizer
# nltk.download('wordnet')

#from spellchecker import SpellChecker

## Reading in data

In [3]:
# reading in the publication list
excel_file = 'data/bkg registry nelists.xlsx'
pubs = pd.read_excel(excel_file, sheet_name='Publication')
# check
pubs.head()

Unnamed: 0,name,doi,arxiv,pmid,publicationDate,title,authors,abstract,keywords,meshTerms
0,Sun2014,10.1039/C4IB00122B,,25133803.0,2014-08-18,The integrated disease network,Kai Sun and Natalie Buchan and Chris Larminie ...,"The growing body of transcriptomic, proteomic,...","""phenotype, crohn's disease, heterogeneity, ge...","['Computational Biology / methods', 'Databases..."
1,Ernst2015,10.1186/s12859-015-0549-5,,25971816.0,2015-05-14,KnowLife: a versatile approach for constructin...,Patrick Ernst and Amy Siu and Gerhard Weikum,Background: Biomedical knowledge bases (KB's) ...,"['Biomedical text mining','knowledge base','re...","['Biomedical Research', 'Humans', 'Information..."
2,Himmelstein2015,10.1371/journal.pcbi.1004259,,26158728.0,2015-07-09,Heterogeneous Network Edge Prediction: A Data ...,Daniel Himmelstein and Serio Baranzini,The first decade of Genome Wide Association St...,,Algorithms\nAnimals\nChromosome Mapping / meth...
3,Himmelstein2017,10.7554/eLife.26726,10.1101/087619v3,28936969.0,2017-09-22,Systematic integration of biomedical knowledge...,Daniel Scott Himmelstein and Antoine Lizee and...,The ability to computationally predict whether...,,Computational Biology / methods*\nDrug Discove...
4,Martinez2015,10.1016/j.artmed.2014.11.003,,25704113.0,2015-01-13,DrugNet: Network-based drug–disease prioritiza...,Víctor Martínez and Carmen Navarro and Carlos ...,Objective: Computational drug repositioning ca...,Data integration; Disease networks; Drug repos...,Area Under Curve\nComputational Biology*\nComp...


In [9]:
pubs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87 entries, 0 to 86
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   name             87 non-null     object        
 1   doi              86 non-null     object        
 2   arxiv            11 non-null     object        
 3   pmid             76 non-null     float64       
 4   publicationDate  87 non-null     datetime64[ns]
 5   title            87 non-null     object        
 6   authors          86 non-null     object        
 7   abstract         87 non-null     object        
 8   keywords         45 non-null     object        
 9   meshTerms        65 non-null     object        
dtypes: datetime64[ns](1), float64(1), object(8)
memory usage: 6.9+ KB


## Trying to get full text

In [91]:
def annotate_pmids(pmids):
    # Define the PubTator API URL
    pubtator_url = "https://www.ncbi.nlm.nih.gov/research/pubtator-api/publications/export/biocjson"

    # Join the PMIDs into a comma-separated string
    pmid_str = ','.join(pmids)

    # Send a GET request to the API with the list of PMIDs
    response = requests.get(pubtator_url, params={"pmids": pmid_str, "full": True})

    # Check if the request was successful
    if response.status_code == 200:
        with open(f"PMC4448285.xml", "w", encoding="utf-8") as f:
            f.write(response.text)
        print(f"Downloaded full text XML as 4448285.xml")
    # if response.status_code == 200:
        #return response.json()  # Return the response in JSON format
    else:
        raise Exception(f"Error {response.status_code}: {response.text}")

# Example usage
pmids = ["25133803", "25971816"]
annotations = annotate_pmids(pmids)

# Display the annotated results
import pprint
pprint.pprint(annotations)

ConnectionError: HTTPSConnectionPool(host='www.ncbi.nlm.nih.gov', port=443): Max retries exceeded with url: /research/pubtator-api/publications/export/biocjson?pmids=25133803%2C25971816&full=True (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x131505090>: Failed to resolve 'www.ncbi.nlm.nih.gov' ([Errno 8] nodename nor servname provided, or not known)"))

In [77]:
pmcid = "25133803" 
url = f'https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_xml/{pmcid}/unicode'

response = requests.get(url)

In [78]:
with open('PMC4448285.xml', "wb") as f:
    f.write(response.content)

In [92]:
import xml.etree.ElementTree as ET

def extract_full_text_from_bioc(xml_file):
    # Parse the XML file
    tree = ET.parse(xml_file)
    root = tree.getroot()

    # Initialize a list to store text passages
    full_text = []

    # Iterate through passages in the document
    for document in root.findall(".//document"):
        for passage in document.findall("passage"):
            # Extract the text content
            text = passage.find("text").text
            if text:
                full_text.append(text.strip())

    # Join all passages into a single string with new lines
    #return "\n\n".join(full_text)
    return full_text


In [93]:

# Example usage
xml_file = "PMC4448285.xml"  # Replace with your BioC XML file
full_text = extract_full_text_from_bioc(xml_file)

# Print or save the full text
print(full_text)


['KnowLife: a versatile approach for constructing a large knowledge graph for biomedical sciences', 'Background', "Biomedical knowledge bases (KB's) have become important assets in life sciences. Prior work on KB construction has three major limitations. First, most biomedical KBs are manually built and curated, and cannot keep up with the rate at which new findings are published. Second, for automatic information extraction (IE), the text genre of choice has been scientific publications, neglecting sources like health portals and online communities. Third, most prior work on IE has focused on the molecular level or chemogenomics only, like protein-protein interactions or gene-drug relationships, or solely address highly specific topics such as drug effects.", 'Results', 'We address these three limitations by a versatile and scalable approach to automatic KB construction. Using a small number of seed facts for distant supervision of pattern-based extraction, we harvest a huge number of

In [94]:
len(full_text)

171

In [87]:
import xml.etree.ElementTree as ET

def extract_selected_sections_from_bioc(xml_file):
    tree = ET.parse(xml_file)
    root = tree.getroot()
    full_text = []

    # Define the section types we want to keep
    sections_to_keep = ["ABSTRACT", "INTRO"]

    for document in root.findall(".//document"):
        for passage in document.findall("passage"):
            # Get the section type from the infon key="section_type"
            section_type = passage.find("infon[@key='section_type']").text if passage.find("infon[@key='section_type']") else None
            
            # Check if the section type is one of the desired sections
            if section_type in sections_to_keep:
                text = passage.find("text").text
                if text:
                    full_text.append(text.strip())

    return "\n\n".join(full_text)


In [89]:

# Example usage
xml_file = "PMC4448285.xml"  # Path to your BioC XML file
selected_text = extract_selected_sections_from_bioc(xml_file)

# Print the selected text
print(selected_text)





In [86]:

# Example usage
xml_file = "PMC4448285.xml"  # Replace with your BioC XML file
sections = ["ABSTRACT", "introduction"]  # Sections you want to keep
selected_text = extract_selected_sections_from_bioc(xml_file, sections)

# Print or save the selected text
print(selected_text)





In [13]:
# Base URL for PMC OAI-PMH
pmc_url = f"https://www.ncbi.nlm.nih.gov/pmc/oai/oai.cgi?verb=GetRecord&metadataPrefix=pmc&identifier=pmcid:PMC4448285"

# Make the request to get full text XML
response = requests.get(pmc_url)

if response.status_code == 200:
    with open(f"PMC4448285.xml", "w", encoding="utf-8") as f:
        f.write(response.text)
    print(f"Downloaded full text XML as 4448285.xml")
else:
    print("Failed to retrieve the full text.")


Downloaded full text XML as 4448285.xml


testing out pubtator

In [95]:
def annotate_pmids(pmids):
    # Define the PubTator API URL
    pubtator_url = "https://www.ncbi.nlm.nih.gov/research/pubtator-api/publications/export/biocjson"

    # Join the PMIDs into a comma-separated string
    pmid_str = ','.join(pmids)

    # Send a GET request to the API with the list of PMIDs
    response = requests.get(pubtator_url, params={"pmids": pmid_str, "full": True})

    # Check if the request was successful
    if response.status_code == 200:
        return response.json()  # Return the response in JSON format
    else:
        raise Exception(f"Error {response.status_code}: {response.text}")


In [105]:
all_pmids = pubs['pmid'].dropna().astype(int).astype(str).to_list()
# Example usage
pmids = all_pmids[:15]

#pmids = ["25133803", "25971816"]
annotations = annotate_pmids(pmids)

# Display the annotated results
import pprint
pprint.pprint(annotations)


{'PubTator3': [{'_id': '25971816|PMC4448285',
                'authors': ['Ernst P', 'Siu A', 'Weikum G'],
                'date': '2015-05-14T00:00:00Z',
                'id': '4448285',
                'infons': {},
                'journal': 'BMC Bioinformatics',
                'meta': {},
                'passages': [{'annotations': [],
                              'infons': {'article-id_doi': '10.1186/s12859-015-0549-5',
                                         'article-id_pmc': '4448285',
                                         'article-id_pmid': '25971816',
                                         'article-id_publisher-id': '549',
                                         'elocation-id': '157',
                                         'kwd': 'Biomedical text mining '
                                                'Knowledge base Relation '
                                                'extraction',
                                         'license': 'This is an Open Access 

In [122]:
annotations['PubTator3'][3]['passages'][0]['infons']['section_type']

'TITLE'

In [125]:
[(i, x['infons']['section_type'], x['text']) for i, x in enumerate(annotations['PubTator3'][3]['passages'])]

[(0,
  'TITLE',
  'A network integration approach for drug-target interaction prediction and computational drug repositioning from heterogeneous information'),
 (1,
  'ABSTRACT',
  'The emergence of large-scale genomic, chemical and pharmacological data provides new opportunities for drug discovery and repositioning. In this work, we develop a computational pipeline, called DTINet, to predict novel drug-target interactions from a constructed heterogeneous network, which integrates diverse drug-related information. DTINet focuses on learning a low-dimensional vector representation of features, which accurately explains the topological properties of individual nodes in the heterogeneous network, and then makes prediction based on these representations via a vector space projection scheme. DTINet achieves substantial performance improvement over other state-of-the-art methods for drug-target interaction prediction. Moreover, we experimentally validate the novel interactions between three 

In [81]:
def annotate_pmids(pmids):
    # Define the PubTator API URL
    pubtator_url = "https://www.ncbi.nlm.nih.gov/research/pubtator-api/publications/export/biocxml"

    # Join the PMIDs into a comma-separated string
    pmid_str = ','.join(pmids)

    # Send a GET request to the API with the list of PMIDs
    response = requests.get(pubtator_url, params={"pmids": pmid_str, "full": True})

    # Check if the request was successful
    if response.status_code == 200:
        with open(f"PMC4448285.xml", "w", encoding="utf-8") as f:
            f.write(response.text)
        print(f"Downloaded full text XML as 4448285.xml")
    # if response.status_code == 200:
        #return response.json()  # Return the response in JSON format
    else:
        raise Exception(f"Error {response.status_code}: {response.text}")

# Example usage
pmids = ["25133803", "25971816"]
annotations = annotate_pmids(pmids)

# Display the annotated results
import pprint
pprint.pprint(annotations)


Downloaded full text XML as 4448285.xml
None
