Purpose of this notebook is to get full text articles from PMC

In [17]:
# imports
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import os, json
import requests
import time

import sys
sys.path.append('../pubmed_rag')

from bioc import get_biocjson, passages_to_df

# import nltk
# from nltk.stem import WordNetLemmatizer
# nltk.download('wordnet')

#from spellchecker import SpellChecker

In [9]:
df = pd.read_csv('data/test_pmid_list.csv', header=None)
df.head()

pmids = df[0].astype(str).to_list()

### Current status 

Getting full texts via biocjson from [Pubtator](https://www.ncbi.nlm.nih.gov/research/pubtator3/api) instead of [BioC API for PMC](https://www.ncbi.nlm.nih.gov/research/bionlp/APIs/BioC-PMC/)

Why?? 
- I only have pmids currently, and Pubtator automatically links pmid to pmcid to return full text if it is available, otherwise returns abstract.
- BioC API for PMC is not compatible with pmid id even though it says it does

Bioc files
- it looks like each 'passage' is a sentence
- and the passages have annotations 
- additionally, relevant to us, the passage has metadata such as what section its from


Steps
- config
    - provide list of pmids 
    - output path
- retrieve biocjson (full text or abstract) and save to output path
- for each json parse for text + section

Vertex database of sentences or sections?
Option 1:
- embedd each sentence using [PubMedBERT now called BiomedBERT](https://huggingface.co/microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext)
- keeping metadata

Option 2:
- collapse the sentences into one section 
- embed each section using same model




In [18]:
# Example usage
sample = pmids[:15]


In [19]:
for each in sample:
    
    result = get_biocjson(each, 'biocjson')

    if result is not None:
        passages_to_df(result, 'biocjson')

    time.sleep(2)

In [68]:
# cleaning?
# read in example
df_test = pd.read_csv('biocjson/df_26158728.csv')
# lower case section names
df_test['section'] = df_test['section'].str.lower().str.strip()
# pmids to object
df_test['pmid'] = df_test['pmid'].astype(str)
df_test['date'] = pd.to_datetime(df_test['date'])
# also stripping sentences in case?
df_test['sentence'] = df_test['sentence'].str.strip()

punctuations = ('!',',','.','?',',','"', "'")
# lol adding a . to the end for now?
df_test['sentence'] = np.where(df_test['sentence'].str.endswith(punctuations), df_test['sentence'], df_test['sentence']+'.')

# check
df_test.head()

Unnamed: 0,index,section,sentence,pmid,pmcid,date,authors,journal
0,0,title,Heterogeneous Network Edge Prediction: A Data ...,26158728,PMC4497619,2015-07-09 00:00:00+00:00,Himmelstein DS and Baranzini SE,PLoS Comput Biol
1,1,abstract,The first decade of Genome Wide Association St...,26158728,PMC4497619,2015-07-09 00:00:00+00:00,Himmelstein DS and Baranzini SE,PLoS Comput Biol
2,2,abstract,Author Summary.,26158728,PMC4497619,2015-07-09 00:00:00+00:00,Himmelstein DS and Baranzini SE,PLoS Comput Biol
3,3,abstract,"For complex human diseases, identifying the ge...",26158728,PMC4497619,2015-07-09 00:00:00+00:00,Himmelstein DS and Baranzini SE,PLoS Comput Biol
4,4,intro,Introduction.,26158728,PMC4497619,2015-07-09 00:00:00+00:00,Himmelstein DS and Baranzini SE,PLoS Comput Biol


In [69]:
# which sections to keep? 
keep_sections = ['title', 'abstract', 'intro', 'results', 'discuss', 'methods']

# filter 
df_filtered = df_test[df_test['section'].isin(keep_sections)]

In [50]:
df_filtered[[len(x)==1 for x in df_filtered['sentence'].str.split(' ', )]]

Unnamed: 0,index,section,sentence,pmid,pmcid,date,authors,journal
4,4,intro,Introduction,26158728,PMC4497619,2015-07-09 00:00:00+00:00,Himmelstein DS and Baranzini SE,PLoS Comput Biol
11,11,results,Results,26158728,PMC4497619,2015-07-09 00:00:00+00:00,Himmelstein DS and Baranzini SE,PLoS Comput Biol
60,60,discuss,Discussion,26158728,PMC4497619,2015-07-09 00:00:00+00:00,Himmelstein DS and Baranzini SE,PLoS Comput Biol
66,66,methods,Methods,26158728,PMC4497619,2015-07-09 00:00:00+00:00,Himmelstein DS and Baranzini SE,PLoS Comput Biol
73,73,methods,Nodes,26158728,PMC4497619,2015-07-09 00:00:00+00:00,Himmelstein DS and Baranzini SE,PLoS Comput Biol
75,75,methods,Associations,26158728,PMC4497619,2015-07-09 00:00:00+00:00,Himmelstein DS and Baranzini SE,PLoS Comput Biol


In [71]:
# if collapsing to embed by section?

for each in df_filtered.groupby('section'):

    section_rows = df_filtered[df_filtered['section']==each]

    section_rows = section_rows.sort_values(by='index', ascending=True)

    section_rows['text'] = (' '.join(section_rows['sentence']))

    section_rows = section_rows.drop_duplicates(subset=['text', 'section', ], axis=1)

Heterogeneous Network Edge Prediction: A Data Integration Approach to Prioritize Disease-Associated Genes.
The first decade of Genome Wide Association Studies (GWAS) has uncovered a wealth of disease-associated variants. Two important derivations will be the translation of this information into a multiscale understanding of pathogenic variants and leveraging existing data to increase the power of existing and future studies through prioritization. We explore edge prediction on heterogeneous networks:graphs with multiple node and edge types:for accomplishing both tasks. First we constructed a network with 18 node types:genes, diseases, tissues, pathophysiologies, and 14 MSigDB (molecular signatures database) collections:and 19 edge types from high-throughput publicly-available resources. From this network composed of 40,343 nodes and 1,608,168 edges, we extracted features that describe the topology between specific genes and diseases. Next, we trained a model from GWAS associations and 

In [None]:
section_rows.drop

In [81]:
[x for x in df_filtered.groupby('section')]

[('abstract',
     index   section                                           sentence  \
  1      1  abstract  The first decade of Genome Wide Association St...   
  2      2  abstract                                    Author Summary.   
  3      3  abstract  For complex human diseases, identifying the ge...   
  
         pmid       pmcid                      date  \
  1  26158728  PMC4497619 2015-07-09 00:00:00+00:00   
  2  26158728  PMC4497619 2015-07-09 00:00:00+00:00   
  3  26158728  PMC4497619 2015-07-09 00:00:00+00:00   
  
                             authors           journal  
  1  Himmelstein DS and Baranzini SE  PLoS Comput Biol  
  2  Himmelstein DS and Baranzini SE  PLoS Comput Biol  
  3  Himmelstein DS and Baranzini SE  PLoS Comput Biol  ),
 ('discuss',
      index  section                                           sentence  \
  60     60  discuss                                        Discussion.   
  61     61  discuss  In this work, we developed a framework to pre

In [81]:
def annotate_pmids(pmids):
    # Define the PubTator API URL
    pubtator_url = "https://www.ncbi.nlm.nih.gov/research/pubtator-api/publications/export/biocxml"

    # Join the PMIDs into a comma-separated string
    pmid_str = ','.join(pmids)

    # Send a GET request to the API with the list of PMIDs
    response = requests.get(pubtator_url, params={"pmids": pmid_str, "full": True})

    # Check if the request was successful
    if response.status_code == 200:
        with open(f"PMC4448285.xml", "w", encoding="utf-8") as f:
            f.write(response.text)
        print(f"Downloaded full text XML as 4448285.xml")
    # if response.status_code == 200:
        #return response.json()  # Return the response in JSON format
    else:
        raise Exception(f"Error {response.status_code}: {response.text}")

# Example usage
pmids = ["25133803", "25971816"]
annotations = annotate_pmids(pmids)

# Display the annotated results
import pprint
pprint.pprint(annotations)


Downloaded full text XML as 4448285.xml
None


In [None]:
import xml.etree.ElementTree as ET

def extract_full_text_from_bioc(xml_file):
    # Parse the XML file
    tree = ET.parse(xml_file)
    root = tree.getroot()

    # Initialize a list to store text passages
    full_text = []

    # Iterate through passages in the document
    for document in root.findall(".//document"):
        for passage in document.findall("passage"):
            # Extract the text content
            text = passage.find("text").text
            if text:
                full_text.append(text.strip())

    # Join all passages into a single string with new lines
    #return "\n\n".join(full_text)
    return full_text
