Purpose of this notebook is to get full text articles from PMC

In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# from sklearn.manifold import TSNE
# from sklearn.preprocessing import LabelEncoder
import os, sys
sys.path.append('../pubmed_rag')
from utils import get_chunks
from bioc import (
    collapse_sections,
    get_smaller_texts, 
    get_biocjson,
    passages_to_df
)


In [3]:
df = pd.read_csv('data/test_pmid_list.csv', header=None)
df.head()

pmids = df[0].astype(str).to_list()
#pmids

### Current status 

Getting full texts via biocjson from [Pubtator](https://www.ncbi.nlm.nih.gov/research/pubtator3/api) instead of [BioC API for PMC](https://www.ncbi.nlm.nih.gov/research/bionlp/APIs/BioC-PMC/)

Why?? 
- I only have pmids currently, and Pubtator automatically links pmid to pmcid to return full text if it is available, otherwise returns abstract.
- BioC API for PMC is not compatible with pmid id even though it says it does

Bioc files
- it looks like each 'passage' is a sentence
- and the passages have annotations 
- additionally, relevant to us, the passage has metadata such as what section its from


Steps
- config
    - provide list of pmids 
    - output path
- retrieve biocjson (full text or abstract) and save to output path
- for each json parse for text + section

Vertex database of sentences or sections?
Option 1:
- embedd each sentence using [PubMedBERT now called BiomedBERT](https://huggingface.co/microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext)
- keeping metadata

Option 2:
- collapse the sentences into one section 
- embed each section using same model




In [8]:
# Example usage
sample = pmids[:15]
max_tokens = 350

In [19]:
# for each pmid
for pmid in sample:
    
    result = get_biocjson(pmid, 'biocjson')

    if result is not None:
        df_test = passages_to_df(result, 'biocjson')

        # cleaning?
        # lower case section names
        df_test['section'] = df_test['section'].str.lower().str.strip()
        # pmids to object
        df_test['pmid'] = df_test['pmid'].astype(str)
        df_test['date'] = pd.to_datetime(df_test['date'])
        # also stripping sentences in case?
        df_test['sentence'] = df_test['sentence'].str.strip()
        punctuations = ('!',',','.','?',',','"', "'")
        # lol adding a . to the end for now? if no punc
        df_test['sentence'] = np.where(
            df_test['sentence'].str.endswith(punctuations), 
            df_test['sentence'], 
            df_test['sentence']+'.'
        )
        # which sections to keep? 
        keep_sections = ['title', 'abstract', 'intro', 'results', 'discuss', 'methods']
        # filter 
        df_filtered = df_test[df_test['section'].isin(keep_sections)]

        # grouping by section
        collapsed = collapse_sections(df_filtered, 'biocjson')
        # smaller texts within section
        for i, section in enumerate(collapsed['text']):
            smaller = get_smaller_texts(section, max_tokens)
            collapsed.at[i, 'text'] = smaller
        exploded = collapsed.explode('text')
    time.sleep(2)