# Paper Filtering Criterias
1. Paper must be original research article and (not a review, poster or preprint) ---**Programatically Solvable**---
  - Include Publication Type Field in Search Query

2. Paper must have an AD focus.This means: ---**Optimized Search Query + LLM Task**---
  - Either having a population of AD patients or
  - Looking at Alzheimer disease specific biomarkers  


3. Human sample size must be over 50 ---**LLM Task**---
  - Extract *Abstract* and *Methods* sections from calls to BioC API
  - Feed extracted data into LLM that has been prompt engineered to search and filter papers based on the desired criterias.


4. Must be looking at a protein. ---**Optimized Search Query + LLM Task**---
  - Amyloid β
  - Tau
  - Amyloid Precursor Protein
  - Presenilin-1 and Presenilin-2
  - Apolipoprotein E (ApoE)




5. Fluid samples like CSF, blood from animal models? ---**Optimized Search Query + LLM Task**---
  - How is this different from clinical models?


6. "Blood" vs "Blood Pressure" ---**Optimized Search Query + LLM Task**---
  - How is blood being used in the paper? Is the paper using blood pressure as a biomarker or are actual blood samples being taken for biomarkers.


7. Papers from 2024 and onward. ---**Optimized Search Query**---


## LLM Task

One way that we can have the LLM do filtering is by extracting the abstract and maybe methodology sections through PubMed's BioC API.
Therefore we don't have to worry about going through the downloaded articles, unzip their parent folders and parse through the PDF, we could just get the Abstract and Methodolgies returned to us in plain text format and feed it into the LLM.

I'm able to get those fields returned from the API for several articles so this seems promising.

# Optimizing Search Queries
**All the queries we're making are getting translated by PubMed into a different string that tries to capture our original query string that is then used to query the database.** Because of this papers that might not be relevant to the researchers is being returned.

### Solutions?
**Is there any way we can have more control over the translated string?**


- Using MeSH Headings in the Query | [A description of the MeSH hierarchy](https://www.nlm.nih.gov/mesh/intro_trees.html)

  - Automatic Explosion causes MeSH terms that are not included in the query, both original and translated, to be included in the search results. This can broaden the returned papers which might not be relevant. So a solution to this is to either select MeSH terms that when exploded still included terms that are relevant to us or by writing our query with terms that won't cause explosion, *i.e.* leaf terms that don't have any children- refer to picture below for an example.

- Using MeSH **subheadings** in the Query

  - Subheadings act as qualifiers for MeSH Headings [List of MeSH qualifiers](https://www.nlm.nih.gov/mesh/subhierarchy.html)
  - Like MeSH headings, subheadings also explode

- Limiting the use of Automatic Term Mapping?
  - PubMed's search engine maps entry terms, **i.e.** terms with the field ["All Fields"], to a common MeSH heading depending on their similarity.
  - We can limit this by using **Search Field** tags


#Recording Relevant MeSH Headings

- Chemicals and Drugs Category
  - Macromolecular Substances
  - Amino Acids, Peptides, and Proteins
  - Biological Factors
    - Biomarkers
    - Blood Coagulation Factor Inhibitors
    - Blood Coagulation Factors
- Diseases Category
  - Nervous System Diseases
- Anatomy Category
  - Nervous System
  - Cells
- Psychiatry and Psychology Category
  - Psychological Phenomena
  - Mental Disroders
- Pharmacological Actions Category

# Ideas

- How crazy would it be to have the LLM generate the query string itslef?

# Filtering papers using Ollama at Oceanus

In [None]:
!pip install ollama
!pip install tiktoken

## Using BioC API to return full articles from PMC in JSON format

# Experimenting with Ollama for filtering

* Design prompt for filtering
* Pick Models
* Test classification

## Retreiving Papers
* Extract title or article using OCR or just type it in
* Request BioC API to retreive full article from Pubmed Central

In [8]:
from collections import defaultdict

article_info = defaultdict(str)

In [9]:
import requests
import json
import time

with open("valid.txt", 'r') as f1:
    with open('valid_articles.txt', 'r') as f2:
        for id, line in zip(f1, f2):
            article_info[id[:-1]] = {'class': 1, "title": line[:-1]}
        f2.close()
    f1.close()
with open("invalid.txt", 'r') as f1:
    with open('invalid_articles.txt', 'r') as f2:
        for id, line in zip(f1, f2):
            article_info[id[:-1]] = {'class': 0, "title": line[:-1]}
        f2.close()
    f1.close()


In [None]:
from urllib.parse import quote
def search_article_ids(query, max_articles_per_query, api_key = "5209f5d2ecb2ce377d8c20c5b5a08fd46f09"):
    """
    Search articles matching the query, return their ID
    """
    search_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
    all_article_ids = []


    for start in range(0, max_articles_per_query):
        params = {
            "db": "pmc",
            "term": f'{query}',
            "retmode": "json",
            "api_key": api_key,
            "retstart": start,
            "retmax": 1
        }
        response = requests.get(search_url, params=params)
        try:
          if response.status_code == 200:
              data = response.json()
              article_ids = data["esearchresult"]["idlist"]
              if not article_ids:
                  break  # No more articles found
              all_article_ids.extend(article_ids)
              time.sleep(.5)  # To respect PubMed's rate limit
          else:
            # While response status code is 500 then increase wait time and retry
              print(f"Failed to search article IDs at {start}: {response.status_code}\n{query}\nTrying again")
              time.sleep(3)
              while response.status_code == 500:
                params = {
                  "db": "pmc",
                  "term": query,
                  "retmode": "json",
                  "api_key": api_key,
                  "retstart": start,
                  "retmax": 100
                }
                response = requests.get(search_url, params=params)
                if response.status_code == 200:
                  data = response.json()
                  article_ids = data["esearchresult"]["idlist"]
                if not article_ids:
                  print(f"No more articles found at {start}")
                  break  # No more articles found
                all_article_ids.extend(article_ids)
          all_article_ids.extend(article_ids)
        except KeyError as e:
          print(f"Key Error found at {start}")
          continue
        except json.decoder.JSONDecodeError as e:
          print(f"JSON Decode Error found at {start}")
          continue
              # break
    return all_article_ids

for title in titles:
    title['pmc_id'] = search_article_ids(title['title'], 1)[0]


In [6]:
valid_ids = []
invalid_ids = []

with open('valid.txt', 'r') as f:
    valid_ids = [line.strip() for line in f]

with open('invalid.txt', 'r') as f:
    invalid_ids = [line.strip() for line in f]

In [7]:
def fetch_full_text(article_id):
    response = requests.get(f"https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/{article_id}/unicode")

    summary = {
        'Abstract': [],
        'Method': []
    }

    if response.text == f'No record can be found for the input: pmc{article_id[3:]}':
        print(f"No record found for article {article_id}.")

    else:
        try:
            data = json.loads(response.text[1:-1])
            for text in data['documents'][0]['passages'][:]:
                if text['infons']['section_type'] == 'ABSTRACT':
                    summary['Abstract'].append(text['text'])
                if text['infons']['section_type'] == 'METHODS':
                    summary['Method'].append(text['text'])
        except json.JSONDecodeError as e:
            print("Json error for string: ", response.text)
    
    return summary

In [5]:
for id in valid_ids:
    article_info[id]['summary'] = fetch_full_text(id)
    
for id in invalid_ids:
    article_info[id]['summary'] = fetch_full_text(id)

Json error for string:  [{"bioctype": "BioCCollection", "source": "PMC", "date": "20241021", "key": "pmc.key", "version": "1.0", "infons": {}, "documents": [{"bioctype": "BioCDocument", "id": "PMC11485411", "infons": {"license": "CC BY"}, "passages": [{"bioctype": "BioCPassage", "offset": 0, "infons": {"alt-title": "BLUM et\u00a0al.\n", "article-id_doi": "10.1002/alz.14169", "article-id_pmc": "PMC11485411", "article-id_pmid": "39099181", "article-id_publisher-id": "ALZ14169", "fn": "David Blum, Susanna Schraen\u2010Maschke, and Olivier Hanon co\u2010directed this work.", "fpage": "6948", "issue": "10", "kwd": "Alzheimer's disease caffeine CSF biomarkers memory mild cognitive impairment", "license": "This is an open access article under the terms of the http://creativecommons.org/licenses/by/4.0/ License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.", "lpage": "6959", "name_0": "surname:Blum;given-names:David", "name_1": "

TypeError: 'str' object does not support item assignment

In [148]:
print(article_info)
print('hey')

defaultdict(<class 'str'>, {'PMC7649343': {'class': 1, 'title': 'Blood and cerebrospinal fluid neurofilament light differentially detect neurodegeneration in early Alzheimer’s disease', 'summary': {'Abstract': ['Cerebrospinal fluid (CSF) neurofilament light (NfL) concentration has reproducibly been shown to reflect neurodegeneration in brain disorders, including Alzheimer’s disease (AD). NfL concentration in blood correlates with the corresponding CSF levels, but few studies have directly compared the reliability of these 2 markers in sporadic AD. Herein, we measured plasma and CSF concentrations of NfL in 478 cognitively unimpaired (CU) subjects, 227 patients with mild cognitive impairment, and 113 patients with AD dementia. We found that the concentration of NfL in CSF, but not in plasma, was increased in response to Aβ pathology in CU subjects. Both CSF and plasma NfL concentrations were increased in patients with mild cognitive impairment and AD dementia. Furthermore, only NfL in C

KeyboardInterrupt: 

In [None]:
not_found_counter = 0
full_texts = []
over_context_window = 0
for id in pmc_id_list:
        response = requests.get(f"https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/PMC{id}/unicode")
        # print(response, id)
        if response.text == f'No record can be found for the input: pmc{id}':
            not_found_counter += 1
            print(f"No record found for article {id}.\Didn't find {not_found_counter} articles")
            continue
        else:
            try:
                data = json.loads(response.text[1:-1].strip())
                # print(data['documents'][0]['passages'][1:][1]['infons']['section_type'])
                total_len = 0
                total_len_tokenized = 0
                for text in data['documents'][0]['passages'][:]:
                    if text != None:
                        # print(text['text'])
                        total_len += len(text['text'])
                        # total_len_tokenized += len(tokenizer.encode(text['text']))
                        # print(tokenizer.encode(text['text']))
                # if data['documents'][0]['passages'][1:][1]['infons']['section_type'] == 'ABSTRACT' or data['documents'][0]['passages'][1:][1]['infons']['section_type'] == 'METHOD':
                print(total_len)
                # break
                # print(total_len_tokenized)
                if total_len_tokenized > 120000:
                     over_context_window += 1
                     print(total_len_tokenized, ' ', id, ' --- ', over_context_window)
                # break
            except json.JSONDecodeError as e:
                 print("Json error for string: ", response.text)



In [None]:
list_models_path = "http://oceanus.cs.unlv.edu:11434/api/tags"
res = requests.get(path)

In [6]:
import requests

path = ""

params = {
    "model": "llama3.2:1b",
    "prompt": "What's the definition of a random variable?",
    # "stream": "false",
    "format": "json"
}

res = requests.post(path, json=params)

In [None]:
response = client.chat(model="llama3.2:1b", messages=[{'role':'user', 'content': 'What is the definition of a random variable?',},])