# Paper Filtering Criterias
1. Paper must be original research article and (not a review, poster or preprint) ---**Programatically Solvable**---
  - Include Publication Type Field in Search Query

2. Paper must have an AD focus.This means: ---**Optimized Search Query + LLM Task**---
  - Either having a population of AD patients or
  - Looking at Alzheimer disease specific biomarkers  


3. Human sample size must be over 50 ---**LLM Task**---
  - Extract *Abstract* and *Methods* sections from calls to BioC API
  - Feed extracted data into LLM that has been prompt engineered to search and filter papers based on the desired criterias.


4. Must be looking at a protein. ---**Optimized Search Query + LLM Task**---
  - Amyloid Î²
  - Tau
  - Amyloid Precursor Protein
  - Presenilin-1 and Presenilin-2
  - Apolipoprotein E (ApoE)




5. Fluid samples like CSF, blood from animal models? ---**Optimized Search Query + LLM Task**---
  - How is this different from clinical models?


6. "Blood" vs "Blood Pressure" ---**Optimized Search Query + LLM Task**---
  - How is blood being used in the paper? Is the paper using blood pressure as a biomarker or are actual blood samples being taken for biomarkers.


7. Papers from 2024 and onward. ---**Optimized Search Query**---


## LLM Task

One way that we can have the LLM do filtering is by extracting the abstract and maybe methodology sections through PubMed's BioC API.
Therefore we don't have to worry about going through the downloaded articles, unzip their parent folders and parse through the PDF, we could just get the Abstract and Methodolgies returned to us in plain text format and feed it into the LLM.

I'm able to get those fields returned from the API for several articles so this seems promising.

# Optimizing Search Queries
**All the queries we're making are getting translated by PubMed into a different string that tries to capture our original query string that is then used to query the database.** Because of this papers that might not be relevant to the researchers is being returned.

### Solutions?
**Is there any way we can have more control over the translated string?**


- Using MeSH Headings in the Query | [A description of the MeSH hierarchy](https://www.nlm.nih.gov/mesh/intro_trees.html)

  - Automatic Explosion causes MeSH terms that are not included in the query, both original and translated, to be included in the search results. This can broaden the returned papers which might not be relevant. So a solution to this is to either select MeSH terms that when exploded still included terms that are relevant to us or by writing our query with terms that won't cause explosion, *i.e.* leaf terms that don't have any children- refer to picture below for an example.

- Using MeSH **subheadings** in the Query

  - Subheadings act as qualifiers for MeSH Headings [List of MeSH qualifiers](https://www.nlm.nih.gov/mesh/subhierarchy.html)
  - Like MeSH headings, subheadings also explode

- Limiting the use of Automatic Term Mapping?
  - PubMed's search engine maps entry terms, **i.e.** terms with the field ["All Fields"], to a common MeSH heading depending on their similarity.
  - We can limit this by using **Search Field** tags


#Recording Relevant MeSH Headings

- Chemicals and Drugs Category
  - Macromolecular Substances
  - Amino Acids, Peptides, and Proteins
  - Biological Factors
    - Biomarkers
    - Blood Coagulation Factor Inhibitors
    - Blood Coagulation Factors
- Diseases Category
  - Nervous System Diseases
- Anatomy Category
  - Nervous System
  - Cells
- Psychiatry and Psychology Category
  - Psychological Phenomena
  - Mental Disroders
- Pharmacological Actions Category

# Ideas

- How crazy would it be to have the LLM generate the query string itslef?

# Filtering papers using Ollama at Oceanus

In [None]:
!pip install ollama
!pip install tiktoken

## Using BioC API to return full articles from PMC in JSON format

# Experimenting with Ollama for filtering

* Design prompt for filtering
* Pick Models
* Test classification

## Retreiving Papers
* Extract title or article using OCR or just type it in
* Request BioC API to retreive full article from Pubmed Central

In [71]:
from collections import defaultdict

article_info = defaultdict(str)

In [72]:
import requests
import json
import time

with open("valid.txt", 'r') as f1:
    with open('valid_articles.txt', 'r') as f2:
        for id, line in zip(f1, f2):
            article_info[id[:-1]] = {'class': 1, "title": line[:-1]}
        f2.close()
    f1.close()
with open("invalid.txt", 'r') as f1:
    with open('invalid_articles.txt', 'r') as f2:
        for id, line in zip(f1, f2):
            article_info[id[:-1]] = {'class': 0, "title": line[:-1]}
        f2.close()
    f1.close()


In [None]:
from urllib.parse import quote
def search_article_ids(query, max_articles_per_query, api_key = "5209f5d2ecb2ce377d8c20c5b5a08fd46f09"):
    """
    Search articles matching the query, return their ID
    """
    search_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
    all_article_ids = []


    for start in range(0, max_articles_per_query):
        params = {
            "db": "pmc",
            "term": f'{query}',
            "retmode": "json",
            "api_key": api_key,
            "retstart": start,
            "retmax": 1
        }
        response = requests.get(search_url, params=params)
        try:
          if response.status_code == 200:
              data = response.json()
              article_ids = data["esearchresult"]["idlist"]
              if not article_ids:
                  break  # No more articles found
              all_article_ids.extend(article_ids)
              time.sleep(.5)  # To respect PubMed's rate limit
          else:
            # While response status code is 500 then increase wait time and retry
              print(f"Failed to search article IDs at {start}: {response.status_code}\n{query}\nTrying again")
              time.sleep(3)
              while response.status_code == 500:
                params = {
                  "db": "pmc",
                  "term": query,
                  "retmode": "json",
                  "api_key": api_key,
                  "retstart": start,
                  "retmax": 100
                }
                response = requests.get(search_url, params=params)
                if response.status_code == 200:
                  data = response.json()
                  article_ids = data["esearchresult"]["idlist"]
                if not article_ids:
                  print(f"No more articles found at {start}")
                  break  # No more articles found
                all_article_ids.extend(article_ids)
          all_article_ids.extend(article_ids)
        except KeyError as e:
          print(f"Key Error found at {start}")
          continue
        except json.decoder.JSONDecodeError as e:
          print(f"JSON Decode Error found at {start}")
          continue
              # break
    return all_article_ids

for title in titles:
    title['pmc_id'] = search_article_ids(title['title'], 1)[0]


In [73]:
valid_ids = []
invalid_ids = []

with open('valid.txt', 'r') as f:
    valid_ids = [line.strip() for line in f]

with open('invalid.txt', 'r') as f:
    invalid_ids = [line.strip() for line in f]

In [74]:
def fetch_full_text(article_id):
    response = requests.get(f"https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/{article_id}/unicode")

    summary = {
        'Abstract': [],
        'Method': []
    }

    if response.text == f'No record can be found for the input: pmc{article_id[3:]}':
        print(f"No record found for article {article_id}.")

    else:
        try:
            data = json.loads(response.text[1:-1])
            for text in data['documents'][0]['passages'][:]:
                if text['infons']['section_type'] == 'ABSTRACT':
                    summary['Abstract'].append(text['text'])
                if text['infons']['section_type'] == 'METHODS':
                    summary['Method'].append(text['text'])
        except json.JSONDecodeError as e:
            print("Json error for string: ", response.text)
    
    return summary

In [75]:
for id in valid_ids:
    print(id)
    article_info[id]['summary'] = fetch_full_text(id)
    
for id in invalid_ids:
    article_info[id]['summary'] = fetch_full_text(id)

PMC7649343
PMC10612408
PMC11193202
PMC10794000
PMC11485411
Json error for string:  [{"bioctype": "BioCCollection", "source": "PMC", "date": "20241021", "key": "pmc.key", "version": "1.0", "infons": {}, "documents": [{"bioctype": "BioCDocument", "id": "PMC11485411", "infons": {"license": "CC BY"}, "passages": [{"bioctype": "BioCPassage", "offset": 0, "infons": {"alt-title": "BLUM et\u00a0al.\n", "article-id_doi": "10.1002/alz.14169", "article-id_pmc": "PMC11485411", "article-id_pmid": "39099181", "article-id_publisher-id": "ALZ14169", "fn": "David Blum, Susanna Schraen\u2010Maschke, and Olivier Hanon co\u2010directed this work.", "fpage": "6948", "issue": "10", "kwd": "Alzheimer's disease caffeine CSF biomarkers memory mild cognitive impairment", "license": "This is an open access article under the terms of the http://creativecommons.org/licenses/by/4.0/ License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.", "lpage": "695

TypeError: 'str' object does not support item assignment

# Models to test

* Llama3.2
    * 1B
* Phi3
* Gemma

In [None]:
list_models_path = "http://oceanus.cs.unlv.edu:11434/api/tags"
res = requests.get(list_models_path)
res.json()

In [150]:
custom_template = """
<|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2025

You are going to be performing classifications on research articles regarding Alzheimerâs disease. Below are the guidelines on how 
to judge an article as being relevant or not. Please return a single Yes or No in your response.

## 1. Papers Must Be Original Research Articles
    * Metadata Filtering: Use metadata to identify papers labeled "original research" and exclude reviews, perspectives, posters, or preprints.
    * Keyword Identification: Scan sections for phrases like "data were collected" or "we conducted" to confirm original research. Flag terms like "this review explores" or "we summarize" for exclusion.
    * Preprint/Poster Exclusion: Exclude papers from platforms like "arXiv," "bioRxiv," or those labeled "poster presented."
    * Automated Categorization: Use metadata and text analysis to classify papers. Only include those strongly aligned with "original research."
## 2. Papers must have an AD focus, including a population of AD patients (at risk/MCI/AD) and/or looking at Alzheimerâs disease specific biomarkers (amyloid) (Many papers investigating neurodegenerative diseases will mention Alzheimerâs, even if it is not the focus)  
    * Criteria for Selection:
        * AD-Focused: Include papers explicitly studying Alzheimerâs disease (AD) topics like diagnosis, treatment, biomarkers (e.g., amyloid), or pathology.
        * AD Patients: Papers involving AD populations (at risk, MCI, or diagnosed) are relevant, even if AD is not the central focus.
        * Subset Context: Categorize papers focusing broadly on neurodegeneration with AD patients as "AD-relevant," unless biomarkers or pathology are specific to AD.
        * Biomarker Specificity: Prioritize studies addressing AD biomarkers (e.g., amyloid, tau) over general neurodegeneration markers.
    * LLM Utilization:
        * Identify keywords like "Alzheimerâs," "amyloid," or "neurodegeneration."
        * Differentiate papers as "AD-focused" or "AD-relevant" based on biomarker and patient population content.
        * This ensures inclusion of research directly tied to AD while recognizing broader studies with AD relevance.
## 3.  Human Sample Size Must Be Over 50
    * Criteria for Inclusion:
        * Stated Sample Size: Include papers explicitly reporting a sample size of 50+ for AD patients (at risk, MCI, or diagnosed).
        * Missing Information: Exclude papers without specific sample size details unless other critical criteria (e.g., strong AD focus or biomarker analysis) are met.
        * LLM Filtering: Use the LLM to identify terms like "n =", "sample size," or "number of participants." Flag unclear cases for secondary review if warranted.
        * This ensures the inclusion of robust studies with sufficient participant data while focusing on transparent methodologies.
## 4.  Must be looking at a protein (no genes, transcripts, or fragments)
    * Training Dataset: Compile papers clearly categorized as either protein-focused or gene/transcript-focused. This will provide the LLM with concrete examples to distinguish between the two categories, even without an exhaustive list of proteins.
    * Keyword Filtering: Use terms like "protein," "amyloid," "tau," or specific AD-related proteins (e.g., "beta-amyloid") to identify relevant studies. Exclude papers mentioning "gene," "RNA," "transcription," or "fragment" as indicators of a non-protein focus.
    * Contextual Pattern Recognition: Train the LLM to go beyond keywords and recognize context-specific usage, such as identifying molecular mechanisms tied to proteins versus genetic or transcriptomic functions.
    * Iterative Refinement: Review the LLM's classifications periodically. Add misclassified papers as additional examples to improve its ability to recognize complex distinctions, like protein interactions or separating fragments from full proteins.
    * Manual Review for Ambiguities: Flag papers with ambiguous terms (e.g., "molecular markers") for manual checks. Continued refinement of the LLM will reduce reliance on manual reviews over time.
    * This expanded approach ensures accurate identification of protein-specific studies while maintaining flexibility for edge cases.
## 5. Include Fluids from Non-Clinical Models (Exclude Tissue Samples)
    * Fluid Criteria: Focus on animal studies using fluids like cerebrospinal fluid (CSF), blood, serum, or plasma. These fluids often contain biomarkers relevant to AD research.
    * Exclusion of Tissue Samples: Exclude studies involving tissue samples (e.g., brain slices, biopsy samples) using keywords like "tissue," "histology," or "brain slice."
    * Sample Paper Review: Provide example papers to train the LLM on acceptable fluid-based studies versus tissue-based ones, using phrases like âCSF collectionâ or âserum analysis.â
    * Iterative Refinement: Regularly review and refine the LLMâs ability to distinguish cases where both fluids and tissues are mentioned, ensuring focus remains on fluid-based biomarkers.
    * LLM Training: Build context-based learning to differentiate between fluid-borne biomarkers and structural tissue analyses.
6. Exclude âBlood Pressureâ When Analyzing âBloodâ
    * Keyword Exclusion: Identify "blood" as relevant but exclude papers mentioning "blood pressure" (e.g., "blood pressure measurement" or "high blood pressure").
    * Contextual Filtering: Train the LLM to differentiate between "blood" used in biomarker sampling (e.g., "serum analysis") and circulatory assessments like "blood pressure."
    * Pattern Recognition: Address indirect references, such as "hypertension study" or "vascular health," by training the LLM with examples of these patterns for exclusion.
    * Confidence Scoring: Assign confidence levels for "blood" contexts. Automatically exclude high-confidence blood pressure-related papers and flag ambiguous cases for manual review.
    * This ensures focus on biomarker-related studies while excluding those centered on blood pressure or circulatory metrics.

"""

In [151]:
print(article_info["PMC7649343"]['summary'].keys())
prompt_1 = f"""
Here's a summary of an article that I want you to classify. Please respond with a yes or a no.

-------------------------------------------------------------------
Title
{article_info["PMC7649343"]['title']}
-------------------------------------------------------------------

Abstract
{article_info["PMC7649343"]['summary']['Abstract']}
-------------------------------------------------------------------

Method
{article_info["PMC7649343"]['summary']['Method']}
"""

dict_keys(['Abstract', 'Method'])


In [146]:
print(prompt_1)


Here's a summary of an article that I want you to classify. Please respond with a yes or a no.

-------------------------------------------------------------------
Title
Blood and cerebrospinal fluid neurofilament light differentially detect neurodegeneration in early Alzheimerâs disease
-------------------------------------------------------------------

Abstract
['Cerebrospinal fluid (CSF) neurofilament light (NfL) concentration has reproducibly been shown to reflect neurodegeneration in brain disorders, including Alzheimerâs disease (AD). NfL concentration in blood correlates with the corresponding CSF levels, but few studies have directly compared the reliability of these 2 markers in sporadic AD. Herein, we measured plasma and CSF concentrations of NfL in 478 cognitively unimpaired (CU) subjects, 227 patients with mild cognitive impairment, and 113 patients with AD dementia. We found that the concentration of NfL in CSF, but not in plasma, was increased in response to AÎ² pat

In [152]:
import requests

generate_path = "http://oceanus.cs.unlv.edu:11434/api/generate"
# generate_path = "http://127.0.0.1:11434/api/generate"
models = ["custom-llama3.2:latest","llama3.2:1b", "llama3.2:3b","phi3.5:3.8b", "llama3.1:70b"]

params = {
    "model": models[4],
    "system": "Do not summarize the articles. Classify them as being relevant or not.",
    # "template": custom_template,
    "system": custom_template,
    "prompt": prompt_1,
    # "prompt": "What's your knowledge cutoff date?",
    "stream": False
}

generate_response = requests.post(generate_path, json=params)
generate_response = generate_response.json()

In [119]:
generate_response.keys()

dict_keys(['model', 'created_at', 'response', 'done', 'done_reason', 'context', 'total_duration', 'load_duration', 'prompt_eval_count', 'prompt_eval_duration', 'eval_count', 'eval_duration'])

In [153]:
print(generate_response['response'])

This text appears to be a methods section from a research article in the field of neuroscience or neurology. The authors describe the procedures used to analyze cerebrospinal fluid (CSF) and plasma samples from humans and mice, as well as imaging techniques used to visualize amyloid plaques in mouse brains.

Here are the specific methods described:

1. Human sample collection: CSF and plasma samples were collected from human subjects participating in the BioFINDER study.
2. Mouse sample collection: CSF and serum samples were collected from 5ÃFAD mice, a transgenic model of Alzheimer's disease, and non-transgenic (non-tg) littermates.
3. ELISA assay: A sandwich ELISA method was used to measure neurofilament light chain (NfL) concentrations in human CSF and plasma samples.
4. Simoa NfL assay: A Simoa NfL assay was used to measure NfL concentrations in human plasma samples and mouse CSF and serum samples.
5. Euroimmun immunoassay: This assay was used to measure amyloid-Î² (AÎ²) 40, AÎ²42

In [89]:
print(generate_response['response'])

This is a research paper that appears to be about the detection of Alzheimer's disease (AD) biomarkers, specifically Amyloid Beta (AÎ²), Tau protein, and N-Fold (NfL) in cerebrospinal fluid (CSF) and plasma from mice.

Here are some key points that summarize the main findings:

**Biomarkers:**

1. The authors measured CSF and plasma AÎ²40, AÎ²42, phosphorylated tau, and NfL concentrations using various methods.
2. They found correlations between these biomarkers and age, sex, and cognitive decline in mice.

**Age and Biomarker Correlations:**

1. The authors examined if there were age-related changes in CSF and plasma biomarker levels.
2. Their results showed that AÎ²40, AÎ²42, phosphorylated tau, and NfL concentrations increased with age in 5ÃFAD (a mouse model of AD) mice.

**Relationship between Biomarkers:**

1. The authors investigated the relationship between AÎ²40, AÎ²42, phosphorylated tau, and NfL concentrations.
2. They found that there was a positive correlation between CSF

In [139]:
# show_model = "http://oceanus.cs.unlv.edu:11434/api/show"
show_model = "http://127.0.0.1:11434/api/show"
params = {
    # "name": "llama3.2:1b",
    # "name": "phi3.5:3.8b"
    "name": "custom-llama3.2:latest"
}

res = requests.post(show_model, json=params)
model_info = res.json()
# model_info

In [140]:
print(model_info['template'])

<|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023

You are going to be performing classifications on research articles regarding Alzheimerâs disease. Below are the guidelines on how 
to judge an article as being relevant or not. Please return a single Yes or No in your response.

## 1. Papers Must Be Original Research Articles
    * Metadata Filtering: Use metadata to identify papers labeled "original research" and exclude reviews, perspectives, posters, or preprints.
    * Keyword Identification: Scan sections for phrases like "data were collected" or "we conducted" to confirm original research. Flag terms like "this review explores" or "we summarize" for exclusion.
    * Preprint/Poster Exclusion: Exclude papers from platforms like "arXiv," "bioRxiv," or those labeled "poster presented."
    * Automated Categorization: Use metadata and text analysis to classify papers. Only include those strongly aligned with "original research."
## 2. Papers must ha

In [136]:
print(model_info['template'])

custom_template_1.txt


In [None]:
!pip install ollama

In [2]:
from ollama import Client
client = Client(host = 'http://oceanus.cs.unlv.edu:11434')
response = client.chat(model="llama3.2:1b", messages=[{'role':'user', 'content': 'What is the definition of a random variable?',},])

In [143]:
!pip install openai

9978.64s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


Collecting openai
  Downloading openai-1.54.4-py3-none-any.whl (389 kB)
[2K     [90mâââââââââââââââââââââââââââââââââââââââ[0m [32m389.6/389.6 KB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting tqdm>4
  Downloading tqdm-4.67.0-py3-none-any.whl (78 kB)
[2K     [90mââââââââââââââââââââââââââââââââââââââââ[0m [32m78.6/78.6 KB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
Collecting pydantic<3,>=1.9.0
  Downloading pydantic-2.9.2-py3-none-any.whl (434 kB)
[2K     [90mââââââââââââââââââââââââââââââââââââââ[0m [32m434.9/434.9 KB[0m [31m14.4 MB/s[0m eta [36m0:00:00[0m
Collecting distro<2,>=1.7.0
  Downloading distro-1.9.0-py3-none-any.whl (20 kB)
Collecting jiter<1,>=0.4.0
  Downloading jiter-0.7.1-cp310-cp310-manylinux_2_17_

In [None]:
import openai as openAI 
import os

client = openAI()
response = client.chat.completions.create({
    messages: [{ role: 'system', content: ''}] 
})
client = openAI.api_key(os.getenv("OPENAI_API_KEY"))
print(client)

ModuleNotFoundError: No module named 'OpenAI'