# Paper Filtering Criterias
1. Paper must be original research article and (not a review, poster or preprint) ---**Programatically Solvable**---
  - Include Publication Type Field in Search Query

2. Paper must have an AD focus.This means: ---**Optimized Search Query + LLM Task**---
  - Either having a population of AD patients or
  - Looking at Alzheimer disease specific biomarkers  


3. Human sample size must be over 50 ---**LLM Task**---
  - Extract *Abstract* and *Methods* sections from calls to BioC API
  - Feed extracted data into LLM that has been prompt engineered to search and filter papers based on the desired criterias.


4. Must be looking at a protein. ---**Optimized Search Query + LLM Task**---
  - Amyloid β
  - Tau
  - Amyloid Precursor Protein
  - Presenilin-1 and Presenilin-2
  - Apolipoprotein E (ApoE)




5. Fluid samples like CSF, blood from animal models? ---**Optimized Search Query + LLM Task**---
  - How is this different from clinical models?


6. "Blood" vs "Blood Pressure" ---**Optimized Search Query + LLM Task**---
  - How is blood being used in the paper? Is the paper using blood pressure as a biomarker or are actual blood samples being taken for biomarkers.


7. Papers from 2024 and onward. ---**Optimized Search Query**---


## LLM Task

One way that we can have the LLM do filtering is by extracting the abstract and maybe methodology sections through PubMed's BioC API.
Therefore we don't have to worry about going through the downloaded articles, unzip their parent folders and parse through the PDF, we could just get the Abstract and Methodolgies returned to us in plain text format and feed it into the LLM.

I'm able to get those fields returned from the API for several articles so this seems promising.

# Optimizing Search Queries
**All the queries we're making are getting translated by PubMed into a different string that tries to capture our original query string that is then used to query the database.** Because of this papers that might not be relevant to the researchers is being returned.

### Solutions?
**Is there any way we can have more control over the translated string?**


- Using MeSH Headings in the Query | [A description of the MeSH hierarchy](https://www.nlm.nih.gov/mesh/intro_trees.html)

  - Automatic Explosion causes MeSH terms that are not included in the query, both original and translated, to be included in the search results. This can broaden the returned papers which might not be relevant. So a solution to this is to either select MeSH terms that when exploded still included terms that are relevant to us or by writing our query with terms that won't cause explosion, *i.e.* leaf terms that don't have any children- refer to picture below for an example.

- Using MeSH **subheadings** in the Query

  - Subheadings act as qualifiers for MeSH Headings [List of MeSH qualifiers](https://www.nlm.nih.gov/mesh/subhierarchy.html)
  - Like MeSH headings, subheadings also explode

- Limiting the use of Automatic Term Mapping?
  - PubMed's search engine maps entry terms, **i.e.** terms with the field ["All Fields"], to a common MeSH heading depending on their similarity.
  - We can limit this by using **Search Field** tags


#Recording Relevant MeSH Headings

- Chemicals and Drugs Category
  - Macromolecular Substances
  - Amino Acids, Peptides, and Proteins
  - Biological Factors
    - Biomarkers
    - Blood Coagulation Factor Inhibitors
    - Blood Coagulation Factors
- Diseases Category
  - Nervous System Diseases
- Anatomy Category
  - Nervous System
  - Cells
- Psychiatry and Psychology Category
  - Psychological Phenomena
  - Mental Disroders
- Pharmacological Actions Category

# Ideas

- How crazy would it be to have the LLM generate the query string itslef?

# Filtering papers using Ollama at Oceanus

In [2]:
!pip install ollama
!pip install tiktoken



## Using BioC API to return full articles from PMC in JSON format

# Experimenting with Ollama for filtering

* Design prompt for filtering
* Pick Models
* Test classification

## Retreiving Papers
* Extract title or article using OCR or just type it in
* Request BioC API to retreive full article from Pubmed Central

In [3]:
from collections import defaultdict

article_info = defaultdict(str)

In [4]:
import requests
import json
import time

with open("valid.txt", 'r') as f1:
    with open('valid_articles.txt', 'r') as f2:
        for id, line in zip(f1, f2):
            article_info[id[:-1]] = {'class': 1, "title": line[:-1]}
        f2.close()
    f1.close()
with open("invalid.txt", 'r') as f1:
    with open('invalid_articles.txt', 'r') as f2:
        for id, line in zip(f1, f2):
            article_info[id[:-1]] = {'class': 0, "title": line[:-1]}
        f2.close()
    f1.close()


In [5]:
# from urllib.parse import quote
# def search_article_ids(query, max_articles_per_query, api_key = "5209f5d2ecb2ce377d8c20c5b5a08fd46f09"):
#     """
#     Search articles matching the query, return their ID
#     """
#     search_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
#     all_article_ids = []


#     for start in range(0, max_articles_per_query):
#         params = {
#             "db": "pmc",
#             "term": f'{query}',
#             "retmode": "json",
#             "api_key": api_key,
#             "retstart": start,
#             "retmax": 1
#         }
#         response = requests.get(search_url, params=params)
#         try:
#           if response.status_code == 200:
#               data = response.json()
#               article_ids = data["esearchresult"]["idlist"]
#               if not article_ids:
#                   break  # No more articles found
#               all_article_ids.extend(article_ids)
#               time.sleep(.5)  # To respect PubMed's rate limit
#           else:
#             # While response status code is 500 then increase wait time and retry
#               print(f"Failed to search article IDs at {start}: {response.status_code}\n{query}\nTrying again")
#               time.sleep(3)
#               while response.status_code == 500:
#                 params = {
#                   "db": "pmc",
#                   "term": query,
#                   "retmode": "json",
#                   "api_key": api_key,
#                   "retstart": start,
#                   "retmax": 100
#                 }
#                 response = requests.get(search_url, params=params)
#                 if response.status_code == 200:
#                   data = response.json()
#                   article_ids = data["esearchresult"]["idlist"]
#                 if not article_ids:
#                   print(f"No more articles found at {start}")
#                   break  # No more articles found
#                 all_article_ids.extend(article_ids)
#           all_article_ids.extend(article_ids)
#         except KeyError as e:
#           print(f"Key Error found at {start}")
#           continue
#         except json.decoder.JSONDecodeError as e:
#           print(f"JSON Decode Error found at {start}")
#           continue
#               # break
#     return all_article_ids

# for title in titles:
#     title['pmc_id'] = search_article_ids(title['title'], 1)[0]


In [6]:
valid_ids = []
invalid_ids = []

with open('valid.txt', 'r') as f:
    valid_ids = [line.strip() for line in f]

with open('invalid.txt', 'r') as f:
    invalid_ids = [line.strip() for line in f]

In [7]:
def fetch_full_text(article_id):
    response = requests.get(f"https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/{article_id}/unicode")

    summary = {
        'Abstract': [],
        'Method': []
    }

    if response.text == f'No record can be found for the input: pmc{article_id[3:]}':
        print(f"No record found for article {article_id}.")

    else:
        try:
            # print(len(response.text[1:-1]))
            data = json.loads(response.text[1:-1])
            for text in data['documents'][0]['passages'][:]:
                if text['infons']['section_type'] == 'ABSTRACT':
                    summary['Abstract'].append(text['text'])
                if text['infons']['section_type'] == 'METHODS':
                    summary['Method'].append(text['text'])
        except json.JSONDecodeError as e:
            data = json.loads(response.text[e.colno+2:-1])
            for text in data['documents'][0]['passages'][:]:
                if text['infons']['section_type'] == 'ABSTRACT':
                    summary['Abstract'].append(text['text'])
                if text['infons']['section_type'] == 'METHODS':
                    summary['Method'].append(text['text'])

    summary['Abstract'] = '\n'.join(summary['Abstract'])
    summary['Method'] = '\n'.join(summary['Method'])
    return summary

In [8]:
for id in valid_ids:
    try:
        article_info[id]['summary'] = fetch_full_text(id)
    except TypeError:
        print(article_info[id])
    
for id in invalid_ids:
    if id == '---' :
        continue
    try:
        article_info[id]['summary'] = fetch_full_text(id)
    except TypeError:
        print(article_info[id])




No record found for article PMC10612412.

No record found for article PMC11065015.


KeyboardInterrupt: 

In [None]:
print(article_info['PMC10047620'])

In [None]:
list_models_path = "http://oceanus.cs.unlv.edu:11434/api/tags"
res = requests.get(list_models_path)
res.json()

In [None]:
custom_template = """
You are going to be performing classifications on research articles regarding Alzheimer’s disease. Below are the rules on how 
to judge an article as being relevant or not. Please return a single Yes or No in your response.

## 1. Papers Must Be Original Research Articles
    * Metadata Filtering: Use metadata to identify papers labeled "original research" and exclude reviews, perspectives, posters, or preprints.
    * Keyword Identification: Scan sections for phrases like "data were collected" or "we conducted" to confirm original research.
## 2. Papers must have an AD focus, including a population of AD patients (at risk/MCI/AD) and/or looking at Alzheimer’s disease specific biomarkers (amyloid) (Many papers investigating neurodegenerative diseases will mention Alzheimer’s, even if it is not the focus)  
    * Criteria for Selection:
        * AD-Focused: Include papers explicitly studying Alzheimer’s disease (AD) topics like diagnosis, treatment, biomarkers (e.g., amyloid), or pathology.
        * AD Patients: Papers involving AD populations (at risk, MCI, or diagnosed) are relevant, even if AD is not the central focus.
        * Subset Context: Unless biomarkers or pathology are specific to AD, exclude papers focusing broadly on neurodegeneration with AD patients.
        * Biomarker Specificity: Include studies addressing AD biomarkers (e.g., amyloid, tau). Exclude them if they contain general neurodegeneration markers.
## 3.  Human Sample Size Must Be Over 50
    * Criteria for Inclusion:
        * Stated Sample Size: Include papers explicitly reporting a sample size of 50+ for AD patients (at risk, MCI, or diagnosed).
        * Missing Information: Exclude papers without specific sample size details unless other critical criteria (e.g., strong AD focus or biomarker analysis) are met.
## 4.  Must be looking at a protein (no genes, transcripts, or fragments)
    * Keyword Filtering: Use terms like "protein," "amyloid," "tau," or specific AD-related proteins (e.g., "beta-amyloid") to identify relevant studies. Exclude papers mentioning "gene," "RNA," "transcription," or "fragment" as indicators of a non-protein focus.
## 5. Include Fluids from Non-Clinical Models (Exclude Tissue Samples)
    * Fluid Criteria: Focus on animal studies using fluids like cerebrospinal fluid (CSF), blood, serum, or plasma. These fluids often contain biomarkers relevant to AD research.
    * Exclusion of Tissue Samples: Exclude studies involving tissue samples (e.g., brain slices, biopsy samples) using keywords like "tissue," "histology," or "brain slice."
## 6. Exclude “Blood Pressure” When Analyzing “Blood”
    * Keyword Exclusion: Identify "blood" as a relevant biomarker but exclude papers mentioning "blood pressure" (e.g., "blood pressure measurement" or "high blood pressure").
    * Contextual Filtering: Differentiate between "blood" used in biomarker sampling (e.g., "serum analysis") and circulatory assessments like "blood pressure."
    * Pattern Recognition: Exclude studies such as "hypertension study" or "vascular health,".
"""

In [None]:
question = f"""
Here's a summary of an article that I want you to classify. Please respond with a yes or a no.

-------------------------------------------------------------------
Title
{article_info["PMC10047620"]['title']}
-------------------------------------------------------------------

Abstract
{article_info["PMC10047620"]['summary']['Abstract']}
-------------------------------------------------------------------

Method
{article_info["PMC10047620"]['summary']['Method']}

"""

In [None]:
question2 = f"""
Here's a summary of an article that I want you to classify. Please respond with a yes or a no.

-------------------------------------------------------------------
Title
{article_info["PMC7649343"]['title']}
-------------------------------------------------------------------

Abstract
{article_info["PMC7649343"]['summary']['Abstract']}
-------------------------------------------------------------------

Method
{article_info["PMC7649343"]['summary']['Method']}

"""

In [None]:
print(article_info["PMC10047620"]['summary']['Method'])

In [None]:
with open('temp_art.txt', 'w') as f:
    f.write(article_info["PMC7649343"]['title'])
    f.write('\n-------------------\n')
    f.write('\n'.join(article_info["PMC7649343"]['summary']['Abstract']))
    f.write('\n-------------------\n')
    f.write('\n'.join(article_info["PMC7649343"]['summary']['Method']))


In [None]:
import tiktoken

tokenizer = tiktoken.get_encoding('cl100k_base')
print(len(tokenizer.encode(custom_template)))

1223


In [None]:
print(len(tokenizer.encode(prompt_1)))

4215


In [None]:
import requests

generate_path = "http://oceanus.cs.unlv.edu:11434/api/generate"
# generate_path = "http://127.0.0.1:11434/api/generate"
models = ["custom-llama3.2:latest","llama3.2:1b", "llama3.2:3b","phi3.5:3.8b", "llama3.1:70b", "medllama2:7b", "mistral:7b"]

params = {
    "model": models[3],
    # "system": custom_template,
    "prompt": question + custom_template ,
    "stream": False
}

generate_response = requests.post(generate_path, json=params)
generate_response = generate_response.json()

In [None]:
print(generate_response['response'])

In [None]:
print(generate_response['response'])

Yes


In [None]:
# show_model = "http://oceanus.cs.unlv.edu:11434/api/show"
show_model = "http://127.0.0.1:11434/api/show"
params = {
    # "name": "llama3.2:1b",
    # "name": "phi3.5:3.8b"
    "name": "custom-llama3.2:latest"
}

res = requests.post(show_model, json=params)
model_info = res.json()
# model_info

In [None]:
print(model_info['template'])

In [1]:
from openai import OpenAI 
import os

client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
response = client.chat.completions.create(
    model = "gpt-4o-mini",
    messages = [
        { "role": 'system', "content": f'{custom_template}'},
        { 
            "role": 'user',
            "content": f"{question}"
        }
    ] 
)

NameError: name 'custom_template' is not defined

In [20]:
response

ChatCompletion(id='chatcmpl-AVqexlVLYzeGOSxujkXNcaGvctjLG', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='No', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None))], created=1732153955, model='gpt-4o-mini-2024-07-18', object='chat.completion', service_tier=None, system_fingerprint='fp_0705bf87c0', usage=CompletionUsage(completion_tokens=1, prompt_tokens=3541, total_tokens=3542, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0)))

In [10]:
def validate_res(model_res):
    from sklearn.metrics.pairwise import cosine_similarity
    import requests
    import numpy as np

    embedding_path = "http://oceanus.cs.unlv.edu:11434/api/embeddings"

    params = {
        'model': "mxbai-embed-large",
        'prompt': f"{model_res}"
    }

    response = requests.post(embedding_path, json=params)

    params = {
        'model': "mxbai-embed-large",
        # 'prompt': "The queen of england loves conquering africa"
        'prompt': "No it is not"
    }

    response2 = requests.post(embedding_path, json=params)

    emb1 = np.array(response.json()['embedding'])
    emb2 = np.array(response2.json()['embedding'])
    
    similarity = cosine_similarity([emb1], [emb2])
    return similarity[0][0]

In [12]:
import json
with open('./classification_results/classification_results_llama3.2:1b.json', 'r') as f:
    articles_classified = json.loads(f.read())

for article in articles_classified:
    try:
        print('-------------------------------------')
        print(articles_classified[article]['class'])
        print(articles_classified[article]["ollama_model_response"])
        print(articles_classified[article]["openai_response"])
        # similarity_score = validate_res(articles_classified[article]["ollama_model_response"])
        # similarity_score = validate_res(articles_classified[article]["openai_response"])
        # print(similarity_score)
    except KeyError:
        print(article)

-------------------------------------
1
## Step 1: Determine the relevance of articles to the problem
The problem requires judging research articles regarding Alzheimer's disease, focusing on original research papers that investigate AD biomarkers and patients (at risk/MCI/AD), with an emphasis on neurodegenerative diseases, specifically amyloid.

## Step 2: Identify criteria for selecting relevant articles
To be considered relevant, the article must meet all of the following criteria:
1. Articles are original research.
2. The focus is on Alzheimer's disease (AD) and/or AD biomarkers (amyloid).
3. The population studied includes at-risk, MCI, or diagnosed individuals with AD.
4. The article discusses specific aspects of Alzheimer's such as diagnostic, treatment, biomarker analysis, or pathology.

## Step 3: Analyze the criteria for selecting relevant articles
Given the strict criteria, we can determine which articles are likely to be relevant:

- Original research papers meeting all sp

In [4]:
import json
with open('./classification_results/classification_results_llama3.1:70b.json', 'r') as f:
    articles_classified = json.loads(f.read())

for article in articles_classified:
    try:
        print(articles_classified[article]["ollama_model_response"])
        print(articles_classified[article]["openai_response"])
        print('-------------------------------------')
    except KeyError:
        print(article)

Based on the provided text, I would classify this article as:

**Yes**

Reasoning:
* The article appears to be an original research article, as indicated by the use of phrases like "we analyzed."
* The focus is clearly on Alzheimer's disease (AD), with specific attention to AD biomarkers and pathology.
* Human sample sizes are not explicitly mentioned in the provided text; however, given the context, it seems reasonable to assume that human samples were used, even if the exact number isn't stated. In this case, I'll err on the side of caution and consider other factors more heavily.
* The study involves analyzing proteins (NfL) rather than genes or transcripts.
* Fluids from both clinical (human) and non-clinical models (5×FAD mice) are analyzed, which fits within the criteria.
* "Blood" is used in a biomarker context ("serum analysis") rather than referring to blood pressure.
Yes
-------------------------------------
Please provide the text of a research article regarding Alzheimer's 

In [63]:
import json
with open('./classification_results/classification_results_phi3.5:3.8b.json', 'r') as f:
    articles_classified = json.loads(f.read())

for article in articles_classified:
    try:
        print(articles_classified[article]["ollama_model_response"])
        print(articles_classified[article]["openai_response"])
        print('-------------------------------------')
    except KeyError:
        print(article)

No, the research article does not meet all of the specified criteria for classification. Here's why based on each point in your provided rules:

1. The paper type is unclear from this abstract alone – we can’t tell if it’s an original research study or a review/perspective piece just by reading the summary, hence I cannot confirm its eligibility without additional information about whether "data were collected" suggests that indeed it's primary data being discussed.
2. The paper does seem to focus on Alzheimer’s disease-specific biomarkers (NfL), which is relevant but there are no explicit mentions of amyloid, and the population mentioned concerns multiple groups without specifying if they include AD patients specifically at risk or with MCI/AD diagnosis.
3. There's a mention that comparisons were made using Mann-Whitney U tests suggesting some sample sizes might be involved; however, there are no specific numbers provided regarding human subjects (over 50 is unclear). Without this exp

In [13]:
import json
with open('./classification_results/classification_results_phi3.5:3.8b.json', 'r') as f:
    articles_classified = json.loads(f.read())
    for article in articles_classified:
        if articles_classified[article]['class'] == 1:
            print(articles_classified[article]['summary'])
            break

{'Abstract': 'Cerebrospinal fluid (CSF) neurofilament light (NfL) concentration has reproducibly been shown to reflect neurodegeneration in brain disorders, including Alzheimer’s disease (AD). NfL concentration in blood correlates with the corresponding CSF levels, but few studies have directly compared the reliability of these 2 markers in sporadic AD. Herein, we measured plasma and CSF concentrations of NfL in 478 cognitively unimpaired (CU) subjects, 227 patients with mild cognitive impairment, and 113 patients with AD dementia. We found that the concentration of NfL in CSF, but not in plasma, was increased in response to Aβ pathology in CU subjects. Both CSF and plasma NfL concentrations were increased in patients with mild cognitive impairment and AD dementia. Furthermore, only NfL in CSF was associated with reduced white matter microstructure in CU subjects. Finally, in a transgenic mouse model of AD, CSF NfL increased before serum NfL in response to the development of Aβ patholo