# Paper Filtering Criterias
1. Paper must be original research article and (not a review, poster or preprint) ---**Programatically Solvable**---
  - Include Publication Type Field in Search Query

2. Paper must have an AD focus.This means: ---**Optimized Search Query + LLM Task**---
  - Either having a population of AD patients or
  - Looking at Alzheimer disease specific biomarkers  


3. Human sample size must be over 50 ---**LLM Task**---
  - Extract *Abstract* and *Methods* sections from calls to BioC API
  - Feed extracted data into LLM that has been prompt engineered to search and filter papers based on the desired criterias.


4. Must be looking at a protein. ---**Optimized Search Query + LLM Task**---
  - Amyloid β
  - Tau
  - Amyloid Precursor Protein
  - Presenilin-1 and Presenilin-2
  - Apolipoprotein E (ApoE)




5. Fluid samples like CSF, blood from animal models? ---**Optimized Search Query + LLM Task**---
  - How is this different from clinical models?


6. "Blood" vs "Blood Pressure" ---**Optimized Search Query + LLM Task**---
  - How is blood being used in the paper? Is the paper using blood pressure as a biomarker or are actual blood samples being taken for biomarkers.


7. Papers from 2024 and onward. ---**Optimized Search Query**---


## LLM Task

One way that we can have the LLM do filtering is by extracting the abstract and maybe methodology sections through PubMed's BioC API.
Therefore we don't have to worry about going through the downloaded articles, unzip their parent folders and parse through the PDF, we could just get the Abstract and Methodolgies returned to us in plain text format and feed it into the LLM.

I'm able to get those fields returned from the API for several articles so this seems promising.

# Optimizing Search Queries
**All the queries we're making are getting translated by PubMed into a different string that tries to capture our original query string that is then used to query the database.** Because of this papers that might not be relevant to the researchers is being returned.

### Solutions?
**Is there any way we can have more control over the translated string?**


- Using MeSH Headings in the Query | [A description of the MeSH hierarchy](https://www.nlm.nih.gov/mesh/intro_trees.html)

  - Automatic Explosion causes MeSH terms that are not included in the query, both original and translated, to be included in the search results. This can broaden the returned papers which might not be relevant. So a solution to this is to either select MeSH terms that when exploded still included terms that are relevant to us or by writing our query with terms that won't cause explosion, *i.e.* leaf terms that don't have any children- refer to picture below for an example.

- Using MeSH **subheadings** in the Query

  - Subheadings act as qualifiers for MeSH Headings [List of MeSH qualifiers](https://www.nlm.nih.gov/mesh/subhierarchy.html)
  - Like MeSH headings, subheadings also explode

- Limiting the use of Automatic Term Mapping?
  - PubMed's search engine maps entry terms, **i.e.** terms with the field ["All Fields"], to a common MeSH heading depending on their similarity.
  - We can limit this by using **Search Field** tags


#Recording Relevant MeSH Headings

- Chemicals and Drugs Category
  - Macromolecular Substances
  - Amino Acids, Peptides, and Proteins
  - Biological Factors
    - Biomarkers
    - Blood Coagulation Factor Inhibitors
    - Blood Coagulation Factors
- Diseases Category
  - Nervous System Diseases
- Anatomy Category
  - Nervous System
  - Cells
- Psychiatry and Psychology Category
  - Psychological Phenomena
  - Mental Disroders
- Pharmacological Actions Category

# Ideas

- How crazy would it be to have the LLM generate the query string itslef?

# Filtering papers using Ollama at Oceanus

In [1]:
!pip install ollama
# !pip install tiktoken

Collecting ollama
  Using cached ollama-0.3.3-py3-none-any.whl (10 kB)
Installing collected packages: ollama
Successfully installed ollama-0.3.3


In [2]:
import requests
import json
from google.colab import drive

drive.mount('/content/drive')
base_path = '/content/drive/MyDrive/Research'

ModuleNotFoundError: No module named 'requests'

In [None]:
pmc_id_list = []
with open(f'{base_path}/Code/uid_list3.txt', 'r') as f:
    line = f.read()
    pmc_id_list = line.split('\n')

In [None]:
!pip install tiktoken

# Getting a tokenizer library to count how many tokens are in an article

In [None]:
import tiktoken

# Getting the tokenizer for gpt-40
# tokenizer = tiktoken.get_encoding("cl100k_base")

# or using tiktoken.encoding_for_model
tokenizer = tiktoken.encoding_for_model('gpt-4o')


## Using BioC API to return full articles from PMC in JSON format

In [None]:
import json

not_found_counter = 0
full_texts = []
over_context_window = 0
for id in pmc_id_list:
        response = requests.get(f"https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/PMC{id}/unicode")
        # print(response, id)
        if response.text == f'No record can be found for the input: pmc{id}':
            not_found_counter += 1
            print(f"No record found for article {id}.\Didn't find {not_found_counter} articles")
            continue
        else:
            try:
                data = json.loads(response.text[1:-1].strip())
                # print(data['documents'][0]['passages'][1:][1]['infons']['section_type'])
                total_len = 0
                total_len_tokenized = 0
                for text in data['documents'][0]['passages'][:]:
                    if text != None:
                        # print(text['text'])
                        total_len += len(text['text'])
                        # total_len_tokenized += len(tokenizer.encode(text['text']))
                        # print(tokenizer.encode(text['text']))
                # if data['documents'][0]['passages'][1:][1]['infons']['section_type'] == 'ABSTRACT' or data['documents'][0]['passages'][1:][1]['infons']['section_type'] == 'METHOD':
                print(total_len)
                # break
                # print(total_len_tokenized)
                if total_len_tokenized > 120000:
                     over_context_window += 1
                     print(total_len_tokenized, ' ', id, ' --- ', over_context_window)
                # break
            except json.JSONDecodeError as e:
                 print("Json error for string: ", response.text)

# Experimenting with Ollama for filtering

* Design prompt for filtering
* Pick Models
* Test classification


### Oceanus endpoint for Ollama
* https://arnav.cs.unlv.edu/11434

#### Generate method
* POST /api/generate

In [6]:
import requests

path = ""

params = {
    "model": "llama3.2:1b",
    "prompt": "What's the definition of a random variable?",
    # "stream": "false",
    "format": "json"
}

res = requests.post(path, json=params)

In [18]:
!pip install ollama

Collecting ollama
  Downloading ollama-0.3.3-py3-none-any.whl.metadata (3.8 kB)
Downloading ollama-0.3.3-py3-none-any.whl (10 kB)
Installing collected packages: ollama
Successfully installed ollama-0.3.3


In [10]:
from ollama import Client
client = Client(host='')
# response = client.chat(model="llama3.2:1b", messages=[{'role':'user', 'content': 'What is the definition of a random variable?',},])

In [12]:
response = client.chat(model="llama3.2:1b", messages=[{'role':'user', 'content': 'What is the definition of a random variable?',},])

ConnectError: [Errno -2] Name or service not known

In [4]:
list_models_path = "http://oceanus.cs.unlv.edu:11434/api/tags"
res = requests.get(path)

ConnectionError: HTTPConnectionPool(host='oceanus.cs.unlv.edu', port=11434): Max retries exceeded with url: /api/generate (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7e8f7533e140>: Failed to resolve 'oceanus.cs.unlv.edu' ([Errno -2] Name or service not known)"))