<a href="https://colab.research.google.com/github/GaryM02/fyp_repo/blob/main/Colab%20Notebooks/fyp_model_exp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# connect drive
from google.colab import drive
import os
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
os.chdir('/content/drive/My Drive/Colab Notebooks/PredictiveAnalytics')

## Retrieval Pipeline (Putting It All Together)
1. Embed Articles: Generate embeddings for the input article and your source collection using a pretrained transformer.
2. Initial Similarity Search: Use cosine similarity to retrieve top candidate articles from the source collection based on embeddings.
3. Topic Matching: Filter the candidate set based on topical similarity to ensure the retrieved articles match the theme of the input.
4. Entity and Keyword Matching: Apply additional filtering based on entities and keywords to focus on articles with specific content overlap.
5. Credibility Scoring: If required, use anomaly detection to prioritize articles with higher credibility or alignment with known reliable sources.

# 1. Embed Input Article

The model used to generate embeddings in parquet file ( sentence-transformers/all-MiniLM-L6-v2 )

In [None]:
import pandas as pd
import pyarrow as pa
import torch
from transformers import AutoTokenizer, AutoModel
import gc

In [None]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
import pprint

In [None]:
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

In [None]:
def process_input(user_input, model_name="sentence-transformers/all-MiniLM-L6-v2"):
    """
    Generates text embeddings for a given input using a specified pre-trained model.
    Args:
        user_input (str): The text input for which to generate embeddings.
        model_name (str, optional): The name of the pre-trained model to use.
                                     Defaults to "sentence-transformers/all-MiniLM-L6-v2".
    Returns:
        numpy.ndarray: The generated text embeddings as a NumPy array.
    """
    # Load tokenizer and model only once if possible
    global tokenizer, model
    if tokenizer is None or model is None:
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModel.from_pretrained(model_name)
        model = model.to("cuda" if torch.cuda.is_available() else "cpu")
        model.eval()

    inputs = tokenizer(user_input["abstract"], padding=True, truncation=True, return_tensors="pt").to(model.device)
    with torch.no_grad():
        embeddings = model(**inputs).last_hidden_state.mean(dim=1).cpu().numpy()

    return embeddings

In [None]:
user_input = {"title": "Novel Treatment for Alzheimer's Disease", "abstract": "A significant influence of age on CBF and metabolism in patients with dementia was not found."}

In [None]:
embeddings = process_input(user_input)

In [None]:
print(embeddings.shape)

(1, 384)


# 2. Initial Similarity Search

In [None]:
def retrieve_top_articles(embeddings, top_k=5):
    """
    Retrieves the top k candidate articles from the source collection based on cosine similarity.

    Args:
        embeddings (np.ndarray): Embeddings of the input article.
        top_k (int): The number of top articles to retrieve.

    Returns:
        pd.DataFrame: DataFrame of the top k articles.
    """

    # Load the source collection
    source_collection = pd.read_parquet("Data/pubmed/parquet/pubmed_with_embeddings.parquet")

    # Calculate cosine similarity between input and source collection embeddings
    source_embeddings = np.stack(source_collection['embedding'].values)
    similarities = cosine_similarity(embeddings, source_embeddings)

    # Get indices of top k similar articles
    top_indices = np.argsort(similarities[0])[::-1][:top_k]

    # Return top k articles
    return source_collection.iloc[top_indices]

In [None]:
top_articles = retrieve_top_articles(embeddings.reshape(1, -1))

In [None]:
for index, row in top_articles.iterrows():
    print(row['abstract'])

The purpose of this retrospective study was to investigate how the blood flow and oxidative metabolism of the brain was changed in dementia and the influence of the age factor. Cerebral blood flow (CBF) was measured in 115 patients aged from 40 to 83 years by means of the Kety-Schmidt technique with the modification of Bernsmeier and Siemons. The cerebral metabolic rates of oxygen and CO2 were determined by the van Slyke method and by gaschromatography respectively and of glucose and lactate by standard enzymatic methods. All cases of dementia due to head injuries, cerebral infections, cerebral infarctions, exogenous or endogenous intoxications or circulatory diseases were excluded from this study, but no classification of the dementias was made. Statistical calculations were carried out by means of the analysis of variance for a two-way design. Cerebral blood flow did not show a normal distribution curve but was at least triphasic; CBF in demented patients was either lower than normal

# 3. Topic Matching
Investigating HDBSCAN, KMEANs and BERTopic for unsupervised topic matching

In [None]:
!pip install bertopic

In [None]:
import hdbscan
from bertopic import BERTopic
from sklearn.cluster import KMeans

Read embeddings to train models

In [None]:
source_collection = pd.read_parquet("Data/pubmed/parquet/pubmed_with_embeddings.parquet")
source_embeddings = np.stack(source_collection['embedding'].values)

In [None]:
clusterer = hdbscan.HDBSCAN(min_cluster_size=5, min_samples=2)  # Parameters may need tuning
labels = clusterer.fit_predict(source_embeddings)

# 4. Entity and Keyword Matching

# 5. Credibility Scoring