# IUST Computer Engineering Department 🏫
## Introduction to Natural Language Processing 📚 (The Final Project)
### Course Instructor: Dr. Marzieh Davoodabadi Farahani 👩‍🏫
### Project Teaching Assistant: Erfan Moosavi Monazzah (tel: @ErfanMoosavi2000) 📞
-------------------------------------------------------------------------------<br>
The objective of this project is to acquaint you with the fundamentals of Retrieval Augmented Generation (RAG). Be sure to explore various options and address challenges in a creative manner. 🎯

**Project Guidelines** 📝
- Avoid cheating at all costs. If a set of submissions is found to be [plagiarized](https://translate.google.as/?sl=en&tl=fa&text=Very%20hard%20word%2C%20I%20know%2C%20here%27s%20the%20meaning%3A%0Aplagiarized&op=translate), only one will be randomly chosen for grading. The others will fail the project. ❌
- You are allowed to use any document, article, paper, or video as a resource for writing your code, provided you include a link to the material used. 📖
- The use of Language Learning Models (LLMs), ChatBots, and Copilots is encouraged. If you utilize any of these tools, make sure to attach the chat history that led you to the answer to your question, or the code, to this .ipynb document. (You must provide the entire chat, not just the final answer or your initial prompt.) 💻
- You may not submit any additional documents, files, etc., along with this document. Only solutions, codes, explanations, etc., in this document will be graded. 📄
- You are required to implement everything (except the Language Modeling parts) from scratch. The use of libraries like langchain, llama_index, etc., is not permitted for this purpose. 🚫
- Please adhere to the code guidelines provided throughout the documents. 📝 I’ve spent time in a library 📚 crafting all of this, so if you overlook them, you’ll lose the points allocated for that section. ❌
- We need to use GPUs for this assignment, don't forget to turn on GPU usage for your notebook session.

-------------------------------------------------------------------------------<br>
# Alright, let's get started. 🚀

## What is RAG? 🤔
We've all used ChatGPT and experienced moments when it starts to generate content that is often incorrect or unrelated to our query. Do you know why this happens? These Large Language Models (LLMs) are not magical entities; they are simply models trained on a vast amount of text. 📚 You could even consider a significant portion of the internet. However, this is not all the data available in the world, because data is not a static concept. You yourself generate some data every day through your use of the Internet, Social Media, and so on. 🌐💻📱

So, no matter how much data you use to train your LLM, you always end up encountering new data. This is one of the reasons behind the famous ChatGPT response that tells you it only knows things up to a certain date. 📅 Also, these models tend to hallucinate too. It means they provide incorrect answers but in a very convincing manner. 🎭

On the other hand, we have retrieval techniques. Don't worry if it sounds complicated (it actually isn't easy, you may need to take a course to familiarize yourself with these concepts 😅, but that's not necessary for this project), but you use it on a daily basis. You can think of Search Engines (like Google, for example) as a complex form of information retrieval. 🔍

So, one day, people came up with this idea that it would be cool if ChatGPT could search Google for us, read the articles for us, summarize what it read, and tell us that. 📖 So, this is not exactly what RAG is, but it's something similar. We have a corpus (a large amount of data) and a query (what a user typed as input). Now, we search through this corpus using techniques related to vectors and vector databases, and find the most similar items in our corpus to the query. Then, we pass these items to an LLM and ask for a structured, well-formatted, user-friendly output. 📈📊

## I'm Interested in the Technical Details, What Should I Read? 📚🔍
- I strongly recommend reading the [original RAG paper](https://arxiv.org/abs/2005.11401). If you need help understanding the paper or have any questions about it, feel free to reach out to me via Telegram or find me on the second floor of the department in the NLP lab on Sundays and Tuesdays. 📖
- There appears to be a [comprehensive 2.5-hour course](https://www.freecodecamp.org/news/mastering-rag-from-scratch/) available. I haven't personally watched it, but if you find a better one, let me know so I can update this document. 🎥
- Here is [an article](https://www.smashingmagazine.com/2024/01/guide-retrieval-augmented-generation-language-models/) that explains the concepts very well. Initially, I wanted to use this article as the basis for this project, but unfortunately, the llama_index library used in the article seems to be outdated, so most of the code would need to be rewritten. On second thought, I found it more useful to focus on core concepts rather than learning specific libraries. You might want to check out some libraries like langchain or llama_index which provide a lot of tools for RAG. (But not for this project) 📝💡
- Don't hesitate to use Google, ask chatbots about any new concepts and terms. If you use search engine-aware chatbots like Microsoft Copilot, they provide links for each part of their answers which is useful if you want to delve deeper into that part. 🌐🤖
- Lastly, we have [the article](https://learnbybuilding.ai/tutorials/rag-from-scratch) that serves as the foundation for this project. 📚🔍

# Learn
First, we’re going to go through a simple RAG implementation. It’s going to be similar to the article, except for the (LLM) part. For that, I’m going to use Hugging Face. 🤗 I’ll also try to explain the code in simple terms, but feel free to read the article if you prefer their writing style.

## Let's Install the Necessary Libraries 📚🔧
Did you know that using the `--quiet` or `-q` option with the `pip install` command minimizes the output displayed on your screen? 🖥️ This can make your terminal less cluttered. Also, using `-U` will upgrade the libraries if they were previously installed. This is particularly useful for certain libraries like `transformers` that are frequently updated. 🔄

In [None]:
!pip install -U accelerate transformers --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m309.4/309.4 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.3/9.3 MB[0m [31m53.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m15.5 MB/s[0m eta [36m0:00:00[0m
[?25h

## Gather a Corpus 📚
Technically, a corpus refers to a large and structured set of texts. However, for the sake of our discussion, let’s consider our collection as a “corpus”, even though it might not be large in the traditional sense. 😉

In [None]:
corpus_of_documents = [
    "Take a leisurely walk in the park and enjoy the fresh air.",
    "Visit a local museum and discover something new.",
    "Attend a live music concert and feel the rhythm.",
    "Go for a hike and admire the natural scenery.",
    "Have a picnic with friends and share some laughs.",
    "Explore a new cuisine by dining at an ethnic restaurant.",
    "Take a yoga class and stretch your body and mind.",
    "Join a local sports league and enjoy some friendly competition.",
    "Attend a workshop or lecture on a topic you're interested in.",
    "Visit an amusement park and ride the roller coasters."
]

## Create a Retriever 🕵️‍♂️
Now, we’re going to create a simple retriever. The role of the retriever is to compare the user’s query with a large corpus of text and find those that are most similar in context. (You know what context is by now, don’t you? 😊 If you’ve forgotten, refer back to your initial lectures). For now, let’s say we want to find similar text based on simple similarity metrics. The code is straightforward, and I have faith in you, chief! Dive into the code. 👨‍💻

In [None]:
def jaccard_similarity(query, document):
    query = query.lower().split(" ")
    document = document.lower().split(" ")
    intersection = set(query).intersection(set(document))
    union = set(query).union(set(document))
    return len(intersection)/len(union)

Hey, you may want to look at wikipedia page for [Jaccard Similarity](https://en.wikipedia.org/wiki/Jaccard_index).

In [None]:
def return_response(query, corpus):
    similarities = []
    for doc in corpus:
        similarity = jaccard_similarity(user_input, doc)
        similarities.append(similarity)
    return corpus_of_documents[similarities.index(max(similarities))]

## Create a Generator 🖥️
Now, we’re going to create a generator. This will help us compile the information retrieved into a well-structured and user-friendly text.

OK, let's say in a senario, we ask user what they like to do, the their answer is this:

In [None]:
user_input = "I like to hike"

Now by using the retrieval model I find this activity that best fits this user.

In [None]:
relevant_document = return_response(user_input, corpus_of_documents)
print(relevant_document)

Go for a hike and admire the natural scenery.


The answer seems good enough, but we can do better, yeah?

Let’s import a Language Model. I’m going to try out Microsoft Phi-3 because it recently hit the market, and I haven’t had a chance to try it for myself yet. So, I’m seizing this opportunity to do so! 😊👨‍💻

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

Downloading the model gonna take a while, use this time to rest your eyes for a bit. 😊👀💤

In [None]:
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-128k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")

config.json:   0%|          | 0.00/3.35k [00:00<?, ?B/s]

configuration_phi3.py:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-128k-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi3.py:   0%|          | 0.00/73.8k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-128k-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json:   0%|          | 0.00/16.3k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/172 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.18k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/568 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

generation_args = {
    "max_new_tokens": 500,
    "return_full_text": False,
    "temperature": 0.0,
    "do_sample": False,
}

Now we try to get the LLM to become our generator. We simply place the retrieved information and user query in the following prompt and ask the model for well formatted text.

In [None]:
prompt = """You are a bot that makes recommendations for activities. Try to be helpful recommender system.
This is the recommended activity: {relevant_document}
The user input is: {user_input}
Compile a recommendation to the user based on the recommended activity and the user input."""

In [None]:
prompt = prompt.replace("{relevant_document}", relevant_document).replace("{user_input}", user_input)
print(prompt)

You are a bot that makes recommendations for activities. Try to be helpful recommender system.
This is the recommended activity: Go for a hike and admire the natural scenery.
The user input is: I like to hike
Compile a recommendation to the user based on the recommended activity and the user input.


In [None]:
messages = [
    {"role": "user", "content": prompt},
]

Here's the augmented generated text

In [None]:
output = pipe(messages, **generation_args)
print(output[0]['generated_text'])

Based on your interest in hiking and our recommended activity, I suggest you embark on a scenic hike in a beautiful natural environment. This will not only allow you to enjoy the physical benefits of hiking but also provide a wonderful opportunity to admire breathtaking landscapes, observe diverse flora and fauna, and experience the tranquility of nature. Don't forget to bring along essentials like water, snacks, and appropriate hiking gear for a safe and enjoyable adventure!


## Very Cool, but Not Perfect! 😎👌
Alright, you’ve just seen a very basic example of RAG. However, there are some issues present. The corpus is small, and the documents in the corpus are short sentences, which causes the Language Model (LM) to generate some text on its own. 📚🤖

Also, our retriever is not very efficient and it may encounter bugs in some cases. For instance, even when users specify that they are not interested in a certain activity, the retriever might still bring up that activity for them. 🐜🔍

So, in this project, you’re going to address some of these issues. The rest of this document consists of some empty cells and tips for you on how to fill them with code. Let’s get coding! 👨‍💻🚀

# The Project

## Ready the environment and import libraries

In [None]:
!pip install -U accelerate sentence-transformers transformers datasets --quiet

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/309.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━[0m [32m174.1/309.4 kB[0m [31m5.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m309.4/309.4 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.3/9.3 MB[0m [31m25.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m38.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.8/40.8 MB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from sentence_transformers import SentenceTransformer, util
from datasets import load_dataset, Dataset
import pandas as pd
import numpy as np
import re
import torch
from tqdm import tqdm
from huggingface_hub import notebook_login

In [None]:
# Load the Drive helper and mount
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
from transformers import logging
logging.set_verbosity_error()

In [None]:
SIMILARITY_SCORES_OUTPUT = '/content/drive/MyDrive/NLP/Projects/Project1/outputs/similarity_scores.csv'
# DATA_NUM = 100000
OUTPUT_RETREIVER_DATSET = 'drive/MyDrive/NLP/Projects/Project1/outputs/dataset.csv'

MODEL_NAMES = [
    'NousResearch/Meta-Llama-3-8B-Instruct',
    'Qwen/Qwen2-7B-Instruct',
    'mistralai/Mistral-7B-Instruct-v0.3'
]

FILE_LIST = [f'/content/drive/MyDrive/NLP/Projects/Project1/outputs/{model_name.split("/")[0]}.csv' for model_name in model_names]

RELATED_DOCS_OUTPUT = '/content/drive/MyDrive/NLP/Projects/Project1/outputs/related_docs.csv'


## Determine Your Task 🎯
What do you aim to implement with RAG? A recommender system? 🎁 A chatbot for a website’s FAQ? 💬 A medical advisor? 🩺 Or perhaps something else entirely?

Specify your objective in this cell.

In [None]:
task_title = "Midical Advisor"
url_for_more_information = "http://bioasq.org/"

print(f"My task is: {task_title}")
print(f'For more information see: {url_for_more_information}')

My task is: Midical Advisor
For more information see: http://bioasq.org/


## 🧐 Find or gather a corpus
Remember the fake corpus? 📚 It’s time to switch things up and use something real. 🌐 You need to use a dataset from  [huggingface datasets](https://huggingface.co/datasets) for this project. 🚀 Don’t use files that are outside of this notebook, this notebook should be able to run on its own without depending on anything external. 💻👍


In [None]:
from datasets import load_dataset

train_data = load_dataset("ruslanmv/ai-medical-chatbot", split='train')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/863 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/142M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/256916 [00:00<?, ? examples/s]

In [None]:
train_data.num_rows

256916

In [None]:
df = pd.DataFrame(train_data)
df = df[["Description", "Doctor"]].rename(columns={"Description": "question", "Doctor": "answer"})
df = df[~df.apply(lambda row: row.astype(str).str.contains('-->').any(), axis=1)]
# df = df[:DATA_NUM]
df.insert(0, 'id', df.index)
df = df.reset_index(drop=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 235899 entries, 0 to 235898
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   id        235899 non-null  int64 
 1   question  235899 non-null  object
 2   answer    235899 non-null  object
dtypes: int64(1), object(2)
memory usage: 5.4+ MB


In [None]:
# Clean the 'question' and 'answer' columns
df['answer'] = df['answer'].str.replace('^Hi. ', '', regex=True)
df['answer'] = df['answer'].str.replace('^Hi ', '', regex=True)
df['answer'] = df['answer'].str.replace('^Hello. ', '', regex=True)
df['answer'] = df['answer'].str.replace('^Hello there ', '', regex=True)
df['answer'] = df['answer'].apply(lambda x: re.sub(r'\s+', ' ', x.strip()))
# df['question'] = df['question'].str.replace('^Q. ', '', regex=True)
# df['question'] = df['question'].apply(lambda x: re.sub(r'\s+', ' ', x.strip()))
df.head()

Unnamed: 0,id,question,answer
0,1,Q. What should I do to reduce my weight gained...,You have really done well with the hypothyroid...
1,2,Q. I have started to get lots of acne on my fa...,there Acne has multifactorial etiology. Only a...
2,3,Q. Why do I have uncomfortable feeling between...,The popping and discomfort what you felt is ei...
3,4,Q. My symptoms after intercourse threatns me e...,"The HIV test uses a finger prick blood sample,..."
4,5,Q. I had a surgery which ended up with some fa...,If you are saying it is already six months sin...


## 📝 Create some queries
I want you to create 20 queries related to your task. You can use any Language Model you want for this matter, or if you’re feeling strong 💪 and have the time, write it yourself. 🖊️

You need to create a Hugging Face account, format your 20 queries into the accepted dataset format for Hugging Face 🤗 and push it to your Hugging Face account. Be sure to make it public and use it for the evaluation task. 👀

In [None]:
medical_queries = [
    "What are the symptoms of acute thyroiditis in teenagers?",
    "Is a PAP test effective for screening prostatitis symptoms?",
    "How can I treat acne on my lower back after an oil massage?",
    "Is obsessive behavior about spanking a symptom of Asperger's syndrome in children?",
    "What are natural remedies for erectile dysfunction?",
    "Why do I experience rectal bleeding after eating spicy food?",
    "What are the potential causes of frequent diarrhea?",
    "How effective are Nano-Leo capsules for treating erectile dysfunction?",
    "What are the risk factors for colon cancer in young adults?",
    "Can high blood pressure medication affect erectile function?",
    "What are the symptoms of HIV exposure?",
    "How long after potential HIV exposure should I get tested?",
    "What causes styes in the eye and how can they be treated?",
    "Are skin tags a sign of a more serious condition?",
    "What causes lightheadedness when walking or looking at a computer screen?",
    "How can I lower my triglyceride levels naturally?",
    "What are non-invasive procedures to detect prostate cancer?",
    "How accurate are PSA tests in detecting prostate cancer?",
    "What causes chronic eye allergies and irritation?",
    "Are there alternatives to steroid eye drops for treating eye allergies?"
]

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
import numpy as np

medical_queries_dataset = Dataset.from_dict({'query_id': np.arange(1, len(medical_queries)), 'query': medical_queries})

medical_queries_dataset.push_to_hub('kamyar-mroadian/medical_queries')

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/264 [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/datasets/kamyar-mroadian/medical_queries/commit/fdf05eb1b04b2e6c094c9eab273be765fcde33a2', commit_message='Upload dataset', commit_description='', oid='fdf05eb1b04b2e6c094c9eab273be765fcde33a2', pr_url=None, pr_revision=None, pr_num=None)

this part is done using ChatGPT:

https://www.perplexity.ai/search/generate-20-queries-with-respe-TLcKrjwgQ82Q9Q1w4yJwng

## 🛠️ Create a Retriever
To create your retriever, you need to use an encoder model. Something like BERT? Nah, BERT is so yesterday. Find something new and shiny! ✨ The basic idea is to encode every document (sentence) in your corpus into a vector space using the same encoder. Then, encode the user query into that same space. With some similarity metrics like dot product, you can find the most similar document to the user’s input and retrieve it. 🎯 You can train your own encoder if you have enough data and resources, 💪 or you can use one of those [ready-made on Hugging Face](https://huggingface.co/models?pipeline_tag=sentence-similarity&sort=trending), like these ones.

In [None]:
def measure_similarity(query, doc_ids, document_embeddings):
    query_embedding = retriever_model.encode([query])[0]
    query_norm = np.linalg.norm(query_embedding)
    max_i = -1
    max_value = -np.inf
    for i, emb in zip(doc_ids, document_embeddings):
        doc_norm = np.linalg.norm(emb)
        cosine_sim = np.dot(query_embedding, emb) / (doc_norm * query_norm)
        if cosine_sim > max_value:
            max_i = i
            max_value = cosine_sim
    return max_i, max_value

In [None]:
def return_response(query, corpus):
    query_id, similarity = measure_similarity(query, corpus['id'], corpus['answer_embedding'])
    return df[df['id']==query_id]['answer'].values[0]

In [None]:
from sentence_transformers import SentenceTransformer

retriever_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
# embeddings = model.encode(sentences)
# print(embeddings)

In [None]:
# sentences = list(df['answer'])
# embeddings = retriever_model.encode(sentences, show_progress_bar=True)
# embeddings_list = embeddings.tolist()
# df['answer_embedding'] = embeddings_list
# df

In [None]:
sentences = list(df['answer'])
pool = retriever_model.start_multi_process_pool()
embeddings = retriever_model.encode_multi_process(sentences, pool)
retriever_model.stop_multi_process_pool(pool)
embeddings_list = embeddings.tolist()
df['answer_embedding'] = embeddings_list
# df.to_csv(OUTPUT_RETREIVER_DATSET, index=False)
df.head()

Unnamed: 0,id,question,answer,answer_embedding
0,1,Q. What should I do to reduce my weight gained...,You have really done well with the hypothyroid...,"[0.05510440468788147, -0.0012084447080269456, ..."
1,2,Q. I have started to get lots of acne on my fa...,there Acne has multifactorial etiology. Only a...,"[-0.0547616109251976, -0.020722554996609688, 0..."
2,3,Q. Why do I have uncomfortable feeling between...,The popping and discomfort what you felt is ei...,"[0.0065601966343820095, -0.08333427459001541, ..."
3,4,Q. My symptoms after intercourse threatns me e...,"The HIV test uses a finger prick blood sample,...","[-0.013021341525018215, 0.04355882480740547, 0..."
4,5,Q. I had a surgery which ended up with some fa...,If you are saying it is already six months sin...,"[0.06730877608060837, -0.006252889521420002, 0..."


## 🎛️ Create a Generator
For this part, I practically handed you the whole code on a silver platter. 🍽️ But since we know you’re an explorer at heart and love trying new things, you can’t use the model I previously used. 😈 You have to try 3 different generators and compare them based on the quality of their answers. 🧪📊 [These might come in handy](https://huggingface.co/models?pipeline_tag=text-generation&sort=trending).

In [None]:
!mkdir outputs

In [None]:
def get_model_and_tokenizer(model_name):
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        device_map="cuda",
        torch_dtype=torch.bfloat16,
        trust_remote_code=True,
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
def get_message(relevant_document, user_input, has_system=True):
    prompt = """
    You are an AI medical advisor trained on up-to-date, evidence-based medical information.
    Your role is to provide helpful guidance to healthcare professionals, but not to diagnose or
    prescribe treatments. Please respond to the{include_following}query in a clear, concise manner using
    medical terminology appropriate for healthcare professionals:\n
    {user_input}
    In your response:\n
    1- Consider this related respose: {relevant_document}\n
    2- Summarize the key points related to the query\n
    3- Mention potential differential diagnoses if applicable\n
    4- Suggest appropriate next steps or areas for further investigation\n
    Please use a professional yet approachable tone, and structure your response with appropriate
    headers and bullet points for readability. If you are unsure about any aspect of the query,
    state so clearly and suggest consulting with a specialist or reviewing additional medical literature.
    """
    prompt = prompt.replace("{relevant_document}", relevant_document)
    if has_system:
        prompt = prompt.replace("{include_following}", ' ')
        prompt = prompt.replace("{user_input}", '')
        messages = [
            {'role': 'system', "content": prompt},
            {"role": "user", "content": user_input},
        ]
    else:
        prompt = prompt.replace("{include_following}", ' following ')
        prompt = prompt.replace("{user_input}", user_input)
        messages = [
            {"role": "user", "content": prompt},
        ]
    return messages

In [None]:
def get_model_outputs(pipe, generation_args, queries, docs, csv_filename, has_system=True, special_eos=False):
    if special_eos:
        terminators = [
            pipe.tokenizer.eos_token_id,
            pipe.tokenizer.convert_tokens_to_ids("<|eot_id|>")
        ]
        generation_args['eos_token_id'] = terminators
    else:
        generation_args['eos_token_id'] = pipe.tokenizer.eos_token_id
    generation_args['bos_token_id'] = pipe.tokenizer.bos_token_id

    messages = [get_message(doc, query, has_system) for doc, query in zip(docs, queries)]

    prompts = [pipe.tokenizer.apply_chat_template(
        message,
        tokenize=False,
        add_generation_prompt=True
    ) for message in messages]

    prompt_df = pd.DataFrame({'prompt': prompts})

    tqdm.pandas()
    outputs = prompt_df['prompt'].progress_apply(
        lambda x: pipe(x, **generation_args)
    )

    generated_advices = [output[0]["generated_text"] for output, prompt in zip(outputs, prompts)]

    # Create a DataFrame to hold the outputs
    df = pd.DataFrame({
        "Query": queries,
        "Document": docs,
        "Generated Output": generated_advices
    })

    # Save the DataFrame to a CSV file
    df.to_csv(csv_filename, index=False)

    return generated_advices

## 📊 Evaluate the results
Here, you’ve got to put those 3 models to the test. Use the 20 queries you’ve created on each of the 3 models. Now you’ll have 20 tuples, each containing five items: user input, selected document, and 3 responses from three different models. Use a judge model on each tuple to select the best answer. 🥇 The judge model can be any language model accessible on the internet, whether you find one on Hugging Face or use one through an API. 🌐 Finally, calculate the score for each model, which is how many times the judge picked that model. 🏆

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
ds = load_dataset('kamyar-mroadian/medical_queries', split='train')
queries = ds.to_pandas()['query']

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/300 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.60k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/20 [00:00<?, ? examples/s]

This cell is preprocessed and results are saved.

In [None]:
## Uncomment if you want to run and get related queries
# tqdm.pandas() # Enabling tqdm progress_bar
# docs = queries.progress_apply(lambda x: return_response(x, df))
## If you have new queries uncomment these lines
# docs.to_csv(RELATED_DOCS_OUTPUT, index=False)

100%|██████████| 20/20 [04:47<00:00, 14.38s/it]


In [None]:
docs = pd.read_csv(RELATED_DOCS_OUTPUT)
docs = docs['query']

With using following cell you can expriment all 3 models defined above.

In [None]:
# Change this line and expriment with other models
MODEL_INDEX = 0
model_in_use = MODEL_NAMES[MODEL_INDEX]
file_name = FILE_LIST[MODEL_INDEX]
# generator_model, generator_tokenizer = get_model_and_tokenizer(model_in_use)

generator_pipe = pipeline(
    "text-generation",
    model=model_in_use,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
    trust_remote_code=True
)

generation_args = {
    "max_new_tokens": 256,
    "return_full_text": False,
    "do_sample": False,
    "top_d": None,
}

config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

In [None]:
if MODEL_INDEX == 0:
    special_eos = True
    has_system = True
if MODEL_INDEX == 1:
    special_eos = False
    has_system = True
else:
    special_eos = False
    has_system = False
generator_outputs = get_model_outputs(generator_pipe, generation_args, queries.values, docs.values, file_name, has_system=has_system, special_eos=special_eos)

100%|██████████| 20/20 [1:16:40<00:00, 230.02s/it]


Evaluator model is following model:
`Alibaba-NLP/gte-large-en-v1.5`

I wanted to choose a medium-sized or smile-sized model. Choices were the mentioned model and following model: `sentence-transformers/all-mpnet-base-v2`

With a little research and help of ChatGPT I finally chose `Alibaba-NLP/gte-large-en-v1.5` model.

https://www.perplexity.ai/search/compare-the-performance-of-the-qpgMRHj4SjKNgNnU18_q3Q

In [None]:
from sentence_transformers import SentenceTransformer

evaluator_model = SentenceTransformer('Alibaba-NLP/gte-large-en-v1.5', trust_remote_code=True)

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/71.2k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/54.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.35k [00:00<?, ?B/s]

configuration.py:   0%|          | 0.00/7.13k [00:00<?, ?B/s]

modeling.py:   0%|          | 0.00/57.5k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.74G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

In [None]:
def read_model_outputs(file_list):
    data_frames = [pd.read_csv(file) for file in file_list]
    return data_frames

def compute_similarity(model, queries, documents, outputs):
    query_embeddings = model.encode(queries, convert_to_tensor=True)
    document_embeddings = model.encode(documents, convert_to_tensor=True)
    output_embeddings = [model.encode(output, convert_to_tensor=True) for output in outputs]

    query_similarities = [util.pytorch_cos_sim(query_embeddings[i], output_embeddings[i]).item() for i in range(len(queries))]
    document_similarities = [util.pytorch_cos_sim(document_embeddings[i], output_embeddings[i]).item() for i in range(len(documents))]

    return query_similarities, document_similarities

def save_similarity_scores(evaluator_model, file_list, csv_filename, model_names):
    data_frames = read_model_outputs(file_list)
    similarity_results = []

    for i, df in enumerate(data_frames):
        queries = df['Query'].tolist()
        documents = df['Document'].tolist()
        outputs = df['Generated Output'].tolist()

        query_sim, doc_sim = compute_similarity(evaluator_model, queries, documents, outputs)

        df['Query Similarity'] = query_sim
        df['Document Similarity'] = doc_sim
        df['Average Similarity'] = (df['Query Similarity'] + df['Document Similarity']) / 2
        df['Model'] = f'{model_names[i]}'

        similarity_results.append(df)

    combined_df = pd.concat(similarity_results, ignore_index=True)

    # Rank models for each query based on all three similarity scores
    combined_df['Query Score'] = combined_df.groupby('Query')['Query Similarity'].rank(ascending=True, method='min')
    combined_df['Document Score'] = combined_df.groupby('Query')['Document Similarity'].rank(ascending=True, method='min')
    combined_df['Average Score'] = combined_df.groupby('Query')['Average Similarity'].rank(ascending=True, method='min')

    # Calculate total rank for each model per query
    combined_df['Total Score'] = combined_df['Query Score'] + combined_df['Document Score'] + combined_df['Average Score']

    # Calculate overall rank for each model
    model_ranks = combined_df.groupby('Model')['Total Score'].sum().reset_index()
    model_ranks['Overall Rank'] = model_ranks['Total Score'].rank(ascending=False, method='min')
    model_ranks = model_ranks.sort_values('Overall Rank', ascending=True).reset_index(drop=True)

    # Add overall rank to the combined dataframe
    combined_df = combined_df.merge(model_ranks[['Model', 'Overall Rank']], on='Model')

    combined_df.to_csv(csv_filename, index=False)

    return combined_df, model_ranks

combined_df, model_ranks = save_similarity_scores(evaluator_model, file_list, SIMILARITY_SCORES_OUTPUT, MODEL_NAMES)

print("Model Rankings:")
model_ranks

Model Rankings:


Unnamed: 0,Model,Total Score,Overall Rank
0,mistralai/Mistral-7B-Instruct-v0.3,139.0,1.0
1,NousResearch/Meta-Llama-3-8B-Instruct,126.0,2.0
2,Qwen/Qwen2-7B-Instruct,95.0,3.0


Here's how the score of each model is calculated:

for each model I have calculated the rank for each of the following scores:
- Query Similarity (How close model's output is to the query)
- Document Similarity (How close model's outpus is to relavent document of query)
- Average Similarity (Average of above two scores)

such that each model with highest similarity will get highest score (so 3 will be assigned to model with highest score). Finally all scores will add up to calculate total score of each model. Model with the highest Total Score will be ranked first.

As you can see, mistralai/Mistral-7B-Instruct-v0.3 model is ranked first. Also, here's the ranking of models based on the time they spent to generate outputs:

1.   mistralai/Mistral-7B-Instruct-v0.3
2.   Qwen/Qwen2-7B-Instruct
3.   NousResearch/Meta-Llama-3-8B-Instruct

In [None]:
print('Final result all in one view:')

combined_df

Final result all in one view:


Unnamed: 0,Query,Document,Generated Output,Query Similarity,Document Similarity,Average Similarity,Model,Query Score,Document Score,Average Score,Total Score,Overall Rank
0,What are the symptoms of acute thyroiditis in ...,HelloAs you mentioned that your daughter is ha...,**Query Response**\n\n**Symptoms of Acute Thyr...,0.905146,0.658204,0.781675,NousResearch/Meta-Llama-3-8B-Instruct,3.0,2.0,3.0,8.0,2.0
1,Is a PAP test effective for screening prostati...,UTI and taking antibiotics such as amoxicilin ...,**Query Response:**\n\n**Is a PAP test effecti...,0.941089,0.732424,0.836757,NousResearch/Meta-Llama-3-8B-Instruct,3.0,2.0,2.0,7.0,2.0
2,How can I treat acne on my lower back after an...,hi there. acne of the back is difficult to get...,**Treatment of Acne on the Lower Back after an...,0.878518,0.70821,0.793364,NousResearch/Meta-Llama-3-8B-Instruct,3.0,2.0,2.0,7.0,2.0
3,Is obsessive behavior about spanking a symptom...,Hello and Welcome to ‘Ask A Doctor’ service. I...,**Response to Query: Obsessive Behavior about ...,0.91412,0.612802,0.763461,NousResearch/Meta-Llama-3-8B-Instruct,3.0,3.0,3.0,9.0,2.0
4,What are natural remedies for erectile dysfunc...,Unfortunately many treatment options for erect...,**Natural Remedies for Erectile Dysfunction**\...,0.82302,0.763225,0.793123,NousResearch/Meta-Llama-3-8B-Instruct,1.0,1.0,1.0,3.0,2.0
5,Why do I experience rectal bleeding after eati...,HelloYour concern is wright.Spicy food is some...,**Query Response**\n\n**Summary of Query**\n\n...,0.816868,0.605446,0.711157,NousResearch/Meta-Llama-3-8B-Instruct,1.0,1.0,1.0,3.0,2.0
6,What are the potential causes of frequent diar...,hi causes are -generic -heredity -protein or v...,**Potential Causes of Frequent Diarrhea**\n\n*...,0.881011,0.766057,0.823534,NousResearch/Meta-Llama-3-8B-Instruct,2.0,2.0,3.0,7.0,2.0
7,How effective are Nano-Leo capsules for treati...,I can understand your concern. Nano Leo consis...,**Response to Query: Effectiveness of Nano-Leo...,0.885237,0.86177,0.873503,NousResearch/Meta-Llama-3-8B-Instruct,3.0,2.0,3.0,8.0,2.0
8,What are the risk factors for colon cancer in ...,"HelloOf course, colon cancer has to be conside...",**Risk Factors for Colon Cancer in Young Adult...,0.878745,0.656774,0.767759,NousResearch/Meta-Llama-3-8B-Instruct,1.0,2.0,2.0,5.0,2.0
9,Can high blood pressure medication affect erec...,hiii.welcome to our site.usually erection occu...,**Query Response: Can high blood pressure medi...,0.894501,0.831951,0.863226,NousResearch/Meta-Llama-3-8B-Instruct,3.0,2.0,2.0,7.0,2.0


Code for comparison of models is written with the help of ChatGPT (for pandas manipulation)

https://www.perplexity.ai/search/edit-this-code-such-that-we-ca-8kMN90iVQ1yudkL9bDXQeA

### Now that I'm writing this message, it's 3 in the morning and I'm tired as fox. So I hope you've learned something from this project and someday you use what you've learned here in a real-case scenario. Good Luck! ✌️