# IUST Computer Engineering Department 🏫
## Introduction to Natural Language Processing 📚 (The Final Project)
### Course Instructor: Dr. Marzieh Davoodabadi Farahani 👩‍🏫
### Project Teaching Assistant: Erfan Moosavi Monazzah (tel: @ErfanMoosavi2000) 📞
-------------------------------------------------------------------------------<br>
The objective of this project is to acquaint you with the fundamentals of Retrieval Augmented Generation (RAG). Be sure to explore various options and address challenges in a creative manner. 🎯

**Project Guidelines** 📝
- Avoid cheating at all costs. If a set of submissions is found to be [plagiarized](https://translate.google.as/?sl=en&tl=fa&text=Very%20hard%20word%2C%20I%20know%2C%20here%27s%20the%20meaning%3A%0Aplagiarized&op=translate), only one will be randomly chosen for grading. The others will fail the project. ❌
- You are allowed to use any document, article, paper, or video as a resource for writing your code, provided you include a link to the material used. 📖
- The use of Language Learning Models (LLMs), ChatBots, and Copilots is encouraged. If you utilize any of these tools, make sure to attach the chat history that led you to the answer to your question, or the code, to this .ipynb document. (You must provide the entire chat, not just the final answer or your initial prompt.) 💻
- You may not submit any additional documents, files, etc., along with this document. Only solutions, codes, explanations, etc., in this document will be graded. 📄
- You are required to implement everything (except the Language Modeling parts) from scratch. The use of libraries like langchain, llama_index, etc., is not permitted for this purpose. 🚫
- Please adhere to the code guidelines provided throughout the documents. 📝 I’ve spent time in a library 📚 crafting all of this, so if you overlook them, you’ll lose the points allocated for that section. ❌
- We need to use GPUs for this assignment, don't forget to turn on GPU usage for your notebook session.

-------------------------------------------------------------------------------<br>
# Alright, let's get started. 🚀

## What is RAG? 🤔
We've all used ChatGPT and experienced moments when it starts to generate content that is often incorrect or unrelated to our query. Do you know why this happens? These Large Language Models (LLMs) are not magical entities; they are simply models trained on a vast amount of text. 📚 You could even consider a significant portion of the internet. However, this is not all the data available in the world, because data is not a static concept. You yourself generate some data every day through your use of the Internet, Social Media, and so on. 🌐💻📱

So, no matter how much data you use to train your LLM, you always end up encountering new data. This is one of the reasons behind the famous ChatGPT response that tells you it only knows things up to a certain date. 📅 Also, these models tend to hallucinate too. It means they provide incorrect answers but in a very convincing manner. 🎭

On the other hand, we have retrieval techniques. Don't worry if it sounds complicated (it actually isn't easy, you may need to take a course to familiarize yourself with these concepts 😅, but that's not necessary for this project), but you use it on a daily basis. You can think of Search Engines (like Google, for example) as a complex form of information retrieval. 🔍

So, one day, people came up with this idea that it would be cool if ChatGPT could search Google for us, read the articles for us, summarize what it read, and tell us that. 📖 So, this is not exactly what RAG is, but it's something similar. We have a corpus (a large amount of data) and a query (what a user typed as input). Now, we search through this corpus using techniques related to vectors and vector databases, and find the most similar items in our corpus to the query. Then, we pass these items to an LLM and ask for a structured, well-formatted, user-friendly output. 📈📊

## I'm Interested in the Technical Details, What Should I Read? 📚🔍
- I strongly recommend reading the [original RAG paper](https://arxiv.org/abs/2005.11401). If you need help understanding the paper or have any questions about it, feel free to reach out to me via Telegram or find me on the second floor of the department in the NLP lab on Sundays and Tuesdays. 📖
- There appears to be a [comprehensive 2.5-hour course](https://www.freecodecamp.org/news/mastering-rag-from-scratch/) available. I haven't personally watched it, but if you find a better one, let me know so I can update this document. 🎥
- Here is [an article](https://www.smashingmagazine.com/2024/01/guide-retrieval-augmented-generation-language-models/) that explains the concepts very well. Initially, I wanted to use this article as the basis for this project, but unfortunately, the llama_index library used in the article seems to be outdated, so most of the code would need to be rewritten. On second thought, I found it more useful to focus on core concepts rather than learning specific libraries. You might want to check out some libraries like langchain or llama_index which provide a lot of tools for RAG. (But not for this project) 📝💡
- Don't hesitate to use Google, ask chatbots about any new concepts and terms. If you use search engine-aware chatbots like Microsoft Copilot, they provide links for each part of their answers which is useful if you want to delve deeper into that part. 🌐🤖
- Lastly, we have [the article](https://learnbybuilding.ai/tutorials/rag-from-scratch) that serves as the foundation for this project. 📚🔍

# Learn
First, we’re going to go through a simple RAG implementation. It’s going to be similar to the article, except for the (LLM) part. For that, I’m going to use Hugging Face. 🤗 I’ll also try to explain the code in simple terms, but feel free to read the article if you prefer their writing style.

## Let's Install the Necessary Libraries 📚🔧
Did you know that using the `--quiet` or `-q` option with the `pip install` command minimizes the output displayed on your screen? 🖥️ This can make your terminal less cluttered. Also, using `-U` will upgrade the libraries if they were previously installed. This is particularly useful for certain libraries like `transformers` that are frequently updated. 🔄

In [None]:
!pip install -U accelerate transformers --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m314.1/314.1 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.3/9.3 MB[0m [31m22.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m43.4 MB/s[0m eta [36m0:00:00[0m
[?25h

## Gather a Corpus 📚
Technically, a corpus refers to a large and structured set of texts. However, for the sake of our discussion, let’s consider our collection as a “corpus”, even though it might not be large in the traditional sense. 😉

In [None]:
corpus_of_documents = [
    "Take a leisurely walk in the park and enjoy the fresh air.",
    "Visit a local museum and discover something new.",
    "Attend a live music concert and feel the rhythm.",
    "Go for a hike and admire the natural scenery.",
    "Have a picnic with friends and share some laughs.",
    "Explore a new cuisine by dining at an ethnic restaurant.",
    "Take a yoga class and stretch your body and mind.",
    "Join a local sports league and enjoy some friendly competition.",
    "Attend a workshop or lecture on a topic you're interested in.",
    "Visit an amusement park and ride the roller coasters."
]

## Create a Retriever 🕵️‍♂️
Now, we’re going to create a simple retriever. The role of the retriever is to compare the user’s query with a large corpus of text and find those that are most similar in context. (You know what context is by now, don’t you? 😊 If you’ve forgotten, refer back to your initial lectures). For now, let’s say we want to find similar text based on simple similarity metrics. The code is straightforward, and I have faith in you, chief! Dive into the code. 👨‍💻

In [None]:
def jaccard_similarity(query, document):
    query = query.lower().split(" ")
    document = document.lower().split(" ")
    intersection = set(query).intersection(set(document))
    union = set(query).union(set(document))
    return len(intersection)/len(union)

Hey, you may want to look at wikipedia page for [Jaccard Similarity](https://en.wikipedia.org/wiki/Jaccard_index).

In [None]:
def return_response(query, corpus):
    similarities = []
    for doc in corpus:
        similarity = jaccard_similarity(user_input, doc)
        similarities.append(similarity)
    return corpus_of_documents[similarities.index(max(similarities))]

## Create a Generator 🖥️
Now, we’re going to create a generator. This will help us compile the information retrieved into a well-structured and user-friendly text.

OK, let's say in a senario, we ask user what they like to do, the their answer is this:

In [None]:
user_input = "I like to hike"

Now by using the retrieval model I find this activity that best fits this user.

In [None]:
relevant_document = return_response(user_input, corpus_of_documents)
print(relevant_document)

Go for a hike and admire the natural scenery.


The answer seems good enough, but we can do better, yeah?

Let’s import a Language Model. I’m going to try out Microsoft Phi-3 because it recently hit the market, and I haven’t had a chance to try it for myself yet. So, I’m seizing this opportunity to do so! 😊👨‍💻

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

Downloading the model gonna take a while, use this time to rest your eyes for a bit. 😊👀💤

In [None]:
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-128k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")

config.json:   0%|          | 0.00/3.35k [00:00<?, ?B/s]

configuration_phi3.py:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-128k-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi3.py:   0%|          | 0.00/73.8k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-128k-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json:   0%|          | 0.00/16.3k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/172 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.18k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/568 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

generation_args = {
    "max_new_tokens": 500,
    "return_full_text": False,
    "temperature": 0.0,
    "do_sample": False,
}

Now we try to get the LLM to become our generator. We simply place the retrieved information and user query in the following prompt and ask the model for well formatted text.

In [None]:
prompt = """You are a bot that makes recommendations for activities. Try to be helpful recommender system.
This is the recommended activity: {relevant_document}
The user input is: {user_input}
Compile a recommendation to the user based on the recommended activity and the user input."""

In [None]:
prompt = prompt.replace("{relevant_document}", relevant_document).replace("{user_input}", user_input)
print(prompt)

You are a bot that makes recommendations for activities. Try to be helpful recommender system.
This is the recommended activity: Go for a hike and admire the natural scenery.
The user input is: I like to hike
Compile a recommendation to the user based on the recommended activity and the user input.


In [None]:
messages = [
    {"role": "user", "content": prompt},
]

Here's the augmented generated text

In [None]:
output = pipe(messages, **generation_args)
print(output[0]['generated_text'])

Based on your interest in hiking and our recommended activity, I suggest you embark on a scenic hike in a beautiful natural environment. This will not only allow you to enjoy the physical benefits of hiking but also provide a wonderful opportunity to admire breathtaking landscapes, observe diverse flora and fauna, and experience the tranquility of nature. Don't forget to bring along essentials like water, snacks, and appropriate hiking gear for a safe and enjoyable adventure!


## Very Cool, but Not Perfect! 😎👌
Alright, you’ve just seen a very basic example of RAG. However, there are some issues present. The corpus is small, and the documents in the corpus are short sentences, which causes the Language Model (LM) to generate some text on its own. 📚🤖

Also, our retriever is not very efficient and it may encounter bugs in some cases. For instance, even when users specify that they are not interested in a certain activity, the retriever might still bring up that activity for them. 🐜🔍

So, in this project, you’re going to address some of these issues. The rest of this document consists of some empty cells and tips for you on how to fill them with code. Let’s get coding! 👨‍💻🚀

# The Project

In [1]:
# Import necessary libraries
!pip install -U accelerate transformers huggingface_hub datasets sentence-transformers --quiet
# !pip install -U bitsandbytes

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from datasets import load_dataset, Dataset
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from tqdm import tqdm
import numpy as np
import pandas as pd

## Determine Your Task 🎯
What do you aim to implement with RAG? A recommender system? 🎁 A chatbot for a website’s FAQ? 💬 A medical advisor? 🩺 Or perhaps something else entirely?

Specify your objective in this cell.

In [2]:
# Determine Your Task
# Task Determination: I've set the task as "A chatbot for FAQ (frequently asked questions)"
task_title = "A chatbot for FAQ" # frequently asked questions (FAQ)
url_for_more_information = "https://en.wikipedia.org/wiki/Chatbot"

print(f"My task is: {task_title}")
print(f'For more information see: {url_for_more_information}')

My task is: A chatbot for FAQ
For more information see: https://en.wikipedia.org/wiki/Chatbot


## 🧐 Find or gather a corpus
Remember the fake corpus? 📚 It’s time to switch things up and use something real. 🌐 You need to use a dataset from  [huggingface datasets](https://huggingface.co/datasets) for this project. 🚀 Don’t use files that are outside of this notebook, this notebook should be able to run on its own without depending on anything external. 💻👍


In [3]:
# Find or gather a corpus
# Corpus: I've used the SQuAD v2 dataset from Hugging Face. I used the unique contexts of dataset.

# dataset = load_dataset("squad", split="train") # 87599 -> unique: 18891
dataset = load_dataset("squad_v2", split="train") # 130319 -> unique: 19029

corpus = dataset["context"]
corpus = list(set(corpus))

print(f"Corpus size: {len(corpus)}")
print("First few documents:")
for doc in corpus[:5]:
    print(doc[:100] + "...")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Corpus size: 19029
First few documents:
There is no natural source for green food colorings which has been approved by the US Food and Drug ...
Many of the world's largest media conglomerates are also based in the city. Manhattan contained over...
In the political realm, historians debate whether Napoleon was "an enlightened despot who laid the f...
Alloys of primarily zinc with small amounts of copper, aluminium, and magnesium are useful in die ca...
South Africa occupied the colony in 1915 after defeating the German force during World War I and adm...


## 📝 Create some queries
I want you to create 20 queries related to your task. You can use any Language Model you want for this matter, or if you’re feeling strong 💪 and have the time, write it yourself. 🖊️

You need to create a Hugging Face account, format your 20 queries into the accepted dataset format for Hugging Face 🤗 and push it to your Hugging Face account. Be sure to make it public and use it for the evaluation task. 👀

In [4]:
# Create some queries
# First I created 20 queries with GPT2 LLM. But it doesn't make sense well. So, I used GPT4-o and Cluade 3.5 Sonnet for generating queries.

# from transformers import pipeline

# query_generator = pipeline("text-generation", model="gpt2")

# queries = []
# for _ in range(20):
#     query = query_generator("Generate a frequently asked question", max_length=50, num_return_sequences=1)[0]['generated_text']
#     queries.append(query)

queries = [ # Cluade 3.5 Sonnet
    "Who was the first president of the United States?",
    "What is the capital city of France?",
    "When was the Declaration of Independence signed?",
    "What is the largest planet in our solar system?",
    "Who wrote the play 'Romeo and Juliet'?",
    "What is the main function of the human heart?",
    "What year did World War II end?",
    "Who invented the telephone?",
    "What is the chemical symbol for gold?",
    "What is the longest river in the world?",
    "Who painted the Mona Lisa?",
    "What is the speed of light?",
    "What is the capital of Japan?",
    "Who discovered penicillin?",
    "What is the largest ocean on Earth?",
    "Who was the first woman to win a Nobel Prize?",
    "What is the main ingredient in guacamole?",
    "What is the tallest mountain in the world?",
    "Who wrote 'To Kill a Mockingbird'?",
    "What is the largest mammal on Earth?"
]
# queries = [ # GPT4-o
#     "What are the operating hours for the museum?",
#     "How do I purchase tickets for the concert?",
#     "Can I get a refund for my ticket?",
#     "What is the location of the amusement park?",
#     "Are there any vegetarian options at the restaurant?",
#     "What should I wear to the yoga class?",
#     "How do I join the local sports league?",
#     "What are the safety measures for the hike?",
#     "Is there a fee for attending the workshop?",
#     "How do I book a picnic spot?",
#     "What time does the live music concert start?",
#     "Are pets allowed in the park?",
#     "How do I find the schedule for the lectures?",
#     "What are the best rides in the amusement park?",
#     "How do I reserve a table at the restaurant?",
#     "What is the best time to visit the museum?",
#     "Are there any discounts for group bookings?",
#     "What should I bring to the yoga class?",
#     "How do I get to the sports league venue?",
#     "What are the COVID-19 guidelines for the hike?"
# ]

# Push queries to Hugging Face
from huggingface_hub import login
from google.colab import userdata
USERNAME = userdata.get('HUGGINGFACE_USERNAME')
ACCESS_TOKEN = userdata.get('HUGGINGFACE_WRITE_ACCESS_TOKEN')

login(token=ACCESS_TOKEN)

query_dataset = Dataset.from_dict({"query": queries})
query_dataset.push_to_hub(f"{USERNAME}/chatbot-FAQ-queries-v2", token=ACCESS_TOKEN, commit_message="Upload chatbot-FAQ-queries-v2 dataset")

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/farzanrahmani/chatbot-FAQ-queries-v2/commit/b39ce9c8fecf66f2fc8bcd36fa5d508e01266c21', commit_message='Upload chatbot-FAQ-queries-v2 dataset', commit_description='', oid='b39ce9c8fecf66f2fc8bcd36fa5d508e01266c21', pr_url=None, pr_revision=None, pr_num=None)

In [5]:
print("Generated queries:")
for query in queries:
    print(query)

Generated queries:
Who was the first president of the United States?
What is the capital city of France?
When was the Declaration of Independence signed?
What is the largest planet in our solar system?
Who wrote the play 'Romeo and Juliet'?
What is the main function of the human heart?
What year did World War II end?
Who invented the telephone?
What is the chemical symbol for gold?
What is the longest river in the world?
Who painted the Mona Lisa?
What is the speed of light?
What is the capital of Japan?
Who discovered penicillin?
What is the largest ocean on Earth?
Who was the first woman to win a Nobel Prize?
What is the main ingredient in guacamole?
What is the tallest mountain in the world?
Who wrote 'To Kill a Mockingbird'?
What is the largest mammal on Earth?


## 🛠️ Create a Retriever
To create your retriever, you need to use an encoder model. Something like BERT? Nah, BERT is so yesterday. Find something new and shiny! ✨ The basic idea is to encode every document (sentence) in your corpus into a vector space using the same encoder. Then, encode the user query into that same space. With some similarity metrics like dot product, you can find the most similar document to the user’s input and retrieve it. 🎯 You can train your own encoder if you have enough data and resources, 💪 or you can use one of those [ready-made on Hugging Face](https://huggingface.co/models?pipeline_tag=sentence-similarity&sort=trending), like these ones.

In [6]:
# Create a Retriever
#  I've implemented a retriever using the SentenceTransformer model 'all-MiniLM-L6-v2' for encoding documents and queries.

encoder = SentenceTransformer('all-MiniLM-L6-v2')

def encode_corpus(corpus):
    return encoder.encode(corpus, show_progress_bar=True)

def retrieve_documents(query, corpus_embeddings, top_k=1):
    query_embedding = encoder.encode([query])
    similarities = cosine_similarity(query_embedding, corpus_embeddings)[0]
    top_indices = np.argsort(similarities)[-top_k:][::-1]
    return [corpus[i] for i in top_indices]

corpus_embeddings = encode_corpus(corpus)

Batches:   0%|          | 0/595 [00:00<?, ?it/s]

In [7]:
# Test the retriever
query = "What is the capital of Iran?"
retrieved_documents = retrieve_documents(query, corpus_embeddings)
print(f"Query: {query}")
print(f"Retrieved Document: {retrieved_documents[0]}")

Query: What is the capital of Iran?
Retrieved Document: Tehran is the country's capital and largest city, as well as its leading cultural and economic center. Iran is a major regional and middle power, exerting considerable influence in international energy security and the world economy through its large reserves of fossil fuels, which include the largest natural gas supply in the world and the fourth-largest proven oil reserves. Iran's rich cultural legacy is reflected in part by its 19 UNESCO World Heritage Sites, the fourth-largest number in Asia and 12th-largest in the world.


In [8]:
# Test the retriever
query = "What should I bring to the yoga class?"
retrieved_documents = retrieve_documents(query, corpus_embeddings)
print(f"Query: {query}")
print(f"Retrieved Document: {retrieved_documents[0]}")

Query: What should I bring to the yoga class?
Retrieved Document: In Indian philosophy, Yoga is among other things, the name of one of the six āstika philosophical schools. The Yoga philosophical system is closely allied with the dualism premises of Samkhya school. The Yoga school accepts the Samkhya psychology and metaphysics, but is considered theistic because it accepts the concept of "personal god", unlike Samkhya. The epistemology of the Yoga school, like the Sāmkhya school, relies on three of six prāmaṇas as the means of gaining reliable knowledge: pratyakṣa (perception), anumāṇa (inference) and śabda (āptavacana, word/testimony of reliable sources).


## 🎛️ Create a Generator
For this part, I practically handed you the whole code on a silver platter. 🍽️ But since we know you’re an explorer at heart and love trying new things, you can’t use the model I previously used. 😈 You have to try 3 different generators and compare them based on the quality of their answers. 🧪📊 [These might come in handy](https://huggingface.co/models?pipeline_tag=text-generation&sort=trending).

In [10]:
# I've created and tested different generators models with diffrenet sizes.
# Best models was "Qwen/Qwen2-1.5B-Instruct", "TinyLlama/TinyLlama-1.1B-Chat-v0.1", and "bigscience/bloom-560m" models.

# Create Generators
def create_generator(model_name):
    model = AutoModelForCausalLM.from_pretrained(model_name, device_map="cuda", torch_dtype="auto", trust_remote_code=True)
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
    return pipe

# generator1 = create_generator("gpt2") # 1.5 billion parameters
# generator1 = create_generator("facebook/opt-1.3b")
# generator1 = create_generator("EleutherAI/gpt-neo-1.3B") # 1.3 billion parameters
# generator1 = create_generator("deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct") # Active Params:2.4B Total Params: 16B
# generator1 = create_generator("numind/NuExtract-tiny") #0.5B
# generator1 = create_generator("numind/NuExtract") #3.82B
generator1 = create_generator("Qwen/Qwen2-1.5B-Instruct")  # 1.54 billion parameters


# generator2 = create_generator("EleutherAI/gpt-neo-125M")
generator2 = create_generator("TinyLlama/TinyLlama-1.1B-Chat-v0.1")

generator3 = create_generator("bigscience/bloom-560m")

# "NousResearch/Meta-Llama-3-8B-Instruct" # big and does not fit on memory

def generate_response(generator, query, context):
    # prompt = f"Query: {query}\nContext: {context}\nResponse:"
    prompt = f"You are a chatbot for FAQ. Try to be a helpful chatbot. Query: {query}\n Context: {context}\n Response:"
    # prompt = f"You are a chatbot for FAQ. Try to be a helpful chatbot. Query: {query}\n Context: {context}\n Compile a response to the user based on the Query and Context.\n Response:"

    response = generator(prompt, max_new_tokens=512, do_sample=True, temperature=0.05)
    # response = generator(prompt, max_new_tokens=512, do_sample=False)
    
    return response[0]['generated_text'].split("Response:")[-1].strip()

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


## 📊 Evaluate the results
Here, you’ve got to put those 3 models to the test. Use the 20 queries you’ve created on each of the 3 models. Now you’ll have 20 tuples, each containing five items: user input, selected document, and 3 responses from three different models. Use a judge model on each tuple to select the best answer. 🥇 The judge model can be any language model accessible on the internet, whether you find one on Hugging Face or use one through an API. 🌐 Finally, calculate the score for each model, which is how many times the judge picked that model. 🏆

In [11]:
# # Evaluate the results
# I've implemented an evaluation system using "facebook/bart-large-mnli" as a judge to compare the responses from the three generators.

# judge_model = create_generator("gpt2-medium")

# def judge_responses(query, context, responses):
#     prompt = f"Query: {query}\nContext: {context}\n\nResponses:\n1. {responses[0]}\n2. {responses[1]}\n3. {responses[2]}\n\nWhich response (1, 2, or 3) best answers the query based on the given context? Just generate a single number."
#     judgment = judge_model(prompt, max_new_tokens=10, do_sample=False)[0]['generated_text'].strip()
#     print("\njudgment", judgment, "\n")
#     return int(judgment[-1]) if judgment[-1].isdigit() and int(judgment[-1]) in [1, 2, 3] else 0

# scores = [0, 0, 0]

# log_df = pd.DataFrame(columns=["query", "retrieved_info", "response_model_1", "response_model_2", "response_model_3", "best_model"])
# for query in tqdm(queries, total=len(queries), desc=f"Evaluating..."):
#     retrieved_doc = retrieve_documents(query, corpus_embeddings)[0]

#     responses = [
#         generate_response(generator1, query, retrieved_doc),
#         generate_response(generator2, query, retrieved_doc),
#         generate_response(generator3, query, retrieved_doc)
#     ]
#     best_response = judge_responses(query, retrieved_doc, responses)
#     if best_response > 0:
#         scores[best_response - 1] += 1

#     log_df.loc[len(log_df)] = [query, retrieved_doc, responses[0], responses[1], responses[2], best_response]

# print("Final scores:")
# print(f"Generator 1 (gpt2): {scores[0]}")
# # print(f"Generator 2 (gpt-neo-125M): {scores[1]}")
# print(f"Generator 2 (TinyLlama-1.1B-Chat-v0.1): {scores[1]}")
# print(f"Generator 3 (bloom-560m): {scores[2]}")

###########

# Evaluate the results
# Method 1.
# # Load judge model and tokenizer
# judge_model_name = "gpt2-medium"
# judge_tokenizer = GPT2TokenizerFast.from_pretrained(judge_model_name)
# judge_model = GPT2ForSequenceClassification.from_pretrained(judge_model_name)

# # Function to score responses
# def score_responses(responses): # responses contains question and answer
#     scores = []
#     for response in responses:
#         inputs = judge_tokenizer(response, return_tensors="pt")
#         outputs = judge_model(**inputs)
#         score = outputs.logits.softmax(dim=-1)[0][1].item()
#         scores.append(score)
#     return scores

# Method 2.
# judge_model = create_generator("gpt2-medium")
# def judge_responses(query, context, responses):
#     prompt = f"Query: {query}\nContext: {context}\n\nResponses:\n1. {responses[0]}\n2. {responses[1]}\n3. {responses[2]}\n\nWhich response (1, 2, or 3) best answers the query based on the given context? Just generate a single number."
#     judgment = judge_model(prompt, max_new_tokens=10, do_sample=False)[0]['generated_text'].strip()
#     print("\njudgment", judgment, "\n")
#     return int(judgment[-1]) if judgment[-1].isdigit() and int(judgment[-1]) in [1, 2, 3] else 0

# Method 3.

judge_model = pipeline("zero-shot-classification", model="facebook/bart-large-mnli", device_map="cuda")
def evaluate_relevance(query, response):
    input_text = f"Question: {query}\nAnswer: {response}"
    result = judge_model(input_text, candidate_labels=["relevant", "irrelevant"])
    return result['scores'][result['labels'].index("relevant")]


scores = [0, 0, 0]
log_relevant_scores = []
log_df = pd.DataFrame(columns=["query", "retrieved_info", "response_model_1", "response_model_2", "response_model_3", "best_model"])

for query in tqdm(queries, total=len(queries), desc=f"Evaluating..."):
    retrieved_doc = retrieve_documents(query, corpus_embeddings)[0]

    response1 = generate_response(generator1, query, retrieved_doc)
    response2 = generate_response(generator2, query, retrieved_doc)
    response3 = generate_response(generator3, query, retrieved_doc)
    response_scores = [
        evaluate_relevance(query, response1),
        evaluate_relevance(query, response2),
        evaluate_relevance(query, response3)
    ]
    log_relevant_scores.append(response_scores)

    scores[np.argmax(response_scores)] += 1

    log_df.loc[len(log_df)] = [query, retrieved_doc, response1, response2, response3, np.argmax(response_scores) + 1]

print("Final scores:")
# print(f"Generator 1 (gpt2): {scores[0]}")
# print(f"Generator 1 (facebook/opt-1.3b): {scores[0]}")
# print(f"Generator 1 (EleutherAI/gpt-neo-1.3B): {scores[0]}")
# print(f"Generator 1 (deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct): {scores[0]}")
# print(f"Generator 1 (numind/NuExtract-tiny): {scores[0]}")
# print(f"Generator 1 (numind/NuExtract): {scores[0]}")
print(f"Generator 1 (Qwen/Qwen2-1.5B-Instruct): {scores[0]}")

# print(f"Generator 2 (gpt-neo-125M): {scores[1]}")
print(f"Generator 2 (TinyLlama-1.1B-Chat-v0.1): {scores[1]}")

print(f"Generator 3 (bloom-560m): {scores[2]}")

Evaluating...:  15%|█▌        | 3/20 [01:37<09:06, 32.13s/it]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Evaluating...: 100%|██████████| 20/20 [11:05<00:00, 33.28s/it]

Final scores:
Generator 1 (numind/NuExtract-tiny): 20
Generator 2 (TinyLlama-1.1B-Chat-v0.1): 0
Generator 3 (bloom-560m): 0





In [17]:
# I made a little mistake in above names because I tested more than 10 models (Generator 1) :)
print(f"Generator 1 (Qwen/Qwen2-1.5B-Instruct): {scores[0]}")
print(f"Generator 2 (TinyLlama-1.1B-Chat-v0.1): {scores[1]}")
print(f"Generator 3 (bloom-560m): {scores[2]}")

Generator 1 (Qwen/Qwen2-1.5B-Instruct): 20
Generator 2 (TinyLlama-1.1B-Chat-v0.1): 0
Generator 3 (bloom-560m): 0


as you can see Qwen/Qwen2-1.5B-Instruct is the best model.

On average results from good to bad:
1. (Qwen/Qwen2-1.5B-Instruct)
2. (bloom-560m)
3. (TinyLlama-1.1B-Chat-v0.1)

In [12]:
log_relevant_scores

[[0.9987866282463074, 0.9942008256912231, 0.9920938611030579],
 [0.9980291128158569, 0.9965434074401855, 0.9975695610046387],
 [0.9987596273422241, 0.9957780241966248, 0.9927420020103455],
 [0.9983118772506714, 0.9904112815856934, 0.9964088201522827],
 [0.998204231262207, 0.9899401068687439, 0.9929514527320862],
 [0.9977173805236816, 0.992989718914032, 0.9882003664970398],
 [0.9974063634872437, 0.995842456817627, 0.9951050281524658],
 [0.9989214539527893, 0.9947289824485779, 0.9932145476341248],
 [0.9988136887550354, 0.9928262233734131, 0.9919299483299255],
 [0.9981122016906738, 0.9821620583534241, 0.9979708790779114],
 [0.9977854490280151, 0.9976065754890442, 0.9917230606079102],
 [0.9988347291946411, 0.9976959824562073, 0.9869338274002075],
 [0.9981831312179565, 0.9944530725479126, 0.9927966594696045],
 [0.9972938895225525, 0.9955613613128662, 0.9894682168960571],
 [0.9968151450157166, 0.9293510317802429, 0.994299590587616],
 [0.9986351728439331, 0.9956530332565308, 0.994623839855194

In [13]:
log_df.to_csv(f"log_df.csv", index=False, encoding="utf-8-sig")
log_df

Unnamed: 0,query,retrieved_info,response_model_1,response_model_2,response_model_3,best_model
0,Who was the first president of the United States?,"In 1785, the assembly of the Congress of the C...",The first president of the United States was G...,The Burj Khalifa in Dubai is the tallest build...,The first president of the United States was G...,1
1,What is the capital city of France?,Most French rulers since the Middle Ages made ...,The capital city of France is Paris. It is kno...,The capital of France is Paris.\n Context: The...,The city of Paris is a city of history. The ci...,1
2,When was the Declaration of Independence signed?,The American Revolution begun with fighting at...,The Declaration of Independence was signed on ...,The Declaration of Independence was signed on ...,The Declaration of Independence was signed by ...,1
3,What is the largest planet in our solar system?,Neptune is the eighth and farthest known plane...,The largest planet in our solar system is Nept...,"Haberlang is a dwarf planet, not a planet.\n C...",The planet Neptune is the third-largest planet...,1
4,Who wrote the play 'Romeo and Juliet'?,Drama is literature intended for performance. ...,"William Shakespeare wrote the play ""Romeo and ...","Shakespeare wrote the poem ""The Rape of Lucrec...",The term 'closet drama' was coined by the Engl...,1
5,What is the main function of the human heart?,The avian circulatory system is driven by a fo...,The main function of the human heart is to pum...,The heart is a complex organ that is highly ad...,"The avian heart is a four-chambered, myogenic ...",1
6,What year did World War II end?,The outbreak of World War I in 1914 was precip...,"World War II ended on September 2, 1945. This ...",1939.\n\nSee also\n\n List of years in history...,The outbreak of World War I in 1914 was precip...,1
7,Who invented the telephone?,"Alexander Graham Bell (March 3, 1847 – August ...",Alexander Graham Bell invented the telephone. ...,"The Burj Khalifa was built in Dubai, United Ar...",The telephone was invented by Alexander Graham...,1
8,What is the chemical symbol for gold?,Many ancient civilizations alloyed metals for ...,The chemical symbol for gold is Au. It stands ...,"Gold is a soft, malleable, and ductile metal t...",The gold in the bath-house was a gold alloy. T...,1
9,What is the longest river in the world?,"Galicia is poetically known as the ""country of...",The longest river in the world is the Amazon R...,The 2019–2020 AFL season was the 122nd season of,The rivers in Galicia are the most important i...,1


### Now that I'm writing this message, it's 3 in the morning and I'm tired as fox. So I hope you've learned something from this project and someday you use what you've learned here in a real-case scenario. Good Luck! ✌️

In [14]:
scores

[20, 0, 0]

In [15]:
# Example interaction
# I've included an example interaction to demonstrate how the system works.
example_query = "What is the capital of U.S.?"
retrieved_doc = retrieve_documents(example_query, corpus_embeddings)[0]
responses = [
    generate_response(generator1, example_query, retrieved_doc),
    generate_response(generator2, example_query, retrieved_doc),
    generate_response(generator3, example_query, retrieved_doc)
]
response_scores = [
    evaluate_relevance(query, responses[0]),
    evaluate_relevance(query, responses[1]),
    evaluate_relevance(query, responses[2])
]

best_model = np.argmax(response_scores)

print("\n Example interaction:")
print(f"Query: {example_query}")
print(f"Retrieved document: {retrieved_doc[:200]}...")
print("\nResponses:")

# print(f"Generator 1 (gpt2): {responses[0]}")
# print(f"Generator 1 (facebook/opt-1.3b): {responses[0]}")
# print(f"Generator 1 (EleutherAI/gpt-neo-1.3B): {responses[0]}")
# print(f"Generator 1 (deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct): {responses[0]}")
# print(f"Generator 1 (numind/NuExtract-tiny): {responses[0]}")
# print(f"Generator 1 (numind/NuExtract): {responses[0]}")
print(f"Generator 1 (Qwen/Qwen2-1.5B-Instruct): {responses[0]}")

# print(f"Generator 2 (gpt-neo-125M): {responses[1]}")
print(f"Generator 2 (TinyLlama-1.1B-Chat-v0.1): {responses[1]}")

print(f"Generator 3 (bloom-560m): {responses[2]}")

print(f"\nBest response according to the judge: Generator {best_model}")
print(f"\n scores of models are as follows:\n{response_scores}")


 Example interaction:
Query: What is the capital of U.S.?
Retrieved document: The capital city, Washington, District of Columbia, is a federal district located on land donated by the state of Maryland. (Virginia had also donated land, but it was returned in 1849.) The United St...

Responses:
Generator 1 (Qwen/Qwen2-1.5B-Instruct): The capital of the United States is Washington, D.C. It is a federal district located on land donated by the state of Maryland. The country also has several other territories, including Puerto Rico, the U.S. Virgin Islands, Guam, American Samoa, and the Northern Mariana Islands. These territories have varying levels of independence and organization.
Generator 2 (TinyLlama-1.1B-Chat-v0.1): The District of Columbia is the nation's capital and the seat of government for the federal district of Columbia, which is located in the U.S. state of
Generator 3 (bloom-560m): The capital city of Washington, DC, is the capital of the United States. The capital city is th

In [16]:
# Example interaction
# I've included an example interaction to demonstrate how the system works.
example_query = "What should I bring to the yoga class?"
retrieved_doc = retrieve_documents(example_query, corpus_embeddings)[0]
responses = [
    generate_response(generator1, example_query, retrieved_doc),
    generate_response(generator2, example_query, retrieved_doc),
    generate_response(generator3, example_query, retrieved_doc)
]
response_scores = [
    evaluate_relevance(query, responses[0]),
    evaluate_relevance(query, responses[1]),
    evaluate_relevance(query, responses[2])
]

best_model = np.argmax(response_scores)

print("\n Example interaction:")
print(f"Query: {example_query}")
print(f"Retrieved document: {retrieved_doc[:200]}...")
print("\nResponses:")

# print(f"Generator 1 (gpt2): {responses[0]}")
# print(f"Generator 1 (facebook/opt-1.3b): {responses[0]}")
# print(f"Generator 1 (EleutherAI/gpt-neo-1.3B): {responses[0]}")
# print(f"Generator 1 (deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct): {responses[0]}")
# print(f"Generator 1 (numind/NuExtract-tiny): {responses[0]}")
# print(f"Generator 1 (numind/NuExtract): {responses[0]}")
print(f"Generator 1 (Qwen/Qwen2-1.5B-Instruct): {responses[0]}")

# print(f"Generator 2 (gpt-neo-125M): {responses[1]}")
print(f"Generator 2 (TinyLlama-1.1B-Chat-v0.1): {responses[1]}")

print(f"Generator 3 (bloom-560m): {responses[2]}")

print(f"\nBest response according to the judge: Generator {best_model}")
print(f"\n scores of models are as follows:\n{response_scores}")


 Example interaction:
Query: What should I bring to the yoga class?
Retrieved document: In Indian philosophy, Yoga is among other things, the name of one of the six āstika philosophical schools. The Yoga philosophical system is closely allied with the dualism premises of Samkhya school. ...

Responses:
Generator 1 (Qwen/Qwen2-1.5B-Instruct): When attending a yoga class, you may want to consider bringing some basic items such as comfortable clothing that allows movement, a mat or towel for cushioning, water if you plan to stay hydrated during your practice, and possibly a yoga block or strap for support. It's also a good idea to check the specific requirements of the class you're joining, as different studios or teachers may have their own preferences or guidelines.
 
Additionally, it can be helpful to bring a journal or notebook to jot down any thoughts or insights you gain from your practice, as well as any questions or concerns you might have about the poses or techniques being taug

here you can see links to my chats with chat GPT for doing this project😁

[chat 1](https://chatgpt.com/share/4e368852-ed82-4edd-8140-347809ca4227)

[chat 2](https://chatgpt.com/share/d111559c-7330-4ce1-8a80-33e8d72da4ef)