# IUST Computer Engineering Department 🏫
## Introduction to Natural Language Processing 📚 (The Final Project)
### Course Instructor: Dr. Marzieh Davoodabadi Farahani 👩‍🏫
### Project Teaching Assistant: Erfan Moosavi Monazzah (tel: @ErfanMoosavi2000) 📞
-------------------------------------------------------------------------------<br>
The objective of this project is to acquaint you with the fundamentals of Retrieval Augmented Generation (RAG). Be sure to explore various options and address challenges in a creative manner. 🎯

**Project Guidelines** 📝
- Avoid cheating at all costs. If a set of submissions is found to be [plagiarized](https://translate.google.as/?sl=en&tl=fa&text=Very%20hard%20word%2C%20I%20know%2C%20here%27s%20the%20meaning%3A%0Aplagiarized&op=translate), only one will be randomly chosen for grading. The others will fail the project. ❌
- You are allowed to use any document, article, paper, or video as a resource for writing your code, provided you include a link to the material used. 📖
- The use of Language Learning Models (LLMs), ChatBots, and Copilots is encouraged. If you utilize any of these tools, make sure to attach the chat history that led you to the answer to your question, or the code, to this .ipynb document. (You must provide the entire chat, not just the final answer or your initial prompt.) 💻
- You may not submit any additional documents, files, etc., along with this document. Only solutions, codes, explanations, etc., in this document will be graded. 📄
- You are required to implement everything (except the Language Modeling parts) from scratch. The use of libraries like langchain, llama_index, etc., is not permitted for this purpose. 🚫
- Please adhere to the code guidelines provided throughout the documents. 📝 I’ve spent time in a library 📚 crafting all of this, so if you overlook them, you’ll lose the points allocated for that section. ❌
- We need to use GPUs for this assignment, don't forget to turn on GPU usage for your notebook session.

-------------------------------------------------------------------------------<br>
# Alright, let's get started. 🚀

## What is RAG? 🤔
We've all used ChatGPT and experienced moments when it starts to generate content that is often incorrect or unrelated to our query. Do you know why this happens? These Large Language Models (LLMs) are not magical entities; they are simply models trained on a vast amount of text. 📚 You could even consider a significant portion of the internet. However, this is not all the data available in the world, because data is not a static concept. You yourself generate some data every day through your use of the Internet, Social Media, and so on. 🌐💻📱

So, no matter how much data you use to train your LLM, you always end up encountering new data. This is one of the reasons behind the famous ChatGPT response that tells you it only knows things up to a certain date. 📅 Also, these models tend to hallucinate too. It means they provide incorrect answers but in a very convincing manner. 🎭

On the other hand, we have retrieval techniques. Don't worry if it sounds complicated (it actually isn't easy, you may need to take a course to familiarize yourself with these concepts 😅, but that's not necessary for this project), but you use it on a daily basis. You can think of Search Engines (like Google, for example) as a complex form of information retrieval. 🔍

So, one day, people came up with this idea that it would be cool if ChatGPT could search Google for us, read the articles for us, summarize what it read, and tell us that. 📖 So, this is not exactly what RAG is, but it's something similar. We have a corpus (a large amount of data) and a query (what a user typed as input). Now, we search through this corpus using techniques related to vectors and vector databases, and find the most similar items in our corpus to the query. Then, we pass these items to an LLM and ask for a structured, well-formatted, user-friendly output. 📈📊

## I'm Interested in the Technical Details, What Should I Read? 📚🔍
- I strongly recommend reading the [original RAG paper](https://arxiv.org/abs/2005.11401). If you need help understanding the paper or have any questions about it, feel free to reach out to me via Telegram or find me on the second floor of the department in the NLP lab on Sundays and Tuesdays. 📖
- There appears to be a [comprehensive 2.5-hour course](https://www.freecodecamp.org/news/mastering-rag-from-scratch/) available. I haven't personally watched it, but if you find a better one, let me know so I can update this document. 🎥
- Here is [an article](https://www.smashingmagazine.com/2024/01/guide-retrieval-augmented-generation-language-models/) that explains the concepts very well. Initially, I wanted to use this article as the basis for this project, but unfortunately, the llama_index library used in the article seems to be outdated, so most of the code would need to be rewritten. On second thought, I found it more useful to focus on core concepts rather than learning specific libraries. You might want to check out some libraries like langchain or llama_index which provide a lot of tools for RAG. (But not for this project) 📝💡
- Don't hesitate to use Google, ask chatbots about any new concepts and terms. If you use search engine-aware chatbots like Microsoft Copilot, they provide links for each part of their answers which is useful if you want to delve deeper into that part. 🌐🤖
- Lastly, we have [the article](https://learnbybuilding.ai/tutorials/rag-from-scratch) that serves as the foundation for this project. 📚🔍

# Learn
First, we’re going to go through a simple RAG implementation. It’s going to be similar to the article, except for the (LLM) part. For that, I’m going to use Hugging Face. 🤗 I’ll also try to explain the code in simple terms, but feel free to read the article if you prefer their writing style.

## Let's Install the Necessary Libraries 📚🔧
Did you know that using the `--quiet` or `-q` option with the `pip install` command minimizes the output displayed on your screen? 🖥️ This can make your terminal less cluttered. Also, using `-U` will upgrade the libraries if they were previously installed. This is particularly useful for certain libraries like `transformers` that are frequently updated. 🔄

In [5]:
!pip install -U accelerate transformers datasets --quiet
!pip install -i https://pypi.org/simple/ bitsandbytes
!pip install flash_attn
!pip install quanto
!pip install -U "huggingface_hub[cli]"
!pip install -q openai tenacity

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m314.0/314.0 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.3/9.3 MB[0m [31m40.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m54.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.8/40.8 MB[0m [31m17.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m17.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m25.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━

In [2]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Gather a Corpus 📚
Technically, a corpus refers to a large and structured set of texts. However, for the sake of our discussion, let’s consider our collection as a “corpus”, even though it might not be large in the traditional sense. 😉

In [None]:
corpus_of_documents = [
    "Take a leisurely walk in the park and enjoy the fresh air.",
    "Visit a local museum and discover something new.",
    "Attend a live music concert and feel the rhythm.",
    "Go for a hike and admire the natural scenery.",
    "Have a picnic with friends and share some laughs.",
    "Explore a new cuisine by dining at an ethnic restaurant.",
    "Take a yoga class and stretch your body and mind.",
    "Join a local sports league and enjoy some friendly competition.",
    "Attend a workshop or lecture on a topic you're interested in.",
    "Visit an amusement park and ride the roller coasters."
]

## Create a Retriever 🕵️‍♂️
Now, we’re going to create a simple retriever. The role of the retriever is to compare the user’s query with a large corpus of text and find those that are most similar in context. (You know what context is by now, don’t you? 😊 If you’ve forgotten, refer back to your initial lectures). For now, let’s say we want to find similar text based on simple similarity metrics. The code is straightforward, and I have faith in you, chief! Dive into the code. 👨‍💻

In [None]:
def jaccard_similarity(query, document):
    query = query.lower().split(" ")
    document = document.lower().split(" ")
    intersection = set(query).intersection(set(document))
    union = set(query).union(set(document))
    return len(intersection)/len(union)

Hey, you may want to look at wikipedia page for [Jaccard Similarity](https://en.wikipedia.org/wiki/Jaccard_index).

In [None]:
def return_response(query, corpus):
    similarities = []
    for doc in corpus:
        similarity = jaccard_similarity(user_input, doc)
        similarities.append(similarity)
    return corpus_of_documents[similarities.index(max(similarities))]

## Create a Generator 🖥️
Now, we’re going to create a generator. This will help us compile the information retrieved into a well-structured and user-friendly text.

OK, let's say in a senario, we ask user what they like to do, the their answer is this:

In [None]:
user_input = "I like to hike"

Now by using the retrieval model I find this activity that best fits this user.

In [None]:
relevant_document = return_response(user_input, corpus_of_documents)
print(relevant_document)

Go for a hike and admire the natural scenery.


The answer seems good enough, but we can do better, yeah?

Let’s import a Language Model. I’m going to try out Microsoft Phi-3 because it recently hit the market, and I haven’t had a chance to try it for myself yet. So, I’m seizing this opportunity to do so! 😊👨‍💻

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

Downloading the model gonna take a while, use this time to rest your eyes for a bit. 😊👀💤

In [None]:
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-128k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")

config.json:   0%|          | 0.00/3.35k [00:00<?, ?B/s]

configuration_phi3.py:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-128k-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi3.py:   0%|          | 0.00/73.8k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-128k-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json:   0%|          | 0.00/16.3k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/172 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.18k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/568 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

generation_args = {
    "max_new_tokens": 500,
    "return_full_text": False,
    "temperature": 0.0,
    "do_sample": False,
}

Now we try to get the LLM to become our generator. We simply place the retrieved information and user query in the following prompt and ask the model for well formatted text.

In [None]:
prompt = """You are a bot that makes recommendations for activities. Try to be helpful recommender system.
This is the recommended activity: {relevant_document}
The user input is: {user_input}
Compile a recommendation to the user based on the recommended activity and the user input."""

In [None]:
prompt = prompt.replace("{relevant_document}", relevant_document).replace("{user_input}", user_input)
print(prompt)

You are a bot that makes recommendations for activities. Try to be helpful recommender system.
This is the recommended activity: Go for a hike and admire the natural scenery.
The user input is: I like to hike
Compile a recommendation to the user based on the recommended activity and the user input.


In [None]:
messages = [
    {"role": "user", "content": prompt},
]

Here's the augmented generated text

In [None]:
output = pipe(messages, **generation_args)
print(output[0]['generated_text'])

Based on your interest in hiking and our recommended activity, I suggest you embark on a scenic hike in a beautiful natural environment. This will not only allow you to enjoy the physical benefits of hiking but also provide a wonderful opportunity to admire breathtaking landscapes, observe diverse flora and fauna, and experience the tranquility of nature. Don't forget to bring along essentials like water, snacks, and appropriate hiking gear for a safe and enjoyable adventure!


## Very Cool, but Not Perfect! 😎👌
Alright, you’ve just seen a very basic example of RAG. However, there are some issues present. The corpus is small, and the documents in the corpus are short sentences, which causes the Language Model (LM) to generate some text on its own. 📚🤖

Also, our retriever is not very efficient and it may encounter bugs in some cases. For instance, even when users specify that they are not interested in a certain activity, the retriever might still bring up that activity for them. 🐜🔍

So, in this project, you’re going to address some of these issues. The rest of this document consists of some empty cells and tips for you on how to fill them with code. Let’s get coding! 👨‍💻🚀

# The Project

## Determine Your Task 🎯
What do you aim to implement with RAG? A recommender system? 🎁 A chatbot for a website’s FAQ? 💬 A medical advisor? 🩺 Or perhaps something else entirely?

Specify your objective in this cell.

In [1]:
task_title = "Psychology Recommendation"
url_for_more_information = "https://huggingface.co/datasets/jkhedri/psychology-dataset?row=9"

print(f"My task is: {task_title}")
print(f'For more information see: {url_for_more_information}')

My task is: Psychology Recommendation
For more information see: https://huggingface.co/datasets/jkhedri/psychology-dataset?row=9


## 🧐 Find or gather a corpus
Remember the fake corpus? 📚 It’s time to switch things up and use something real. 🌐 You need to use a dataset from  [huggingface datasets](https://huggingface.co/datasets) for this project. 🚀 Don’t use files that are outside of this notebook, this notebook should be able to run on its own without depending on anything external. 💻👍


In [2]:
from datasets import load_dataset

dataset = load_dataset("jkhedri/psychology-dataset")
train_ds = dataset['train']
train_ds

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Dataset({
    features: ['question', 'response_j', 'response_k'],
    num_rows: 9846
})

In [3]:
corpus_of_documents = train_ds.to_pandas()[['question', 'response_j']]
corpus_of_documents.head()

Unnamed: 0,question,response_j
0,I'm feeling really anxious lately and I don't ...,"It's common to feel anxious at times, and ther..."
1,I think my partner may be cheating on me. What...,It's understandable to feel worried and suspic...
2,I'm feeling really overwhelmed with work and s...,It sounds like you're going through a difficul...
3,I'm having trouble sleeping and I'm constantly...,It's important to talk to your doctor about an...
4,"I've been feeling really anxious lately, and I...",It's common to feel anxious without knowing th...


## 📝 Create some queries
I want you to create 20 queries related to your task. You can use any Language Model you want for this matter, or if you’re feeling strong 💪 and have the time, write it yourself. 🖊️

You need to create a Hugging Face account, format your 20 queries into the accepted dataset format for Hugging Face 🤗 and push it to your Hugging Face account. Be sure to make it public and use it for the evaluation task. 👀

In [4]:
from datasets import Dataset

psychological_issues_solutions = {
    "issue_1": {
        "question": "I constantly feel anxious in social situations and avoid going out because of it.",
        "response": "Social anxiety can be challenging. Let's explore gradual exposure to social settings combined with relaxation techniques and cognitive behavioral therapy to help you manage and reduce your anxiety."
    },
    "issue_2": {
        "question": "I'm struggling with low self-esteem and often feel worthless.",
        "response": "It's important to recognize your strengths and achievements. We can work on positive affirmations and self-compassion exercises to help improve your self-esteem over time."
    },
    "issue_3": {
        "question": "I have trouble sleeping at night and often feel tired during the day.",
        "response": "Establishing a consistent sleep routine and creating a restful environment can improve sleep quality. We can also explore relaxation techniques and limit screen time before bed."
    },
    "issue_4": {
        "question": "I feel really isolated and lonely, even when I'm around people.",
        "response": "Loneliness can affect anyone. Building meaningful connections through shared interests and seeking support from a counselor can help you feel more connected and supported."
    },
    "issue_5": {
        "question": "I often find myself procrastinating and can't seem to get anything done.",
        "response": "Procrastination can be addressed by breaking tasks into smaller, manageable steps and using time management techniques. Let's develop a plan to help you stay focused and motivated."
    },
    "issue_6": {
        "question": "I feel constantly sad and have lost interest in activities I used to enjoy.",
        "response": "These feelings might be symptoms of depression. Seeking professional help and discussing therapy options, such as cognitive behavioral therapy, can be beneficial in addressing these symptoms."
    },
    "issue_7": {
        "question": "I have frequent mood swings and often feel out of control emotionally.",
        "response": "Mood swings can be challenging to manage. Identifying triggers and practicing mindfulness or emotional regulation strategies can help stabilize your mood over time."
    },
    "issue_8": {
        "question": "I worry excessively about things that might never happen.",
        "response": "Excessive worry can be managed with cognitive-behavioral techniques and mindfulness practices. Let's work on challenging irrational thoughts and focusing on the present moment."
    },
    "issue_9": {
        "question": "I have difficulty trusting others and often feel paranoid.",
        "response": "Building trust takes time. Exploring the root causes of your mistrust in therapy and practicing open communication can gradually help improve your relationships with others."
    },
    "issue_10": {
        "question": "I often feel overwhelmed by my emotions and don't know how to handle them.",
        "response": "Emotional overwhelm can be addressed by developing healthy coping mechanisms. Techniques like deep breathing, journaling, or talking to a trusted friend can help you manage your emotions more effectively."
    },
    "issue_11": {
        "question": "I feel like I'm not good enough and constantly compare myself to others.",
        "response": "Comparing yourself to others can be detrimental. Focusing on your unique qualities and practicing self-acceptance can help build a healthier self-image. Let's work on strategies to reinforce your self-worth."
    },
    "issue_12": {
        "question": "I have trouble expressing my feelings and often keep things bottled up inside.",
        "response": "Expressing feelings can be difficult. Journaling, practicing assertive communication, and seeking therapy can help you learn to express your emotions in a healthy way."
    },
    "issue_13": {
        "question": "I feel overwhelmed by my responsibilities and don't know where to start.",
        "response": "Breaking down tasks into smaller, manageable steps and prioritizing them can make responsibilities feel less overwhelming. Let's create a plan that helps you tackle your tasks more effectively."
    },
    "issue_14": {
        "question": "I have difficulty forming and maintaining relationships.",
        "response": "Building relationships takes effort and communication. Developing social skills and seeking support from a therapist can help you form and maintain healthier connections with others."
    },
    "issue_15": {
        "question": "I often feel guilty and blame myself for things that go wrong.",
        "response": "Excessive guilt can be harmful. Exploring the root causes of your guilt in therapy and practicing self-forgiveness can help you develop a more balanced perspective."
    },
    "issue_16": {
        "question": "I feel like I'm stuck in life and don't know how to move forward.",
        "response": "Feeling stuck can be frustrating. Setting achievable goals and exploring new interests or opportunities can help reignite your motivation and sense of purpose."
    },
    "issue_17": {
        "question": "I often feel angry and irritable for no apparent reason.",
        "response": "Unexplained anger can be managed by identifying underlying triggers and practicing relaxation techniques. Therapy can also help you explore and address the root causes of your anger."
    },
    "issue_18": {
        "question": "I struggle with perfectionism and fear making mistakes.",
        "response": "Perfectionism can be paralyzing. Accepting that mistakes are part of growth and focusing on progress rather than perfection can help you overcome these fears. Let's work on developing a healthier mindset."
    },
    "issue_19": {
        "question": "I feel disconnected from my own identity and don't know who I am.",
        "response": "Exploring your values, interests, and passions can help you reconnect with your identity. Therapy can provide a safe space to explore and understand yourself better."
    },
    "issue_20": {
        "question": "I often feel overwhelmed by negative thoughts and can't seem to escape them.",
        "response": "Negative thoughts can be persistent. Cognitive-behavioral therapy and mindfulness practices can help you challenge and reframe these thoughts, leading to a more positive mindset."
    }
}

# Convert to a list of dictionaries
data = {"question": [value["question"] for key, value in psychological_issues_solutions.items()], "response": [value["response"] for key, value in psychological_issues_solutions.items()]}

ds = Dataset.from_dict(data)
ds

Dataset({
    features: ['question', 'response'],
    num_rows: 20
})

In [None]:
!pip install huggingface_hub
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) y
Token is valid (permission: write).
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your ter

In [None]:
ds.push_to_hub("mohammadhabp/psychological_issues")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/mohammadhabp/psychological_issues/commit/797ee55a0837704e607696112f0be882b52c693b', commit_message='Upload dataset', commit_description='', oid='797ee55a0837704e607696112f0be882b52c693b', pr_url=None, pr_revision=None, pr_num=None)

## 🛠️ Create a Retriever
To create your retriever, you need to use an encoder model. Something like BERT? Nah, BERT is so yesterday. Find something new and shiny! ✨ The basic idea is to encode every document (sentence) in your corpus into a vector space using the same encoder. Then, encode the user query into that same space. With some similarity metrics like dot product, you can find the most similar document to the user’s input and retrieve it. 🎯 You can train your own encoder if you have enough data and resources, 💪 or you can use one of those [ready-made on Hugging Face](https://huggingface.co/models?pipeline_tag=sentence-similarity&sort=trending), like these ones.

In [None]:
from transformers import AutoTokenizer, AutoModel, AutoModelForCausalLM, pipeline
import torch
import torch.nn.functional as F
from tqdm import tqdm

device = torch.device('cuda' if torch.cuda.is_available else 'cpu')

encoder_model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L12-v2").to(device)
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L12-v2")

In [6]:
def get_embeddings(texts):
    inputs = tokenizer(texts, return_tensors='pt', padding=True, truncation=True, max_length=512).to(device)

    with torch.no_grad():
        outputs = encoder_model(**inputs)
        hidden_states = outputs.last_hidden_state

    embeddings = hidden_states.mean(dim=1)
    return embeddings


def cosine_similarity(emb1, emb2):
    cos_sim = F.cosine_similarity(emb1, emb2)
    return cos_sim.item()


def return_response(query, corpus):
  similarities = []
  query_embedding = get_embeddings([query])
  for doc in tqdm(corpus):
      doc_embedding = get_embeddings([doc])
      similarity = cosine_similarity(query_embedding, doc_embedding)
      similarities.append(similarity)
  return corpus[similarities.index(max(similarities))], max(similarities)

In [7]:
corpus = corpus_of_documents['question'].tolist()

In [None]:
import pandas as pd

sim_doc_to_query = pd.DataFrame(columns=['query', 'relevant_document', 'similarity_score'])
for i, q in enumerate(ds['question']):
  print('query', i+1)
  relevant_document, sim_score = return_response(q, corpus)
  sim_doc_to_query.loc[len(sim_doc_to_query.index)] = [q, relevant_document, sim_score]

In [None]:
sim_doc_to_query.to_csv('sim_doc_to_query.csv')
sim_doc_to_query

Unnamed: 0,query,relevant_document,similarity_score
0,I constantly feel anxious in social situations...,I'm feeling really anxious about social situat...,0.868375
1,I'm struggling with low self-esteem and often ...,I'm struggling with feelings of worthlessness ...,0.959566
2,I have trouble sleeping at night and often fee...,I'm having trouble sleeping at night and feeli...,0.961264
3,"I feel really isolated and lonely, even when I...",I'm feeling really lonely and isolated. I don'...,0.900065
4,I often find myself procrastinating and can't ...,I'm struggling with procrastination and can't ...,0.935328
5,I feel constantly sad and have lost interest i...,I have been feeling really down lately and hav...,0.865077
6,I have frequent mood swings and often feel out...,I'm having trouble with my mood swings and I'm...,0.822804
7,I worry excessively about things that might ne...,I'm constantly worried about the future and wh...,0.738143
8,I have difficulty trusting others and often fe...,I'm having trouble trusting others.,0.817516
9,I often feel overwhelmed by my emotions and do...,I'm feeling really overwhelmed with my emotion...,0.939762


In [8]:
import pandas as pd

sim_doc_to_query = pd.read_csv('sim_doc_to_query.csv')
sim_doc_to_query.head()

Unnamed: 0.1,Unnamed: 0,query,relevant_document,similarity_score
0,0,I constantly feel anxious in social situations...,I'm feeling really anxious about social situat...,0.868375
1,1,I'm struggling with low self-esteem and often ...,I'm struggling with feelings of worthlessness ...,0.959566
2,2,I have trouble sleeping at night and often fee...,I'm having trouble sleeping at night and feeli...,0.961264
3,3,"I feel really isolated and lonely, even when I...",I'm feeling really lonely and isolated. I don'...,0.900065
4,4,I often find myself procrastinating and can't ...,I'm struggling with procrastination and can't ...,0.935328


## 🎛️ Create a Generator
For this part, I practically handed you the whole code on a silver platter. 🍽️ But since we know you’re an explorer at heart and love trying new things, you can’t use the model I previously used. 😈 You have to try 3 different generators and compare them based on the quality of their answers. 🧪📊 [These might come in handy](https://huggingface.co/models?pipeline_tag=text-generation&sort=trending).

In [9]:
class RAGModel:
  def __init__(self, checkpoint, corpus, *kwargs, **args):
    self.model = AutoModelForCausalLM.from_pretrained(checkpoint, **args).to(device)
    self.tokenizer = AutoTokenizer.from_pretrained(checkpoint)
    self.pipe = pipeline(
          "text-generation",
          model=self.model,
          tokenizer=self.tokenizer,
      )

    self.generation_args = {
          "max_new_tokens": 250,
          "return_full_text": False,
          "temperature": 0.0,
          "do_sample": False,
      }

    self.corpus = corpus

    self.prompt = """You are a bot that makes advices for psychological issues that people face with them. Try to be helpful therapist system.
    The user issue is: {user_issue}
    This is the recommended solution to the issue: {relevant_document_response}
    Compile a recommendation to the user based on the recommended solution and the user issue."""


  def get_relevant_doc(self, query):
    for i, row in sim_doc_to_query.iterrows():
      if row['query'] == query:
        return row['relevant_document']
    raise ValueError(f"No relevant document found for query: {query}")


  def get_relevant_document_response(self, relevant_doc):
    for i, row in corpus_of_documents.iterrows():
      if row['question'] == relevant_doc:
        return row['response_j']
    raise  ValueError(f"No response found for query: {relevant_doc}")


  def inference(self, user_issue):
    relevant_document = self.get_relevant_doc(user_issue)
    relevant_document_response = self.get_relevant_document_response(relevant_document)
    prompt = self.prompt.replace("{relevant_document_response}", relevant_document_response).replace("{user_issue}", user_issue)
    messages = [
        {"role": "user", "content": prompt},
    ]
    output = self.pipe(messages, **self.generation_args)
    return output[0]['generated_text']


  def save_inferences(self, queries):
    inferences = pd.DataFrame(columns=['query', 'relevant_document', 'relevant_document_response', 'model_inference'])
    for i, q in tqdm(enumerate(queries)):
      relevant_document = self.get_relevant_doc(q)
      relevant_document_response = self.get_relevant_document_response(relevant_document)
      inference = self.inference(q)
      inferences.loc[len(inferences.index)] = [q, relevant_document, relevant_document_response, inference]

    return inferences

# First generator

In [None]:
first_generator_model = RAGModel("microsoft/Phi-3-mini-4k-instruct", corpus=corpus, device_map="cuda", torch_dtype="auto", trust_remote_code=True)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
queries = ds['question']
inferences = first_generator_model.save_inferences(queries)
inferences.to_csv('Phi-3-mini-4k-instruct_inferences.csv')

The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
10it [01:35,  8.90s/it]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
20it [03:01,  9.06s/it]


# Second generator

In [None]:
second_generator_model = RAGModel("TinyLlama/TinyLlama-1.1B-Chat-v1.0", corpus=corpus, torch_dtype=torch.bfloat16, device_map="auto")

config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

In [None]:
queries = ds['question']
inferences = second_generator_model.save_inferences(queries)
inferences.to_csv('TinyLlama-1.1B-Chat-v1.0_inferences.csv')

8it [01:08,  8.72s/it]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
20it [02:47,  8.37s/it]


# Third generator

In [11]:
from transformers import QuantoConfig

quantization_config = QuantoConfig(weights="int8")
third_generator_model = RAGModel("mistralai/Mistral-7B-Instruct-v0.2", corpus=corpus, quantization_config= quantization_config, low_cpu_mem_usage=True, device_map='auto')

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
queries = ds['question']
inferences = third_generator_model.save_inferences(queries)
inferences.to_csv('Mistral-7B-Instruct-v0.2_inferences.csv')

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
10it [16:50, 93.00s/it]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-

## 📊 Evaluate the results
Here, you’ve got to put those 3 models to the test. Use the 20 queries you’ve created on each of the 3 models. Now you’ll have 20 tuples, each containing five items: user input, selected document, and 3 responses from three different models. Use a judge model on each tuple to select the best answer. 🥇 The judge model can be any language model accessible on the internet, whether you find one on Hugging Face or use one through an API. 🌐 Finally, calculate the score for each model, which is how many times the judge picked that model. 🏆

In [2]:
import pandas as pd

first_generator_inferences = pd.read_csv('Phi-3-mini-4k-instruct_inferences.csv')
second_generator_inferences = pd.read_csv('TinyLlama-1.1B-Chat-v1.0_inferences.csv')
third_generator_inferences = pd.read_csv('Mistral-7B-Instruct-v0.2_inferences.csv')

In [3]:
model_inferences = pd.DataFrame(columns=['query', 'advice', 'first_model_inference', 'second_model_inference', 'third_model_inference'])

for i in range(20):
  first_generator_inference = first_generator_inferences.iloc[i]
  second_generator_inference = second_generator_inferences.iloc[i]
  third_generator_inference = third_generator_inferences.iloc[i]

  query = first_generator_inference['query']
  advice = first_generator_inference['relevant_document_response']

  model_inferences.loc[len(model_inferences.index)] = [
      query,
      advice,
      first_generator_inference['model_inference'],
      second_generator_inference['model_inference'],
      third_generator_inference['model_inference']
  ]

model_inferences.head()

Unnamed: 0,query,advice,first_model_inference,second_model_inference,third_model_inference
0,I constantly feel anxious in social situations...,It may be helpful to gradually expose yourself...,"Dear user,\n\nI understand that you are exper...",To the user:\n\nI understand that you feel anx...,Based on the user issue of constantly feeling...
1,I'm struggling with low self-esteem and often ...,"Low self-esteem can be challenging, but it's i...",I'm sorry to hear that you're struggling with...,"Bot: Hi there, I'm a therapist system that spe...",Based on the user issue of struggling with lo...
2,I have trouble sleeping at night and often fee...,"Insomnia can be caused by many factors, such a...","Dear user,\n\nI understand that you are exper...",As a bot that makes advices for psychological ...,Based on the user issue and the recommended s...
3,"I feel really isolated and lonely, even when I...","Feeling lonely and isolated can be difficult, ...",I'm sorry to hear that you're feeling isolate...,"I am not a human being, but I can provide a re...",Based on the user issue and the recommended s...
4,I often find myself procrastinating and can't ...,"It's important to break tasks into smaller, ma...",I understand that procrastination can be a ch...,Bot: Hi there! I'm a therapist system that can...,"Based on the issue you've shared with me, it ..."


In [6]:
from openai import OpenAI
from tenacity import retry, stop_after_attempt, wait_fixed
import time

@retry(stop=stop_after_attempt(3), wait=wait_fixed(20))
def gpt_inference(client, prompt):
  messages = []
  messages.append(
      {"role": "system", "content": "You are a system to judge the answers of 3 models for psychological issues"}
  )
  messages.append(
      {"role": "user", "content": prompt},
  )

  chat = client.chat.completions.create(
        messages=messages,
        model="gpt-3.5-turbo",
        temperature = 0.0
    )
  response = chat.choices[0].message.content
  return response

In [14]:
PROMPT = '''
  3 models are asked to advise for a psychological issue based on the provided solution.
  psychological issue: <issue>
  solution: <solution>
  first model inference: <first_model>
  second model inference: <second_model>
  third model inference: <third_model>

  your task is to judge the inferences of the models based on the psychological issue and its solution and say which one is the best.
  please respond just a number without any further detail. if the first model is the best respond 0, else if the second model is the best return 1, and if the third model is the best return 2. <SEP>
'''


In [21]:
from google.colab import userdata


def evaluate(results):
  client = OpenAI(
    api_key = userdata.get('GPT_API_KEY')
  )

  evaluations = [0, 0, 0]
  for i, row in results.iterrows():
    prompt = PROMPT.replace('<issue>', row['query']).replace('<solution>', row['advice']).replace('<first_model>', row['first_model_inference']).replace('<second_model>', row['second_model_inference']).replace('<third_model>', row['third_model_inference'])
    answer = gpt_inference(client, prompt)
    evaluations[int(answer)] += 1

    time.sleep(20)

  return evaluations

In [22]:
evaluation = evaluate(model_inferences)
evaluation

[2, 4, 14]

In [23]:
eval_dic = {'model': ['Phi-3-mini-4k-instruct', 'TinyLlama-1.1B-Chat-v1.0', 'Mistral-7B-Instruct-v0.2'],
            'evaluation': [evaluation[0]/sum(evaluation), evaluation[1]/sum(evaluation), evaluation[2]/sum(evaluation)]}

eval_df = pd.DataFrame(eval_dic)
eval_df

Unnamed: 0,model,evaluation
0,Phi-3-mini-4k-instruct,0.1
1,TinyLlama-1.1B-Chat-v1.0,0.2
2,Mistral-7B-Instruct-v0.2,0.7
