In [1]:
!pip install transformers sentence-transformers faiss-cpu datasets

Collecting sentence-transformers
  Downloading sentence_transformers-2.7.0-py3-none-any.whl (171 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting faiss-cpu
  Downloading faiss_cpu-1.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m36.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.19.1-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m30.6 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux20

Installing the required libraries

In [2]:
import torch
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModelForCausalLM
from sentence_transformers import CrossEncoder

* The SentenceTransformer class from the sentence-transformers library is used for embedding sentences into dense vectors.
* The transformers library by Hugging Face offers a range of pre-trained models and tools for NLP tasks. AutoTokenizer and AutoModelForCausalLM are used for tokenizing input text and generating responses.

In [3]:
text = [
    "The capital of India is New Delhi.",
    "The capital of the USA is Washington, D.C.",
    "The capital of England is London.",
    "The capital of Australia is Canberra.",
    "Paris is the capital city of France.",
    "Berlin is the capital city of Germany.",
    "The capital of Japan is Tokyo.",
    "Ottawa is the capital of Canada.",
    "Mumbai is the capital of Maharashtra state in India.",
    "Bhopal is the capital of the state of Madhya Pradesh."
]

Example text for retrieval

In [4]:
# loaded a sentence transformer model
transformer_model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')

# encoded the text
text_embeddings = transformer_model.encode(text, convert_to_tensor=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/11.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/383 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

* Here I have used "__multi-qa-MiniLM-L6-cos-v1__" model for sentance transformation to embeded vectors.<br>
* Then I encoded the text and kept __tensor=true__ because the output embeddings from numpy arrays to __pytorch tensors__ .

In [5]:
# assigning dimension of embedded
dim = text_embeddings.shape[1]
# here faiss is initialized
index = faiss.IndexFlatL2(dim)
#embeddings are added to index
index.add(text_embeddings.cpu().detach().numpy())

In [6]:
def retrieved_text(query, top_k=3):
    ques_embedding = transformer_model.encode(query, convert_to_tensor=True)
    ques_embedding = ques_embedding.cpu().detach().numpy().reshape(1, -1)
    distances, indices = index.search(ques_embedding, top_k)
    retrieved_text = [text[idx] for idx in indices[0]]
    return retrieved_text

This Function is defined to retrieve the relevent text from text provided.<br>
This is Important as it fetches the data which is correct and that can be displayed to user.<br>
I adjusted top_k=3 so that it fetches the top 3 most similar texts.

In [7]:
# loaded the model and tokenizer
model_name = "bigscience/bloom-1b7" # here i have changed the model to increase accuracy of answers.
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)




tokenizer_config.json:   0%|          | 0.00/222 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/715 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.44G [00:00<?, ?B/s]

In [8]:
def refine_context(query, retrieved_docs):
    cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
    pairs = [[query, doc] for doc in retrieved_docs]
    scores = cross_encoder.predict(pairs)
    ranked_docs = [doc for doc, score in sorted(zip(retrieved_docs, scores), key=lambda x: x[1], reverse=True)]
    return " ".join(ranked_docs[:1])  # Use the most relevant text

This part of my code was provided by ChatGPT. I used the prompt that my answers are not coming accurately and it is giving unnecessary info so it provided me this code.<br>
My answers before this:<br>
Bhopal is the capital of<br>
Bhopal is the capital of the state of Madhya Pradesh.. It is located in the western part of India. It has a population of about 1.5 million people. The city is also known as the birthplace of Mahatma Gandhi.<br>
Mumbai is the captial of<br>
Mumbai is the captial of the world. It is a place where you will find the most beautiful beaches, the best restaurants, and the largest shopping mall in the city. If you are looking for the perfect place to spend your vacation, then you
<br>Captial of India<br>
Captial of India and the Government of the United Kingdom of Great Britain and Northern Ireland, in cooperation with the Department of Foreign Affairs and Trade, the Economic and Social Commission for Western Asia (ESCWA), the World Trade Organization (WTO) and
<br>The capital of England is
<br>The capital of England is London..London is the largest city in the world. It has a population of over 2.5 million people. The city is also known as the capital city of the United Kingdom.
<br>Paris is the capital of
<br>Paris is the capital of  the Republic of Moldova and the largest city in the country. It is located on the border of Romania and Bulgaria. The city has a population of about 1.5 million. Moldova is a member of the European Union and
<br>Capital of Japan
<br>The capital of Japan is Tokyo.. Tokyo is the largest city in the world. It has a population of more than 2.5 million people. The city is also known as the “City of Love.”
Tokyo is one of the most beautiful
<br>Tell me about Berlin
<br>Tell me about Berlin.
<br>- I don't know.
<br>I don't even know where it is.
<br>But I do know that Berlin is the place where I want to be.
<br>And that is why I am here.
<br>I'm here to tell you that I love you

In [9]:
def rag_generate_text(ques, max_length=75):
    # Retrieving the relevant text
    retrieved_docs = retrieved_text(ques, top_k=3)  # Retrieve the 3 most similar things from text

    # Refine context
    context = refine_context(ques, retrieved_docs)

    # Tokenizing the input
    input = tokenizer(context, return_tensors="pt")

    # Generating an answer
    output = model.generate(
        input.input_ids,
        max_length=max_length,
        num_return_sequences=1,
        num_beams=5,
        no_repeat_ngram_size=2,
        early_stopping=True
    )

    # Decoding the answer so as to ensure it only contains the relevant answer
    answer = tokenizer.decode(output[0], skip_special_tokens=True)
    return answer

This rag_generate_text function generates a small and relevant answer to a question asked using text retrieval and text generation.
  * max_length: Maximum length of the text that will be generated.
  * num_return_sequences: Number of  sequences generated (val = 1).
  * num_beams: Beam search parameter improves  quality of generation text
  * no_repeat_ngram_size: Prevents repeating n-grams of the specified size (val = 2)
  

I used both text generation and text retrival so that my chatbot  gives some info about a particular thing.(though it can be wrong as well.)
<br>
max_length = 75 so that generated text by model has max 75 letters.<br>
You can change it to reduce the runtime of the function.


In [10]:
questions = ["Bhopal is the capital of",
             "Mumbai is the captial of",
             "Captial of India",
             "The capital of England is",
             "Paris is the capital of ",
             "Capital of Japan",
             "Tell me about Berlin"]

for ques in questions:
  answer = rag_generate_text(ques)
  print(ques)
  print(answer)


config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Bhopal is the capital of
Bhopal is the capital of the state of Madhya Pradesh. It is also the largest city in India and the second largest in the country after Delhi. The city is located at the confluence of two major rivers, the Indus and Yamuna, and is surrounded by hills and mountains.
The city has a population of over 1.5 million people, making it the
Mumbai is the captial of
Mumbai is the capital of Maharashtra state in India. It is located on the banks of the river Ganges. The city is known for its beautiful architecture, rich culture, and vibrant nightlife. Mumbai is also known as the ‘City of Dreams’ as it is home to many famous landmarks such as Taj Mahal, Fort Kochi, The Great Wall of China,
Captial of India
The capital of India is New Delhi. It is the most populous city in the country, with a population of over 1.5 million. The city is also home to the Indian Parliament, the Supreme Court, and many other government offices. Delhi is a cosmopolitan metropolis that is known fo

Here is the analysis part of my chatbot. <br>
1) It is giving answer of questions correctly. The data which is directly in the text part given by me it is giving correct answers.<br>
2) There are some wrong info given by my model in between which is not good and should be imporved.<br>
3) I guess if increase the data set the chance of getting a wrong answer will be very less.<br>
4) Also it giving same things for 2 different questions which must be avoided.<br>

In [11]:
#asking a random question
ques ="Who is the winner if IPL 2023"
answer = rag_generate_text(ques)
print(ques)
print(answer)



Who is the winner if IPL 2023
The capital of India is New Delhi. It is the most populous city in the country, with a population of over 1.5 million. The city is also home to the Indian Parliament, the Supreme Court, and many other government offices. Delhi is a cosmopolitan metropolis that is known for its vibrant nightlife, world-famous restaurants, shopping malls,


5)When asked a random question it is giving wrong answers

__Challenges Faced:__<br>
1) I was not having enough info how to make the chatbot and I was unable to get good resources on the internet to start with.<br>
2) Tried using GPT api but it was not working in my case.<br>
3) Choosing the pre-trained model with good dataset.<br>
4) Was unable to implement RAG and attention Mechanism properly.<br>


__Resources Used:__<br>
1) ChatGPT : It helped me a lot for figuring out the errors and suggesting good pre-trained models. The model I am currently using was suggested by him.<br>
2) Numereos Github Repos : They were just used by me for ideas.(No part of my code was derived by them).<br>
3) Youtube Videos : They were used by me so as to understand __RAG__ , I used the IBM videos : <br>
* _url1_ : (https://www.youtube.com/watch?v=XctooiH0moI)
* _url2_ : (https://www.youtube.com/watch?v=qppV3n3YlF8)
* _url3_ : (https://www.youtube.com/watch?v=T-D1OfcDW1M)

**Improvements that can be done:**<br>
* Using a better pre-trained model.<br>
* Using a better Dataset(text). The Dataset I have used is very less. If I have used a little more it would be much better.<br>
* Better implementation of **RAG** and **Attention Mechanism**.<br>
* Training the model. I am not having a Good Idea about it but we could train our model to give us accurate responses then it would be better.<br>
* Checking the info generated by the chatbot from internet or somewhere else.<br>
* Only give complete passages and not the incomplete ones. I was unable to implement it.