# RAG Implementation

This notebook contains the code for the implemenation of the Retrrieval Augmented Generation feature for the Worldbank IDEAS Chatbot (WiChat).

## Step 1: Data Ingestion
Our data sources are 79 .txt files scraped from the Wordldbank Ideas Project Website and Worldbank Ideas Project Social Media. We can use their social media because it is publicly available and not private which we would have needed permission to use. Also, only websites allowed for crawling as detailed on the sitemap were scraped. View the scraping of the data here.

langchain for framework

HuggingFace (minilm) for embedding

Falcon 7b instruct as LLM

FAISS as vectore store (knowledge base)

In [1]:
#move to the folder where the scraped data is on your device
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/Data/scraped_data

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/Data/scraped_data


In [2]:
# install dependencies
#!pip install requirements.txt
#!pip install -qU langchain
#!pip install -qU transformers
#!pip install -qU langchain-community
#!pip install -qU unstructured
#!pip install -qU sentence-transformers
#!pip install faiss-gpu-cu12
#!pip install -qU bitsandbytes

In [3]:
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain import HuggingFacePipeline
from transformers import pipeline, AutoTokenizer, AutoModelForQuestionAnswering, AutoModelForCausalLM, BitsAndBytesConfig
import torch


In [4]:
# ensure the notebook is in the same folder as the data files
# load all txt files
loader1 = DirectoryLoader(r'/content/drive/MyDrive/Data/scraped_data', glob = '*.txt', show_progress = True)

In [5]:
# get content of txt files
docs = loader1.load()
len(docs)

100%|██████████| 79/79 [00:10<00:00,  7.80it/s]


79

In [6]:
# split the txt files into chunks of 1000 characters and 150 characters overlap
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1000, chunk_overlap = 150)
data = text_splitter.split_documents(docs)

In [7]:
!nvcc --version
!nvidia-smi # to ensure GPU and cuda device is in use

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Jun__6_02:18:23_PDT_2024
Cuda compilation tools, release 12.5, V12.5.82
Build cuda_12.5.r12.5/compiler.34385749_0
/bin/bash: line 1: nvidia-smi: command not found


In [8]:
# embed data sources
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': False}

embeddings = HuggingFaceEmbeddings(
    model_name = "sentence-transformers/all-MiniLM-L6-v2",
    model_kwargs = model_kwargs,
    encode_kwargs = encode_kwargs
)


  embeddings = HuggingFaceEmbeddings(
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [9]:
# vector store
db = FAISS.from_documents(data, embeddings)

In [10]:
# test the retrieval. It retrieves chunks of data on this
question = "What is the Worldbank Ideas Project?"
search_docs = db.similarity_search(question)
print(search_docs[1].page_content)

In a message to the opening ceremony of the workshop today read on his behalf by the National Project Coordinator, Mrs Blessing Ogwu, Education Minister,Adamu Adamu,told stakeholders that the essence of the IDEAS project is to address the current deficiencies in the education system that have made a large number of school leavers unemployed,urging Nigerian youths to take full advantage of the opportunities offered by the project.

According to the Minister,an estimated 40 Technical colleges in the country, alongside the private sector will benefit from the project which also has a technical teacher training component.

The Minister revealed that a Two Hundred Million Dollar credit facility from the world Bank has been approved for the project which implementation will span over a five year period.


In [11]:
# define LLM model to be used for text generation - falcon 7b instruct
model_name = "tiiuae/falcon-7b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model1 = AutoModelForQuestionAnswering.from_pretrained(model_name, quantization_config=quantization_config, torch_dtype=torch.bfloat16)
#model2 = AutoModelForCausalLM.from_pretrained(model_name, load_in_8bit = True, torch_dtype=torch.bfloat16)

ERROR:bitsandbytes.cextension:Could not load bitsandbytes native library: /lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.32' not found (required by /usr/local/lib/python3.11/dist-packages/bitsandbytes/libbitsandbytes_cpu.so)
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/bitsandbytes/cextension.py", line 85, in <module>
    lib = get_native_library()
          ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/bitsandbytes/cextension.py", line 72, in get_native_library
    dll = ct.cdll.LoadLibrary(str(binary_path))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/ctypes/__init__.py", line 454, in LoadLibrary
    return self._dlltype(name)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/ctypes/__init__.py", line 376, in __init__
    self._handle = _dlopen(self._name, mode)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: /lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4

RuntimeError: CUDA is required but not available for bitsandbytes. Please consider installing the multi-platform enabled version of bitsandbytes, which is currently a work in progress. Please check currently supported platforms and installation instructions at https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend

In [None]:
generation_config = model1.generation_config
generation_config =
print(generation_config)

In [None]:
# pipeline for text generation, initialise separate pipeline for other tasks like speech to text, translatipn, etc.
"""text_generator = pipeline(
    "text-generation",
    model=model2,
    tokenizer=tokenizer,
    torch_dtype = torch.bfloat16,
    device_map = "auto",
    max_length=500,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    return_full_text=False,
    eos_token_id=tokenizer.eos_token_id
)
"""


In [None]:
question_answering = pipeline(
    model=model1,
    tokenizer=tokenizer,
    task="question-answering",
    torch_dtype = torch.bfloat16,
    device_map = "auto",
    max_new_tokens=1000,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    return_full_text=False,
)

In [None]:
# instantiate the question answering llm pipeline
llm1 = HuggingFacePipeline(pipeline = model1, model_kwargs = {'temperature':0})
# instantiate the text generator llm pipeline
#llm2 = HuggingFacePipeline(pipeline = text_generator, model_kwargs = {'temperature':0, "max_length": 1000})

In [None]:
# chatbot initial prompt template
template = """ You are the official AI chatbot for the Worldbank Ideas Project that follows instruction extremely well. You will provide lots of specific details to user queries on the Worldbank Ideas Project based on {context}. Please, be respectful and friendly and ensure each user has a pleasant experience and conversation with you. Please be truthful and give direct answers

### User:
{query}

### WiChat: """


In [None]:

query = "What is World Bank IDEAS Project?"
query_embedded = embeddings.embed_query(query)


In [None]:
# retrieve related text
retrieved_docs = db.similarity_search(query=query, k=5)
retrieved_docs_text = [doc.page_content for doc in retrieved_docs]  # We only need the text of the documents
context = "\nExtracted documents:\n"
context += "".join([f"Document {str(i)}:::\n" + doc for i, doc in enumerate(retrieved_docs_text)])

final_prompt = template.format(query=query, context=context)
print(final_prompt)

In [None]:
from langchain import PromptTemplate,  LLMChain


prompt = PromptTemplate(template=final_prompt)

llm_chain = LLMChain(prompt=prompt, llm=llm1)


print(llm_chain.run(query_embedded))

In [None]:
# retriever
#retriever = db.as_retriever()

In [None]:

# 1. Audio to Text
#speech_to_text = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-base-960h") # Replace model with your model
#audio_file = "path/to/your/audio.wav" # Replace with your file
#text = speech_to_text(audio_file)["text"]

# 2. Translation
#translator = pipeline("translation_en_to_fr", model="Helsinki-NLP/opus-mt-en-fr") # Replace with the correct language codes and model
#translated_text = translator(text)[0]['translation_text']

#print("Original Text:", text)
#print("Translated Text:", translated_text)
