# Retrieval-Augmented Generation (RAG) Chatbot 🗨



## Phase 3: Preprocess and Chunk the Text

In [None]:
import pandas as pd
import re
from bs4 import BeautifulSoup
import nltk
from nltk.tokenize import sent_tokenize

# Download the necessary NLTK data
nltk.download('punkt')
nltk.download('punkt_tab') # Download punkt_tab data

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [None]:
# --- Load your data ---
df = pd.read_csv("egyptian_history.csv")  # Change to your filename

In [None]:
df

Unnamed: 0,paragraph,source
0,"Egypt is a country in North Africa, on the Med...",https://www.worldhistory.org/egypt/
1,Memphis was the first capital of Egypt and a f...,https://www.worldhistory.org/egypt/
2,One of the reasons for the enduring popularity...,https://www.worldhistory.org/egypt/
3,"To the Egyptians, life on earth was only one a...",https://www.worldhistory.org/egypt/
4,Egypt has a long history which goes back far b...,https://www.worldhistory.org/egypt/
...,...,...
542,"During the 2020–2021 Tigray War, Egypt was als...",https://en.wikipedia.org/wiki/History_of_moder...
543,"In 332 BC, Alexander III of Macedon conquered ...",https://en.wikipedia.org/wiki/History_of_ancie...
544,Following Alexander's death in Babylon in 323 ...,https://en.wikipedia.org/wiki/History_of_ancie...
545,The later Ptolemies took on Egyptian tradition...,https://en.wikipedia.org/wiki/History_of_ancie...


In [None]:
# Clean each paragraph by removing HTML tags and extra spaces
def clean_text(text):
    # Remove HTML tags
    text = BeautifulSoup(text, "html.parser").get_text()

    # Remove extra whitespace
    text = re.sub(r"\s+", " ", text)

    # Strip leading and trailing spaces
    text = text.strip()

    return text

df["cleaned_paragraph"] = df["paragraph"].apply(clean_text)


In [None]:
# Show 5 samples of cleaned text
df["cleaned_paragraph"].head()

Unnamed: 0,cleaned_paragraph
0,"Egypt is a country in North Africa, on the Med..."
1,Memphis was the first capital of Egypt and a f...
2,One of the reasons for the enduring popularity...
3,"To the Egyptians, life on earth was only one a..."
4,Egypt has a long history which goes back far b...


In [None]:
# --- Combine all paragraphs into one large string ---
full_text = " ".join(df["cleaned_paragraph"].tolist())

In [None]:
def chunk_text(text, max_words=300):
    # Tokenize the text into sentences
    sentences = sent_tokenize(text)

    chunks = []
    current_chunk = []
    current_word_count = 0

    for sentence in sentences:
        sentence_word_count = len(sentence.split())

        # Check if adding the current sentence exceeds the max word limit
        if current_word_count + sentence_word_count <= max_words:
            current_chunk.append(sentence)
            current_word_count += sentence_word_count
        else:
            # If adding the sentence exceeds the limit, store the chunk and start a new one
            chunks.append(" ".join(current_chunk))
            current_chunk = [sentence]
            current_word_count = sentence_word_count

    # Add the last chunk
    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks

# Example usage
chunks = chunk_text(full_text, max_words=300)

# --- Save chunks to CSV for Phase 4 ---
chunks_df = pd.DataFrame({"chunk": chunks})
chunks_df.to_csv("cleaned_chunks.csv", index=False)

print(f"Done! {len(chunks)} chunks saved to 'cleaned_chunks.csv'.")

Done! 172 chunks saved to 'cleaned_chunks.csv'.


In [None]:
# Show the first 3 chunks
for i, chunk in enumerate(chunks[:3]):
    print(f"\n--- Chunk {i+1} ---\n{chunk[:500]}...\n")  # Only show first 500 characters


--- Chunk 1 ---
Egypt is a country in North Africa, on the Mediterranean Sea, and is home to one of the oldest civilizations on earth. The name 'Egypt' comes from the Greek Aegyptos which was the Greek pronunciation of the ancient Egyptian name 'Hwt-Ka-Ptah' ("Mansion of the Spirit of Ptah"), originally the name of the city of Memphis. Memphis was the first capital of Egypt and a famous religious and trade center; its high status is attested to by the Greeks alluding to the entire country by that name. To the a...


--- Chunk 2 ---
Although ancient Egypt in popular culture is often associated with death and mortuary rites, something even in these speaks to people across the ages of what it means to be a human being and the power and purpose of remembrance. To the Egyptians, life on earth was only one aspect of an eternal journey. The soul was immortal and was only inhabiting a body on this physical plane for a short time. At death, one would meet with judgment in the Hall of Truth and

## Phase 4: Embed the Chunks

In [None]:
!pip install sentence-transformers

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.11.0->sentence-transformers)
 

In [None]:
from sentence_transformers import SentenceTransformer
import pandas as pd
import numpy as np

In [None]:
# Load your cleaned chunks
chunks_df = pd.read_csv("cleaned_chunks.csv")
chunks = chunks_df["chunk"].tolist()

In [None]:
# Load the embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
# Generate embeddings
print("Generating embeddings...")
embeddings = model.encode(chunks, show_progress_bar=True)

Generating embeddings...


Batches:   0%|          | 0/6 [00:00<?, ?it/s]

In [None]:
# Convert to NumPy array and save
embeddings_array = np.array(embeddings)
np.save("chunk_embeddings.npy", embeddings_array)

In [None]:
# print the embeddings
print(embeddings_array)

[[-0.02090334  0.10395747 -0.01440946 ...  0.04607908 -0.00542887
  -0.02366966]
 [-0.03085236  0.12589885 -0.04189853 ...  0.03677839  0.00358248
  -0.00612462]
 [ 0.01523023  0.0584539  -0.08671261 ...  0.03210965  0.0560926
  -0.05054455]
 ...
 [-0.13346834  0.03887247  0.0145149  ...  0.04768401  0.0088409
  -0.05814388]
 [-0.04132102  0.05651832 -0.04685583 ...  0.00702925  0.01924037
  -0.09578288]
 [-0.04014845  0.03723944 -0.02267091 ...  0.00607035  0.03374572
  -0.07846491]]


In [None]:
# save mapping to chunks
chunks_df["embedding_index"] = range(len(chunks_df))
chunks_df.to_csv("chunks_with_index.csv", index=False)

## Phase 5: Create a Vector Store

In [None]:
!pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.8 kB)
Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl (31.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.3/31.3 MB[0m [31m52.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.11.0


In [None]:
import faiss
import numpy as np
import pandas as pd

In [None]:
# Load chunk embeddings
embeddings = np.load("chunk_embeddings.npy")

In [None]:
# Dimension of the vectors (must match the embedding size)
embedding_dim = embeddings.shape[1]
embedding_dim

384

In [None]:
# Create a FAISS index (L2 or cosine similarity)
index = faiss.IndexFlatL2(embedding_dim)  # You can also use IndexFlatIP for inner product (cosine)

In [None]:
# Add embeddings to the index
index.add(embeddings)

In [None]:
# Save the index
faiss.write_index(index, "chunk_faiss_index.idx")

print(f"FAISS index created and saved. Total vectors: {index.ntotal}")

FAISS index created and saved. Total vectors: 172


In [None]:
# Load the FAISS index
index = faiss.read_index("chunk_faiss_index.idx")

index

<faiss.swigfaiss_avx2.IndexFlatL2; proxy of <Swig Object of type 'faiss::IndexFlatL2 *' at 0x7b97d8b8eb80> >

## Phase 6  : The RAG System

In [None]:
# Retrieve the relevant chunks from a query
def retrieve_chunks(query, top_k=3):
  query_embedding = model.encode([query]) # generate embeddings for the query
  distances, indices = index.search(query_embedding, k=top_k) # search for the top 3 relevant chunks

  return "\n".join(chunks[i] for i in indices[0])

In [None]:
!pip install together

Collecting together
  Downloading together-1.5.7-py3-none-any.whl.metadata (15 kB)
Collecting eval-type-backport<0.3.0,>=0.1.3 (from together)
  Downloading eval_type_backport-0.2.2-py3-none-any.whl.metadata (2.2 kB)
Downloading together-1.5.7-py3-none-any.whl (88 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.0/89.0 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading eval_type_backport-0.2.2-py3-none-any.whl (5.8 kB)
Installing collected packages: eval-type-backport, together
Successfully installed eval-type-backport-0.2.2 together-1.5.7


In [None]:
from together import Together
import time

# Initialize the client with the API key to send prompts to models hosted on Together.ai
client = Together(api_key="b6d3e1f699ac30c2e615c4951e08558724e409f92ea2f81b83dea25d9d7259a5")

def rag_chat(query, context):
    prompt = f"""Answer the question based on the context below. If the answer is not in the context, say "I don't know."

Context:
{context}

Question:
{query}

Answer:""" # How the model will behave

    response = client.chat.completions.create(
      model="meta-llama/Llama-3.3-70B-Instruct-Turbo-Free",
      messages=[{"role": "user", "content": prompt}],
      temperature=0.7, # Controls randomness of model for diverse responses
      max_tokens=500,   # LLaMA 3.3-70B has a total limit of 8192 tokens (input + output combined).
      stream=True
      )

    for chunk in response:
      print(chunk.choices[0].delta.content, end="", flush=True)
      time.sleep(0.2)

    #return response.choices[0].message.content.replace("\n", "")

## Sample Questions with Answers

In [None]:
query = 'Who was Cleopatra?'
context = retrieve_chunks(query)

answer = rag_chat(query, context)
answer

'Cleopatra was an Egyptian queen of the Ptolemaic dynasty, famous in history and drama as the lover of Julius Caesar and later as the wife of Mark Antony. She was the last queen of the Macedonian dynasty that ruled Egypt between the death of Alexander the Great in 323 bce and its annexation by Rome in 30 bce.'

In [None]:
query = 'Who was Nefertiti?'
context = retrieve_chunks(query)

answer = rag_chat(query, context)
answer

'Nefertiti was the wife of the Egyptian pharaoh Akhenaten, and possibly the daughter of Ay, although this claim is not substantiated. She was a member of the royal court at Thebes, an adherent of the cult of Aten, and the mother of six daughters. Her life and influence are not well-documented, but it is known that she was deeply devoted to her husband and was likely a supportive figure in his religious and cultural changes.'

In [None]:
query = 'Who built the Great Pyramids of Giza?'
context = retrieve_chunks(query)

rag_chat(query, context)

'The Great Pyramids of Giza were built by three later 4th-dynasty monarchs: Khufu, Khafre, and Menkaure.'

In [None]:
query = 'Who is Memphis?'
context = retrieve_chunks(query)

answer = rag_chat(query, context)
answer

'Memphis is not a person, but rather a city and the ancient capital of Egypt. It is also referred to as the modern name of a city, which is a Greek version of the Egyptian Men-nefer, and another geographic term for it is Hut-ka-Ptah, meaning "mansion of the ka of Ptah".'

In [None]:
query = 'What is the Pharoah?'
context = retrieve_chunks(query)

answer = rag_chat(query, context)
answer

"The Pharaoh was the supreme ruler of the people in ancient Egypt, considered a god on earth and the intermediary between the gods and the people. As such, the Pharaoh held multiple roles, including 'High Priest of Every Temple' and 'Lord of the Two Lands', with duties such as building temples, officiating at religious ceremonies, making laws, owning land, collecting taxes, and defending the country."

In [None]:
query = 'What are the sons and daughters of Nefertiti?'
context = retrieve_chunks(query)

rag_chat(query, context)

"The sons of Akhenaten (Nefertiti's husband) with his lesser wife Kiya are Tutankhamun and possibly Smenkhkare. The daughters of Akhenaten and Nefertiti are Meritaten, Meketaten, Ankhesenpaaten, Nefernefruaten-tasherit, Neferneferure, and Setepenre. It is also mentioned that Akhenaten may have had children with his daughters Meritaten and Ankhesenpaaten, but this is disputed."

In [None]:
query = 'What is the Sphinx?'
context = retrieve_chunks(query)

rag_chat(query, context)

'The Sphinx is a giant recumbent lion with the head of a man, located in the midst of an ancient plateau, specifically on the Giza plateau in Egypt.'

In [None]:
query = 'Talk about Egypt'
context = retrieve_chunks(query)

rag_chat(query, context)

"Egypt is one of the world's oldest civilizations, with a history dating back to around 3150 BC when it was unified by King Narmer. It has been ruled by various powers, including the Persians, Greeks, Romans, and Islamic caliphates, before joining the Ottoman Empire in 1517. Later, it was controlled by Britain in the late 19th century and became a republic in 1953. Today, Egypt is led by Abdel Fattah el-Sisi.In ancient times, Egypt was a time of great wealth and power, with notable pharaohs such as Hatshepsut, a rare female pharaoh, and Thutmose III, who expanded Egypt's army and was a successful military leader. Other notable pharaohs include Amenhotep III, who built extensively at the temple of Karnak, and Akhenaten, who introduced monotheistic worship of the god Aten.Egyptian civilization has had a significant impact on the world, with its iconography and beliefs influencing Christianity and many other aspects of modern culture. The country's rich history and cultural heritage conti

In [None]:
query = 'State a list of Pharaohs'
context = retrieve_chunks(query)

rag_chat(query, context)

"I don't know. The context does not provide a comprehensive list of Pharaohs. It mentions some Pharaohs, such as Hatshepsut, Akhenaten, Tutankhamun, and Nectanebo II, but it does not provide a complete list of all Pharaohs. Additionally, it mentions some dynasties and their rulers, but the information is incomplete and unclear in some cases."

In [None]:
query = 'How was the new kingdom of Egypt?'
context = retrieve_chunks(query)

rag_chat(query, context)

'The New Kingdom of Egypt was the era of Imperial Egypt when it became an empire, following the disunity of the Second Intermediate Period and preceding the dissolution of the central government at the start of the Third Intermediate Period. It lasted from approximately 1570 to 1069 BCE.'

In [None]:
query = 'Talk about Alexander the Great'
context = retrieve_chunks(query)

rag_chat(query, context)

'Alexander the Great conquered Egypt in 332 BC with little resistance from the Persians. He visited Memphis and went on a pilgrimage to the oracle of Amun at the Siwa Oasis. The oracle declared him the son of Amun. He conciliated the Egyptians by showing respect for their religion but appointed Greeks to senior posts in the country and founded a new Greek city, Alexandria, to be the new capital. Early in 331 BC, he led his forces away to Phoenicia, never returning to Egypt. After his death in 323 BC, one of his closest companions, Ptolemy, was appointed to rule Egypt and eventually established himself as the ruler, founding the Ptolemaic dynasty that ruled Egypt for nearly 300 years.'

In [None]:
query = 'What are the great monuments of Egypt?'
context = retrieve_chunks(query)

rag_chat(query, context)

'The great monuments of Egypt include the Pyramids of Giza (the Great Pyramid of Khufu, the pyramid of Khafre, and the pyramid of Menkaure), the Great Sphinx of Giza, the Step Pyramid at Saqqara, and the sun temples.'

In [None]:
query = "Who was the most recognizable queen of ancient Egypt?"
context = retrieve_chunks(query)

rag_chat(query, context)

'Nefertiti'

In [None]:
query = "Who is Nefertiti's father? Talk about him"
context = retrieve_chunks(query)

rag_chat(query, context)

"It appears that Nefertiti was the daughter of Ay, but this claim is far from substantiated. Ay was a tutor to the young Amenhotep IV (later Akhenaten) and held other duties, but nothing is known of his lesser wife, who might have been Nefertiti's mother. Ay's wife, Tiye (or Tey), is referred to as Nefertiti's wet nurse, not her mother."

In [None]:
query = "How was Egypt in the modern era?"
context = retrieve_chunks(query)

rag_chat(query, context)

'In the modern era, Egypt was controlled by Britain in the late 19th century, then it became a republic in 1953. After several political transitions, Abdel Fattah el-Sisi currently leads the country. Additionally, prior to becoming a republic, Egypt was part of the Ottoman Empire in 1517 and was later ruled by various other empires, including Persian, Greek, and Roman, as well as Islamic rule.'

In [None]:
query = "What was the death in ancient egypt?"
context = retrieve_chunks(query)

rag_chat(query, context)

'In ancient Egypt, death was not an end, but a transition to an eternal life. The soul was believed to be immortal and would meet with judgment in the Hall of Truth. If justified, the individual would move on to an eternal paradise known as The Field of Reeds, which was a mirror image of their life on earth, where they could live peacefully with loved ones, including pets, in a familiar environment.'

In [None]:
query = "What was the language of ancient Egypt?"
context = retrieve_chunks(query)

rag_chat(query, context)

"The language of ancient Egypt is not explicitly stated in the context, but it is mentioned that the ancient Egyptian name for the city of Memphis was 'Hwt-Ka-Ptah' and the country was known as 'Kemet', which means 'Black Land'. Additionally, it is mentioned that the Coptic language was spoken until the 17th century and remains a liturgical language today, and that Arabic culture replaced the Greek and Coptic languages and cultures."

In [None]:
query = "What was the reasons behind the abrupt disappearance of Nefertiti?"
context = retrieve_chunks(query)

rag_chat(query, context)

"There have been many theories offered to explain Nefertiti's abrupt disappearance, but none of them can be substantiated except possibly the fourth theory, which is also uncertain. The other theories are: 1. Akhenaten deserted Nefertiti because he had a male heir in Tutankhamun, but this is unlikely since he already had a male heir.2. Nefertiti left the cult of Aten, but there is no evidence to support this.3. The throne name of Akhenaten's successor is not the same as hers, but this is not true since the throne name of Smenkhkare is virtually identical to that of Akhenaten's coregent, now convincingly identified as Nefertiti.The fourth theory, known as the Nefertiti-as-Smenkhkare theory, suggests that Nefertiti may have taken the throne name Smenkhkare, but this is also uncertain."

In [None]:
query = "How was the religious life of ancient egypt?"
context = retrieve_chunks(query)

rag_chat(query, context)

"The religious life of ancient Egypt was centered around the concept of an eternal journey, where the soul was immortal and would meet with judgment in the Hall of Truth after death. If justified, one would move on to an eternal paradise known as The Field of Reeds, which was a mirror image of one's life on earth. The Egyptians believed in living in accordance with the will of the gods to achieve this eternal life. They also believed in the power of remembrance and the importance of preserving the body through mummification to ensure entry into the afterlife. The gods played a significant role in helping people in the afterlife, and the Egyptians filled tombs with items that would be needed in the afterlife, such as food, games, and clothing. Additionally, many of the iconography and beliefs of Egyptian religion were later incorporated into Christianity."

In [None]:
query = "Talk about gods of ancient egypt"
context = retrieve_chunks(query)

rag_chat(query, context)

'In ancient Egypt, the gods played a significant role in the lives of the people. The gods were believed to have given the people everything and had set the king over them as the one best-equipped to understand and implement their will. Some of the notable gods mentioned include Horus, who defeated the forces of chaos and restored order, Osiris, the god of the dead, Ra, the sun god, and Amun, whose priests held great power at Thebes. Additionally, there was Aten, a sun deity, who was the center of a monotheistic cult that was favored by Akhenaten. The gods were believed to guide the pharaoh in his role as ruler and to have given him the power to rule. The pharaoh, as the intermediary between the gods and the people, was responsible for building great temples and monuments to celebrate his own achievements and to pay homage to the gods. The gods were also believed to have a significant role in the afterlife, where they would judge the deceased in the Hall of Truth and determine their wo

In [None]:
query = "Who is Horus?"
context = retrieve_chunks(query)

rag_chat(query, context)

"Horus is the god who had defeated the forces of chaos and restored order. In Egyptian mythology, he is the son of Osiris and Isis, and is associated with the pharaoh in life. According to the myth, Horus avenged his father's murder by defeating his uncle Set, illustrating the triumph of order over chaos."

In [None]:
query = "Who is Horus?"
context = retrieve_chunks(query)

rag_chat(query, context)

Horus is the god who had defeated the forces of chaos and restored order. In Egyptian mythology, he is the son of Osiris and Isis, and he avenged his father's murder by defeating his uncle Set. The pharaoh was associated with Horus in life.