<a href="https://colab.research.google.com/github/Aman-Gautam1/Chat_bot-with-Rag-and-Mysql-Storage/blob/main/data_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fetching wikipedia content using Langchain

In [None]:
!pip install langchain wikipedia



In [None]:
!pip install langchain-community



In [None]:
from langchain.document_loaders import WikipediaLoader
loader = WikipediaLoader(query='History of India')
docs = loader.load()
docs[0]

Document(metadata={'title': 'History of India', 'summary': "Anatomically modern humans first arrived on the Indian subcontinent between 73,000 and 55,000 years ago. The earliest known human remains in South Asia date to 30,000 years ago. Sedentariness began in South Asia around 7000 BCE; by 4500 BCE, settled life had spread, and gradually evolved into the Indus Valley Civilisation, one of three early cradles of civilisation in the Old World, which flourished between 2500 BCE and 1900 BCE in present-day Pakistan and north-western India. Early in the second millennium BCE, persistent drought caused the population of the Indus Valley to scatter from large urban centres to villages. Indo-Aryan tribes moved into the Punjab from Central Asia in several waves of migration. The Vedic Period of the Vedic people in northern India (1500–500 BCE) was marked by the composition of their extensive collections of hymns (Vedas). The social structure was loosely stratified via the varna system, incorpor

#chunk and preprocess the text using `RECUSRSIVETEXTSPLITTER`



In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size = 250,
    chunk_overlap  = 70
)

In [None]:
clean_text = docs[0].page_content.strip()
clean_text = " ".join(clean_text.split())
chunks = splitter.split_documents(docs)
print(f"total chunks are {len(chunks)}")
chunks[0]

total chunks are 595


Document(metadata={'title': 'History of India', 'summary': "Anatomically modern humans first arrived on the Indian subcontinent between 73,000 and 55,000 years ago. The earliest known human remains in South Asia date to 30,000 years ago. Sedentariness began in South Asia around 7000 BCE; by 4500 BCE, settled life had spread, and gradually evolved into the Indus Valley Civilisation, one of three early cradles of civilisation in the Old World, which flourished between 2500 BCE and 1900 BCE in present-day Pakistan and north-western India. Early in the second millennium BCE, persistent drought caused the population of the Indus Valley to scatter from large urban centres to villages. Indo-Aryan tribes moved into the Punjab from Central Asia in several waves of migration. The Vedic Period of the Vedic people in northern India (1500–500 BCE) was marked by the composition of their extensive collections of hymns (Vedas). The social structure was loosely stratified via the varna system, incorpor

# Embedding and vector store

In [None]:
!pip install chromadb sentence-transformers



In [None]:
import chromadb
from sentence_transformers import SentenceTransformer


client = chromadb.Client()
collection = client.create_collection("Indianhistory")

In [None]:
text_chunks = [chunk.page_content for chunk in chunks]
text_chunks[0]

'Anatomically modern humans first arrived on the Indian subcontinent between 73,000 and 55,000 years ago. The earliest known human remains in South Asia date to 30,000 years ago. Sedentariness began in South Asia around 7000 BCE; by 4500 BCE, settled'

In [None]:
model1 = SentenceTransformer('all-MiniLM-L6-v2')
embedding = model1.encode(text_chunks).tolist()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
collection.add(
    ids=[str(i) for i in range(len(text_chunks))],  # Unique IDs for each document
    documents=text_chunks,
    embeddings=embedding
)


# Retrival from vector data base

In [None]:
def retrival_top_k(query, k=4):

    query_embedding = model1.encode([query])  # Pass the query as a list

    results = collection.query(
        query_embeddings=query_embedding,
        n_results=k  # Number of results to retrieve
    )

    # Extract the top-k documents (context chunks)
    top_k_chunks = results['documents']

    return top_k_chunks


# Answer Generation

In [None]:
!pip install transformers



In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

model_name = "t5-small"  # Lightweight T5 model
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [None]:
def generate_answer(query, retrieved_chunks):
    # Flatten the list of chunks (if each chunk is a list itself)
    retrieved_chunks = [item for sublist in retrieved_chunks for item in sublist]
    context = " ".join(retrieved_chunks)
    input_text = f"question: {query} context: {context}"
    inputs = tokenizer(input_text, return_tensors="pt", max_length=500, truncation=True)

    outputs = model.generate(**inputs, max_length=550)
    answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return answer

In [None]:
query= 'Salamn Khan  deer'
top_k_chunks = retrival_top_k(query)

for i , chunk in enumerate(top_k_chunks):
  print(f"Chunk {i+1}: {chunk}")

Chunk 1: ["the Himalayas, the Uttarapatha. He also built 1700 'serais' where two horses were always kept for the despatch of the Royal Mail. Akbar introduced camels in addition to the horses and runners.", 'were depicted in ancient Indian reliefs. Another copper seal from Mohenjo Daro shows a horned hunter holding a composite bow.', 'According to tradition, in the Deer Park in Sarnath near Vārāṇasī in northern India, Buddha set in motion the Wheel of Dharma by delivering his first sermon to the group of five companions with whom he had previously sought liberation. They,', 'post system. This was expanded into the dak chowkis, a horse and foot runner service, by Alauddin Khalji in 1296. Sher Shah Suri (1541–1545) replaced runners with horses for conveyance of messages along the northern Indian high road, today known as']


In [None]:
query = "Who was Ashoka"

# Get the top-k chunks (from the retrival_top_k function)
top_k_chunks = retrival_top_k(query)

# Generate the answer
answer = generate_answer(query, top_k_chunks)

print("Generated Answer:")
print(answer)

Generated Answer:
Ashoka (c. 268–232 BCE) the three Tamil dynasties of Chola, Chera and Pandya were ruling the south


In [None]:
import pickle
with open("chunks.pkl", "wb") as f:
    pickle.dump(chunks, f)

with open("embeddings.pkl", "wb") as f:
    pickle.dump(embedding, f)

In [None]:
!pip install pipreqs


Collecting pipreqs
  Downloading pipreqs-0.5.0-py3-none-any.whl.metadata (7.9 kB)
Collecting docopt==0.6.2 (from pipreqs)
  Downloading docopt-0.6.2.tar.gz (25 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting ipython==8.12.3 (from pipreqs)
  Downloading ipython-8.12.3-py3-none-any.whl.metadata (5.7 kB)
Collecting yarg==0.1.9 (from pipreqs)
  Downloading yarg-0.1.9-py2.py3-none-any.whl.metadata (4.6 kB)
Collecting jedi>=0.16 (from ipython==8.12.3->pipreqs)
  Downloading jedi-0.19.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting stack-data (from ipython==8.12.3->pipreqs)
  Downloading stack_data-0.6.3-py3-none-any.whl.metadata (18 kB)
Collecting executing>=1.2.0 (from stack-data->ipython==8.12.3->pipreqs)
  Downloading executing-2.2.0-py2.py3-none-any.whl.metadata (8.9 kB)
Collecting asttokens>=2.1.0 (from stack-data->ipython==8.12.3->pipreqs)
  Downloading asttokens-3.0.0-py3-none-any.whl.metadata (4.7 kB)
Collecting pure-eval (from stack-data->ipython==8.12.3->pipr

In [None]:
!pipreqs . --force


INFO: Not scanning for jupyter notebooks.
INFO: Successfully saved requirements file in ./requirements.txt
