# **Capstone 1 : AI Chatbot for Answer Evaluation**




# **1. Install required libraries**



*   `PyPDF2`: Used to extract text from PDF files.
*   `faiss-cpu`: A library for efficient similarity search using vector-based approaches (e.g., nearest neighbor search).
*   `transformers`: A popular library for handling pre-trained models (like GPT, T5, etc.).
*   `sentence-transformers`: Provides pre-trained models for generating sentence embeddings.
*   `langchain-community`: A library designed for building applications that utilize language models in a more flexible manner, integrating with various tools (like FAISS for indexing and retrieval).















In [None]:
# Install required libraries
!pip install PyPDF2 faiss-cpu transformers sentence-transformers
!pip install langchain-community

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.9.0.post1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading faiss_cpu-1.9.0.post1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.5/27.5 MB[0m [31m32.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2, faiss-cpu
Successfully installed PyPDF2-3.0.1 faiss-cpu-1.9.0.post1
Collecting langchain-community
  Downloading langchain_community-0.3.15-py3-none-any.whl.metadata (2.9 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting httpx-sse<0.5.0,>=

# **2. Import required modules**

*   `PdfReader` from `PyPDF2` to read and extract text from the PDF file.
*   `AutoTokenizer`, `AutoModel`, and `AutoModelForSeq2SeqLM` from `transformers` for working with pre-trained transformer models (including tokenization and inference).
*   `torch`: PyTorch is used for model computations and managing tensors.
*   `numpy`: For numerical operations, although it’s not used explicitly in the code you provided.
*   `CharacterTextSplitter` from `langchain`: For splitting the raw extracted text into smaller, manageable chunks.
*   `FAISS`: For building a vector database for efficient similarity search.
*   `drive`: For mounting Google Drive to access files (in this case, the PDF file).
*   `HuggingFaceEmbeddings` is used to wrap the model and tokenizer into a single object for embedding generation in a FAISS index.
*   `The cosine_similarity` function from `sklearn.metrics.pairwise` is used to compute the cosine similarity between vectors. Cosine similarity measures the cosine of the angle between two non-zero vectors in a multidimensional space, giving a value between -1 and 1.












In [None]:
# Import required modules
from PyPDF2 import PdfReader
from transformers import AutoTokenizer, AutoModel, AutoModelForSeq2SeqLM
import torch
import numpy as np
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS
from google.colab import drive
from langchain.embeddings import HuggingFaceEmbeddings
from sklearn.metrics.pairwise import cosine_similarity

# **3. Mount Google Drive and load PDF**
This section mounts Google Drive to access files stored in Google Drive and sets the path to the target PDF file (UNIT-7 Optimizing E-commerce Systems NEW1.pdf).

In [None]:
# Mount Google Drive to access the PDF file
drive.mount('/content/drive')

# Path to the PDF file in Google Drive
pdf_path = '/content/drive/MyDrive/UNIT-7 Optimizing E-commerce Systems NEW1.pdf'

Mounted at /content/drive


# **4. Extract text from the PDF**



*   The `PdfReader` is used to read the PDF file.
*   Then, we loop through each page of the PDF and extract the text using `extract_text()`.
*   All the extracted text is combined into the `raw_text` variable.
*   Finally, the length of the extracted text is printed along with a preview of the first 1000 characters.







In [None]:
# Extract text from the PDF
reader = PdfReader(pdf_path)

raw_text = ''
for page in reader.pages:
    text = page.extract_text()
    if text:
        raw_text += text

print(f"Extracted Text Length: {len(raw_text)}")
print(raw_text[:1000])

Extracted Text Length: 41510
Outline:
❑Search Engine Optimization, 
❑Working mechanism of Search Engines, 
❑On Page SEO, Off Page SEO, Page Ranks, 
❑Using Google Analytics, Social Media Analytics, 
❑Recommendation Systems: Collaborative, Content Based, 
❑Use of Recommendation Systems in E -commerceUNIT 7: Optimizing E -Commerce Systems
12/28/2024 1UNIT -7 Optimizing E -Comemrce SystemSearch Engine Optimization (SEO)
12/28/2024 UNIT -7 Optimizing E -Comemrce System2The process of maximizing the number of visitors to a particular 
website by ensuring that the site appears high on the list of results 
returned by a search engine .
"the key to getting more traffic lies in integrating content with search 
engine optimization and social media marketing"
SEO stands for Search Engine Optimization and helps search engines 
understand your website’s content and connect it with users by 
delivering relevant, valuable results based on their search queries.
The goal of SEO is to rank on the fir

# **5. Split text into manageable chunks**



*   Here, the text is split into smaller chunks for better processing.

*   The `CharacterTextSplitter` is used to split the `raw_text`:

      `separator="\n"`: Chunks are split by newline characters.

      `chunk_size=500`: Each chunk will contain a maximum of 500 characters.

      `chunk_overlap=50`: Each chunk will overlap with the next by 50 characters (helps maintain context between chunks).

      The resulting chunks are stored in texts, and the number of chunks and a sample chunk are printed.



In [None]:
# Split the text into manageable chunks
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=500,
    chunk_overlap=50,
    length_function=len,
)
texts = text_splitter.split_text(raw_text)

print(f"Number of text chunks: {len(texts)}")
print(f"Sample chunk:\n{texts[0]}")

Number of text chunks: 91
Sample chunk:
Outline:
❑Search Engine Optimization, 
❑Working mechanism of Search Engines, 
❑On Page SEO, Off Page SEO, Page Ranks, 
❑Using Google Analytics, Social Media Analytics, 
❑Recommendation Systems: Collaborative, Content Based, 
❑Use of Recommendation Systems in E -commerceUNIT 7: Optimizing E -Commerce Systems
12/28/2024 1UNIT -7 Optimizing E -Comemrce SystemSearch Engine Optimization (SEO)


# **6. Load a Hugging Face model for embeddings**

Here, we load a pre-trained model from Hugging Face for generating sentence embeddings.


*   `sentence-transformers/all-MiniLM-L6-v2` is a lightweight transformer model designed to produce dense vector embeddings for sentences.
*   `AutoTokenizer` and `AutoModel` are used to load the tokenizer and the model respectively.





In [None]:
# Load a Hugging Face model for embeddings
model_name = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

# **7. Generate embeddings for text chunks**


*   This function takes the texts (split chunks of the PDF) and generates embeddings for each chunk using the pre-trained model.
*   The `tokenizer` converts each chunk into token IDs (as input for the model).
*   The model generates outputs, and we use `last_hidden_state` to get the embeddings for the input text. The embeddings are averaged (mean pooling) across all tokens in the chunk.
*   The `torch.no_grad()` context ensures that gradients are not calculated, saving memory and computation.



In [None]:
# Generate embeddings for text chunks
def get_embeddings(texts):
    embeddings = []
    for text in texts:
        inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512, padding="max_length")
        with torch.no_grad():
            outputs = model(**inputs)
            # Use mean pooling to create a single vector for each chunk
            embedding = outputs.last_hidden_state.mean(dim=1).squeeze().numpy()
            embeddings.append(embedding)
    return embeddings

embeddings = get_embeddings(texts)

# **8. Create a FAISS index**


*   `FAISS.from_embeddings()` creates a `FAISS` index from the embeddings and text chunks.
*   This allows us to perform fast similarity search on the text chunks based on their embeddings.


In [None]:
# Create a FAISS index
text_embedding_pairs = list(zip(texts, embeddings))

embedding_function = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs={'device': 'cuda' if torch.cuda.is_available() else 'cpu'}
)

# Pass the list of tuples and the embedding function to from_embeddings()
faiss_index = FAISS.from_embeddings(text_embedding_pairs, embedding_function)

  embedding_function = HuggingFaceEmbeddings(


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

# **9. Save the FAISS index for future use**
The `FAISS` index is saved to the local directory in Google Drive `/content/drive/MyDrive/faiss_index`. This makes it easier to load and reuse the index later without needing to re-generate it.

In [None]:
# Save the FAISS index for future use
faiss_index.save_local('/content/drive/MyDrive/faiss_index')

# **10. Load a pre-trained model for answering questions**
This loads a question-answering model `flan-t5-large` from Hugging Face. The model is specifically fine-tuned for tasks like text generation and answering questions based on provided context.

In [None]:
# Load a pre-trained model for answering questions
qa_model_name = "google/flan-t5-large"
qa_tokenizer = AutoTokenizer.from_pretrained(qa_model_name)
qa_model = AutoModelForSeq2SeqLM.from_pretrained(qa_model_name)

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

# **11. Generate an answer based on the question and evaluates the answer**

*   `relevant_chunks = faiss_index.similarity_search(question, k=5)`, searches a FAISS index for the top 5 relevant chunks of text matching the given question.
*   `context = " ".join([chunk.page_content for chunk in relevant_chunks])`, combines the text content (page_content) of the retrieved chunks into a single context string.
*   `input_text = f"Question: {question}\nContext: {context}"`, formats the question and context into a text structure suitable for the QA model.
*   `with torch.no_grad():`, ensures the model runs in inference mode, disabling gradient calculations for efficiency.
*   `answer = qa_tokenizer.decode(outputs[0], skip_special_tokens=True)`, decodes the model's output back into human-readable text.
*   `question_embedding = embedding_function.embed_query(question)`, generates a vector representation of the question.
*   `chunk_embeddings = [embedding_function.embed_query(chunk.page_content) for chunk in relevant_chunks]`, creates vector representations for the content of the retrieved chunks.
*   `similarity_scores = cosine_similarity([question_embedding], chunk_embeddings).flatten()`, computes cosine similarity between the question embedding and chunk embeddings.
*   `rating = min(max(int(np.mean(similarity_scores) * 5), 1), 5)`, averages similarity scores, scales them to a range of 1–5, and ensures the rating remains within bounds.
*   `return answer, rating`, returns the generated answer and its corresponding rating.





In [None]:
# Function for answer generation and rating
def chatbot_question_answer(question):

    # Retrieve top 5 relevant chunks
    relevant_chunks = faiss_index.similarity_search(question, k=5)

    # Extract page content from relevant chunks
    context = " ".join([chunk.page_content for chunk in relevant_chunks])

    # Prepare the QA model input
    input_text = f"Question: {question}\nContext: {context}"
    inputs = qa_tokenizer(input_text, return_tensors="pt", truncation=True, max_length=1024)

    # Generate the answer
    with torch.no_grad():
        outputs = qa_model.generate(**inputs, max_length=300, num_beams=5, early_stopping=True)
    answer = qa_tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Calculate rating based on similarity
    question_embedding = embedding_function.embed_query(question)

    # Extract page content from chunks before embedding
    chunk_embeddings = [embedding_function.embed_query(chunk.page_content) for chunk in relevant_chunks]
    similarity_scores = cosine_similarity([question_embedding], chunk_embeddings).flatten()
    rating = min(max(int(np.mean(similarity_scores) * 5), 1), 5)

    return answer, rating

In [None]:
# Test the chatbot
while True:
    question = input("Ask a question (or type 'exit' to quit): ")
    if question.lower() == 'exit':
        break
    answer, rating = chatbot_question_answer(question)
    print(f"Answer: {answer}")
    print(f"Rating (1-5): {rating}")

Ask a question (or type 'exit' to quit): SEO
Answer: SEO is the foundation of holistic marketing, where everything your company does matters. Once you understand what your users want, you can then implement that knowledge across your:
Rating (1-5): 1
Ask a question (or type 'exit' to quit): conclude the pdf
Answer: Cont.. 12/28/2024 UNIT -7 Optimizing E -Comemrce System8
Rating (1-5): 1
Ask a question (or type 'exit' to quit): exit


# **Conclusion:**
The overall process extracts text from a PDF, splits it into chunks, generates embeddings, indexes these chunks with FAISS, and then allows users to ask questions that are answered by the chatbot based on the context provided by the relevant chunks retrieved from the FAISS index.