# **Capstone 1 : AI Chatbot**




# **1. Install required libraries**



*   `PyPDF2`: Used to extract text from PDF files.
*   `faiss-cpu`: A library for efficient similarity search using vector-based approaches (e.g., nearest neighbor search).
*   `transformers`: A popular library for handling pre-trained models (like GPT, T5, etc.).
*   `sentence-transformers`: Provides pre-trained models for generating sentence embeddings.
*   `langchain-community`: A library designed for building applications that utilize language models in a more flexible manner, integrating with various tools (like FAISS for indexing and retrieval).















In [None]:
# Install required libraries
!pip install PyPDF2 faiss-cpu transformers sentence-transformers langchain-community

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.10.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.4 kB)
Collecting langchain-community
  Downloading langchain_community-0.3.18-py3-none-any.whl.metadata (2.4 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.8.0-py3-none-any.whl.metadata (3.5 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-no

# **2. Import required modules**

*   `PdfReader` from `PyPDF2` to read and extract text from the PDF file.
*   `AutoTokenizer`, `AutoModel`, and `AutoModelForSeq2SeqLM` from `transformers` for working with pre-trained transformer models (including tokenization and inference).
*   `torch`: PyTorch is used for model computations and managing tensors.
*   `numpy`: For numerical operations, although it’s not used explicitly in the code you provided.
*   `CharacterTextSplitter` from `langchain`: For splitting the raw extracted text into smaller, manageable chunks.
*   `FAISS`: For building a vector database for efficient similarity search.
*   `drive`: For mounting Google Drive to access files (in this case, the PDF file).
*   `HuggingFaceEmbeddings` is used to wrap the model and tokenizer into a single object for embedding generation in a FAISS index.
*   `The cosine_similarity` function from `sklearn.metrics.pairwise` is used to compute the cosine similarity between vectors. Cosine similarity measures the cosine of the angle between two non-zero vectors in a multidimensional space, giving a value between -1 and 1.













In [None]:
# Import required modules
from PyPDF2 import PdfReader
from transformers import AutoTokenizer, AutoModel, AutoModelForSeq2SeqLM
import torch
import numpy as np
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS
from google.colab import drive
from langchain.embeddings import HuggingFaceEmbeddings
from sklearn.metrics.pairwise import cosine_similarity
import random

# **3. Mount Google Drive and load PDF**
This section mounts Google Drive to access files stored in Google Drive and sets the path to the target PDF file (UNIT-7 Optimizing E-commerce Systems NEW1.pdf).

In [None]:
# Mount Google Drive to access the PDF file
drive.mount('/content/drive')

# Path to the PDF file in Google Drive
pdf_path = '/content/drive/MyDrive/UNIT-7 Optimizing E-commerce Systems NEW1.pdf'

Mounted at /content/drive


# **4. Extract text from the PDF**



*   The `PdfReader` is used to read the PDF file.
*   Then, we loop through each page of the PDF and extract the text using `extract_text()`.
*   All the extracted text is combined into the `raw_text` variable.
*   Finally, the length of the extracted text is printed along with a preview of the first 1000 characters.







In [None]:
# Extract text from the PDF
reader = PdfReader(pdf_path)

raw_text = ''
for page in reader.pages:
    text = page.extract_text()
    if text:
        raw_text += text

print(f"Extracted Text Length: {len(raw_text)}")
print(raw_text[:1000])

Extracted Text Length: 41510
Outline:
❑Search Engine Optimization, 
❑Working mechanism of Search Engines, 
❑On Page SEO, Off Page SEO, Page Ranks, 
❑Using Google Analytics, Social Media Analytics, 
❑Recommendation Systems: Collaborative, Content Based, 
❑Use of Recommendation Systems in E -commerceUNIT 7: Optimizing E -Commerce Systems
12/28/2024 1UNIT -7 Optimizing E -Comemrce SystemSearch Engine Optimization (SEO)
12/28/2024 UNIT -7 Optimizing E -Comemrce System2The process of maximizing the number of visitors to a particular 
website by ensuring that the site appears high on the list of results 
returned by a search engine .
"the key to getting more traffic lies in integrating content with search 
engine optimization and social media marketing"
SEO stands for Search Engine Optimization and helps search engines 
understand your website’s content and connect it with users by 
delivering relevant, valuable results based on their search queries.
The goal of SEO is to rank on the fir

# **5. Split text into manageable chunks**



*   Here, the text is split into smaller chunks for better processing.

*   The `CharacterTextSplitter` is used to split the `raw_text`:

      `separator="\n"`: Chunks are split by newline characters.

      `chunk_size=500`: Each chunk will contain a maximum of 500 characters.

      `chunk_overlap=50`: Each chunk will overlap with the next by 50 characters (helps maintain context between chunks).

      The resulting chunks are stored in texts, and the number of chunks and a sample chunk are printed.



In [None]:
# Split the text into manageable chunks
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=500,
    chunk_overlap=50,
    length_function=len,
)
texts = text_splitter.split_text(raw_text)

print(f"Number of text chunks: {len(texts)}")
print(f"Sample chunk:\n{texts[0]}")

Number of text chunks: 91
Sample chunk:
Outline:
❑Search Engine Optimization, 
❑Working mechanism of Search Engines, 
❑On Page SEO, Off Page SEO, Page Ranks, 
❑Using Google Analytics, Social Media Analytics, 
❑Recommendation Systems: Collaborative, Content Based, 
❑Use of Recommendation Systems in E -commerceUNIT 7: Optimizing E -Commerce Systems
12/28/2024 1UNIT -7 Optimizing E -Comemrce SystemSearch Engine Optimization (SEO)


# **6. Load a Hugging Face model for embeddings**

Here, we load a pre-trained model from Hugging Face for generating sentence embeddings.


*   `sentence-transformers/all-MiniLM-L6-v2` is a lightweight transformer model designed to produce dense vector embeddings for sentences.
*   `AutoTokenizer` and `AutoModel` are used to load the tokenizer and the model respectively.





In [None]:
# Load a Hugging Face model for embeddings
model_name = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

# **7. Generate embeddings for text chunks**


*   This function takes the texts (split chunks of the PDF) and generates embeddings for each chunk using the pre-trained model.
*   The `tokenizer` converts each chunk into token IDs (as input for the model).
*   The model generates outputs, and we use `last_hidden_state` to get the embeddings for the input text. The embeddings are averaged (mean pooling) across all tokens in the chunk.
*   The `torch.no_grad()` context ensures that gradients are not calculated, saving memory and computation.



In [None]:
# Generate embeddings for text chunks
def get_embeddings(texts):
    embeddings = []
    for text in texts:
        inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512, padding="max_length")
        with torch.no_grad():
            outputs = model(**inputs)
            # Use mean pooling to create a single vector for each chunk
            embedding = outputs.last_hidden_state.mean(dim=1).squeeze().numpy()
            embeddings.append(embedding)
    return embeddings

embeddings = get_embeddings(texts)

# **8. Create a FAISS index**


*   `FAISS.from_embeddings()` creates a `FAISS` index from the embeddings and text chunks.
*   This allows us to perform fast similarity search on the text chunks based on their embeddings.


In [None]:
# Create a FAISS index
text_embedding_pairs = list(zip(texts, embeddings))

embedding_function = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs={'device': 'cuda' if torch.cuda.is_available() else 'cpu'}
)

# Pass the list of tuples and the embedding function to from_embeddings()
faiss_index = FAISS.from_embeddings(text_embedding_pairs, embedding_function)

  embedding_function = HuggingFaceEmbeddings(


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

# **9. Save the FAISS index for future use**
The `FAISS` index is saved to the local directory in Google Drive `/content/drive/MyDrive/faiss_index`. This makes it easier to load and reuse the index later without needing to re-generate it.

In [None]:
# Save the FAISS index for future use
faiss_index.save_local('/content/drive/MyDrive/faiss_index')

# **10. Load a pre-trained model for answering questions**
This loads a question-answering model `flan-t5-large` from Hugging Face. The model is specifically fine-tuned for tasks like text generation and answering questions based on provided context.

In [None]:
# Load a pre-trained model for answering questions
qa_model_name = "google/flan-t5-large"
qa_tokenizer = AutoTokenizer.from_pretrained(qa_model_name)
qa_model = AutoModelForSeq2SeqLM.from_pretrained(qa_model_name)

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

# **11. Functions for Contextual Question Answering and Answer Evaluation**



In [None]:
# Function for answering a user query using the PDF content
def generate_answer_from_context(query):
    # Use FAISS to retrieve relevant chunks based on the query
    relevant_chunks = faiss_index.similarity_search(query, k=7)

    # Extract page content from relevant chunks
    context = " ".join([chunk.page_content for chunk in relevant_chunks])

    # Prepare the input for the QA model
    input_text = f"Question: {query}\nContext: {context}\nPlease provide a detailed answer covering all relevant aspects of the question."
    inputs = qa_tokenizer(input_text, return_tensors="pt", truncation=True, max_length=1024)

    # Generate the answer using the model
    with torch.no_grad():
        outputs = qa_model.generate(**inputs, max_length=600, num_beams=5, early_stopping=True)

    chatbot_answer = qa_tokenizer.decode(outputs[0], skip_special_tokens=True)
    return chatbot_answer.strip()

*   The function `generate_answer_from_context(query)` takes a user query as input.
*   It uses `faiss_index` to search for the top 7 relevant chunks based on the query.
*   The relevant chunks are combined into a single string, forming the context.
*   It formats the input as `Question: {query}\nContext: {context}\n` and tokenizes it with `qa_tokenizer`, applying truncation and max length constraints.
*   The tokenized input is passed to the QA model (`qa_model`) to generate an answer, using beam search for efficiency.
*   The model output is decoded back into text and cleaned by removing special tokens or extra spaces.
*   The cleaned answer is returned to the user.




In [None]:
# Function to generate a relevant question based on the content of the PDF
def generate_question_from_context():
    # Randomly select a chunk from the PDF for generating a more diverse question
    random_chunk = random.choice(texts)
    prompt = f"Generate a relevant question based on the following content:\n{random_chunk[:500]}"

    # Tokenize the input text
    inputs = qa_tokenizer(prompt, return_tensors="pt", truncation=True, max_length=1024, padding=True)

    # Generate a question using the model
    with torch.no_grad():
        outputs = qa_model.generate(**inputs, max_length=50, num_beams=5, early_stopping=True)

    # Decode and return the generated question
    generated_question = qa_tokenizer.decode(outputs[0], skip_special_tokens=True)
    return generated_question.strip()

*   The function `generate_question_from_context()` generates a relevant question based on the content from a PDF.
*   It randomly selects a chunk of text from `texts` to introduce diversity in the generated question.
*   The prompt is formatted to instruct the model to generate a question based on the selected text chunk.
*   The input is tokenized using `qa_tokenizer`, with truncation, padding, and a max length of 1024 tokens.
*   The tokenized input is passed to the QA model (`qa_model`) to generate a question, using beam search for better results.
*   The output from the model is decoded into text and cleaned by removing special tokens or extra spaces.
*   The generated question is returned to the user.




In [None]:
# Function to rate the answer provided by the user
def rate_answer(user_answer, correct_answer):
    # Tokenize both answers
    user_inputs = tokenizer(user_answer, return_tensors="pt", truncation=True, max_length=512, padding="max_length")
    correct_inputs = tokenizer(correct_answer, return_tensors="pt", truncation=True, max_length=512, padding="max_length")

    # Get embeddings for both answers
    with torch.no_grad():
        user_embedding = model(**user_inputs).last_hidden_state.mean(dim=1).squeeze().numpy()
        correct_embedding = model(**correct_inputs).last_hidden_state.mean(dim=1).squeeze().numpy()

    # Calculate cosine similarity
    similarity = cosine_similarity([user_embedding], [correct_embedding])[0][0]
    return similarity

*   The function `rate_answer(user_answer, correct_answer)` compares the user's answer with the correct answer.
*   Both the user's answer and the correct answer are tokenized using the `tokenizer` with truncation, padding, and a max length of 512 tokens.
*   The embeddings (numerical representations) for both answers are generated by passing the tokenized inputs through the model.
*   The embeddings are averaged across the sequence dimension and converted to NumPy arrays.
*   The cosine similarity between the user's answer embedding and the correct answer embedding is computed using `cosine_similarity`.
*   The calculated similarity score is returned, indicating how close the user's answer is to the correct answer.


# **12. Chatbot Interaction**

*   The loop continuously asks for user input.
*   If the user types "exit", "stop", "quit", or "quit now", the loop breaks and the chatbot exits with a goodbye message.
*   If the prompt contains both "generate" and "question", it generates a question based on the PDF content and displays it.
*   If the prompt contains "rate my answer", the system generates a question, asks the user for their answer, and compares it with the correct answer from the PDF.
*   The similarity between the user's answer and the correct answer is calculated, and a score (from 0 to 1) is displayed.
*   Based on the similarity score, feedback is provided: excellent (>=0.8), good (>=0.6), or needs improvement (<0.6).
*   For any other prompt, the system generates and displays an answer based on the PDF content.
*   Input validation could be added for empty answers or non-specific prompts to improve usability.


In [None]:
# Main interaction loop
while True:
    user_prompt = input("\nEnter your prompt : ")

    # Exit the loop
    if user_prompt.lower() in ['exit', 'stop', 'quit']:
        print("Exiting the chatbot. Goodbye!")
        break

    # Check if the user prompt asks to generate a question from the PDF
    if "generate" in user_prompt.lower() and "question" in user_prompt.lower():
        # Generate and display a question based on the PDF content
        generated_question = generate_question_from_context()
        print(f"Chatbot Response: Generated Question: {generated_question}")

    # If the user wants to rate their answer
    elif "rate my answer" in user_prompt.lower():
        # Generate and display a question
        question = generate_question_from_context()
        print(f"Chatbot Response: Generated Question: {question}")

        # Get the user's answer
        user_answer = input("Enter your answer: ")

        # Generate the correct answer from the PDF
        correct_answer = generate_answer_from_context(question)

        # Rate the user's answer
        similarity_score = rate_answer(user_answer, correct_answer)
        print(f"Similarity Score: {similarity_score:.2f}")
        if similarity_score >= 0.8:
            print("Your answer is excellent!")
        elif similarity_score >= 0.6:
            print("Your answer is good, but could be more detailed.")
        else:
            print("Your answer needs improvement.")

    else:
        # Answer the user's question based on the PDF content
        response = generate_answer_from_context(user_prompt)
        print(f"Chatbot Response: {response}")


Enter your prompt : generate 1 question
Chatbot Response: Generated Question: What are the benefits of using Google Search Console?

Enter your prompt : What are the benefits of using Google Search Console?
Chatbot Response: provides rankings and traffic reports for top keywords and pages, and can help identify and fix on -site technical issues.

Enter your prompt : generate a question
Chatbot Response: Generated Question: What are the main benefits of customer behaviour analysis system?

Enter your prompt : What are the main benefits of customer behaviour analysis system?
Chatbot Response: This filtering method uses item features to recommend other items similar to what the user likes and also based on their previous actions or explicit feedback.

Enter your prompt : rate my answer
Chatbot Response: Generated Question: What is a good metric to measure?
Enter your answer: A good metric for optimizing an e-commerce system is Conversion Rate, which measures the percentage of visitors wh

# **Conclusion:**

This script leverages Google Colab and various libraries such as `PyPDF2`, `faiss-cpu`, `transformers`, and `langchain` to extract and process content from a PDF, store it in a FAISS index, and interact with users through a chatbot. The chatbot can generate questions, answer user queries based on the PDF's content, and even rate answers using cosine similarity. By embedding text and utilizing pre-trained models, the system enables efficient question-answering and evaluation, making it a powerful tool for PDF document analysis and interactive learning.