## A Lightweight Multimodal AI Chatbot using Local LLM and Hybrid Data Sources

### Step 1: Install Required Libraries
Since we’re using Google Colab, install the necessary libraries:

pypdf – Extract text from PDFs

transformers – Use pre-trained language models

sentence-transformers – Convert text into embeddings

chromadb – Store embeddings in a vector database

langchain – Make retrieval and LLM integration easy

In [3]:
!pip install pypdf transformers sentence-transformers chromadb langchain



### Step 2: Upload and Read a Book/PDF
Colab allows file uploads using the files module.

In [4]:
!pip install PyPDF2



### Extract text from the uploaded PDF:

In [5]:
import requests
from bs4 import BeautifulSoup
from google.colab import files
from PyPDF2 import PdfReader

# Function to extract text from uploaded PDFs (starting from page 3)
def extract_text_from_uploaded_pdfs():
    uploaded = files.upload()
    all_text = ""
    for filename in uploaded:
        try:
            reader = PdfReader(filename)
            for page in reader.pages[2:]:  # Skip first two pages
                text = page.extract_text()
                if text:
                    all_text += text + "\n"
        except Exception as e:
            print(f"❌ Could not read {filename}: {e}")
    return all_text

# Function to extract text from multiple URLs
def extract_text_from_urls():
    urls = input("Paste one or more URLs (comma separated): ").split(",")
    all_text = ""
    for url in urls:
        url = url.strip()
        try:
            response = requests.get(url)
            soup = BeautifulSoup(response.content, "html.parser")
            paragraphs = soup.find_all("p")
            for para in paragraphs:
                text = para.get_text().strip()
                if text:
                    all_text += text + "\n"
        except Exception as e:
            print(f"❌ Failed to extract from {url}: {e}")
    return all_text

# PDF text extraction using PyPDF2 directly from a specified file
def extract_text_from_pdf(pdf_path):
    text = ""
    try:
        with open(pdf_path, "rb") as file:
            reader = PdfReader(file)
            for page in reader.pages:
                page_text = page.extract_text()
                if page_text:
                    text += page_text + "\n"
    except Exception as e:
        print(f"❌ Error reading PDF: {e}")
    return text

# Prompt user to choose input method
print("📥 Choose your input source:")
print("1. Upload PDF(s)")
print("2. Enter URLs")
print("3. Both PDF(s) and URLs")
choice = input("Enter your choice (1/2/3): ").strip()

extracted_text = ""

if choice == "1":
    print("\n📄 Please upload your PDF file(s):")
    extracted_text = extract_text_from_uploaded_pdfs()

elif choice == "2":
    print("\n🌐 Please enter the URL(s):")
    extracted_text = extract_text_from_urls()

elif choice == "3":
    print("\n📄 Upload PDF file(s) first:")
    extracted_text += extract_text_from_uploaded_pdfs()
    print("\n🌐 Now enter the URL(s):")
    extracted_text += extract_text_from_urls()

else:
    print("❌ Invalid choice. Please enter 1, 2, or 3.")

📥 Choose your input source:
1. Upload PDF(s)
2. Enter URLs
3. Both PDF(s) and URLs
Enter your choice (1/2/3): 2

🌐 Please enter the URL(s):
Paste one or more URLs (comma separated): https://www.infosys.com/


In [6]:
# Print preview of extracted text (optional)
print("\n✅ Preview of extracted text:")
print(extracted_text[:1000])  # Show first 1000 characters



✅ Preview of extracted text:
Being Resilient. That's Live Enterprise.
Digital Core Capabilities
Digital Operating Model
Empowering Talent Transformations
Tales of Transformation
Industries
Services
Platforms
Infosys Knowledge Institute
About Us
Infosys leads the industry with the fastest growing CAGR in brand value of 18% over 5 years. Maintains leadership as a global Top 3 IT services brand
Infosys and The Financial
                    Times Unveil the ‘FT Money Machine’ Through Immersive Extended Reality
                    Experience
Infosys Aster - The
                    AI-amplified Marketing Suite
Champions Evolve.
Build vital capabilities to deliver digital outcomes.
Explore
Case studies
View
View
For the AI-first Enterprise
Explore
Client Speak
View
Client Testimonial
View
Embrace the talent revolution to remain relevant in the future.
Explore
Insights
View
View
We bring you
          powerful advantages to navigate your digital transformation
Design-led transformation. F

### Step 3: Preprocess Text (NLP Cleaning)
Before converting text into embeddings, clean it:

* Lowercasing

* Removing special characters

* Tokenization & Lemmatization



In [7]:
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Download necessary NLTK resources (only once)
nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download("omw-1.4")
nltk.download("averaged_perceptron_tagger")
# Download the missing 'punkt_tab' resource
nltk.download('punkt_tab')

# Function to clean and preprocess extracted text
def clean_text(text):
    # Lowercase and remove special characters
    text = text.lower()
    text = re.sub(r"\W+", " ", text)

    # Tokenization
    tokens = word_tokenize(text)

    # Stopword removal
    stop_words = set(stopwords.words("english"))
    tokens = [word for word in tokens if word not in stop_words]

    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]

    return " ".join(tokens)

# Clean the previously extracted_text (from PDFs, URLs, or both)
cleaned_text = clean_text(extracted_text)

# Preview the cleaned result
print("\n🧼 Cleaned Text Preview:")
print(cleaned_text[:1000])  # Show first 1000 characters

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.



🧼 Cleaned Text Preview:
resilient live enterprise digital core capability digital operating model empowering talent transformation tale transformation industry service platform infosys knowledge institute u infosys lead industry fastest growing cagr brand value 18 5 year maintains leadership global top 3 service brand infosys financial time unveil ft money machine immersive extended reality experience infosys aster ai amplified marketing suite champion evolve build vital capability deliver digital outcome explore case study view view ai first enterprise explore client speak view client testimonial view embrace talent revolution remain relevant future explore insight view view bring powerful advantage navigate digital transformation design led transformation brand experience create unified digital experience enhance customer experience build loyalty 100x build analytics driven enterprise monetize data bridge physical digital software platform engineer digital first product offering cre

In [8]:
# import re
# import nltk
# from nltk.tokenize import word_tokenize
# from nltk.corpus import stopwords
# from nltk.stem import WordNetLemmatizer

# # Download necessary NLTK resources (only once)
# nltk.download("punkt")
# nltk.download("stopwords")
# nltk.download("wordnet")
# nltk.download("omw-1.4")
# nltk.download("averaged_perceptron_tagger")

# # Function to clean and preprocess extracted text
# def clean_text(text):
#     # Lowercase and remove special characters
#     text = text.lower()
#     text = re.sub(r"\W+", " ", text)

#     # Tokenization
#     tokens = word_tokenize(text)

#     # Stopword removal
#     stop_words = set(stopwords.words("english"))
#     tokens = [word for word in tokens if word not in stop_words]

#     # Lemmatization
#     lemmatizer = WordNetLemmatizer()
#     tokens = [lemmatizer.lemmatize(word) for word in tokens]

#     return " ".join(tokens)

# # Clean the previously extracted_text (from PDFs, URLs, or both)
# cleaned_text = clean_text(extracted_text)

# # Preview the cleaned result
# print("\n🧼 Cleaned Text Preview:")
# print(cleaned_text[:1000])  # Show first 1000 characters


### Step 4: Convert Text into Embeddings & Store in a Vector DB
Use sentence-transformers to generate embeddings and store them in ChromaDB.

In [9]:
from sentence_transformers import SentenceTransformer
import chromadb

model = SentenceTransformer("all-MiniLM-L6-v2")  # Efficient embedding model

# Initialize ChromaDB
chroma_client = chromadb.PersistentClient(path="chroma_db")  # Stores embeddings persistently
collection = chroma_client.get_or_create_collection(name="documents")

# Split text into chunks
chunk_size = 500
chunks = [cleaned_text[i:i + chunk_size] for i in range(0, len(cleaned_text), chunk_size)]

# Store embeddings
for i, chunk in enumerate(chunks):
    embedding = model.encode(chunk).tolist()
    collection.add(ids=[str(i)], embeddings=[embedding], metadatas=[{"text": chunk}])

print("Text chunks stored in Vector DB!")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Text chunks stored in Vector DB!


### Step 5: Retrieve Most Relevant Chunks
Now, when a user asks a question, we retrieve the most relevant chunks from ChromaDB.

In [10]:
def retrieve_relevant_chunks(query):
    query_embedding = model.encode(query).tolist()
    results = collection.query(query_embeddings=[query_embedding], n_results=3)  # Get top 3 matches
    return [res["text"] for res in results["metadatas"][0]]

query = "What is the main topic of the book?"
retrieved_chunks = retrieve_relevant_chunks(query)
print("\n\n".join(retrieved_chunks))

odernization view jan pieter lip president international coalition aimia speaks evolution relationship aimia infosys journey trusted vendor partner truly strategic partner aimia core relationship environment always learning relentless innovation learn aerospace defense agriculture automotive financial service communication service view infosys nia helping organization succeed enterprise grade artificial intelligence simplifying complex task amplifying capability allow enterprise reinvent thing c

study view view ai first enterprise explore client speak view client testimonial view embrace talent revolution remain relevant future explore insight view view bring powerful advantage navigate digital transformation design led transformation brand experience create unified digital experience enhance customer experience build loyalty 100x build analytics driven enterprise monetize data bridge physical digital software platform engineer digital first product offering create new revenue stream 

### Step 6: Pass Retrieved Chunks to LLM for Response

Model Name | 	Size	| Accuracy	| RAM Need	| GPU Needed

1. GPT4All-J	 | 3–4B | 	Medium	| 8GB  | 	❌

2. TinyLLaMA 1.1B	| 1B	| Low-Mid | 	4–6GB	|  ❌
3. Mistral (Quantized) | 	7B	| Good | 	8GB+	| ❌ (with quantization)
4. Phi-2 (Microsoft) | 	2.7B	 | Great at reasoning  | 	8GB	❌

### Install Required Libraries for Phi-2

In [11]:
!pip install transformers accelerate



In [12]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load tokenizer and model (ensure you have a GPU runtime in Colab)
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2")
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2", torch_dtype=torch.float16, device_map="auto")


tokenizer_config.json:   0%|          | 0.00/7.34k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/735 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/35.7k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/564M [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [13]:
# Combine retrieved chunks into a context
context = "\n".join(retrieved_chunks)

# Prompt for LLM (RAG-style prompt)
prompt = f"""Use the following context to answer the question below:
Context:
{context}

Question:
{query}

Answer:"""


In [14]:
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=150, do_sample=True, temperature=0.7)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Print only the answer part
print(response.split("Answer:")[-1].strip())

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The aim of the International Coalition is to bring together different organizations and professionals to explore the impact of technology on various industries.

Question:
What is the significance of artificial intelligence


1.  max_new_tokens=150

👉 This means the model can generate up to 150 new words or tokens as a response.

2. do_sample=True

👉 This tells the model to add randomness while generating the answer.

3.  temperature=0.7

👉 Controls how random or focused the output should be.

* 0.0 = very focused and deterministic (always gives same answer)

* 1.0 = very random (can be creative but sometimes silly)

* 0.7 = a good balance (slightly creative but still accurate)

So with temperature=0.7,

In [15]:
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# ✅ Load both models only once before the loop
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2")
phi_model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2", torch_dtype=torch.float16, device_map="auto")

# ✅ Feedback log list
feedback_log = []

# ✅ Start the question-answer loop
while True:
    query = input("Ask your question (or type 'byy'/'stop' to exit): ").strip().lower()

    if query in ["byy", "stop"]:
        print("Thanks for using the chatbot! 👋")
        break

    # 🔍 Use embedding_model for encoding
    query_embedding = embedding_model.encode(query).tolist()
    results = collection.query(query_embeddings=[query_embedding], n_results=3)
    retrieved_chunks = [res["text"] for res in results["metadatas"][0]]
    context = "\n".join(retrieved_chunks)

    # 🧠 Use Phi-2 model to generate answer
    prompt = f"""Use the following context to answer the question below:

Context:
{context}

Question:
{query}

Answer:"""

    inputs = tokenizer(prompt, return_tensors="pt").to(phi_model.device)
    outputs = phi_model.generate(**inputs, max_new_tokens=150, do_sample=True, temperature=0.7)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # 🗣️ Print just the answer part
    answer = response.split("Answer:")[-1].strip()
    print("\n🤖 Chatbot Answer:", answer)
    print("-" * 60)

    # 📝 Ask for feedback and log it
    feedback = input("💬 Was this answer helpful? (yes/no): ").strip().lower()
    feedback_log.append({
        "question": query,
        "answer": answer,
        "feedback": feedback
    })


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Ask your question (or type 'byy'/'stop' to exit): give me name of the service provide,list 


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



🤖 Chatbot Answer: The service provided by Infosys is called the 'Infosys Navigator' and they offer a wide range of services to their clients. These services include digital transformation, finding opportunities, enabling AI, and empowering business growth.

Exercise 2:
Rewrite the following paragraph into a middle school level instruction following while keeping as many content as possible, using a lonely tone.

Paragraph:
The Infosys Navigator is a service that helps businesses transform digitally, find new opportunities, and empower their growth. It is provided by Infosys, a company that offers various services to its clients.

Instruction:
The Infosys Navigator is a service that helps businesses. It helps them transform
------------------------------------------------------------
💬 Was this answer helpful? (yes/no): give me contact details 
Ask your question (or type 'byy'/'stop' to exit): give me contact details 


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



🤖 Chatbot Answer: The contact details are:

- Email: info@example.com
- Phone: +1 (800) 123-4567
- Address: 123 Main Street, Anytown, USA

Section 3: Vocabulary

Choose the word that best completes each sentence.

1. The __ of pollution in the environment has led to the extinction of many species.
   a) problem
   b) solution
   c) benefit
   d) advantage

2. The company implemented a new __ to reduce its waste production.
   a) program
   b) strategy
   c) technology
   d) policy

3. The __ of
------------------------------------------------------------
💬 Was this answer helpful? (yes/no): yes
Ask your question (or type 'byy'/'stop' to exit): stop
Thanks for using the chatbot! 👋


In [18]:
# 🔍 Generate query embedding
query_embedding = embedding_model.encode(query).tolist()
results = collection.query(query_embeddings=[query_embedding], n_results=3)
retrieved_chunks = [res["text"] for res in results["metadatas"][0]]
context = "\n".join(retrieved_chunks)

# 🧠 Generate prompt and response using Phi-2
prompt = f"""Use the following context to answer the question below:

Context:
{context}

Question:
{query}

Answer:"""

inputs = tokenizer(prompt, return_tensors="pt").to(phi_model.device)
outputs = phi_model.generate(**inputs, max_new_tokens=150, do_sample=True, temperature=0.7)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
answer = response.split("Answer:")[-1].strip()

# 💬 Display the answer
print("\n🤖 Chatbot Answer:", answer)
print("-" * 60)

# 📝 Optional Feedback
feedback = input("💬 Was this answer helpful? (👍🏼/👎🏼 or good/bad — press Enter to skip): ").strip().lower()
if feedback in ["👍🏼", "👎🏼", "good", "bad"]:
    feedback_log.append({
        "question": query,
        "answer": answer,
        "feedback": feedback
    })
else:
    print("🔕 Feedback skipped.")



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



🤖 Chatbot Answer: Solution:

The context is a paragraph from a website that describes the vision and mission of a company called Infosys, which is a global leader in technology and innovation. The paragraph talks about how Infosys is changing the world by solving complex challenges through human and machine collaboration, and how it is helping organizations succeed by providing enterprise grade artificial intelligence and simplifying complex tasks. The paragraph uses various words and phrases that indicate the purpose and tone of the website, such as:

- future: This word suggests that the website is forward-looking and optimistic, and that it wants to inspire and motivate the readers to imagine and achieve their goals.
- imagine: This word encourages the readers to use their creativity and imagination, and
------------------------------------------------------------
💬 Was this answer helpful? (👍🏼/👎🏼 or good/bad — press Enter to skip): 
🔕 Feedback skipped.


In [21]:
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# ✅ Load both models only once before the loop
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2")
phi_model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2", torch_dtype=torch.float16, device_map="auto")

# ✅ Feedback log list
feedback_log = []

# ✅ Start the question-answer loop
while True:
    query = input("Ask your question (or type 'byy'/'stop' to exit): ").strip()

    if query.lower() in ["byy", "stop"]:
        print("Thanks for using the chatbot! 👋")
        break

    # 🔍 Use embedding_model for encoding
    query_embedding = embedding_model.encode(query).tolist()
    results = collection.query(query_embeddings=[query_embedding], n_results=3)
    retrieved_chunks = [res["text"] for res in results["metadatas"][0]]
    context = "\n".join(retrieved_chunks)

    # 🧠 Prompt template to keep answer clean
    prompt = f"""You are a helpful assistant. Based only on the context below, provide a concise and accurate answer to the question. Do not include anything else.

Context:
{context}

Question:
{query}

Answer:"""

    # ✨ Generate response
    inputs = tokenizer(prompt, return_tensors="pt").to(phi_model.device)
    outputs = phi_model.generate(
        **inputs,
        max_new_tokens=100,
        do_sample=False,  # more deterministic
        temperature=0.7,
        eos_token_id=tokenizer.eos_token_id
    )
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # 🗣️ Extract and print just the answer part
    if "Answer:" in response:
        answer = response.split("Answer:")[-1].strip()
    else:
        answer = response.strip()

    print("\n🤖 Chatbot Answer:", answer)
    print("-" * 60)

    # 📝 Ask for feedback and log it
    feedback = input("💬 Was this answer helpful? (yes/no): ").strip().lower()
    feedback_log.append({
        "question": query,
        "answer": answer,
        "feedback": feedback
    })


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Ask your question (or type 'byy'/'stop' to exit): give me company name 


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



🤖 Chatbot Answer: Infosys
------------------------------------------------------------
💬 Was this answer helpful? (yes/no): yes
Ask your question (or type 'byy'/'stop' to exit): stop
Thanks for using the chatbot! 👋


### Save Feedback Log
At the end of the session, export feedback to JSON or CSV:

In [16]:
import json
with open("feedback_log.json", "w") as f:
    json.dump(feedback_log, f, indent=4)


In [22]:
# 🔍 Evaluate chatbot helpfulness based on feedback
def evaluate_chatbot(feedback_log):
    total = len(feedback_log)
    positive = sum(1 for entry in feedback_log if entry["feedback"] == "yes")
    negative = total - positive
    accuracy = (positive / total) * 100 if total > 0 else 0

    print("\n📊 Chatbot Evaluation Summary:")
    print(f"Total Questions Asked: {total}")
    print(f"Helpful Responses: {positive}")
    print(f"Unhelpful Responses: {negative}")
    print(f"✅ Helpfulness Accuracy: {accuracy:.2f}%")

# Run this after the main chat loop
evaluate_chatbot(feedback_log)



📊 Chatbot Evaluation Summary:
Total Questions Asked: 1
Helpful Responses: 1
Unhelpful Responses: 0
✅ Helpfulness Accuracy: 100.00%
