<a href="https://colab.research.google.com/github/Indusree21/Internal_Docs_QA_Agent/blob/main/Internal_Docs_QA_Agent_Hack.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Install essential libraries:
# - langchain for building LLM apps,
# - faiss-cpu for fast vector similarity search,
# - sentence-transformers for generating text embeddings
!pip install -q langchain faiss-cpu sentence-transformers

In [None]:
#Create Fake Internal Documents
import os

# Create a folder to store documents
os.makedirs("data/docs", exist_ok=True)

# Create sample refund policy document
with open("data/docs/refund_policy.txt", "w") as f:
    f.write("""
    Refund Policy:
    Customers can request a refund within 30 days of purchase.
    To initiate a refund, contact support@company.com with your order ID.
    Refunds take 5–7 business days to process.
    """)

# Create sample leave policy document
with open("data/docs/leave_policy.txt", "w") as f:
    f.write("""
    Leave Policy:
    Employees are entitled to 20 paid leaves per year.
    Leave requests must be submitted at least 3 days in advance.
    Use the HR portal to apply for leave.
    """)

# Create sample design request guide
with open("data/docs/design_request.txt", "w") as f:
    f.write("""
    Design Asset Request:
    To request a design asset, fill out the form on the intranet.
    Requests are handled by the Creative team within 48 hours.
    Mention dimensions, format, and usage clearly.
    """)

print("✅ Sample documents created.")


✅ Sample documents created.


In [None]:
# Install the latest version of langchain-community, which includes document loaders and integrations
!pip install -U langchain-community

Collecting langchain-community
  Downloading langchain_community-0.3.27-py3-none-any.whl.metadata (2.9 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.10.1-py3-none-any.whl.metadata (3.4 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.1-py3-none-any.whl.metadata (9.4 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting mypy-extensions>=0.3.0 (from typing-inspect<1,>=0.4.0->dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading mypy_extensions-1.1.0-py3-n

In [None]:
#Load & Split Documents (Prepare for AI)
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
import os

# Load all text files in the folder
docs = []
for filename in os.listdir("data/docs"):
    loader = TextLoader(os.path.join("data/docs", filename))
    docs.extend(loader.load())

# Break long texts into smaller chunks (better for AI search)
text_splitter = CharacterTextSplitter(chunk_size=300, chunk_overlap=30)
documents = text_splitter.split_documents(docs)

print(f"✅ Loaded and split {len(documents)} document chunks.")



✅ Loaded and split 3 document chunks.


In [None]:
#Convert Documents to AI Vector Format
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings

# Load embedding model from Hugging Face (used to convert text into vector numbers)
embedding = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# Build a searchable vector database
vectorstore = FAISS.from_documents(documents, embedding)

print("✅ Vector database created.")


  embedding = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

✅ Vector database created.


In [None]:
# Install Hugging Face libraries:
# - transformers for using pre-trained LLMs
# - accelerate for optimized model inference on different hardware (CPU/GPU)
!pip install -q transformers accelerate

In [None]:
#Use a Local Model like flan-t5-base
from transformers import pipeline

# Load the model locally from Hugging Face
qa_pipeline = pipeline("text2text-generation", model="google/flan-t5-base")

print("✅ Local model loaded.")


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

Device set to use cpu


✅ Local model loaded.
