In [None]:
pip install streamlit

In [None]:
import os   #let's us to interact with computer OS. eg, create/delete folders,read files,check file paths etc...
import io   #stands for i/p o/p .Let's us work with data as if it's a file(even if it's in a memory)
import re   #Regular expression. helps to find or replace pattern in text
import json  #used to work with JSON data(common format to store data like a dict)
import pandas as pd  #pandas-->data handling module, popular library for data analysis,works with table(rows and column) like excel
import numpy as np  #numpy-->data handling module,library for numerical calculation,great for working with arrays,matrics and math operation
import time  #let's us work with time(pause pgm,measure how long something takes)
import streamlit as st  #stramlit-->app & dashboard,used to make web apps for DS & ML
from typing import List,Dict,Tuple,Any  #type hinting(not code execution,just for clarity),helps to explain datatypes in function

e.g for type hinting

def add_numbers(numbers: List[int]) -> int:
   return sum(numbers)

list[int] means the input is a list of integer
-> int means the function returns as integer

In [None]:
from sentence_transformers import SentenceTransformer
from transformers import pipeline
import faiss
from pypdf import PdfReader
from docx import Document as DocxDocument

NLP & AI models
1.from sentence_transformers import SentenceTransformer
   library to create sentence embbedding(turns sentence into num vector);useful for sementaic search,recommandation,clustering
   Example:
   "I love cats" → [0.12, -0.55, 0.88, ...] (vector).

2.from transformers import pipeline
   Hugging Face Transformers library.
   pipeline is a shortcut to use pre-trained AI models easily.
   Example tasks: text summarization, translation, sentiment analysis, question answering.
   Example:
   from transformers import pipeline
   summarizer = pipeline("summarization")
   print(summarizer("I love tom"))

Searching Similar Data
1.import faiss
  FAISS (by Facebook/Meta) is for fast similarity search.
  Helps you search quickly through millions of vectors (like sentence embeddings).
  Example: find the most similar sentence/document to a query.

Reading Files
1.from pypdf import PdfReader
  Library to read text from PDF files.
  Example: load a PDF, get number of pages, extract text.

from docx import Document as DocxDocument
Library to read and write Word documents (.docx).
Example: open a .docx file and read paragraphs, or create a new Word file.


In short:

SentenceTransformer → turns sentences into numbers (vectors).
pipeline → quick way to use AI models (summarize, translate, classify).
faiss → finds similar vectors really fast (used for search).
PdfReader → extracts text from PDFs.
DocxDocument → reads/writes Word docs.

In [None]:
pip install faiss-cpu

In [None]:
pip install pypdf

In [None]:
pip install python-docx

In [None]:
#clean & chunk text
def clean_text(text:str)->str:
  text=re.sub(r"\s+"," ",text)
  return text.strip()

What it does step by step:

def clean_text(text: str) -> str:
Defines a function called clean_text.
It takes some text as input (text).
-> str means it will return a string.

re.sub(r"\s+", " ", text)
Uses regular expressions (re).
\s+ means "one or more spaces, tabs, or newlines".
Replaces them with a single space " ".
Example: "Hello World\n\nHow are you?" → "Hello World How are you?".

text.strip()
Removes extra spaces at the start and end of the text.
Example: " Hello World " → "Hello World".

In [None]:
def chunk_text(text:str, chunk_size:int=800, overlap:int=120) -> List[str]:
  """
  Character based chunking(simple & Robust)
  Chunk size ~800 chars works well for small model like FLAN-T5
  """
  text = clean_text(text)
  chunks = []
  start = 0
  n = len(text)
  while start < n:
    end = min(start + chunk_size, n)
    chunk = text[start:end]
    chunks.append(chunk)
    start=end-overlap
    if start < 0:
      start = 0
  return chunks

What it does

This function splits long text into smaller pieces (chunks).
Inputs
text: str → the text you want to split.
chunk_size: int = 800 → how many characters per chunk (default 800).
overlap: int = 120 → how many characters overlap between chunks (default 120).
Returns: a list of text chunks (List[str]).

text = clean_text(text)
Cleans the text (removes extra spaces/newlines).

Initialize variables
chunks = [] → empty list to store the text pieces.
start = 0 → where we start cutting.
n = len(text) → total number of characters in text.

Loop until we reach the end of the text
end = min(start + chunk_size, n)
→ defines where the chunk ends.
chunk = text[start:end]
→ takes a slice of the text.
chunks.append(chunk)
→ saves the chunk in the list.

Overlap handling
start = end - overlap
→ moves the start backward by 120 chars so the next chunk overlaps with the previous one.
This is important because AI models may lose context if chunks don’t overlap.

Return chunks
After looping, we return the list of all chunks.

In [None]:
#file loader(PDF,DOCX,CSV,TXT)

In [None]:
def load_txt(file_bytes: bytes)->str:
  return file_bytes.decode("utf-8",errors="ignore")

What it does

file_bytes: bytes
The function takes a file that’s been read in bytes form (raw computer data).
Example: when you upload a .txt file, it is often read as bytes first.

.decode("utf-8", errors="ignore")
Converts those bytes into a human-readable string using the utf-8 text format (the most common text encoding).
errors="ignore" means: if there are weird symbols it can’t decode, just skip them instead of crashing.

Returns
A string version of the file’s contents.

In [None]:
def load_pdf(file_bytes: bytes)->str:
  with io.BytesIO(file_bytes) as fb:
    reader = PdfReader(fb)
    texts=[]
    for page in reader.pages:
      try:
        t=page.extract_text() or ""
      except Exception :
        t=""
      if t:
        texts.append(t)
  return "\n".join(texts)

What it does

This function takes a PDF file in bytes and extracts all the text inside it.
Step by step:
with io.BytesIO(file_bytes) as fb:
Turns the raw file bytes into a file-like object (fb) so that PdfReader can read it (like opening a real file).

reader = PdfReader(fb)
Creates a PdfReader object to work with the PDF.

texts = []
An empty list to collect text from each page.

Loop through each page
for page in reader.pages:
Goes page by page inside the PDF.

Extract text safely
try:
    t = page.extract_text() or ""
except Exception:
    t = ""
page.extract_text() → tries to pull text from the page.
or "" → if it returns None, replace with an empty string.
except → if extraction fails (like scanned PDFs with images), it just skips.

Add text if available
if t:
    texts.append(t)
Saves the text from that page into the list.

Return joined text
return "\n".join(texts)
Combines all page texts into one big string, separated by newlines.

In [None]:
def load_docx(file_bytes: bytes)->str:
  with io.BytesIO(file_bytes) as fb:
    doc=DocxDocument(fb)
  return "\n".join(p.text for p in doc.paragraphs)

This function takes a Word file (.docx) in bytes and extracts all the text inside it.

Step by step:
with io.BytesIO(file_bytes) as fb:
Turns the raw file bytes into a file-like object (fb), so Python can treat it like an actual .docx file.

doc = DocxDocument(fb)
Opens the Word document using the python-docx library.

doc.paragraphs
Gives you a list of all the paragraphs in the Word document.
Each p is a paragraph object, and p.text is its text content.

"/n".join(p.text for p in doc.paragraphs)
Goes through each paragraph (p.text) and joins them together into one big string.

In [None]:
def load_csv(file_bytes: bytes)->str:
  with io.BytesIO(file_bytes) as fb:
    df=pd.read_csv(fb)
    #converting to readable FAQ-like table text
  return df.to_csv(index=False)

This function takes a CSV file in bytes and converts it into a text format (CSV string).

Step by step:
with io.BytesIO(file_bytes) as fb:
Converts the raw bytes into a file-like object (fb) so pandas can read it.

df = pd.read_csv(fb)
Reads the CSV file into a pandas DataFrame (like an Excel table in Python).

Comment: # converting to readable FAQ-like table text
This means the idea is to convert the data into a nice text format (like Q&A style), but in this code they just output it back as CSV text.

return df.to_csv(index=False)
Converts the DataFrame back into a CSV string, without the row index.
Basically, it gives you the CSV contents as plain text.

In [None]:
def read_any(file)->Tuple[str,str]:
  name=file.name.lower()
  content=file.read()
  if name.endswith(".pdf"):
    return "pdf",load_pdf(content)
  elif name.endswith(".docx"):
    return "docx",load_docx(content)
  elif name.endswith(".csv"):
    return "csv",load_csv(content)
  elif name.endswith(".txt"):
    return "txt",load_txt(content)
  else:
    raise ValueError("Unsupported file type.Please upload PDF,DOCX,TXT, OR CSV.")

This function can read different types of files (PDF, DOCX, CSV, TXT) using the right loader function.

Step by step:
name = file.name.lower()
Gets the file name (like "example.PDF").
Converts it to lowercase so it’s easier to check extensions (e.g., "pdf" vs "PDF").

content = file.read()
Reads the file content into memory (as bytes).

Check the file type using extension
If the file ends with .pdf → call load_pdf(content)
If .docx → call load_docx(content)
If .csv → call load_csv(content)
If .txt → call load_txt(content)

Return a tuple
First item = file type (like "pdf", "docx", etc.).
Second item = the extracted text from the file.

Unsupported file
If the file type isn’t recognized, it raises an error telling the user only PDF, DOCX, TXT, or CSV are allowed.

In [None]:
#Embedding + FAISS handling
@st.cache_resource
def get_embedder():
  return SentenceTransformer("all-MiniLM-L6-v2")

What it does

@st.cache_resource
This is a Streamlit decorator.
It tells Streamlit:
“Run this function only once and cache (remember) the result.”
So if you call get_embedder() many times in the app, it won’t reload the model every time (saves time).

def get_embedder():
Defines a function named get_embedder.

SentenceTransformer("all-MiniLM-L6-v2")
Loads a pretrained embedding model from sentence-transformers.
Model name: all-MiniLM-L6-v2 (a small, fast, but good model for embeddings).
This model converts sentences into embeddings (numeric vectors).
Example: "Hello world" → [0.23, -0.51, 0.88, ...]

return SentenceTransformer(...)
The function returns the model object.
So now, whenever you need embeddings, you just call:
embedder = get_embeder()
vectors=embedder.encode(["hi sam","how u doing?"])

summary:
This function loads a text embedding model once, caches it, and reuses it.
The model converts sentences → vectors (embeddings).
These embeddings are later used with FAISS for similarity search.

In [None]:
def build_or_load_index(
    embedder:SentenceTransformer,
    storage_dir:str="storage"
) ->Tuple[faiss.IndexFlatL2,List[dict[str,Any]]]:
    os.makedirs(storage_dir,exist_ok=True)
    index_path=os.path.join(storage_dir,"faiss.index")
    meta_path=os.path.join(storage_dir,"meta.npy")

    if os.path.exists(index_path) and os.path.exists(meta_path):
        index=faiss.read_index(index_path)
        metadata=np.load(meta_path,allow_pickle=True).tolist()
    return index,metadata

    #Empty new index
    index=faiss.IndexFlatL2(384)
    metadata: List[Dict[str,Any]]=[]
    return index,metadata


This function either loads an existing FAISS index (and metadata) from disk or, if none exists, creates a new empty index.

Step by step:
Function definition
Takes in an embedder (the SentenceTransformer model).
storage_dir="storage" → default folder where index files are saved.
Returns:
a FAISS index (faiss.IndexFlatL2)
metadata (list of dictionaries with extra info about each chunk).

Paths for storage
index_path = os.path.join(storage_dir, "faiss.index")
meta_path  = os.path.join(storage_dir, "meta.npy")

One file for the FAISS index.
One file for the metadata.

Check if files exist
if os.path.exists(index_path) and os.path.exists(meta_path):
    index = faiss.read_index(index_path)        # load FAISS index
    metadata = np.load(meta_path, allow_pickle=True).tolist()  # load metadata
    return index, metadata
If both files are present → load them from disk and return.

If no files found → create a new index
index = faiss.IndexFlatL2(384)
metadata: List[Dict[str, Any]] = []
return index, metadata
Creates a new FAISS index with vectors of size 384 (because "all-MiniLM-L6-v2" embeddings are 384-dimensional).
Starts with an empty metadata list.

Example usage
embedder = get_embedder()
index, metadata = build_or_load_index(embedder)

print(index.ntotal)   # number of vectors stored
print(len(metadata))  # number of metadata entries


If nothing exists yet → you get an empty FAISS index.
If files exist → it resumes from saved data.

In simple terms:
This function makes sure you always have a FAISS index ready:
If you’ve already built and saved one → it loads it.
If not → it creates a new, empty one.

What is a FAISS index?

FAISS = Facebook AI Similarity Search.
A FAISS index is like a special database designed to store vectors (embeddings) and let you quickly find the ones that are most similar to a query.

Think of it as:
You turn sentences into number vectors using SentenceTransformer.
You put those vectors inside a FAISS index.
Later, when you give a new query (also turned into a vector), FAISS searches the index and finds the closest vectors (i.e., most similar sentences).
So instead of searching text directly, you’re searching based on semantic meaning.

What is faiss.IndexFlatL2?

FAISS provides different types of indexes.
IndexFlatL2 means:
Flat → It stores all vectors directly (no fancy compression, no approximation).
L2 → It measures similarity using L2 distance (Euclidean distance).
That’s just the usual "straight-line distance" in multi-dimensional space.

sumaary:
FAISS index = a smart database for embeddings.
IndexFlatL2 = the simplest type of FAISS index, which stores vectors and compares them using Euclidean distance to find similar ones.

Later, you can use more advanced FAISS indexes (IndexIVF, HNSW, etc.) for faster searches when you have millions of vectors.

In [None]:
def persist_index(index: faiss.IndexFlatL2, metadata: List[Dict[str, Any]], storage_dir: str = "storage"):
    os.makedirs(storage_dir, exist_ok=True)
    faiss.write_index(index, os.path.join(storage_dir, "faiss.index"))
    np.save(os.path.join(storage_dir, "meta.npy"), np.array(metadata, dtype=object), allow_pickle=True)

This function saves (persists) the FAISS index and metadata to disk so that you can use them later without rebuilding from scratch.

Step by step

Create storage folder if missing
os.makedirs(storage_dir, exist_ok=True)
Makes sure the folder exists (e.g., "storage").
If it already exists → no error (because of exist_ok=True).

Save the FAISS index
faiss.write_index(index, os.path.join(storage_dir, "faiss.index"))
Writes the FAISS index (all the vectors) to a file called faiss.index.

Save the metadata
np.save(os.path.join(storage_dir, "meta.npy"), np.array(metadata, dtype=object), allow_pickle=True)
Converts metadata (Python list of dictionaries) into a NumPy array.
Saves it into meta.npy.
allow_pickle=True allows saving Python objects (like dictionaries).

In simple terms
Think of it like saving your progress in a game:
The FAISS index is like the game’s world data (your embeddings).
The metadata is like the notes telling which vector belongs to which text.
This function writes both into files so next time you run the program, you can just load them back instead of starting over.

In [None]:
def add_texts_to_index(
    texts: List[str],
    source_name: str,
    embedder: SentenceTransformer,
    index: faiss.IndexFlatL2,
    metadata: List[Dict[str,Any]]
):
    if not texts:
      return
    vectors = embedder.encode(texts,convert_to_numpy=True,normalize_embeddings=False)
    index.add(vectors)
    for i,t in enumerate(texts):
      metadata.append({
          "source": source_name,
          "chunk_id": i,
          "text": t
      })

This function takes text chunks, turns them into embeddings, stores them in FAISS, and records metadata.

Step by step

Check if texts exist
if not texts:
    return
If there are no texts, just stop.

Convert texts → vectors
vectors = embedder.encode(texts, convert_to_numpy=True, normalize_embeddings=False)
Uses the SentenceTransformer model to convert each text chunk into a numeric vector (embedding).
Example: "Hello world" → [0.12, -0.34, 0.56, ...]

Add vectors to FAISS index
index.add(vectors)
Stores those embeddings into the FAISS index for similarity search.

Save metadata for each chunk
for i, t in enumerate(texts):
    metadata.append({
        "source": source_name,
        "chunk_id": i,
        "text": t
    })
Keeps track of:
source_name → which file the text came from
chunk_id → position number of the chunk
text → the actual chunk of text

summary.
This function is like filing documents in a library:
First, it turns the text into a special code (vector) the computer can understand.
Then it stores the code in FAISS (like putting it in a drawer).
At the same time, it writes a note card (metadata) saying “this code belongs to file X, chunk Y, with this text.”
So later, when you search FAISS for similar text, you can also look up the original chunk and where it came from.

In [None]:
#Retriever
def retriever(query: str, embedder,index,metadata,k: int=3):
  if index.ntotal == 0:
    return []
  qvec = embedder.encode([query],convert_to_numpy=True)
  D, I = index.search(qvec, k=min(k, index.ntotal))
  results=[]
  for idx in I[0]:
    md = metadata(idx)
    results.append(md)
  return results

This function takes a user’s query, finds the most similar text chunks in the FAISS index, and returns their metadata.

Step by step

Check if index is empty
if index.ntotal == 0:
    return []
If no data has been added yet, just return an empty list.

Convert query → embedding
qvec = embedder.encode([query], convert_to_numpy=True)
Turns your search question into a vector (same way we did with the text chunks).
Example: "What is AI?" → [0.22, -0.15, 0.87, ...]

Search FAISS for nearest neighbors
D, I = index.search(qvec, k=min(k, index.ntotal))
index.search finds the k most similar vectors in FAISS.
D = distances (how close each result is)
I = indices (positions of matching chunks in the metadata list)

Collect metadata for results

for idx in I[0]:
    md = metadata[idx]
    results.append(md)


Uses the indices I to fetch the corresponding metadata (original text, source file, chunk ID).
Builds a list of result dictionaries.
Return the matches
return results
Final output is a list of the most relevant text chunks.

summary
This function is like asking the library a question:
You phrase your question (query).
The system translates it into the same “special code” as the stored chunks.
FAISS finds the chunks that look closest to your query in vector space.
It then gives you the original text + source info (from metadata).

 Example:
results = retrieve("What is AI?", embedder, index, metadata, k=2)

Might return:

[
  {"source": "notes.pdf", "chunk_id": 0, "text": "Artificial Intelligence is..."},
  {"source": "ai_book.docx", "chunk_id": 3, "text": "AI means creating machines..."}
]
