<a href="https://colab.research.google.com/github/AmulyaT29/ML-Projects/blob/main/RAG_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
!pip install llama-index
!pip install llama-index-readers-file
!pip install llama-index-embeddings-huggingface
!pip install llama-index-vector-stores-faiss
!pip install sentence-transformers faiss-cpu
!pip install transformers accelerate
!pip install pypdf  # for PDF reading




In [5]:
!pip install llama-index-llms-huggingface



In [6]:
!pip install pymupdf



In [7]:
HF_TOKEN = "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"

In [3]:
import os
from llama_index.core import VectorStoreIndex, Settings, Document
from llama_index.readers.file import PyMuPDFReader
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.vector_stores.faiss import FaissVectorStore
from llama_index.core.storage.storage_context import StorageContext
from llama_index.llms.huggingface import HuggingFaceLLM
from transformers import AutoTokenizer, AutoModelForCausalLM
import faiss
import torch




In [4]:
# Step 1: Load PDF
reader = PyMuPDFReader()
documents = reader.load(file_path="9.pdf")
# Step 2: Setup embedding model
embed_model = HuggingFaceEmbedding(model_name="all-MiniLM-L6-v2")
Settings.embed_model = embed_model

In [5]:
# Step 3: Create FAISS index
dimension = len(embed_model.get_text_embedding("example text"))
faiss_index = faiss.IndexFlatL2(dimension)
vector_store = FaissVectorStore(faiss_index=faiss_index)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

In [11]:
# Step 4: Load Hugging Face LLM
model_name = "mistralai/Mistral-7B-Instruct-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
)
llm = HuggingFaceLLM(context_window=4096, max_new_tokens=512, tokenizer=tokenizer, model=model)
Settings.llm = llm  # Set the local LLM


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



In [6]:
# Step 4: Load lightweight Hugging Face LLM
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",  # Will use GPU if available, else CPU
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
)

# Optional: fix warning
tokenizer.pad_token = tokenizer.eos_token

llm = HuggingFaceLLM(
    context_window=2048,         # smaller model = smaller context
    max_new_tokens=256,
    tokenizer=tokenizer,
    model=model
)

Settings.llm = llm

tokenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [7]:
# Step 5: Build the index
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)


In [8]:
# Step 6: Query the system
query_engine = index.as_query_engine()
query = "What is attention?"
response = query_engine.query(query)

In [9]:
print(response)

Attention is a way to compute a vector representation for a token at a
particular layer of a transformer, by selectively attending to and integrating
information from prior tokens at the previous layer. Attention takes an input
representation xi corresponding to the input token at position i, and a context
window of prior inputs x1..xi−1, and produces an output ai.


In [11]:
query = "What is feed forward layer?"
response = query_engine.query(query)
print(response)


Feed forward layer is a kind of layer in a neural network that is used to pass the input to the next layer. It is a non-linear transformation that is applied to the input before the output is computed. The output of the feed forward layer is then passed to the next layer.
