**Install Dependencies**

In [1]:
# !pip install langchain langchain-community langchain-core faiss-cpu chromadb pypdf sentence-transformers
# !pip install transformers accelerate bitsandbytes

Installs all the libraries needed:

LangChain → framework for chaining components.

faiss-cpu / chromadb → vector databases for similarity search.

pypdf → load PDF documents.

sentence-transformers → for embeddings.

transformers, accelerate, bitsandbytes → for loading and running TinyLLaMA efficiently on GPU.

**Load TinyLLaMA Model**

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
`torch_dtype` is deprecated! Use `dtype` instead!


Loads TinyLLaMA 1B Chat model from Hugging Face.

Tokenizer converts text → tokens (numbers).

Model generates language outputs (text) from input tokens.

torch_dtype=torch.float16 and device_map="auto" make it GPU-efficient.

**Wrap Model in LangChain LLM Interface**

In [3]:
from langchain_community.llms import HuggingFacePipeline
from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=256,
    temperature=0.7,
    do_sample=True
)
llm = HuggingFacePipeline(pipeline=pipe)

Device set to use cpu
  llm = HuggingFacePipeline(pipeline=pipe)


Wraps the Hugging Face model into a LangChain-compatible LLM object.

The pipeline() defines how the model will generate text (max tokens, randomness).

llm now behaves like any LangChain LLM — you can plug it into chains.

Think of this as the bridge connecting TinyLLaMA ↔ LangChain.

**Load and Split Documents**

In [24]:
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = PyPDFLoader("/content/FineTuningLLMs.pdf")
docs = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=50)
splits = splitter.split_documents(docs)

PyPDFLoader reads all pages of your PDF into LangChain Document objects.

Each Document has:

page_content → the text,

metadata → file/page info.

RecursiveCharacterTextSplitter breaks long documents into smaller chunks (500 characters each with 50 overlapping).

These chunks will later be embedded & stored in the vector database.

Splitting ensures better context retrieval (smaller chunks = more accurate matches).

**Embeddings + Vector Store**

In [25]:
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
db = FAISS.from_documents(splits, embeddings)
retriever = db.as_retriever(search_kwargs={"k": 3})

HuggingFaceEmbeddings converts text → high-dimensional vectors.

FAISS stores these vectors and allows semantic similarity search.

db.as_retriever(k=3) retrieves top 3 most similar chunks for any user question.

This is your knowledge base — it stores and retrieves context from documents.

**Prompt Template**

In [26]:
from langchain.prompts import PromptTemplate

template = """
Use the context below to answer the question **strictly using only the given context**.

Context:
{context}

Question:
{question}

Answer:
"""

prompt = PromptTemplate(
    template=template,
    input_variables=["context", "question"],
)

Defines a prompt template with placeholders {context} and {question}.

**Build the RAG Chain (Runnable)**

In [27]:
from langchain_core.runnables import RunnableMap, RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

def combine_docs_with_sources(docs):
    combined = ""
    for i, d in enumerate(docs, start=1):
        source = d.metadata.get("source", f"doc_{i}")
        page = d.metadata.get("page", "")
        combined += f"Document {i} (source: {source}, page: {page}):\n{d.page_content}\n\n"
    return combined

rag_chain = (
    RunnableMap({
        "context": retriever | combine_docs_with_sources,
        "question": RunnablePassthrough(),
    })
    | prompt
    | llm
    | StrOutputParser()
)

RunnableMap connects multiple steps together:

The retriever finds relevant docs and converts them to text (context).

The question passes through as-is.

The output of this map is fed into:

The prompt → builds final text for the LLM.

The llm → generates an answer.

StrOutputParser() → extracts clean text output.

**Ask Questions**

In [28]:
query = "What is SGD?"
result = rag_chain.invoke(query)
print(result)


Use the context below to answer the question **strictly using only the given context**.

Context:
Document 1 (source: /content/FineTuningLLMs.pdf, page: 29):
5.4.2 Stochastic Gradient Descent (SGD)
Stochastic Gradient Descent (SGD) is a variant of Gradient Descent that focuses on reducing computation
per iteration.
How it Works: SGD updates parameters using a single or few data points at each iteration, intro-

Document 2 (source: /content/FineTuningLLMs.pdf, page: 30):
When to Use: SGD is ideal for large datasets, incremental learning scenarios, and real-time learning
environments where computational resources are limited.
5.4.3 Mini-batch Gradient Descent
Mini-batch Gradient Descent combines the efficiency of SGD and the stability of batch Gradient Descent,

Document 3 (source: /content/FineTuningLLMs.pdf, page: 29):
• Sensitive to the choice of learning rate.
When to Use: Gradient Descent is best used for small datasets where gradient computation is
cheap and simplicity and clarity

Sends a question to the RAG pipeline.

The chain retrieves relevant document chunks, injects them into the prompt, and generates a context-aware answer using TinyLLaMA.