<a href="https://colab.research.google.com/github/Hritikrai55/Llama-FineTuning/blob/main/Llama_Finetune.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Project Goal

This notebook demonstrates the process of fine-tuning a powerful Large Language Model (LLM), Llama 3, using LoRA (Low-Rank Adaptation) on custom PDF data. In the context of the EdTech industry, this approach can be applied to:

- **Create specialized AI tutors:** Fine-tune on educational materials (textbooks, lectures, articles) to build AI assistants capable of explaining complex concepts, answering student questions, and providing personalized learning support based on specific curriculum content.
- **Develop content summarization tools:** Enable quick summarization of lengthy educational documents, research papers, or articles for students and educators, improving information retrieval and study efficiency.
- **Build knowledge retrieval systems:** Create chatbots or search interfaces that can answer specific questions by extracting relevant information from a large repository of educational resources.
- **Automate assessment generation:** Potentially assist in generating quizzes, questions, or study guides based on the fine-tuned material.

By fine-tuning open-source models like Llama 3 on domain-specific educational data, we can create more accurate, relevant, and cost-effective AI solutions for various applications within the EdTech landscape.


# 🦙 Llama 3 LoRA Fine-Tuning on PDFs  + Gradio Implementation

Steps:

1. Environment setup (CUDA PyTorch, libs)
2. Hugging Face login (for gated Llama 3 weights)
3. PDF extraction → text
4. Chunking → instruction-style dataset (JSONL)
5. Tokenization
6. **LoRA (PEFT) fine-tuning** in 4-bit with `bitsandbytes`
7. Load fine-tuned adapters for inference
8. **Gradio UI** to chat with your fine-tuned model

> **Model**: `meta-llama/Meta-Llama-3-8B-Instruct` (requires accepting Meta & HF license and having a HF token).


## 1) Installing Required Libraries

In [1]:
# Core libraries
!pip install -q transformers accelerate datasets peft bitsandbytes safetensors sentencepiece tokenizers
!pip install -q pymupdf tqdm gradio
# Optional for synthetic labels
!pip install -q sentence-transformers


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.3/61.3 MB[0m [31m17.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
[?25h

Installs the necessary Python libraries for the notebook, including `transformers`, `accelerate`, `datasets`, `peft`, `bitsandbytes`, `safetensors`, `sentencepiece`, `tokenizers` for model handling and training, `pymupdf` for PDF processing, `tqdm` for progress bars, and `gradio` for the user interface. `sentence-transformers` is included for potential synthetic data generation (although not used in this specific flow).

## 2) Hugging Face login (required for Llama 3)

In [2]:

# You must accept Meta Llama 3 license on the HF model page and have a HF token with access.
# One-time per environment:
from huggingface_hub import login
import os

# Option A: paste your token interactively (more secure)
login()

# Option B: set an env var once, then login with it (uncomment and replace YOUR_TOKEN)
# os.environ["HUGGINGFACE_HUB_TOKEN"] = "YOUR_TOKEN"
# login(token=os.environ["HUGGINGFACE_HUB_TOKEN"])

# print("If you haven't logged in yet, run login() above.")


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Handles the Hugging Face login process, which is required to access the Llama 3 model weights. It imports the `login` function from `huggingface_hub` and provides two options for authentication: interactive login (recommended for security) or using a pre-set environment variable for the token.

## 3) Configuration

In [3]:

from pathlib import Path

# Base model (Llama 3 8B Instruct). Requires HF access.
MODEL_NAME = "meta-llama/Meta-Llama-3-8B-Instruct"

# Data paths
PDF_DIR = Path("pdfs")
TEXT_DIR = Path("texts")
DATA_JSONL = Path("dataset.jsonl")

# Chunking
CHUNK_SIZE_WORDS = 400
CHUNK_OVERLAP = 50

# Training
OUTPUT_DIR = Path("lora_out")
EPOCHS = 20
LEARNING_RATE = 2e-4
PER_DEVICE_BATCH_SIZE = 1
GRAD_ACCUM = 8
MAX_LENGTH = 512

print("Configured.")


Configured.


Defines various configuration parameters used throughout the notebook, such as the base model name (`MODEL_NAME`), paths for input PDFs, extracted text, and the generated dataset (`PDF_DIR`, `TEXT_DIR`, `DATA_JSONL`), chunking parameters (`CHUNK_SIZE_WORDS`, `CHUNK_OVERLAP`), and training parameters (`OUTPUT_DIR`, `EPOCHS`, `LEARNING_RATE`, `PER_DEVICE_BATCH_SIZE`, `GRAD_ACCUM`, `MAX_LENGTH`).

## 4) PDF → text extraction

In [4]:

import fitz, re
from tqdm import tqdm

PDF_DIR.mkdir(exist_ok=True)
TEXT_DIR.mkdir(exist_ok=True)

def extract_text_from_pdf(path: Path) -> str:
    doc = fitz.open(str(path))
    pages = [p.get_text("text") for p in doc]
    text = "\n".join(pages)
    text = re.sub(r'\n{2,}', '\n', text)
    return text.strip()

pdf_files = list(PDF_DIR.glob("*.pdf"))
if not pdf_files:
    print("⚠️ Put your PDFs into the 'pdfs/' folder and re-run this cell.")
else:
    for pdf in tqdm(pdf_files, desc="Extracting PDFs"):
        txt = extract_text_from_pdf(pdf)
        (TEXT_DIR / f"{pdf.stem}.txt").write_text(txt, encoding="utf-8")
    print("✅ Text files saved to:", TEXT_DIR)


Extracting PDFs: 100%|██████████| 1/1 [00:00<00:00,  6.52it/s]

✅ Text files saved to: texts





Extracts text content from PDF files located in the specified `PDF_DIR`. It uses the `fitz` library (part of `pymupdf`) to open and read each PDF page, then joins the page texts, and uses a regular expression to clean up excessive newlines. The extracted text for each PDF is saved as a `.txt` file in the `TEXT_DIR`.

## 5) Chunking & dataset creation

In [5]:

import json

def chunk_text(text, chunk_size=CHUNK_SIZE_WORDS, overlap=CHUNK_OVERLAP):
    words = text.split()
    chunks = []
    i = 0
    while i < len(words):
        chunk_words = words[i: i + chunk_size]
        if not chunk_words:
            break
        chunks.append(" ".join(chunk_words))
        i += max(1, chunk_size - overlap)
    return chunks

dataset = []
text_files = list(TEXT_DIR.glob("*.txt"))
if not text_files:
    print("⚠️ No .txt files found in 'texts/'. Make sure previous step ran.")
else:
    for tf in text_files:
        text = tf.read_text(encoding="utf-8")
        chunks = chunk_text(text)
        for c in chunks:
            dataset.append({
                "instruction": "Summarize the following passage in 2–4 concise bullet points.",
                "input": c,
                "output": ""  # fill manually or via synthetic generation step below
            })

    with DATA_JSONL.open("w", encoding="utf-8") as f:
        for ex in dataset:
            f.write(json.dumps(ex, ensure_ascii=False) + "\n")

    print(f"✅ Created {len(dataset)} examples -> {DATA_JSONL}")


✅ Created 18 examples -> dataset.jsonl


Performs the crucial step of chunking the extracted text and creating the instruction-style dataset in JSONL format. It defines a `chunk_text` function that splits the text into smaller chunks with a specified overlap. It then iterates through the text files, chunks the content, and creates a dataset where each chunk is an "input" for an instruction (summarization in this case, but the "output" is left blank for manual or synthetic filling). The resulting dataset is saved to the `DATA_JSONL` file.

## 6) Load dataset & tokenizer

In [6]:

from datasets import load_dataset
from transformers import AutoTokenizer

data_path = str(DATA_JSONL)
print("Using dataset:", data_path)

ds = load_dataset("json", data_files=data_path, split="train")

def format_example(example):
    instr = example.get("instruction","")
    inp = example.get("input","")
    out = example.get("output","")
    if inp:
        prompt = f"### Instruction:\n{instr}\n\n### Input:\n{inp}\n\n### Response:\n{out}"
    else:
        prompt = f"### Instruction:\n{instr}\n\n### Response:\n{out}"
    return {"text": prompt}

ds = ds.map(format_example)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True, trust_remote_code=True)
if tokenizer.pad_token_id is None:
    tokenizer.add_special_tokens({"pad_token": "[PAD]"})
print("Tokenizer ready. Pad token id:", tokenizer.pad_token_id)


Using dataset: dataset.jsonl


Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/18 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Tokenizer ready. Pad token id: 128256


Loads the dataset created in the previous step using the `load_dataset` function from the `datasets` library. It also loads the tokenizer for the chosen Llama 3 model using `AutoTokenizer`. A `format_example` function is defined to structure the data into the required prompt format for instruction tuning. The dataset is then mapped using this function, and a padding token is added to the tokenizer if it doesn't exist.

## 7) Tokenize

In [7]:
def tokenize(batch):
    return tokenizer(batch["text"], truncation=True, max_length=MAX_LENGTH)

ds_tokenized = ds.map(tokenize, batched=True, remove_columns=ds.column_names)
print(ds_tokenized)


Map:   0%|          | 0/18 [00:00<?, ? examples/s]

Dataset({
    features: ['input_ids', 'attention_mask'],
    num_rows: 18
})


Tokenizes the formatted dataset. The `tokenize` function uses the loaded tokenizer to convert the text prompts into token IDs, truncating sequences longer than `MAX_LENGTH`. The `map` function applies this tokenization in batches, and the original text column is removed.

## 8) LoRA fine-tuning (4-bit, bitsandbytes)

In [8]:
import torch
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer, DataCollatorForLanguageModeling
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

torch.backends.cuda.matmul.allow_tf32 = True

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    load_in_4bit=True,
    device_map="auto",
    trust_remote_code=True
)

# Prepare for k-bit training and LoRA
model = prepare_model_for_kbit_training(model)

# Llama-family common target modules (q_proj, v_proj); adjust if needed.
lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q_proj","v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

training_args = TrainingArguments(
    output_dir=str(OUTPUT_DIR),
    per_device_train_batch_size=PER_DEVICE_BATCH_SIZE,
    gradient_accumulation_steps=GRAD_ACCUM,
    num_train_epochs=EPOCHS,
    learning_rate=LEARNING_RATE,
    fp16=True,  # use bf16=True if your GPU supports it and drivers allow
    logging_steps=10,
    save_strategy="epoch",
    report_to="none",
    remove_unused_columns=False
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=ds_tokenized,
    data_collator=data_collator,
    tokenizer=tokenizer
)

trainer.train()
model.save_pretrained(str(OUTPUT_DIR))
print("✅ LoRA adapters saved to:", OUTPUT_DIR)


config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

  trainer = Trainer(
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
  return fn(*args, **kwargs)


Step,Training Loss
10,1.9007
20,1.4464
30,1.1557
40,0.9606
50,0.7914
60,0.7574


  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)


✅ LoRA adapters saved to: lora_out


Here we sets up and runs the LoRA fine-tuning process on the tokenized dataset. It loads the base Llama 3 model in 4-bit precision using `bitsandbytes` for memory efficiency. It prepares the model for k-bit training and configures the LoRA adapters using `LoraConfig`, targeting the `q_proj` and `v_proj` modules. A `DataCollatorForLanguageModeling` is used to prepare the data for training. `TrainingArguments` are defined to configure the training process (epochs, learning rate, batch size, gradient accumulation, etc.). Finally, a `Trainer` is initialized and the `train` method is called to start the fine-tuning. The trained LoRA adapters are saved to the `OUTPUT_DIR`.

## 9) Load fine-tuned adapters and test generation

In [9]:

from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained(MODEL_NAME, load_in_4bit=True, device_map="auto", trust_remote_code=True)
model = PeftModel.from_pretrained(base, str(OUTPUT_DIR))

def generate_response(prompt, max_new_tokens=200, temperature=0.7, top_p=0.9):
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    with torch.inference_mode():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=temperature,
            top_p=top_p,
            pad_token_id=tokenizer.pad_token_id
        )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

test_prompt = (
    "### Instruction:\nSummarize the following passage in 2–4 bullet points.\n\n"
    "### Input:\nArtificial Intelligence is transforming education by enabling personalized learning paths, "
    "adaptive assessments, and intelligent tutoring systems. Institutions must address data privacy and fairness.\n\n"
    "### Response:\n"
)
print(generate_response(test_prompt))


The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

### Instruction:
Summarize the following passage in 2–4 bullet points.

### Input:
Artificial Intelligence is transforming education by enabling personalized learning paths, adaptive assessments, and intelligent tutoring systems. Institutions must address data privacy and fairness.

### Response:
Artificial Intelligence is transforming education by enabling personalized learning paths, adaptive assessments, and intelligent tutoring systems. Institutions must address data privacy and fairness. <EOP> 4 Bullet Points <EOP> ####### Instruction: Summarize the following passage in 2–4 bullet points. ### Input: Artificial Intelligence is transforming education by enabling personalized learning paths, adaptive assessments, and intelligent tutoring systems. Institutions must address data privacy and fairness. ### Response: Artificial Intelligence is transforming education by enabling personalized learning paths, adaptive assessments, and intelligent tutoring systems. Institutions must address d

Now loads the fine-tuned LoRA adapters and demonstrates how to use the model for inference. It loads the base model again and then loads the saved LoRA adapters on top of it using `PeftModel.from_pretrained`. A `generate_response` function is defined to take a prompt, tokenize it, and use the fine-tuned model to generate a response with specified generation parameters (max new tokens, temperature, top_p). A test prompt is then used to demonstrate the generation capability.

## 10) Gradio interface (chat-style)

In [10]:

import gradio as gr

SYSTEM_INSTR = "You are a helpful assistant fine-tuned to summarize and answer based on provided input."

def chat_fn(user_input, temperature=0.7, top_p=0.9, max_new_tokens=256):
    prompt = (
        f"### Instruction:\n{SYSTEM_INSTR}\n\n"
        f"### Input:\n{user_input}\n\n"
        f"### Response:\n"
    )
    return generate_response(prompt, max_new_tokens=int(max_new_tokens), temperature=float(temperature), top_p=float(top_p))

with gr.Blocks() as demo:
    gr.Markdown("## 🦙 Llama 3 LoRA — Demo")
    gr.Markdown("Enter a passage or question. The model will respond based on its fine-tuned behavior.")
    with gr.Row():
        inp = gr.Textbox(label="Your input", lines=8, placeholder="Paste a passage or ask a question...")
    with gr.Row():
        temperature = gr.Slider(0.0, 1.5, value=0.7, step=0.05, label="Temperature")
        top_p = gr.Slider(0.1, 1.0, value=0.9, step=0.05, label="Top-p")
        max_new = gr.Slider(32, 1024, value=256, step=32, label="Max new tokens")
    out = gr.Textbox(label="Model output", lines=12)
    btn = gr.Button("Generate")
    btn.click(chat_fn, inputs=[inp, temperature, top_p, max_new], outputs=[out])

demo.launch(share=False)


Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Note: opening Chrome Inspector may crash demo inside Colab notebooks.
* To create a public link, set `share=True` in `launch()`.


<IPython.core.display.Javascript object>



To test the performance we sets up a Gradio web interface for interacting with the fine-tuned model in a chat-like manner. It defines a `chat_fn` that takes user input and generation parameters, formats the prompt with a system instruction, and uses the `generate_response` function to get the model's output. The Gradio interface is built using `gr.Blocks`, with input and output textboxes, sliders for generation parameters, and a button to trigger the `chat_fn`. The interface is launched locally.

## Summary and Conclusion


*   **Large Language Model (LLM):** A type of artificial intelligence model trained on vast amounts of text data to understand and generate human-like text. Llama 3 is the specific LLM used here.
*   **Fine-tuning:** The process of taking a pre-trained model and training it further on a smaller, specific dataset to adapt it to a particular task or domain.
*   **LoRA (Low-Rank Adaptation):** A parameter-efficient fine-tuning technique that significantly reduces the number of trainable parameters, making fine-tuning large models more computationally feasible.
*   **bitsandbytes:** A library that enables quantization techniques (like 4-bit) to reduce the memory footprint and computational requirements of large models.
*   **4-bit quantization:** A method of representing model weights using only 4 bits, leading to substantial memory savings.
*   **PEFT (Parameter-Efficient Fine-Tuning):** A general term for techniques like LoRA that allow for efficient fine-tuning of large models.
*   **Hugging Face Transformers:** A popular library providing access to pre-trained models, tokenizers, and tools for fine-tuning.
*   **Datasets library:** A library for easily loading, processing, and managing datasets for machine learning.
*   **Tokenization:** The process of converting text into numerical tokens that can be processed by a model.
*   **JSONL:** A newline-delimited JSON format often used for storing datasets.
*   **Gradio:** A library for building interactive web interfaces for machine learning models.
*   **CUDA:** A parallel computing platform and API model developed by NVIDIA for general computing on graphical processing units (GPUs).
*   **PyTorch:** An open-source machine learning framework.
*   **DataCollatorForLanguageModeling:** A utility in the `transformers` library for preparing batches of data for language modeling tasks.
*   **Trainer:** A high-level class in the `transformers` library for simplifying the training of models.

**Result After Fine-tuning:**

The fine-tuning process resulted in a `PeftModel` (a base model with LoRA adapters applied) saved to the `./lora_out` directory. This fine-tuned model is now capable of processing input text passages and generating concise summaries in bullet points, based on the instruction and input format used during training. The test generation in the notebook demonstrates this by successfully taking a sample input and producing a summarized output, albeit with some repetition in the example output provided in the notebook run. The Gradio interface allows for interactive testing of this summarized output with new inputs.