##Step 1: Document Ingestion and Content Extraction
The first step is to get the content from the PDFs. A library like PyMuPDF (fitz) or pdfplumber is great for this, as they can accurately extract text and layout information. I have used 3pdfs to read which has almost 5 pages long.


In [None]:
# Install required libraries
!pip install -q PyMuPDF transformers accelerate bitsandbytes sentence-transformers datasets peft

# Install Unsloth for faster fine-tuning
!pip install -q --upgrade pip
!pip install -q --upgrade --no-deps unsloth[colab-new]

import fitz # PyMuPDF
import os
import json
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from datasets import Dataset
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch

In [None]:
def extract_text_from_pdf(pdf_path):
    """
    Extracts text from a PDF file.
    """
    document = fitz.open(pdf_path)
    text = ""
    for page in document:
        text += page.get_text()
    return text

# Define the paths to the PDFs
pdf_paths = [
    '/content/sample_data/Deepfile Task/1.pdf',
    '/content/sample_data/Deepfile Task/2.pdf',
    '/content/sample_data/Deepfile Task/3.pdf',
    '/content/sample_data/Deepfile Task/4.pdf'
]

corpus = [extract_text_from_pdf(p) for p in pdf_paths]

##Step 2: Content Segmentation and Instruction Dataset Generation
Here the raw texts are segmented into meaningful chunks.

Segmentation: Split the large text blobs into smaller, manageable paragraphs or chunks (e.g., 5-10 sentences each).

Instruction Generation: For each chunk, I probably use a powerful model like GPT-3.5 or a local LLM to generate a question and a corresponding answer based on the chunk's content. The prompt would be something like: "Generate a question and a detailed answer based only on the following text: [chunk]."


In [None]:
def generate_instruction_dataset(text_chunks):
    """
    Generates an instruction dataset (ShareGPT/Alpaca style) from text chunks.
    This uses a pre-trained model to act as a "teacher" to generate the dataset.
    """
    tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
    model = AutoModelForCausalLM.from_pretrained(
        "mistralai/Mistral-7B-Instruct-v0.2",
        torch_dtype=torch.float16,
        load_in_4bit=True
    )

    dataset = []

    # Simple chunking logic
    chunks = [text_chunks[i:i + 2000] for i in range(0, len(text_chunks), 1500)]

    for i, chunk in enumerate(chunks):
        if len(chunk) < 50: continue # Skip short chunks

        prompt = f"""
        Below is a passage from a document. The task is to generate a question and a corresponding answer that can be derived *solely* from the provided text.

        Passage: {chunk}

        Formatting the response as a JSON object with 'instruction' and 'output' keys.
        """

        inputs = tokenizer(prompt, return_tensors="pt", max_length=2048, truncation=True).to("cuda")
        outputs = model.generate(**inputs, max_new_tokens=256)
        response = tokenizer.decode(outputs[0], skip_special_tokens=True)

        # Simple parsing to get the JSON part
        try:
            # Here may also require a more robust JSON parser
            qa_pair = json.loads(response.split("```json")[-1].split("```")[0].strip())
            dataset.append({
                "instruction": qa_pair["instruction"],
                "output": qa_pair["output"],
                "input": ""
            })
        except Exception as e:
            print(f"Failed to parse JSON for chunk {i}: {e}")
            continue

    return dataset

# Example usage (assuming 'corpus' from Step 1 is ready)
# combined_text = " ".join(corpus)
# instruction_dataset = generate_instruction_dataset(combined_text)
# with open("instruction_data.json", "w") as f:
#     json.dump(instruction_dataset, f, indent=4)

##Step 3: Fine-Tuning a Small Open Model

Now, fine-tune a small model like Gemma or Llama. I am using QLoRA (Quantized Low-Rank Adaptation) to make this process highly efficient, both in terms of memory and training time. For this I go with Unsloth, which is an excellent library that optimizes this process even further.

In [None]:
pip install unsloth



In [None]:
# Sample fine-tuning script with Unsloth

from unsloth import FastLanguageModel
import torch
from datasets import Dataset
from transformers import TrainingArguments, Trainer

# Load a smaller, more memory-efficient model
max_seq_length = 1024
dtype = None
load_in_4bit = True

model_name = "unsloth/gemma-2b-it"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

# Set the chat template.
tokenizer.chat_template = "<|begin_of_text|>{% for message in messages %}{% if message['role'] == 'user' %}<|start_header_id|>user<|end_header_id|>\n{{ message['content'] }}<|eot_id|>{% elif message['role'] == 'assistant' %}<|start_header_id|>assistant<|end_header_id|>\n{{ message['content'] }}<|eot_id|>{% endif %}{% endfor %}"

# Configure LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing=True,
    random_state=3407,
    max_seq_length=max_seq_length,
)

# Prepare the dataset for training
dummy_data = [
    {"instruction": "What is the capital of India?", "output": "Delhi."},
    {"instruction": "What is the capital of Germany?", "output": "Berlin."}
]
dataset = Dataset.from_list(dummy_data)

# Define a new function that formats AND tokenizes the data
def formatting_and_tokenizing_func(examples):
    prompts = []
    # This is the Jinja chat template
    gemma_template = "<|begin_of_text|>{% for message in messages %}{% if message['role'] == 'user' %}<|start_header_id|>user<|end_header_id|>\n{{ message['content'] }}<|eot_id|>{% elif message['role'] == 'assistant' %}<|start_header_id|>assistant<|end_header_id|>\n{{ message['content'] }}<|eot_id|>{% endif %}{% endfor %}"

    for instruction, output in zip(examples['instruction'], examples['output']):
        # Apply the chat template to create a single text string
        formatted_text = tokenizer.apply_chat_template([
            {"role": "user", "content": instruction},
            {"role": "assistant", "content": output}
        ], tokenize=False, add_generation_prompt=False, chat_template=gemma_template)
        prompts.append(formatted_text)

    # Now, tokenize the entire batch
    tokenized_output = tokenizer(
        prompts,
        padding="longest",
        truncation=True,
        max_length=max_seq_length,
    )

    # For causal language modeling, the labels are the same as the input_ids
    tokenized_output['labels'] = tokenized_output['input_ids'].copy()

    return tokenized_output

# Use the new function to map and tokenize the dataset
tokenized_dataset = dataset.map(formatting_and_tokenizing_func, batched=True)

# Train the model
trainer = Trainer(
    model=model,
    train_dataset=tokenized_dataset, # Use the correctly tokenized dataset
    args=TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=8,
        warmup_steps=5,
        num_train_epochs=3,
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=1,
        output_dir="outputs",
        optim="adamw_8bit",
        seed=3407,
        remove_unused_columns=False, # It's good practice to set this to False

        # This line to disable WandB logging
        report_to="none",
    ),
    data_collator=None,
)

# Start training
trainer.train()

# Save the model
model.save_pretrained_merged("finetuned_model_merged", tokenizer=tokenizer)

==((====))==  Unsloth 2025.8.9: Fast Gemma patching. Transformers: 4.55.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 2 | Num Epochs = 3 | Total steps = 3
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 8
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 8 x 1) = 8
 "-____-"     Trainable parameters = 19,611,648 of 2,525,784,064 (0.78% trained)


Step,Training Loss
1,2.4365
2,2.4365
3,2.325


Found HuggingFace hub cache directory: /root/.cache/huggingface/hub
Checking cache directory for required files...
Cache check failed: model.safetensors not found in local cache.
Not all required files found in cache. Will proceed with downloading.


Unsloth: Merging weights into 16bit:   0%|          | 0/1 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/5.01G [00:00<?, ?B/s]

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the base model
base_model_tokenizer = AutoTokenizer.from_pretrained("unsloth/gemma-2b-it")
base_model = AutoModelForCausalLM.from_pretrained(
    "unsloth/gemma-2b-it",
    torch_dtype=torch.float16,
    load_in_4bit=True
)

# Load the fine-tuned model (the merged model from the previous step)
# Note: You'd load your fine-tuned model path here
finetuned_model = AutoModelForCausalLM.from_pretrained(
    "finetuned_model_merged",
    torch_dtype=torch.float16,
    load_in_4bit=True
)

# Example evaluation prompts
evaluation_prompts = [
    "What is the key benefit of using Unsloth?",
    "According to the text, what is a common challenge with large language models?"
]

def generate_response(model, tokenizer, prompt):
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, max_new_tokens=128)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

print("--- Base Model Responses ---")
for prompt in evaluation_prompts:
    response = generate_response(base_model, base_model_tokenizer, prompt)
    print(f"Prompt: {prompt}\nResponse: {response}\n")

print("\n--- Fine-Tuned Model Responses ---")
for prompt in evaluation_prompts:
    response = generate_response(finetuned_model, base_model_tokenizer, prompt)
    print(f"Prompt: {prompt}\nResponse: {response}\n")

config.json:   0%|          | 0.00/724 [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


model.safetensors:   0%|          | 0.00/5.01G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/154 [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


--- Base Model Responses ---
Prompt: What is the key benefit of using Unsloth?
Response: What is the key benefit of using Unsloth?

Unsloth is a powerful tool for automating tasks and managing your workflow. It offers several key benefits:

**1. Automation:** Unsloth can automate repetitive tasks, saving you time and effort. This includes tasks like:

* Creating and sending emails
* Updating spreadsheets
* Creating reports
* Managing social media accounts
* And much more

**2. Workflow management:** Unsloth helps you manage your workflow by connecting different tasks and activities. This allows you to see the big picture and ensure that all your efforts are contributing to your goals.

**3. Data integration:** Unsloth can integrate with various data sources, allowing

Prompt: According to the text, what is a common challenge with large language models?
Response: According to the text, what is a common challenge with large language models?

The text does not specify a common challenge w

##Method 2 - Using RAG Pipeline

In [None]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from google.colab import userdata
import os
os.environ["HF_TOKEN"] = userdata.get("HF_TOKEN")

In [None]:
# Install required libraries
!pip install -q PyMuPDF transformers accelerate bitsandbytes sentence-transformers faiss-cpu

import fitz
import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# --- Step 1: Document Ingestion and Content Embedding ---

def extract_text_from_pdf(pdf_path):
    """Extracts text from a PDF file."""
    text = ""
    try:
        with fitz.open(pdf_path) as document:
            for page in document:
                text += page.get_text()
    except Exception as e:
        print(f"Error extracting text from {pdf_path}: {e}")
    return text

def chunk_text(text, chunk_size=500, overlap=50):
    """Splits a large text into smaller chunks."""
    chunks = []
    for i in range(0, len(text), chunk_size - overlap):
        chunks.append(text[i:i + chunk_size])
    return chunks

# Define the paths to the PDFs
pdf_paths = [
    '/content/sample_data/Deepfile Task/1.pdf',
    '/content/sample_data/Deepfile Task/2.pdf',
    '/content/sample_data/Deepfile Task/3.pdf',
    '/content/sample_data/Deepfile Task/4.pdf'
]

# Extract and chunk all text from the PDFs
corpus_chunks = []
for p_path in pdf_paths:
    text = extract_text_from_pdf(p_path)
    corpus_chunks.extend(chunk_text(text))

# Load a Sentence Transformer model for creating embeddings
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
corpus_embeddings = embedding_model.encode(corpus_chunks, convert_to_tensor=True)

# Create a FAISS index for efficient similarity search
corpus_embeddings_np = corpus_embeddings.cpu().numpy()
d = corpus_embeddings_np.shape[1]
index = faiss.IndexFlatL2(d)
index.add(corpus_embeddings_np)

# --- Step 2: Retrieval-Augmented Generation (RAG) ---

# Load a powerful, pre-trained LLM for generation
llm_model_name = "mistralai/Mistral-7B-Instruct-v0.2"
llm_tokenizer = AutoTokenizer.from_pretrained(llm_model_name)
llm_model = AutoModelForCausalLM.from_pretrained(
    llm_model_name,
    torch_dtype=torch.float16,
    load_in_4bit=True
)

def retrieve_and_generate(query, top_k=3):
    """Retrieves context from documents and generates a response."""
    # Create embedding for the user's query
    query_embedding = embedding_model.encode(query, convert_to_tensor=True).cpu().numpy()

    # Search the FAISS index for the top-k most similar chunks
    D, I = index.search(query_embedding.reshape(1, -1), top_k)
    retrieved_chunks = [corpus_chunks[i] for i in I[0]]

    # Construct an augmented prompt with the retrieved context
    context = "\n\n".join(retrieved_chunks)

    # This prompt tells the LLM to use the provided context to answer the question
    prompt = f"""
    Based on the following context, answer the question accurately and concisely.

    Context:
    {context}

    Question:
    {query}

    Answer:
    """

    # Generate a response using the augmented prompt
    inputs = llm_tokenizer(prompt, return_tensors="pt", max_length=2048, truncation=True).to("cuda")
    outputs = llm_model.generate(**inputs, max_new_tokens=256, num_return_sequences=1)
    response = llm_tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Extract just the answer part from the response
    answer_start = response.rfind("Answer:") + len("Answer:")
    return response[answer_start:].strip()

# --- Step 3: Evaluation ---

# Your evaluation prompts
evaluation_prompts = [
    "What is a common challenge with large language models mentioned in the text?",
    "What is the key benefit of using LSA in text research?",
]

print("--- RAG Pipeline Responses ---")
for prompt in evaluation_prompts:
    response = retrieve_and_generate(prompt)
    print(f"Prompt: {prompt}\nResponse: {response}\n")

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


--- RAG Pipeline Responses ---


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Prompt: What is a common challenge with large language models mentioned in the text?
Response: A common challenge with large language models is their potential to generate inaccurate, misleading, or harmful information due to their ability to learn from vast amounts of data, including potentially biased or incorrect information. This issue was highlighted in a study by Zhang et al. [27] and was further discussed in a radiology article by Shen et al. [28]. Milano et al. [29] also addressed this concern in the context of AI-generated text.

Prompt: What is the key benefit of using LSA in text research?
Response: The key benefit of using LSA in text research is its ability to match semantically at a level that allows for analyses previously done only through hand-coding, such as determining from which text a subject learned information, measuring the coherence and comprehensibility of texts, and grading the quality of information cited in an essay.

