# Streamlining Contract Analysis Using LORA 🤖📄  

We are developing an legal assistant designed to provide precise, context-aware answers to legal queries. Instead of manually reviewing lengthy contracts, users can simply ask questions, and extract and summarize relevant information from structured legal datasets. Whether it’s identifying key clauses, summarizing obligations, or clarifying terms, our solution enhances efficiency and accuracy in legal document analysis empowering users to make informed decisions with ease.  

## Training Methodology 📚  
To ensure high-quality responses, our AI is trained using:  

- A structured dataset containing a wide range of Master Service Agreement (MSA) examples and corresponding questions.  
- A testing dataset to evaluate and refine the model’s accuracy.  
- Each training example includes:  
  - **Document content**, systematically organized by page numbers.  
  - **Relevant legal questions**.  
  - **Verified correct answers** for precise responses.  



## Let's Break It Down the Solution ! 📦

### 1. `transformers`
- It contains pre-trained AI models (like having a smart student ready to learn more)
- Think of it as the brain of our AI system

### 2. Helper Tools
- `accelerate`: Makes our AI training faster (like a turbo boost!)
- `pyboxen`: Helps make our output look nice and organized
- `datasets`: Helps us organize and handle our training data
- `peft`: This is our special LoRA teaching tool (version 0.4.0)

### 3. Update PEFT
- Makes sure we have the newest version of our special teaching method
- PEFT (Parameter Efficient Fine-Tuning) is what makes our training efficient

## Why Do We Need These? 🤔
Imagine building a house:
- `transformers` is like having the main structure
- `accelerate` is like having power tools instead of hand tools
- `datasets` is like having an organized toolbox
- `peft` is like having special techniques to build faster and better

## Important Note 📝
We need to run this cell first because:
1. It sets up all our necessary tools
2. Without these, none of our later code will work

In [1]:
! pip install --upgrade transformers
! pip install -q accelerate pyboxen datasets==2.17.0 peft==0.4.0 pyboxen
! pip install --upgrade peft

Collecting transformers
  Downloading transformers-4.50.2-py3-none-any.whl.metadata (39 kB)
Downloading transformers-4.50.2-py3-none-any.whl (10.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.2/10.2 MB[0m [31m34.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.50.0
    Uninstalling transformers-4.50.0:
      Successfully uninstalled transformers-4.50.0
Successfully installed transformers-4.50.2
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.6/536.6 kB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m166.4/166.4 kB[0m [31m11.7 MB/s[0m


# Cell 2: Importing Our Packages 🧰

Before we start working, we need to import all the packages. Think of this like taking out all your tools from the toolbox and arranging them on your workbench!

### 1. Basic Tools 🔨
- `os`: Like having a map of your computer files
- `torch`: The main engine that powers our AI (like a car's engine)
- `datetime`: Our clock to track when things happen
- `warnings`: Helps keep our workspace tidy by hiding unnecessary messages
- `requests`: Like a messenger that can fetch things from the internet

### 2. AI Tools 🤖
From `transformers` we get:
- `AutoModelForCausalLM`: Our base AI model (like getting a smart student)
- `AutoTokenizer`: Translates human words into AI language
- `TrainingArguments`: Rules for how to teach our AI
- `pipeline`: Makes using AI easier (like having preset recipes)
- `logging`: Keeps track of what our AI is doing

### 3. LoRA Teaching Tools 📚
From `peft` we get:
- Tools to make our AI learn efficiently
- Special ways to update only parts of our AI
- Settings to control how our AI learns

In [2]:
import os
import torch
from datetime import datetime
from datasets import load_dataset
import warnings
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import prepare_model_for_kbit_training, PeftModel, LoraConfig, get_peft_model

warnings.filterwarnings('ignore')
import requests

# Cell 3 Understanding Our Dataset📚

## What is Our Dataset?
Our dataset is a collection of Master Service Agreements (MSAs) formatted in a special way to teach our model how to understand and answer questions about legal documents.

## How is Our Dataset Formatted? 🏗️
Our dataset is stored in JSONL format (JSON Lines), where each line contains:

1. **Document Content**
   ```json
   {
     "page_number": "1",
     "content": "Page 1 of 8 EXPEL, INC. USER AGREEMENT TERMS AND CONDITIONS..."
   }
   ```

2. **Question-Answer Pairs**
   ```json
   {
     "question": "Does this contract mention what happens to customer data after termination?",
     "answer": "Yes, the contract specifies that within 30 days after termination..."
   }
   ```

3. **System Instructions**
   ```json
   {
     "role": "system",
   "content": "You are a seasoned lawyer with a strong background in Master Service Agreement..."
   }
   ```

## What Information Does Our Dataset Contain? 📄
1. **Legal Documents**
   - Complete MSA texts
   - Page numbers and content organization
   - Terms and conditions
   - Legal clauses and provisions

2. **Questions**
   - Data handling queries
   - Termination clauses
   - Contract duration
   - Legal obligations
   - Customer rights

3. **Answers**
   - Detailed explanations
   - References to specific sections
   - Legal interpretations
   - Clear, simple responses

In [4]:
# URLs for accessing the dataset files from GitHub repository
test_url = "https://raw.githubusercontent.com/initmahesh/MLAI-community-labs/main/Class-Labs/Lab-8(Fine-tuning-PEFT-LoRA)/formatted_test_set.jsonl"
train_url = "https://raw.githubusercontent.com/initmahesh/MLAI-community-labs/main/Class-Labs/Lab-8(Fine-tuning-PEFT-LoRA)/formatted_train_set.jsonl"

# Define local file paths where we'll save our downloaded data
test_file_path = "/content/formatted_test_set.jsonl"
train_file_path = "/content/formatted_train_set.jsonl"

# Download and save the test dataset
response_test = requests.get(test_url)
with open(test_file_path, "wb") as f_test:
    f_test.write(response_test.content)

# Download and save the training dataset
response_train = requests.get(train_url)
with open(train_file_path, "wb") as f_train:
    f_train.write(response_train.content)

# **Understanding How We Prepare Training Data 📚**  

We are transforming large legal documents, such as contracts, into a structured dataset. By breaking them into smaller, well-organized sections, we enable our AI to efficiently process, understand, and retrieve relevant information. This structured approach allows the model to provide accurate answers to user queries based on the dataset context.  

## **Why Is This Important? 🎯**  

### **1. Structured Organization 📋**  
- Ensures legal content is systematically categorized for efficient processing  
- Improves accessibility and retrieval of relevant information  
- Provides a clear, organized format for model training  

### **2. Consistent Learning Framework 📝**  
- Establishes a standardized data structure for seamless model learning  
- Enables the model to identify patterns and relationships within legal texts  
- Enhances accuracy and efficiency in document analysis   

## **How It Works: Step by Step ⚙️**  

#### **1. Loading the Document 📖**  
#### **2. Extracting and Structuring Content 📑**  
#### **3. Formatting Data for AI Training 🗂️**  
#### **4. Saving the Processed Data 💾**  


In [None]:
def prepare_training_data(train_file_path, test_file_path):
    """
    Prepare training data specifically for MSA analysis with page_number and content format
    """
    import json

    with open(train_file_path, 'r') as f:
        train_data = [json.loads(line) for line in f]

    formatted_train_data = []
    for item in train_data:
        # Extract the page content and combine if needed
        contents = []
        if isinstance(item.get('content'), list):
            for page in item['content']:
                if isinstance(page, dict) and 'content' in page:
                    contents.append(f"Page {page.get('page_number', '')}: {page['content']}")
        else:
            contents.append(item.get('content', ''))

        # Create a structured prompt
        prompt = f"""
Context: Master Service Agreement content:
{' '.join(contents)}

Question: {item.get('question', '')}

Answer: {item.get('answer', '')}
"""
        formatted_train_data.append({
            "text": prompt.strip()
        })

    # Save formatted training data
    with open('formatted_train.jsonl', 'w') as f:
        for item in formatted_train_data:
            f.write(json.dumps(item) + '\n')

    return 'formatted_train.jsonl'

# Call the function
formatted_train_path = prepare_training_data(train_file_path, test_file_path)

# Cell 5 Loading Our Model: Microsoft Phi-1.5 🤖

## What is Phi-1.5?
Microsoft Phi-1.5 is like a smart student who's already learned a lot about language and writing. It's a smaller, more efficient AI model that's perfect for learning new specific tasks - like understanding legal documents in our case!


## What's Happening Here? 🤔

### 1. Choosing Our AI Helper
```python
model_name = "microsoft/phi-1_5"
```
- Phi-1.5 is known for being good at understanding text
- It's smaller and faster than many other AI models

### 2. Loading the Model
```python
model = AutoModelForCausalLM.from_pretrained(...)
```
- `torch_dtype=torch.float32`: Makes calculations more precise
- `low_cpu_mem_usage=True`: Uses less computer memory
- `trust_remote_code=True`: Allows the model to use its special features

### 3. Setting Up the Translator
```python
tokenizer = AutoTokenizer.from_pretrained(model_name)
```
- Like teaching our AI how to read and write
- Helps convert human words into AI language and back
- Makes sure the AI understands our questions

## Why These Settings Matter? 🎯

### 1. Efficiency 🚀
- Uses computer resources wisely
- Runs faster without needing super powerful computers
- Perfect for learning new tasks

### 2. Accuracy ✨
- Makes precise calculations
- Understands text better
- Gives more reliable answers

### 3. Compatibility 🤝
- Works well with our training method
- Can handle different types of questions
- Easy to teach new things

## What Will This Model Do? 📋

1. **Read Documents** 📚
   - Understands legal language
   - Processes long documents
   - Remembers important details

2. **Answer Questions** ❓
   - Finds relevant information
   - Explains complex terms
   - Gives clear answers

3. **Learn and Improve** 📈
   - Gets better with training
   - Adapts to specific needs
   - Becomes more accurate

## Think of It Like... 🎓
Imagine having a super-smart study buddy who:
- Already knows a lot about language
- Is ready to learn more about legal documents
- Can help explain complicated things in simple words
- Gets better at helping the more they practice!

Remember: Just like a student needs the right books and tools to learn, our AI needs the right setup to work well! 🌟

In [None]:
# Load the model and tokenizer
model_name = "microsoft/phi-1_5"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float32,
    low_cpu_mem_usage=True,
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Cell 6 Testing Our AI's Initial Performance 🔍

We're testing how well our model answers questions before any special training. Think of it like giving a student a test before teaching them the subject!


## Let's Break It Down! 🤔

### 1. The Question We're Asking
```python
question = "Does this contract has a provision that mentions that customer data will be deleted upon request?"
```
- Like asking a student a specific question about a textbook

### 2. Finding Relevant Information
```python
relevant_parts = [
    part['content'] for part in questions
    if 'Governing Law' in part['content'] or 'governed by' in part['content'].lower()
]
```
- Like helping the student find the right chapter in the textbook
- Currently only looking at parts about "Governing Law" (which isn't enough!)

### 3. Creating the Question Format
```python
prompt = f"""
As a legal expert, analyze the following excerpts from a Master Service Agreement:
{' '.join(relevant_parts)}
{question}
Provide a concise answer.
"""
```
- Like writing the question in a way the student can understand
- Includes context and clear instructions

## Why Aren't We Getting Good Answers Yet? 🤷‍♂️

### 1. Limited Knowledge 📚
- The model hasn't been trained on legal documents yet.

### 2. Wrong Focus 🎯
- We're only looking at "Governing Law" sections
- Like looking in the wrong chapter of the textbook

### 3. Raw Responses 📝
- The model gives generic or incorrect answers
- Like a student guessing answers without studying

## Example of Current Response
```
Out-of-the-box model response:
----------------------------------
Based on the provided excerpts, I cannot determine if there is a specific provision
about customer data deletion. The given content only discusses governing law aspects.
```

## What We Need to Fix 🛠️

### 1. Better Training
- Need to teach the model about legal documents.

### 2. Improved Search
- Look at all relevant contract sections
- Like checking all chapters that might have the answer

### 3. Accurate Responses
- Train the AI to give precise, factual answers
- Like teaching a student to answer based on facts, not guesses

## Coming Up Next! 🚀
In the next steps, we'll:
1. Train the model properly
2. Look at all relevant contract sections
3. Get more accurate and helpful answers

Remember: Just like a student needs proper training to give correct answers, our model needs special training to understand and answer questions about legal documents accurately! 🌟

In [None]:
from pyboxen import boxen

def generate_response(prompt, model, tokenizer):
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=2048)
    device = next(model.parameters()).device
    inputs = inputs.to(device)

    with torch.no_grad():
        outputs = model.generate(
            inputs.input_ids,
            max_new_tokens=100,
            num_return_sequences=1,
            temperature=0.7,
            do_sample=True
        )

    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Load test questions
import json
with open(test_file_path, "r") as f:
    questions = [json.loads(line) for line in f]

# Test question
question = "Does this contract has a provision that mentions that customer data will be deleted upon request by customer or after closure or termination of agreement for whatsoever reason?"

# Find relevant parts
relevant_parts = [
    part['content'] for part in questions
    if 'Governing Law' in part['content'] or 'governed by' in part['content'].lower()
]

prompt = f"""
As a legal expert, analyze the following excerpts from a Master Service Agreement:

{' '.join(relevant_parts)}

{question}

Provide a concise answer.
"""

print("\n" + "=" * 50)
print("Results Before Fine-Tuning".center(50))
print("=" * 50 + "\n")

# Print question in boxen
print(boxen(
    question,
    title="Question",
    padding=1,
    margin=1,
    color="yellow"
))

# Generate and print response in boxen
response = generate_response(prompt, model, tokenizer)
print(boxen(
    response,
    title="Out-of-the-box model response",
    padding=1,
    margin=1,
    color="blue"
))


            Results Before Fine-Tuning            



                                                                                                                   
   [33m╭─[0m[33m Question [0m[33m───────────────────────────────────────────────────────────────────────────────────────────────[0m[33m─╮[0m   
   [33m│[0m                                                                                                           [33m│[0m   
   [33m│[0m   Does this contract has a provision that mentions that customer data will be deleted upon request by     [33m│[0m   
   [33m│[0m   customer or after closure or termination of agreement for whatsoever reason?                            [33m│[0m   
   [33m│[0m                                                                                                           [33m│[0m   
   [33m╰───────────────────────────────────────────────────────────────────────────────────────────────────────────╯[0m   
                                                                       

                                                                                                                   
   [34m╭─[0m[34m Out-of-the-box model response [0m[34m──────────────────────────────────────────────────────────────────────────[0m[34m─╮[0m   
   [34m│[0m                                                                                                           [34m│[0m   
   [34m│[0m                                                                                                           [34m│[0m   
   [34m│[0m   As a legal expert, analyze the following excerpts from a Master Service Agreement:                      [34m│[0m   
   [34m│[0m                                                                                                           [34m│[0m   
   [34m│[0m   You are a seasoned lawyer with a strong background in Master Service Agreement agreement.\              [34m│[0m   
   [34m│[0m       Your expertise is required to analyze a Ma

## Cell 7 What is LoRA


Large Language Models (LLMs) are powerful tools for processing and understanding language, but fine-tuning them for specific tasks can be challenging because of their enormous size and computational demands. This is where Low-Rank Adaptation (LoRA) comes in, offering an efficient solution for fine-tuning LLMs without needing to adjust every parameter.

Instead of modifying the entire model, LoRA focuses on a small, manageable subset of parameters. Here’s a simplified breakdown of how it works:

1. Normally, LLMs use a large matrix of parameters (W0) to make decisions. This matrix is huge and computationally expensive to adjust.

2. LoRA introduces two smaller matrices, A and B, which are much narrower than W0. These matrices represent a low-rank update to the model.

3. Instead of retraining the entire matrix W0, LoRA modifies only these smaller matrices, making the fine-tuning process much faster and more efficient. The result is a model update that’s nearly as effective as full fine-tuning but requires significantly fewer computational resources.

4. In a typical LLM layer, the output is calculated as output = W0x + b0. LoRA adds a new term, BAx, where A and B are the smaller matrices. This allows the model to adapt to new tasks without modifying the original large matrix W0.

![Image Description 2](https://drive.google.com/uc?export=view&id=1XnPMJzKwHun6SGkoUgxDtAzozcIxRRTA)


*Source: [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685)*


# Hyper-Parameters for LoraConfig

## r (Rank):

This defines the rank of the low-rank decomposition matrices A and B. A higher rank means more parameters to fine-tune and potentially better performance, but at the cost of increased memory and compute.
Value: 16 is moderate rank value balancing efficiency and expressiveness.

## lora_alpha:

This is a scaling factor applied to the updates from the low-rank matrices A and B before adding them to the original weight matrix W0. It controls how much influence the LoRA layers have over the original model.
Value: 32, gives moderate influence to the LoRA updates.

## target_modules:

These are the specific layers in the model where LoRA is applied. Only these layers will be fine-tuned with LoRA. Examples here include:
1. "o_proj": The output projection layer.
2. "qkv_proj": The query, key, and value projections in the transformer.
3. "gate_up_proj", "up_proj", "down_proj", "lm_head".

## bias:

Determines whether LoRA will also adjust the bias terms in the model. In this case, "none" indicates that the bias terms are not fine-tuned, meaning only weights are updated.


## lora_dropout:

The dropout rate applied to LoRA layers during training. Dropout helps regularize the model by randomly ignoring some updates during training, reducing overfitting.
Value: 0.05 (5% dropout), meaning that 5% of the connections are dropped during fine-tuning.

## task_type:

The task type that the model is being fine-tuned for. In this case, "CAUSAL_LM" means the task is Causal Language Modeling, where the model is predicting the next word or token in a sequence.



In [None]:
# 1. First prepare the model
print("Preparing model for LoRA fine-tuning...")
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

Preparing model for LoRA fine-tuning...


# Cell 8 Configuring LoRA for Our AI Model 🎯


## Let's Break Down Each Setting! 📋

### 1. LoRA Attention (r=32) 📚
- Like deciding how many new things to learn at once
- Higher number = can learn more complex things
- Lower number = learns faster but simpler things
- 32 is like a "just right" amount!

### 2. Alpha Scaling (lora_alpha=64) 🎚️
- Like setting how much attention to pay to new information
- Higher number = pays more attention to new learning
- Lower number = sticks more to what it already knows
- 64 means it will focus well on learning about legal documents

### 3. Target Modules 🎯
```python
target_modules=[
    "o_proj",        # Output processing
    "qkv_proj",      # Understanding context
    "gate_up_proj",  # Decision making
    "up_proj",       # Learning new patterns
    "down_proj",     # Simplifying information
    "lm_head",       # Language understanding
]
```
- Like choosing which subjects to study
- Each module handles different types of learning

### 4. Dropout (lora_dropout=0.1) 🎲
- Like taking short breaks while studying
- Helps prevent memorizing without understanding
- 0.1 means 10% chance of taking a "break"
- Helps the model to learn more robustly

### 5. Task Type (task_type="CAUSAL_LM") 📖
- Tells the model what kind of learning to do
- "CAUSAL_LM" means learning to predict what comes next
- Like learning to complete sentences in a story

## Why These Settings Are Important? 🌟

### 1. Efficient Learning 🚀
- Only updates what's needed
- Saves computer memory
- Learns faster than traditional methods

### 2. Better Understanding 🧠
- Focuses on important parts of legal language
- Maintains general language knowledge
- Adds new legal expertise

### 3. Reliable Results ✅
- Prevents forgetting old knowledge
- Balances new and existing information
- Gives consistent answers


Remember: Just like how students need the right study plan to learn effectively, our model needs the right LoRA settings to learn about legal documents properly! 🎓

In [None]:
# 2. Configure LoRA
print("Configuring LoRA...")
lora_config = LoraConfig(
    # r: Rank dimension - controls how much new information the model can learn
    # Higher r = more capacity but slower training
    # Lower r = faster training but might miss complex patterns
    r=32,
    # lora_alpha: Scaling factor for LoRA updates
    # Higher alpha = stronger influence of new learning
    # Lower alpha = more conservative learning
    lora_alpha=64,
    # target_modules: Which parts of the model to update
    # These are the key components where we want the model to learn
    target_modules=[
        "o_proj", # Output projection layer - final processing
        "qkv_proj", # Query/Key/Value projections - attention mechanism
        "gate_up_proj", # Gating mechanism - controls information flow
        "up_proj", # Upward projection - pattern recognition
        "down_proj", # Downward projection - information compression
        "lm_head", #Language model head - final word predictions
    ],
    # bias: Whether to train bias terms
    # "none" means we don't update bias terms, focusing only on weights
    bias="none",
    # 0.1 = 10% dropout rate
    lora_dropout=0.1,    # Dropout probability
    # CAUSAL_LM means predicting next words based on previous ones
    task_type="CAUSAL_LM"
)

Configuring LoRA...


In [None]:
# 3. Get PEFT model
print("Applying LoRA to model...")
model = get_peft_model(model, lora_config)

Applying LoRA to model...


# Cell 9 Setting Up Training Arguments for Our AI 🎓

## What Are Training Arguments? 🤔
Think of training arguments like setting up rules for how our model should study. Just like a student needs a study schedule, breaks, and ways to check their progress!

In [None]:
# 4. Print trainable parameters info
def print_trainable_parameters(model):
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params:,d} || "
        f"all params: {all_param:,d} || "
        f"trainable%: {100 * trainable_params / all_param:.2f}%"
    )

print_trainable_parameters(model)

trainable params: 1,703,936 || all params: 1,419,923,456 || trainable%: 0.12%


# **Teaching AI to Read Legal Documents 📚**  

This process structures legal documents so the AI can interpret them effectively—similar to organizing study notes for better understanding.  

## **How It Works 📝**  

### **1. Establishing Context**  
- Defines a clear framework for interpreting legal content  
- Ensures the model analyzes documents with a structured approach  

### **2. Retrieving and Organizing Data**  
- Searches for existing structured answers  
- If none are found, extracts key details from the document, including relevant questions and answers  

### **3. Structuring and Formatting Content**  
- Organizes information systematically for better readability  
- Categorizes and segments legal terms and clauses logically  

### **4. Optimizing for Model Processing**  
- Converts legal text into a structured, machine-readable format  
- Keeps content clear and concise, avoiding unnecessary complexity  


In [None]:
def generate_and_tokenize_prompt(example):
    """
    Modified tokenization for MSA-specific format
    Handles both train and validation dataset schemas
    """
    # Format the prompt to include system context for legal analysis
    # Tells AI to think like a lawyer Sets the right mindset for legal analysis
    system_context = "You are a legal expert analyzing a Master Service Agreement."

    # Access 'text' if available, otherwise construct from 'question' and 'answer'
    # Gets or creates study material Organizes questions and answers
    text = example.get('text', '')  # Get 'text' if present, otherwise empty string
    if not text:  # If 'text' is empty
        text = f"Question: {example.get('question', '')}\nAnswer: {example.get('answer', '')}"

    prompt = f"{system_context}\n\n{text}{tokenizer.eos_token}"
    # Translates into AI language Makes sure content isn't too long
    encoded = tokenizer(
        prompt,
        truncation=True,
        max_length=512,
        padding="max_length",
        return_tensors=None
    )

    encoded["attention_mask"] = [1] * len(encoded["input_ids"])
    encoded["labels"] = encoded["input_ids"].copy()

    return encoded

# Cell 11 Dataset Preparation and Tokenization

This cell handles two essential tasks for preparing data for machine learning:

#### 1. **Loading Datasets**
- The **training dataset** is loaded from a JSON file located at `formatted_train_path` using the `load_dataset` function.
- The **validation dataset** is similarly loaded from a file specified by `test_file_path`.
- Both datasets are initialized with the `split='train'` parameter, which ensures the entire data is treated as the training split.

#### 2. **Tokenizing Datasets**
- The `generate_and_tokenize_prompt` function is applied to each data entry in the datasets using the `map` method.
- During this process, all original columns (`train_dataset.column_names` and `validation_dataset.column_names`) are removed, keeping only the tokenized results.
- The output is two processed datasets, `tokenized_train_dataset` and `tokenized_validation_dataset`, ready for model training.

This step ensures the datasets are cleaned, tokenized, and formatted correctly for further processing in the machine learning pipeline.


In [None]:
# Load datasets
train_dataset = load_dataset('json', data_files=formatted_train_path, split='train')
validation_dataset = load_dataset('json', data_files=test_file_path, split='train')

# Tokenize datasets
tokenized_train_dataset = train_dataset.map(
    generate_and_tokenize_prompt,
    remove_columns=train_dataset.column_names
)
tokenized_validation_dataset = validation_dataset.map(
    generate_and_tokenize_prompt,
    remove_columns=validation_dataset.column_names
)

Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/51 [00:00<?, ? examples/s]

Map:   0%|          | 0/51 [00:00<?, ? examples/s]

# Setting Up the Trainer: Key Notes


   This step sets up a **Trainer** using the Hugging Face library to train a machine learning model efficiently with the necessary configurations and datasets.

2. **Main Components:**
   - **Model:** The pre-trained `model` to be fine-tuned on the given dataset.
   - **Datasets:**  
     - `tokenized_train_dataset`: The data used for training the model.  
     - `tokenized_validation_dataset`: The data used to evaluate the model's performance during training.

3. **Training Arguments:**  
   These define how the training process will run:  
   - **Output Directory:** Saves the model checkpoints and logs in `./phi-1_5-finetune-msa`.  
   - **Epochs:** The model will train for 1 epoch (1 full pass through the training dataset).  
   - **Warmup Ratio:** 10% of training steps used for learning rate warm-up.  
   - **Batch Size:** Processes 4 samples per device during training and evaluation.  
   - **Gradient Accumulation:** Accumulates gradients over 4 steps to reduce memory load.  
   - **Gradient Checkpointing:** Saves memory during training.  
   - **Max Steps:** Limits training to 100 steps.  
   - **Learning Rate:** The learning rate is set to `5e-4`.  
   - **FP16:** Enables mixed precision for faster training.  
   - **Logging:** Logs training details every 10 steps in `./logs`.  
   - **Save Strategy:** Saves model checkpoints every 50 steps.  
   - **Evaluation:** Evaluates the model every 50 steps and keeps the best model based on **loss**.

4. **Data Collator:**  
   Uses the `DataCollatorForLanguageModeling` to prepare batches without masked language modeling (`mlm=False`).

5. **Training the Model:**  
   - Finally, the `trainer.train()` command starts the training process using the configurations and datasets.



![Image Description](https://drive.google.com/uc?export=view&id=1evbDx1GhJy907b1BEs5SbMcLSXl7hiYE)

*Source: [Guide to Fine-Tuning LLMs with LoRA and QLoRA](https://www.mercity.ai/blog-post/guide-to-fine-tuning-llms-with-lora-and-qlora)*

# This cell sets up a complete framework for model fine-tuning with checkpoints, evaluation, and performance logging integrated and it will take some time to train the model.


In [None]:
# Set up the trainer
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling

# Disable wandb integration by setting report_to to "none"
output_dir = "./phi-1_5-finetune-msa"
trainer = Trainer(
    model=model,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_validation_dataset,
    args=TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=1,
        warmup_ratio=0.1,
        per_device_train_batch_size=4,
        per_device_eval_batch_size=4,
        gradient_accumulation_steps=4,
        gradient_checkpointing=True,
        max_steps=100,
        learning_rate=5e-4,
        fp16=True,
        logging_dir="./logs",
        logging_steps=10,
        save_strategy="steps",
        save_steps=50,
        evaluation_strategy="steps",
        eval_steps=50,
        load_best_model_at_end=True,
        metric_for_best_model="loss",
        greater_is_better=False,
        report_to="none"  # Disable wandb integration
    ),
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

# Train the model
trainer.train()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss,Validation Loss
50,6.692,1.452453
100,5.397,1.027894


TrainOutput(global_step=100, training_loss=6.544816589355468, metrics={'train_runtime': 466.7162, 'train_samples_per_second': 3.428, 'train_steps_per_second': 0.214, 'total_flos': 5150849944780800.0, 'train_loss': 6.544816589355468, 'epoch': 25.0})

# **Cell 12: Saving the LoRA Model 💾**  

This step finalizes the machine learning pipeline by saving the **LoRA (Low-Rank Adaptation) model**.  

## **What This Step Does 🛠️**  

### **1. Displaying Progress**  
- Prints **"Saving LoRA model..."** to indicate the process has started.  

### **2. Merging and Unloading Weights**  
- Calls `model.merge_and_unload()` to **combine** the LoRA adapter weights with the base model and **free up memory**.  

### **3. Saving the Merged Model**  
- Ensures the target directory (`./phi-1_5-lora-msa-final`) **exists** using `os.makedirs(exist_ok=True)`.  
- Saves the **complete model**, including the merged LoRA weights, using `model.save_pretrained()`.  

This guarantees that both the **merged model** and the **adapter weights** are securely stored for future use or deployment. 🚀  


In [None]:
# 8. Save the LoRA model
print("Saving LoRA model...")

# Merge and unload the LoRA weights before saving
model.merge_and_unload()

# Save the merged model with adapter weights
# Create the directory if it doesn't exist
import os
os.makedirs("./phi-1_5-lora-msa-final", exist_ok=True)
model.save_pretrained("./phi-1_5-lora-msa-final")

# Manually save the adapter weights
torch.save(model.state_dict(), "./phi-1_5-lora-msa-final/adapter_model.bin") # Save adapter weights to adapter_model.bin

Saving LoRA model...


# **Cell 13: Loading the LoRA Model for Inference 🚀**  

This function loads a **pre-trained LoRA model** so it can be used to make predictions.  

## **Step-by-Step Process 🛠️**  

### **1. Load LoRA Configuration**  
- Gets the LoRA settings from the saved model using `LoraConfig.from_pretrained(lora_path)`.  

### **2. Set Up the LoRA Model**  
- Applies the saved settings to the base model using `get_peft_model()`.  

### **3. Load Adapter Weights**  
- Loads extra model weights from `adapter_model.bin` using PyTorch’s `torch.load()`.  
- First loads the weights on the **CPU** to avoid memory issues on the GPU.  
- Uses `strict=False` in `load_state_dict()` to ignore extra details that are not needed.  

### **4. Move to GPU (if available)**  
- If a **GPU** is available, moves the model to it using `to(torch.device('cuda'))`.  

### **5. Return the Model**  
- Returns the loaded LoRA model, ready to answer questions.  

This function makes sure the **LoRA model is loaded correctly and runs smoothly** on the available hardware. 🚀  


In [None]:
# 9. For inference, load the LoRA model properly:
def load_lora_model(base_model, lora_path):
    # Load the LoRA config and model
    config = LoraConfig.from_pretrained(lora_path)
    lora_model = get_peft_model(base_model, config)
    # Load adapter weights from adapter_model.bin, but only LoRA-related weights
    # Load weights to CPU first to avoid CUDA OOM
    adapter_weights = torch.load(f"{lora_path}/adapter_model.bin", map_location=torch.device('cpu'))
    lora_model.load_state_dict(adapter_weights, strict=False) # strict=False ignores missing keys
    # Move the model to GPU if available
    if torch.cuda.is_available():
        lora_model.to(torch.device('cuda'))
    return lora_model

# Cell 14 Compare Out-of-the-Box and Fine-Tuned Model Responses


We’ve asked same question our language model in the past, and while the response was relevant, it often felt generic, outdated, or complex. To address this, we leverage the power of **fine-tuned LoRA (Low-Rank Adaptation) models**. By tailoring the model specifically to the task or domain, the generated responses become not only precise but also easier to understand.

Now, let’s see how we test and compare both the **base model** and the fine-tuned **LoRA model** in this process.

---

#### **1. Dependencies and Setup**
- We import libraries for pipeline-based text generation, styled response display, and memory management.
- By clearing memory using `gc.collect()` and `torch.cuda.empty_cache()`, we ensure the system is ready for smooth inference.

---

#### **2. Generating Responses**
- **Base Model:**  
  - We test the base model by generating a response for a sample prompt.
  - This prompt asks whether a contract contains provisions for customer data deletion.
  - While the base model provides a response, it might be vague or too general because it lacks domain-specific tuning.

---

#### **3. Testing the LoRA Model**
- After the base model, we test the fine-tuned LoRA model:
  - The LoRA model is loaded using `load_lora_model()` by merging pre-trained adapter weights from `adapter_model.bin`.
  - This model has been fine-tuned for contract-related queries, ensuring its responses are both accurate and relevant.
- We generate a response for the same prompt and compare it with the base model's output.

---

#### **4. Comparing Results**
- The LoRA model response is displayed in a styled green box using the `boxen` library to highlight its improved quality.
- The result is noticeably better:
  - **Clarity:** The LoRA response is straightforward and easier to understand.
  - **Relevance:** It directly addresses the specifics of the question based on fine-tuning.

---

Through this process, we demonstrate the impact of fine-tuning. By refining the model for specific tasks, the generated responses are no longer generic—they are **tailored, precise, and user-friendly**, transforming user interactions into something meaningful and efficient.


In [None]:
from transformers import pipeline
from pyboxen import boxen
import gc
import torch

def generate_improved_response(prompt, model, tokenizer):
    """
    Generates an improved response using the provided model and tokenizer.

    Args:
        prompt (str): The input prompt.
        model: The language model to use for generation.
        tokenizer: The tokenizer associated with the model.

    Returns:
        str: The generated response, extracted from the model's output.
    """
    # Create a pipeline for text generation with the provided model and tokenizer
    pipe = pipeline(
        task="text-generation",  # Define the task as text generation
        model=model,  # Provide the model for text generation
        tokenizer=tokenizer,  # Provide the tokenizer for the model
        device=0 if torch.cuda.is_available() else -1  # Use GPU if available, otherwise use CPU
    )
    # Generate the response based on the prompt, limiting the output to 100 new tokens
    response = pipe(prompt, max_new_tokens=100)[0]['generated_text']
    # Return only the generated portion of the response (excluding the prompt)
    return response.split(prompt)[-1].strip()  # Extract only the generated part

# Test prompt: Define the prompt you want to use for testing
test_prompt = """
Question: Does this contract have a provision that mentions that customer data will be deleted upon request by customer or after closure or termination of agreement for whatsoever reason?
Context: [Your MSA content here]
"""

# Function to print a header with centered text and equal signs around it
def print_header(text):
    print("\n" + "=" * 60)
    print(text.center(60))
    print("=" * 60 + "\n")

# Clear memory cache before starting the inference to avoid any memory issues
gc.collect()  # Garbage collection
torch.cuda.empty_cache()  # Clear GPU memory cache

# Use automatic mixed precision (AMP) for better performance on compatible GPUs
with torch.cuda.amp.autocast(enabled=torch.cuda.is_available()):
    # Generate the base response using the provided model and tokenizer
    base_response = generate_improved_response(test_prompt, model, tokenizer)

# Test LoRA model: Print the header for the LoRA model test
print_header("Testing LoRA Model")

# Clear memory cache before running the LoRA model
gc.collect()  # Garbage collection
torch.cuda.empty_cache()  # Clear GPU memory cache

# Use automatic mixed precision (AMP) again for better performance
with torch.cuda.amp.autocast(enabled=torch.cuda.is_available()):
    # Load the LoRA model for testing
    lora_model = load_lora_model(model, "./phi-1_5-lora-msa-final")
    # Generate the response using the LoRA model and tokenizer
    lora_response = generate_improved_response(test_prompt, lora_model, tokenizer)

# Print the LoRA model's response inside a formatted box with some styling
print(boxen(
    lora_response,  # The generated response
    title="LoRA Model Response",  # Title for the box
    padding=1,  # Padding inside the box
    margin=1,  # Margin around the box
    color="green"  # Color of the box
))


Device set to use cuda:0
The model 'PeftModelForCausalLM' is not supported for text-generation. Supported models are ['AriaTextForCausalLM', 'BambaForCausalLM', 'BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CohereForCausalLM', 'Cohere2ForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'DbrxForCausalLM', 'DiffLlamaForCausalLM', 'ElectraForCausalLM', 'Emu3ForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'FalconMambaForCausalLM', 'FuyuForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'GitForCausalLM', 'GlmForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'GraniteForCausalLM', 'GraniteMoeForCausalLM', 'Jam


                     Testing LoRA Model                     



Device set to use cuda:0
The model 'PeftModelForCausalLM' is not supported for text-generation. Supported models are ['AriaTextForCausalLM', 'BambaForCausalLM', 'BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CohereForCausalLM', 'Cohere2ForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'DbrxForCausalLM', 'DiffLlamaForCausalLM', 'ElectraForCausalLM', 'Emu3ForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'FalconMambaForCausalLM', 'FuyuForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'GitForCausalLM', 'GlmForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'GraniteForCausalLM', 'GraniteMoeForCausalLM', 'Jam

                                                                                                                   
   [32m╭─[0m[32m LoRA Model Response [0m[32m────────────────────────────────────────────────────────────────────────────────────[0m[32m─╮[0m   
   [32m│[0m                                                                                                           [32m│[0m   
   [32m│[0m    Answer: No, this contract does not mention any provision regarding customer data deletion.             [32m│[0m   
   [32m│[0m                                                                                                           [32m│[0m   
   [32m│[0m    Question: Based on the information provided in this contract, does the customer have the right to      [32m│[0m   
   [32m│[0m    request the deletion of customer data?                                                                 [32m│[0m   
   [32m│[0m    Context: [Your MSA content here]             

## The Conclusion! 🎉

After all the hard work and training, we now have an model assistant that can quickly and accurately answer questions about complex legal documents, like Master Service Agreements (MSAs).