

# 🧠 Fine-Tuning Explained (Simple and Complete)

Fine-tuning means **taking a pre-trained AI model** — one that already knows a lot from studying huge amounts of data — and **teaching it something more specific** using your own smaller dataset.

Imagine a student who already knows English. If you now give them only *medical books*, they’ll start talking like a doctor. That’s what fine-tuning does — it **focuses a general model on a specific topic or skill**.

When a model is first trained, it learns general knowledge: how sentences are formed, what common words mean, or what basic objects look like. This process is called *pre-training*, and it’s very expensive — it needs massive data and huge GPUs.
Fine-tuning is the next step where we reuse that knowledge and just make *small adjustments* to the model’s internal settings (called weights). This way, it becomes better at the new task without forgetting what it already knows.

For example:

* If you fine-tune ChatGPT on *legal documents*, it becomes better at legal writing.
* If you fine-tune a vision model on *medical X-rays*, it becomes good at detecting diseases.
* If you fine-tune a voice model on *your accent*, it starts understanding your speech more accurately.

Under the hood, fine-tuning works by feeding your dataset into the model and training it again — but gently. The model compares its predictions with the correct answers, measures the error (called *loss*), and slightly adjusts its weights to reduce that error. This process repeats many times until the model learns your data patterns.

Because large models have billions of parameters, we often don’t train the whole thing again. Instead, we use **efficient fine-tuning methods** like:

* **Partial fine-tuning** – only some layers are updated.
* **LoRA (Low-Rank Adaptation)** – small adapter layers are added, making fine-tuning faster and cheaper.
* **Prompt or prefix tuning** – instead of changing the model, we train a few special tokens that guide it toward your task.

Fine-tuning can be used for all types of AI:

* In **language models**, it helps write, chat, or summarize in specific domains.
* In **vision models**, it helps recognize custom objects.
* In **audio models**, it helps adapt to specific voices or sounds.

In short, fine-tuning:

1. Reuses an existing pre-trained model.
2. Trains it on your smaller dataset.
3. Slightly updates its knowledge to fit your domain.
4. Produces a specialized, smarter version of the model.

It’s powerful because it saves time, data, and money — you don’t start from zero; you build on top of something already intelligent.

You can think of it like this:

> Pre-training builds a brain.
> Fine-tuning gives it a profession.



In [1]:
file_path="../data/private/fine_tuning.txt"
with open (file_path,'r',encoding='utf-8') as f:
    lines=f.readlines()

len(lines)

FileNotFoundError: [Errno 2] No such file or directory: '../data/private/fine_tuning.txt'

In [None]:
import re

encryption_message = "Messages and calls are end-to-end encrypted. No one outside of this chat, not even WhatsApp, can read or listen to them. Tap to learn more."
media_pattern = "<Media omitted>"
email_pattern = r'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}'
url_pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
edited_message = "<This message was edited>"
deleted_message = "You deleted this message"
null_message = "null"
created_group_message = "created group"
added_you_to_group_message = "added you"
tagging_pattern = r'@[\w]+'


filtered_lines = []
for line in lines:
    if (
            encryption_message not in line and
            deleted_message not in line and
            null_message != line.split(" ")[-1] and
            media_pattern not in line and
            created_group_message not in line and
            added_you_to_group_message not in line and
            not re.search(email_pattern, line) and
            not re.search(url_pattern, line)
    ):
        line = line.replace(edited_message, "").strip()
        line = re.sub(tagging_pattern, "", line).strip()
        filtered_lines.append(line)

pattern = r'(\d{2}/\d{2}/\d{4}, \d{2}:\d{2}) - (.*?): (.*?)(?=\n\d{2}/\d{2}/\d{4}, \d{2}:\d{2} -|$)'
content = '\n'.join(filtered_lines)
messages = re.findall(pattern, content, re.DOTALL)

lines_removed = len(lines) - len(filtered_lines)
print(f"Lines removed: {lines_removed}")

# Create the dataset

### 1. Group messages by sender

If a conversation is structured as follows:  

```
User 1: Hey!  
User 1: How are you?  
User 2: I am fine  
User 2: And you?  
User 1: Good.  
```

We want to transform it into:  

```
User 1: Hey!\nHow are you? 
User 2: I am fine\nAnd you?  
User 1: Good  
```

In [None]:
grouped_messages=[]
for _,sender,message in messages:
    if grouped_messages and grouped_messages[-1]['sender'] == sender:
        grouped_messages[-1]["message"] += "\n" + message
    else:
        grouped_messages.append({
            "sender":sender,
            "message": message
        })

len(grouped_messages)

### 2. Include special tokens

Each message follows this format:
```
<|startoftext|>Sender<|seprator|>Message<|endoftext|>
```

In [None]:
#Define special tokens
start_of_text_tokens = "<|startoftext|>"
end_of_text_token = "<|endoftext|>"
seprator_token="<|seprator|>"

fine_tuning_data = []

for message in grouped_messages:
    sender=message['sender']
    message_text = message["message"]
    input_sequence= f"{start_of_text_tokens}{sender}{seprator_token}{message_text}{end_of_text_token}"
    fine_tuning_data.append(input_sequence)

len(fine_tuning_data)

### 3. Save the Data

In [None]:
import json
save_path="../output/fine_tuning/data/fine_tuning.json"
with open(save_path,"w",encoding='utf-8') as f:
    json.dump(fine_tuning_data,f,ensure_ascii=False,indent=4)