In [None]:
# Installs Unsloth, Xformers (Flash Attention) and all other packages
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes

### Background

I saw a reel on Instagram in which an AI enthusiast created an AI clone of himself to talk to his girlfriend (Certainly, I won't do that... xd) using [RAG](https://youtu.be/YVWxbHJakgg?feature=shared) Retrieval Augmented Generation and the Chat GPT-3.5 turbo API. It kind of worked, but it had major privacy issues. Sending personal chats to Chat GPT could potentially result in those chats being used by OpenAI to train its model. This led me to think, what if I fine-tuned a pre-existing model like Llama or Mixtral on my personal WhatsApp chat history? It would be cool to have a model that can talk like me as I do on WhatsApp, primarily in Hinglish (Hindi + English).

[Fine-tuning a large language model (LLM)](https://youtu.be/YVWxbHJakgg?feature=shared) requires a good understanding of LLMs in general. The major challenge with fine-tuning big models, such as a 7B parameter model, is that it requires a minimum of 32GB of GPU RAM, which is costly and not available in free-tier GPU compute services like Colab or Kaggle. So, I had to find a way around this limitation.

Another important aspect is that the fine-tuning results heavily depend on the quality and size of the dataset used. Converting raw WhatsApp chat data into a usable dataset is challenging but worth pursuing.

Let's see how it looks in reality and how it's being carried out.

### Important

We are using a free instance of Google Colab to fine-tune our model`(Llama3)`, making it **totally free**.

For chatting with our fine-tuned model, we will use [Ollama](https://ollama.com/) locally, which is very lightweight and requires only **8GB** of free RAM in your laptop/PC and works without any **GPU** support.

**Keep in mind that your chat data is completely safe; it is not being sent to anyone.**


### Data Prep
To extract chat history from your WhatsApp chats, follow these steps:

1. Open your WhatsApp application.
2. Go to the chat from which you want to extract the chat history.
3. Click on the three dots in the top right corner of the screen.
4. Click on `More` then click on `Export Chat`.
5. Select `Without media`.
6. Save it locally or send it to saved messages on Telegram so you can later download it on your Telegram desktop.
7. Repeat these steps for all of your chats. The more chat data you have, the better the results will be.

It will generate `.zip file`. You don't have to extract it.




#### Upload Exported Chat files to Colab runtime
Now, locate your exported chat zip files and upload them to the Colab runtime. Follow these steps to upload files to Google Colab:

1. Click on the Files icon on the left side of the screen (as shown in the image attached below).
2. Click on the upload button. It will open the File Explorer. Choose the exported chat zip files (you can select multiple files at once).
    * Wait until your files are uploaded. The upload process bar will display at the bottom left corner of the screen.
    * Once files are uploaded successfully, they will appear in the Files tab of Google Colab.
<img src="https://github.com/Eviltr0N/Make-AI-Clone-of-Yourself/raw/main/img/file_download.png">

##### Keep in Mind:

* Export chat history only for meaningful conversations. Before exporting a chat, consider whether it adds value to the data or if it is just a short conversation.
* If you think you don’t want the AI to learn from a specific chat, then don’t export it.
* Currently, it only supports individual chats, so please do not export group chats.


### Data Filtering

The exported data contains many irregularities such as `<Media omitted>`, `This message was deleted` and timestamps of messages. We need to remove these and convert the whole chat history into a `Prompt: Response` format so it can be used to fine-tune the model. To extract messages from the data, I used `regex`. Additionally, I filtered out any links and emails from the chat data for obvious privacy reasons.

**Now, please edit the below list of `filler_words`**. These words may vary from person to person. Some examples are `Ok`, `Yup`, `Hmm`, `Han`. We need to remove these from the dataset because if we don't, the fine-tuned model will primarily learn these words and, in most cases, respond with them. For example, if you ask, "Where are you going?" the model might respond with something like "Ok" or "Hmm."

To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!

In [2]:
filler_words = ["Ok", "Okay", "Yup", "Hmm"]
# Add or remove words from this list based on your personal usage.

chat_dir = "./"

In [3]:
import re
import os
import shutil
import csv

In [4]:
class Wh_Chat_Processor:
    def __init__(self):
        pass
    def open_chat_file(self, dir,filename):
        self.sender_name = filename.replace("WhatsApp Chat with ", "").replace(".txt", "")
        with open(os.path.join(dir,filename)) as f:
            chat_text = f.read()
        return chat_text

    def msg_filter_basic(self, chat_text):
        filtered = []
        pt = r' - ([^:]+): (.*?)(?=\d{1,2}/\d{1,2}/\d{2,4}, \d{1,2}:\d{2}\s*(?:AM|PM|am|pm)? - |$)'
        msgs = re.findall(pt, chat_text, re.DOTALL)
        for msg in msgs:
            line = msg[1]
            wh_default_filter = "Tap to learn more." in line or "<Media omitted>" in line
            website_filter = "https://" in line or "http://" in line
            mail_filter = "@gmail.com" in line
            deleted_msg_filter = "This message was deleted" in line or "You deleted this message" in line or "<This message was edited>" in line or "(file attached)" in line

            if not (wh_default_filter or website_filter or mail_filter or deleted_msg_filter):
                    filtered.append(msg)
        return filtered

    def process_chat(self, chat_data):
        merged_lines = []
        current_sender = None
        current_message = {}
        for line in chat_data:
            if not line:
                continue
            parts = line
            if len(parts) == 2:
                sender, message = parts
                if current_sender is None:
                    current_sender = sender
                    current_message[current_sender] = [message.strip()]
                elif sender == current_sender:
                    current_message[current_sender].append(message.strip())
                else:
                    merged_lines.append(current_message)
                    current_sender = sender
                    current_message = {current_sender: [message.strip()]}
            else:
                if current_sender:
                    current_message[current_sender][-1] += " " + line.strip()
        if current_sender:
            merged_lines.append(current_message)
        keys = set()
        for line in merged_lines:
            # print(line)
            for key in line.keys():
                if key != self.sender_name:
                    keys.add(key)
        self.my_name = list(keys)[0]
        print(list(keys))
        return merged_lines

    def advance_filter(self, merged_chat_data):
        filtered_data=[]
        sender = ""
        me = ""
        chk = 1
        CD = merged_chat_data
        for ind, x in enumerate(CD):
            if x.get(self.sender_name) != None :
                if len(x[self.sender_name]) == 1 and ( x[self.sender_name][0] in filler_words or len(x[self.sender_name][0]) ==1 ):
                    continue
                if len(CD[ind][self.sender_name]) > 1:
                    for y in range(0,len(CD[ind][self.sender_name])):
                        if y+1 != len(CD[ind][self.sender_name]):
                            sender += CD[ind][self.sender_name][y] + "\n"
                        else:
                            sender += CD[ind][self.sender_name][y]
                else:
                    sender += CD[ind][self.sender_name][0]
            elif x.get(self.my_name) != None and len(sender) > 1:
                if len(CD[ind][self.my_name]) > 1:
                    for y in range(0,len(CD[ind][self.my_name])):
                        if y+1 != len(CD[ind][self.my_name]):
                            me += CD[ind][self.my_name][y] + "\n"
                        else:
                            me += CD[ind][self.my_name][y]
                else:
                    me += CD[ind][self.my_name][0]
            else:
                continue
            if chk ==1:
                chk+=1
            elif chk ==2:
                filtered_data.append([sender, me])
                sender = ""
                me=""
                chk=1
            else:
                pass
        return filtered_data


In [32]:
with open("all_chat_data.csv", "w") as f:
    f.write("Prompt,Response"+ "\n")

for file in os.listdir(os.path.join(chat_dir)):
    if file.endswith('.zip'):
        full_path = os.path.join(chat_dir, file)
        shutil.unpack_archive(full_path, chat_dir)

In [None]:
for file in os.listdir(os.path.join(chat_dir)):
    processor = Wh_Chat_Processor()
    if file.endswith('.txt'):
        print("Processing: ",file)
        chat_d = processor.open_chat_file(chat_dir,file)
        basic_f = processor.msg_filter_basic(chat_d)
        chat_ps = processor.process_chat(basic_f)
        filtered_data = processor.advance_filter(chat_ps)
        with open("all_chat_data.csv", "a") as f:
            csv_writer = csv.writer(f)
            for row in filtered_data:
                csv_writer.writerow(row)
print("Successfully Processed all the chats... Generated CSV File of chats is saved in Current directory with the name 'all_chat_data.csv'")

### Model Fine-Tuning
As we discussed earlier, fine-tuning a 7B parameter model with just 16GB of RAM is not possible. To achieve this, we will use a technique known as [Quantization](https://huggingface.co/docs/optimum/en/concept_guides/quantization). Specifically, we will use 4-bit quantization.

I am using [Unsloth](https://github.com/unslothai/unsloth) for the rest of the processes, such as quantization and training the model. Unsloth has very good documentation and requires less VRAM to fine-tune the model.

Check out - [Unsloth's Github](https://github.com/unslothai/unsloth)

This notebook uses the `Llama-3` format for conversation style finetunes. We use [Open Assistant conversations](https://huggingface.co/datasets/philschmid/guanaco-sharegpt-style) in ShareGPT style.

* Unsloth support Llama, Mistral, Phi-3, Gemma, Yi, DeepSeek, Qwen, TinyLlama, Vicuna, Open Hermes etc
* Unsloth support 16bit LoRA or 4bit QLoRA. Both 2x faster.


For finetuning I am using **`Llama3` 8B Instruct** as our base model, you can use other models such as `Mixtral` and `Gemma`. I have traied Mixtral also but it dosent perform as good as `Llama3`.

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/mistral-7b-v0.3-bnb-4bit",      # New Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/llama-3-8b-bnb-4bit",           # Llama-3 15 trillion tokens model 2x faster!
    "unsloth/llama-3-8b-Instruct-bnb-4bit",
    "unsloth/llama-3-70b-bnb-4bit",
    "unsloth/Phi-3-mini-4k-instruct",        # Phi-3 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/gemma-7b-bnb-4bit",             # Gemma 2.2x faster!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-Instruct-bnb-4bit", # Choose ANY! eg teknium/OpenHermes-2.5-Mistral-7B
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Let's prepare dataset from the filtered Whatsapp Chat data

In [25]:
import pandas as pd
from datasets import Dataset, load_dataset
from unsloth.chat_templates import get_chat_template

In [26]:
tokenizer = get_chat_template(
    tokenizer,
    chat_template="llama-3",  # Use the desired chat template
    mapping={"role": "from", "content": "value", "user": "human", "assistant": "gpt"}
)

# Define the formatting function
def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False) for convo in convos]
    return {"text": texts}


In [40]:
df = pd.read_csv("all_chat_data.csv")
conversations = []
for _, row in df.iterrows():
    try:
        conversation = [
            {'from': 'human', 'value': str(row['Prompt'])},
            {'from': 'assistant', 'value': str(row['Response'])}
        ]
        conversations.append(conversation)
    except:
        print(_ , row)


dataset = Dataset.from_dict({"conversations": conversations})
dataset = dataset.map(formatting_prompts_func, batched=True)

Map:   0%|          | 0/2157 [00:00<?, ? examples/s]

Let's see how the `Llama-3` format works by printing the 5th element

In [None]:
dataset[5]["conversations"]

In [None]:
print(dataset[5]["text"])


### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer).
* I am doing `1 epochs` to speed things up, but you can set `num_train_epochs` to 2 or 3, Just experiment with it.
* If you have large dataset then just go for `1 full epoch`. Do not do more than 3 or 4 epoch if the `training loss` is not **decreasing**.

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        num_train_epochs=1,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

In [None]:
trainer_stats = trainer.train()

<a name="Inference"></a>
### Inference
Let's run the model! Since we're using `Llama-3`, use `apply_chat_template` with `add_generation_prompt` set to `True` for inference.

In [50]:
from unsloth.chat_templates import get_chat_template
from transformers import TextStreamer

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth
    mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
)
text_streamer = TextStreamer(tokenizer)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"from": "human", "value": "Pagal ho gya hai kya"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

output = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128, use_cache = True)

<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Pagal ho gya hai kya<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Ha<|eot_id|>


 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

<a name="Save"></a>
### Saving finetuned model (Lora Adapters Only)
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [51]:
model.save_pretrained("lora_model") # Local saving


### GGUF Conversion (For Ollama)
**To use our finetuned model in our PC/laptop we will use [Ollama](https://ollama.com/)**. To use this model with  `Ollama` We have to save the model in `GGUF` Format.

Unsloth allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on Unsloth [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
    * `q8_0` - Fast conversion. High resource use, but generally acceptable.
    * `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
    * `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K

In [None]:
# Save to 8bit Q8_0
if True: model.save_pretrained_gguf("model", tokenizer,)
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

Once the fine-tuning is complete, the model `unsloth.Q8_0.gguf` is saved in the `models/` folder. You need to download this `GGUF` file and `Modelfile` too. You can directly download it by locating the file in the files section of Google Colab, or you can copy this file to your Google Drive to download it from there. If your internet connection is slow like mine, the Google Drive method is best because the file is large (approximately 8GB). Here are the steps for both methods.

### Direct Download via Colab
1. Click on the files section in Google Colab.
2. Locate the models folder. Then expand it by clicking on the arrow located to the left of the folder name.
3. Choose the file `unsloth.Q8_0.gguf` & `Modelfile`, then hover the mouse cursor over the filename.
4. Click on the three dots, then select Download.



![Download_IMG](https://github.com/Eviltr0N/Make-AI-Clone-of-Yourself/raw/main/img/file_download.png)

### Download via Google Drive
* Before using this method, make sure you have 8GB of free space left in your Google Drive. Otherwise, this will not work.
1. Run the below cell to mount your drive with your Colab account.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

* Run this cell to copy the model into your Google Drive.

In [None]:
!mkdir /content/drive/MyDrive/finetuned_model
!cp /content/model/unsloth.Q8_0.gguf /content/drive/MyDrive/finetuned_model/
!cp /content/model/Modelfile /content/drive/MyDrive/finetuned_model/

### Using Finetuned Model With Ollama and Whatsapp
Now follow this guide on my [Github](https://github.com/Eviltr0N/Make-AI-Clone-of-Yourself?tab=readme-ov-file#loading-the-model-into-ollama) to chat with Your finetuned model.
[Here](https://github.com/Eviltr0N/Make-AI-Clone-of-Yourself?tab=readme-ov-file#loading-the-model-into-ollama)  