## MLX Apple Silicon LLM Finetuning

Need to create a dataset in format it within a `jsjonl` representation to fine tune Mistral 7B

`{"text":"<s>[INST] Instruction[/INST] Model answer</s>[INST] Follow-up instruction[/INST]"}`

I've going to reuse the QA dataset i made for the autogen discord RAG from a previous project, 

**Useful Links:**
- [Building your training data for fine-tuning](https://apeatling.com/articles/part-2-building-your-training-data-for-fine-tuning/)
- [Documentation for the JSON Lines text file format](https://jsonlines.org/)

In [1]:
import re
import json
import pandas as pd
from pathlib import Path

### Read in QA pairs to dataframe

In [2]:
def parse_qa(text):
    """Parse the text to extract questions and answers."""
    qa_pairs = re.findall(r'Question: (.*?)\nAnswer: (.*?)\n', text, re.DOTALL)
    return qa_pairs

def create_dataframe(filepath):
    """Read the text file and create a pandas DataFrame."""
    with open(filepath, 'r') as file:
        content = file.read()

    qa_pairs = parse_qa(content)
    df = pd.DataFrame(qa_pairs, columns=['Question', 'Answer'])
    return df


data_path = "./data/22112023_qa.txt"
qa_df = create_dataframe(data_path)
qa_df

Unnamed: 0,Question,Answer
0,How can I handle an invalid URL error when usi...,"To fix an invalid URL error, ensure you're usi..."
1,How should I approach feeding a local image in...,When you want to feed a local image into the M...
2,How do I use the `--pre` flag in pip?,Use the `--pre` flag in pip to include pre-rel...
3,What do you do if you're charged for input tok...,You could modify the logic to terminate the op...
4,How can I install a package from a pre-release...,To install pre-release versions of a package t...
...,...,...
882,What is the current state of support for integ...,"According to a user's statement, Llama2 and mo..."
883,How does one decide between using Autogen and ...,The provided text does not give a specific ans...
884,Are the developers of Autogen and Semantic Ker...,The text implies that there was some awareness...
885,"Can I use any AI model, like Llama2, or am I l...",There is no clear answer provided in the text;...


### Format data in json format

In [3]:
def format_for_mistral(row):
    # Formatting the question-answer pair with the required Mistral format
    return f"<s>[INST] {row['Question']} [/INST] {row['Answer']}</s>"

# Apply the formatting function to each row of the DataFrame
formatted_data = qa_df.apply(format_for_mistral, axis=1)
formatted_data

0      <s>[INST] How can I handle an invalid URL erro...
1      <s>[INST] How should I approach feeding a loca...
2      <s>[INST] How do I use the `--pre` flag in pip...
3      <s>[INST] What do you do if you're charged for...
4      <s>[INST] How can I install a package from a p...
                             ...                        
882    <s>[INST] What is the current state of support...
883    <s>[INST] How does one decide between using Au...
884    <s>[INST] Are the developers of Autogen and Se...
885    <s>[INST] Can I use any AI model, like Llama2,...
886    <s>[INST] What should I do if I have ideas for...
Length: 887, dtype: object

In [4]:
# Save the formatted data to a JSONL file
with open('./data/instructions.json', 'w') as file:
    for entry in formatted_data:
        json.dump({"text": entry}, file)
        file.write('\n')


### Split data in to train and validation sets

In [5]:
def create_valid_file(formatted_data):
    # Calculate 20% of the total lines for validation
    twenty_percent = int(len(formatted_data) * 0.2)
    validation_lines = formatted_data[:twenty_percent]
    training_lines = formatted_data[twenty_percent:]

    with open('./data/train.jsonl', 'w') as file:
        for entry in training_lines:
            json.dump({"text": entry}, file)
            file.write('\n')

    with open('./data/valid.jsonl', 'w') as file:
        for entry in validation_lines:
            json.dump({"text": entry}, file)
            file.write('\n')

create_valid_file(formatted_data)


### Finetuning with MLX

- Install mlx: `pip install mlx`
- I've imported the code from the examples to the `./models` folder.
- The `adapters.npz` file will be outputted to the directory where the command was run.
- [LORA - MLX Documentation](https://github.com/ml-explore/mlx-examples/tree/main/lora)



```bash
python lora.py --train --model mistralai/Mistral-7B-Instruct-v0.2 --data ./data/ --batch-size 2 --lora-layers 8 --iters 1000

. . .

Iter 900: Train loss 2.438, It/sec 0.464, Tokens/sec 69.842
Iter 900: Saved adapter weights to adapters.npz.
Iter 910: Train loss 2.326, It/sec 0.146, Tokens/sec 20.943
Iter 920: Train loss 2.328, It/sec 0.310, Tokens/sec 43.160
Iter 930: Train loss 2.280, It/sec 0.232, Tokens/sec 33.151
Iter 940: Train loss 2.454, It/sec 0.588, Tokens/sec 76.523
Iter 950: Train loss 2.238, It/sec 0.579, Tokens/sec 79.892
Iter 960: Train loss 2.457, It/sec 0.515, Tokens/sec 76.947
Iter 970: Train loss 2.419, It/sec 0.563, Tokens/sec 81.282
Iter 980: Train loss 2.276, It/sec 0.511, Tokens/sec 76.844
Iter 990: Train loss 2.332, It/sec 0.532, Tokens/sec 72.286
Iter 1000: Train loss 2.536, It/sec 0.498, Tokens/sec 73.649
Iter 1000: Val loss 2.187, Val took 82.108s
Iter 1000: Saved adapter weights to adapters.npz.
```


### Comparing Mistral-7B-Instruct-v0.2 to our Finetuned Version

First we will query the regular Mistral model to see what kind of response it gives.<br>
Then, we will ask the fine tuned model the same question to verify the affectiveness of the finetune.

```bash

python lora.py --model mistralai/Mistral-7B-Instruct-v0.2 --max-tokens 1000 --prompt "What are some ways you can use open sourced models with Autogen?"
```

**response:**
```
Open sourced models can be used with Autogen for different purposes, such as improving realism, adding new features, and extending the coverage of different use cases. This is done by deploying the open sourced models as plugins or custom controls.

Split the answer into some possible applications:

1. Improving realism:
   - Open sourced models can be used to enhance existing features within Autogen.
   - Examples include adding additional textures using new models to pinpoint locations like parking lots, gates, fences, or even trees.

2. Extending feature coverage:
   - Open sourced models can be used to enhance the range of Autogen's capabilities by adding new types to the library.
   - For example, adding open sourced models of power stations, wind turbines, or antennas.

3. Custom controls or applications:
   - Scripts can be written to interact with the models, creating new custom controls or applications.
   - For example, using open sourced LIDAR data to create new methods of terrain modeling or even using open sourced satellite data to generate accurate 3D models of the terrain.


Remember that you would need to check the license of the open sourced model before importing to Autogen.
==========
```