# README

This file all contains about a tutorial [learn.deeplearning.ai fine-tune-llm](https://learn.deeplearning.ai/courses/finetuning-large-language-models)

# Install & Import

In [37]:
import os
import lamini
import textwrap
from llama import BasicModelRunner
from dotenv import load_dotenv
import jsonlines
import itertools
import pandas as pd
from pprint import pprint

import datasets
from datasets import load_dataset

from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

load_dotenv()

True

In [38]:
SUPPORTED_MODELS_BY_LAMINI = [
    'EleutherAI/pythia-410m',
    'EleutherAI/pythia-70m',
    'hf-internal-testing/tiny-random-gpt2',
    'meta-llama/Llama-2-13b-chat-hf',
    'meta-llama/Llama-2-7b-chat-hf',
    'meta-llama/Llama-2-7b-hf',
    'meta-llama/Meta-Llama-3-8B-Instruct',
    'microsoft/phi-2',
    'microsoft/Phi-3-mini-4k-instruct',
    'mistralai/Mistral-7B-Instruct-v0.1',
    'mistralai/Mistral-7B-Instruct-v0.2',
    'Qwen/Qwen2-7B-Instruct'
]

# Why finetune

In [22]:
lamini.api_url = os.getenv("http://jupyter-api-proxy.internal.dlai/rev-proxy/lamini")
lamini.api_key = os.getenv("LAMINI_API_KEY")

## None finetuned Model

In [6]:
non_finetuned = BasicModelRunner("meta-llama/Llama-2-7b-hf")
non_finetuned

<lamini.runners.basic_model_runner.BasicModelRunner at 0x7808422679d0>

In [11]:
non_finetuned_output = non_finetuned("Tell me how to train my dog to sit")

wrapped_text = textwrap.fill(non_finetuned_output, width=80)
print(wrapped_text)

. Tell me how to train my dog to sit. I have a 10 month old puppy and I want to
train him to sit. I have tried the treat method and the verbal command method. I
have tried both and he just doesn't seem to get it. I have tried to get him to
sit by putting my hand on his back and pushing him down. I have tried to get him
to sit by putting my hand on his back and pushing him down. I have tried to get
him to sit by putting my hand on his back and pushing him down. I have tried to
get him to sit by putting my hand on his back and pushing him down. I have tried
to get him to sit by putting my hand on his back and pushing him down. I have
tried to get him to sit by putting my hand on his back and pushing him down. I
have tried to get him to sit by putting my hand on his back and pushing him
down. I have tried to get him to sit by putting my hand on his back and pushing
him down. I have tried to get him to sit by putting my hand on his back and
pushing him down. I have tried to get him to sit 

The above response is not good its repeat a sentence multiple times

## Compare to finetuned models 


In [13]:
finetuned_model = BasicModelRunner("meta-llama/Llama-2-7b-chat-hf")

In [16]:
finetuned_output = finetuned_model("Tell me how to train my dog to sit")

wrapped_text = textwrap.fill(finetuned_output)
print(wrapped_text)

on command. Training a dog to sit on command is a basic obedience
command that can be achieved with patience, consistency, and positive
reinforcement. Here's a step-by-step guide on how to train your dog to
sit on command:  1. Choose a quiet and distraction-free area: Find a
quiet area with minimal distractions where your dog can focus on you.
2. Have treats ready: Choose your dog's favorite treats and have them
ready to use as rewards. 3. Stand in front of your dog: Stand in front
of your dog and hold a treat close to their nose. 4. Move the treat up
and back: Slowly move the treat up and back, towards your dog's tail,
while saying "sit" in a calm and clear voice. 5. Dog will sit: As you
move the treat, your dog will naturally sit down to follow the treat.
The moment their bottom touches the ground, say "good sit" and give
them the treat. 6. Repeat the process: Repeat steps 3-5 several times,
so your dog starts to associate the command "sit" with the action of
sitting down. 7. Gradual

In [17]:
print(finetuned_model("[INST]Tell me how to train my dog to sit[/INST]"))

 Training your dog to sit is a basic obedience command that can be achieved with patience, consistency, and positive reinforcement. Here's a step-by-step guide on how to train your dog to sit:

1. Choose a quiet and distraction-free area: Find a quiet area with no distractions where your dog can focus on you.
2. Have treats ready: Choose your dog's favorite treats and have them ready to use as rewards.
3. Stand in front of your dog: Stand in front of your dog and hold a treat close to their nose.
4. Move the treat up and back: Slowly move the treat up and back, towards your dog's tail, while saying "sit" in a calm and clear voice.
5. Dog will sit: As you move the treat, your dog will naturally sit down to follow the treat. The moment their bottom touches the ground, say "good sit" and give them the treat.
6. Repeat the process: Repeat steps 3-5 several times, so your dog starts to associate the command "sit" with the action of sitting down.
7. Gradually phase out the treats: As your do

In [18]:
print(non_finetuned("[INST]Tell me how to train my dog to sit[/INST]"))


[INST]Tell me how to train my dog to sit[/INST]
[INST]Tell me how to train my dog to sit[/INST] [INST]Tell me how to train my dog to sit[/INST] [INST]Tell me how to train my dog to sit[/INST] [INST]Tell me how to train my dog to sit[/INST] [INST]Tell me how to train my dog to sit[/INST] [INST]Tell me how to train my dog to sit[/INST] [INST]Tell me how to train my dog to sit[/INST] [INST]Tell me how to train my dog to sit[/INST] [INST]Tell me how to train my dog to sit[/INST] [INST]Tell me how to train my dog to sit[/INST] [INST]Tell me how to train my dog to sit[/INST] [INST]Tell me how to train my dog to sit[/INST] [INST]Tell me how to train my dog to sit[/INST] [INST]Tell me how to train my dog to sit[/INST] [INST]Tell me how to train my dog to sit[/INST] [INST]Tell me how to train my dog to sit[/INST] [INST]Tell me how to train my dog to sit[/INST] [INST]Tell me how to train my dog to sit[/INST] [INST]Tell me how to train my dog to sit[/INST] [INST]Tell me how to train my dog to si

# Finetuning data: compare to pretraining and basic preparation

Or

`Where finetune fits in?`


Note: The `EleutherAI` is opensourced a dataset its named as `pile` but its not currently in huggingface instead of that one they refering another one dataset [bigcode/the-stack-v2](https://huggingface.co/datasets/bigcode/the-stack-v2). its 1.68TB data is there scrapped from different source of internet.

In [5]:
#pretrained_dataset = load_dataset("EleutherAI/pile", split="train", streaming=True)

pretrained_dataset = load_dataset("c4", "en", split="train", streaming=True)
pretrained_dataset



IterableDataset({
    features: ['text', 'timestamp', 'url'],
    n_shards: 1024
})

In [9]:
n = 5
print("Pretrained dataset:")
top_n = itertools.islice(pretrained_dataset, n)
for i in top_n:
  print(i)

print("================================================")
print(textwrap.fill(i['text']))

Pretrained dataset:
{'text': 'Beginners BBQ Class Taking Place in Missoula!\nDo you want to get better at making delicious BBQ? You will have the opportunity, put this on your calendar now. Thursday, September 22nd join World Class BBQ Champion, Tony Balay from Lonestar Smoke Rangers. He will be teaching a beginner level class for everyone who wants to get better with their culinary skills.\nHe will teach you everything you need to know to compete in a KCBS BBQ competition, including techniques, recipes, timelines, meat selection and trimming, plus smoker and fire information.\nThe cost to be in the class is $35 per person, and for spectators it is free. Included in the cost will be either a t-shirt or apron and you will be tasting samples of each meat that is prepared.', 'timestamp': '2019-04-25T12:57:54Z', 'url': 'https://klyq.com/beginners-bbq-class-taking-place-in-missoula/'}
{'text': 'Discussion in \'Mac OS X Lion (10.7)\' started by axboi87, Jan 20, 2012.\nI\'ve got a 500gb inter

### Contrast with company finetuning dataset you will be using

In [11]:
filename = "data/lamini_docs.csv"
instruction_dataset_df = pd.read_csv(filename)
instruction_dataset_df

Unnamed: 0,question,answer
0,What are the different types of documents avai...,"Lamini has documentation on Getting Started, A..."
1,What is the recommended way to set up and conf...,Lamini can be downloaded as a python package a...
2,How can I find the specific documentation I ne...,"You can ask this model about documentation, wh..."
3,Does the documentation include explanations of...,Our documentation provides both real-world and...
4,Does the documentation provide information abo...,External dependencies and libraries are all av...
...,...,...
1395,What is Lamini and what is its collaboration w...,Lamini is a library that simplifies the proces...
1396,How does Lamini simplify the process of access...,Lamini simplifies data access in Databricks by...
1397,What are some of the key features provided by ...,Lamini automatically manages the infrastructur...
1398,How does Lamini ensure data privacy during the...,"During the training process, Lamini ensures da..."


### Various ways of formatting your data

In [13]:
examples = instruction_dataset_df.to_dict()
text = examples["question"][0] + examples["answer"][0]
print(textwrap.fill(text))

What are the different types of documents available in the repository
(e.g., installation guide, API documentation, developer's
guide)?Lamini has documentation on Getting Started, Authentication,
Question Answer Model, Python Library, Batching, Error Handling,
Advanced topics, and class documentation on LLM Engine available at
https://lamini-ai.github.io/.


The below format is a example if a data set comes in different lable format

In [16]:
if "question" in examples and "answer" in examples:
  text = examples["question"][0] + examples["answer"][0]
elif "instruction" in examples and "response" in examples:
  text = examples["instruction"][0] + examples["response"][0]
elif "input" in examples and "output" in examples:
  text = examples["input"][0] + examples["output"][0]
else:
  text = examples["text"][0]

print(textwrap.fill(text))

What are the different types of documents available in the repository
(e.g., installation guide, API documentation, developer's
guide)?Lamini has documentation on Getting Started, Authentication,
Question Answer Model, Python Library, Batching, Error Handling,
Advanced topics, and class documentation on LLM Engine available at
https://lamini-ai.github.io/.


In [19]:
prompt_template_qa = """### Question:
{question}

### Answer:
{answer}"""
print(prompt_template_qa)

### Question:
{question}

### Answer:
{answer}


In [24]:
question = examples["question"][0]
answer = examples["answer"][0]

text_with_prompt_template = prompt_template_qa.format(question=question, answer=answer)
print(text_with_prompt_template)

### Question:
What are the different types of documents available in the repository (e.g., installation guide, API documentation, developer's guide)?

### Answer:
Lamini has documentation on Getting Started, Authentication, Question Answer Model, Python Library, Batching, Error Handling, Advanced topics, and class documentation on LLM Engine available at https://lamini-ai.github.io/.


In [27]:
prompt_template_q = """### Question:
{question}

### Answer:"""

print(prompt_template_q)

### Question:
{question}

### Answer:


In [30]:
num_examples = len(examples["question"])
finetuning_dataset_text_only = []
finetuning_dataset_question_answer = []
for i in range(num_examples):
  question = examples["question"][i]
  answer = examples["answer"][i]

  text_with_prompt_template_qa = prompt_template_qa.format(question=question, answer=answer)
  finetuning_dataset_text_only.append({"text": text_with_prompt_template_qa})

  text_with_prompt_template_q = prompt_template_q.format(question=question)
  finetuning_dataset_question_answer.append({"question": text_with_prompt_template_q, "answer": answer})

In [31]:
pprint(finetuning_dataset_text_only[0])

{'text': '### Question:\n'
         'What are the different types of documents available in the '
         "repository (e.g., installation guide, API documentation, developer's "
         'guide)?\n'
         '\n'
         '### Answer:\n'
         'Lamini has documentation on Getting Started, Authentication, '
         'Question Answer Model, Python Library, Batching, Error Handling, '
         'Advanced topics, and class documentation on LLM Engine available at '
         'https://lamini-ai.github.io/.'}


In [32]:
pprint(finetuning_dataset_question_answer[0])

{'answer': 'Lamini has documentation on Getting Started, Authentication, '
           'Question Answer Model, Python Library, Batching, Error Handling, '
           'Advanced topics, and class documentation on LLM Engine available '
           'at https://lamini-ai.github.io/.',
 'question': '### Question:\n'
             'What are the different types of documents available in the '
             'repository (e.g., installation guide, API documentation, '
             "developer's guide)?\n"
             '\n'
             '### Answer:'}


### Common ways of storing your data

In [35]:
with jsonlines.open(f'data/lamini_docs_processed.jsonl', 'w') as writer:
    writer.write_all(finetuning_dataset_question_answer)

In [34]:
finetuning_dataset_name = "lamini/lamini_docs"
finetuning_dataset = load_dataset(finetuning_dataset_name)
print(finetuning_dataset)

Downloading readme: 100%|██████████| 577/577 [00:00<00:00, 6.69MB/s]
Downloading data: 100%|██████████| 615k/615k [00:04<00:00, 139kB/s]
Downloading data: 100%|██████████| 83.7k/83.7k [00:01<00:00, 51.0kB/s]
Generating train split: 100%|██████████| 1260/1260 [00:00<00:00, 280020.30 examples/s]
Generating test split: 100%|██████████| 140/140 [00:00<00:00, 108439.99 examples/s]

DatasetDict({
    train: Dataset({
        features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1260
    })
    test: Dataset({
        features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 140
    })
})





# Instruction-tuning

### Load instruction tuned dataset

In [40]:
instruction_tuned_dataset = load_dataset("tatsu-lab/alpaca", split="train", streaming=True)
instruction_tuned_dataset

IterableDataset({
    features: ['instruction', 'input', 'output', 'text'],
    n_shards: 1
})

In [41]:
m = 5
print("Instruction-tuned dataset:")
top_m = list(itertools.islice(instruction_tuned_dataset, m))
for j in top_m:
  print(j)

Instruction-tuned dataset:
{'instruction': 'Give three tips for staying healthy.', 'input': '', 'output': '1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.', 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:\n1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.'}
{'instruction': 'What are the three primary colors?', 'input': '', 'output': 'The three primary colors are red, blue, and yellow.', 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nWhat are the three p

### Two prompt templates


The prompt structure is alpaca prompt structure. this the format used in the `unsloth` finetuning process also

In [44]:
prompt_template_with_input = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:"""

prompt_template_without_input = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:"""

### Hydrate prompts (add data to prompts)

In [47]:
processed_data = []
for j in top_m:
  if not j["input"]:
    processed_prompt = prompt_template_without_input.format(instruction=j["instruction"])
  else:
    processed_prompt = prompt_template_with_input.format(instruction=j["instruction"], input=j["input"])

  processed_data.append({"input": processed_prompt, "output": j["output"]})

pprint(processed_data[0])

{'input': 'Below is an instruction that describes a task. Write a response '
          'that appropriately completes the request.\n'
          '\n'
          '### Instruction:\n'
          'Give three tips for staying healthy.\n'
          '\n'
          '### Response:',
 'output': '1.Eat a balanced diet and make sure to include plenty of fruits '
           'and vegetables. \n'
           '2. Exercise regularly to keep your body active and strong. \n'
           '3. Get enough sleep and maintain a consistent sleep schedule.'}


### Save data to jsonl

In [48]:
with jsonlines.open(f'data/alpaca_processed.jsonl', 'w') as writer:
    writer.write_all(processed_data)

### Compare non-instruction-tuned vs. instruction-tuned models

In [49]:
dataset_path_hf = "lamini/alpaca"
dataset_hf = load_dataset(dataset_path_hf)
print(dataset_hf)

Downloading readme: 100%|██████████| 388/388 [00:00<00:00, 4.06MB/s]
Downloading data: 100%|██████████| 12.7M/12.7M [00:02<00:00, 6.17MB/s]
Generating train split: 100%|██████████| 52002/52002 [00:00<00:00, 1071641.16 examples/s]

DatasetDict({
    train: Dataset({
        features: ['input', 'output'],
        num_rows: 52002
    })
})





In [52]:
non_instruct_model = BasicModelRunner("meta-llama/Llama-2-7b-hf")
non_instruct_output = non_instruct_model("Tell me how to train my dog to sit")
print("Not instruction-tuned output (Llama 2 Base):")
print(textwrap.fill(non_instruct_output))

Not instruction-tuned output (Llama 2 Base):
. Tell me how to train my dog to sit. I have a 10 month old puppy and
I want to train him to sit. I have tried the treat method and the
verbal command method. I have tried both and he just doesn't seem to
get it. I have tried to get him to sit by putting my hand on his back
and pushing him down. I have tried to get him to sit by putting my
hand on his back and pushing him down. I have tried to get him to sit
by putting my hand on his back and pushing him down. I have tried to
get him to sit by putting my hand on his back and pushing him down. I
have tried to get him to sit by putting my hand on his back and
pushing him down. I have tried to get him to sit by putting my hand on
his back and pushing him down. I have tried to get him to sit by
putting my hand on his back and pushing him down. I have tried to get
him to sit by putting my hand on his back and pushing him down. I have
tried to get him to sit by putting my hand on his back and push

The above output is not good sentence are repeated.

In [53]:
instruct_model = BasicModelRunner("meta-llama/Llama-2-7b-chat-hf")
instruct_output = instruct_model("Tell me how to train my dog to sit")
print("Instruction-tuned output (Llama 2): ", instruct_output)

Instruction-tuned output (Llama 2):  on command.
Training a dog to sit on command is a basic obedience command that can be achieved with patience, consistency, and positive reinforcement. Here's a step-by-step guide on how to train your dog to sit on command:

1. Choose a quiet and distraction-free area: Find a quiet area with minimal distractions where your dog can focus on you.
2. Have treats ready: Choose your dog's favorite treats and have them ready to use as rewards.
3. Stand in front of your dog: Stand in front of your dog and hold a treat close to their nose.
4. Move the treat up and back: Slowly move the treat up and back, towards your dog's tail, while saying "sit" in a calm and clear voice.
5. Dog will sit: As you move the treat, your dog will naturally sit down to follow the treat. The moment their bottom touches the ground, say "good sit" and give them the treat.
6. Repeat the process: Repeat steps 3-5 several times, so your dog starts to associate the command "sit" with t

### Try smaller models

To run these command we need GPU or More CPU
```python
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-70m")
model = AutoModelForCausalLM.from_pretrained("EleutherAI/pythia-70m")
```

**Inference**

```python
def inference(text, model, tokenizer, max_input_tokens=1000, max_output_tokens=100):
  # Tokenize
  input_ids = tokenizer.encode(
          text,
          return_tensors="pt",
          truncation=True,
          max_length=max_input_tokens
  )

  # Generate
  device = model.device
  generated_tokens_with_prompt = model.generate(
    input_ids=input_ids.to(device),
    max_length=max_output_tokens
  )

  # Decode
  generated_text_with_prompt = tokenizer.batch_decode(generated_tokens_with_prompt, skip_special_tokens=True)

  # Strip the prompt
  generated_text_answer = generated_text_with_prompt[0][len(text):]

  return generated_text_answer
```



Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


ImportError: 
AutoModelForCausalLM requires the PyTorch library but it was not found in your environment. Checkout the instructions on the
installation page: https://pytorch.org/get-started/locally/ and follow the ones that match your environment.
Please note that you may need to restart your runtime after installation.
