# README

This file all contains about a tutorial [learn.deeplearning.ai fine-tune-llm](https://learn.deeplearning.ai/courses/finetuning-large-language-models)

# Install & Import

In [1]:
import os
import lamini
import textwrap
from llama import BasicModelRunner
from dotenv import load_dotenv
import jsonlines
import itertools
import pandas as pd
from pprint import pprint

import datasets
from datasets import load_dataset, Dataset

from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModelForSeq2SeqLM
from huggingface_hub import login

load_dotenv()

  from .autonotebook import tqdm as notebook_tqdm


True

In [2]:
login(os.environ.get("HF_TOKEN"))

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /home/softsuave/.cache/huggingface/token
Login successful


In [None]:
SUPPORTED_MODELS_BY_LAMINI = [
    'EleutherAI/pythia-410m',
    'EleutherAI/pythia-70m',
    'hf-internal-testing/tiny-random-gpt2',
    'meta-llama/Llama-2-13b-chat-hf',
    'meta-llama/Llama-2-7b-chat-hf',
    'meta-llama/Llama-2-7b-hf',
    'meta-llama/Meta-Llama-3-8B-Instruct',
    'microsoft/phi-2',
    'microsoft/Phi-3-mini-4k-instruct',
    'mistralai/Mistral-7B-Instruct-v0.1',
    'mistralai/Mistral-7B-Instruct-v0.2',
    'Qwen/Qwen2-7B-Instruct'
]

# Why finetune

In [22]:
lamini.api_url = os.getenv("http://jupyter-api-proxy.internal.dlai/rev-proxy/lamini")
lamini.api_key = os.getenv("LAMINI_API_KEY")

## None finetuned Model

In [6]:
non_finetuned = BasicModelRunner("meta-llama/Llama-2-7b-hf")
non_finetuned

<lamini.runners.basic_model_runner.BasicModelRunner at 0x7808422679d0>

In [11]:
non_finetuned_output = non_finetuned("Tell me how to train my dog to sit")

wrapped_text = textwrap.fill(non_finetuned_output, width=80)
print(wrapped_text)

. Tell me how to train my dog to sit. I have a 10 month old puppy and I want to
train him to sit. I have tried the treat method and the verbal command method. I
have tried both and he just doesn't seem to get it. I have tried to get him to
sit by putting my hand on his back and pushing him down. I have tried to get him
to sit by putting my hand on his back and pushing him down. I have tried to get
him to sit by putting my hand on his back and pushing him down. I have tried to
get him to sit by putting my hand on his back and pushing him down. I have tried
to get him to sit by putting my hand on his back and pushing him down. I have
tried to get him to sit by putting my hand on his back and pushing him down. I
have tried to get him to sit by putting my hand on his back and pushing him
down. I have tried to get him to sit by putting my hand on his back and pushing
him down. I have tried to get him to sit by putting my hand on his back and
pushing him down. I have tried to get him to sit 

The above response is not good its repeat a sentence multiple times

## Compare to finetuned models 


In [13]:
finetuned_model = BasicModelRunner("meta-llama/Llama-2-7b-chat-hf")

In [16]:
finetuned_output = finetuned_model("Tell me how to train my dog to sit")

wrapped_text = textwrap.fill(finetuned_output)
print(wrapped_text)

on command. Training a dog to sit on command is a basic obedience
command that can be achieved with patience, consistency, and positive
reinforcement. Here's a step-by-step guide on how to train your dog to
sit on command:  1. Choose a quiet and distraction-free area: Find a
quiet area with minimal distractions where your dog can focus on you.
2. Have treats ready: Choose your dog's favorite treats and have them
ready to use as rewards. 3. Stand in front of your dog: Stand in front
of your dog and hold a treat close to their nose. 4. Move the treat up
and back: Slowly move the treat up and back, towards your dog's tail,
while saying "sit" in a calm and clear voice. 5. Dog will sit: As you
move the treat, your dog will naturally sit down to follow the treat.
The moment their bottom touches the ground, say "good sit" and give
them the treat. 6. Repeat the process: Repeat steps 3-5 several times,
so your dog starts to associate the command "sit" with the action of
sitting down. 7. Gradual

In [17]:
print(finetuned_model("[INST]Tell me how to train my dog to sit[/INST]"))

 Training your dog to sit is a basic obedience command that can be achieved with patience, consistency, and positive reinforcement. Here's a step-by-step guide on how to train your dog to sit:

1. Choose a quiet and distraction-free area: Find a quiet area with no distractions where your dog can focus on you.
2. Have treats ready: Choose your dog's favorite treats and have them ready to use as rewards.
3. Stand in front of your dog: Stand in front of your dog and hold a treat close to their nose.
4. Move the treat up and back: Slowly move the treat up and back, towards your dog's tail, while saying "sit" in a calm and clear voice.
5. Dog will sit: As you move the treat, your dog will naturally sit down to follow the treat. The moment their bottom touches the ground, say "good sit" and give them the treat.
6. Repeat the process: Repeat steps 3-5 several times, so your dog starts to associate the command "sit" with the action of sitting down.
7. Gradually phase out the treats: As your do

In [18]:
print(non_finetuned("[INST]Tell me how to train my dog to sit[/INST]"))


[INST]Tell me how to train my dog to sit[/INST]
[INST]Tell me how to train my dog to sit[/INST] [INST]Tell me how to train my dog to sit[/INST] [INST]Tell me how to train my dog to sit[/INST] [INST]Tell me how to train my dog to sit[/INST] [INST]Tell me how to train my dog to sit[/INST] [INST]Tell me how to train my dog to sit[/INST] [INST]Tell me how to train my dog to sit[/INST] [INST]Tell me how to train my dog to sit[/INST] [INST]Tell me how to train my dog to sit[/INST] [INST]Tell me how to train my dog to sit[/INST] [INST]Tell me how to train my dog to sit[/INST] [INST]Tell me how to train my dog to sit[/INST] [INST]Tell me how to train my dog to sit[/INST] [INST]Tell me how to train my dog to sit[/INST] [INST]Tell me how to train my dog to sit[/INST] [INST]Tell me how to train my dog to sit[/INST] [INST]Tell me how to train my dog to sit[/INST] [INST]Tell me how to train my dog to sit[/INST] [INST]Tell me how to train my dog to sit[/INST] [INST]Tell me how to train my dog to si

# Finetuning data: compare to pretraining and basic preparation

Or

`Where finetune fits in?`


Note: The `EleutherAI` is opensourced a dataset its named as `pile` but its not currently in huggingface instead of that one they refering another one dataset [bigcode/the-stack-v2](https://huggingface.co/datasets/bigcode/the-stack-v2). its 1.68TB data is there scrapped from different source of internet.

In [5]:
#pretrained_dataset = load_dataset("EleutherAI/pile", split="train", streaming=True)

pretrained_dataset = load_dataset("c4", "en", split="train", streaming=True)
pretrained_dataset



IterableDataset({
    features: ['text', 'timestamp', 'url'],
    n_shards: 1024
})

In [9]:
n = 5
print("Pretrained dataset:")
top_n = itertools.islice(pretrained_dataset, n)
for i in top_n:
  print(i)

print("================================================")
print(textwrap.fill(i['text']))

Pretrained dataset:
{'text': 'Beginners BBQ Class Taking Place in Missoula!\nDo you want to get better at making delicious BBQ? You will have the opportunity, put this on your calendar now. Thursday, September 22nd join World Class BBQ Champion, Tony Balay from Lonestar Smoke Rangers. He will be teaching a beginner level class for everyone who wants to get better with their culinary skills.\nHe will teach you everything you need to know to compete in a KCBS BBQ competition, including techniques, recipes, timelines, meat selection and trimming, plus smoker and fire information.\nThe cost to be in the class is $35 per person, and for spectators it is free. Included in the cost will be either a t-shirt or apron and you will be tasting samples of each meat that is prepared.', 'timestamp': '2019-04-25T12:57:54Z', 'url': 'https://klyq.com/beginners-bbq-class-taking-place-in-missoula/'}
{'text': 'Discussion in \'Mac OS X Lion (10.7)\' started by axboi87, Jan 20, 2012.\nI\'ve got a 500gb inter

### Contrast with company finetuning dataset you will be using

In [11]:
filename = "data/lamini_docs.csv"
instruction_dataset_df = pd.read_csv(filename)
instruction_dataset_df

Unnamed: 0,question,answer
0,What are the different types of documents avai...,"Lamini has documentation on Getting Started, A..."
1,What is the recommended way to set up and conf...,Lamini can be downloaded as a python package a...
2,How can I find the specific documentation I ne...,"You can ask this model about documentation, wh..."
3,Does the documentation include explanations of...,Our documentation provides both real-world and...
4,Does the documentation provide information abo...,External dependencies and libraries are all av...
...,...,...
1395,What is Lamini and what is its collaboration w...,Lamini is a library that simplifies the proces...
1396,How does Lamini simplify the process of access...,Lamini simplifies data access in Databricks by...
1397,What are some of the key features provided by ...,Lamini automatically manages the infrastructur...
1398,How does Lamini ensure data privacy during the...,"During the training process, Lamini ensures da..."


### Various ways of formatting your data

In [13]:
examples = instruction_dataset_df.to_dict()
text = examples["question"][0] + examples["answer"][0]
print(textwrap.fill(text))

What are the different types of documents available in the repository
(e.g., installation guide, API documentation, developer's
guide)?Lamini has documentation on Getting Started, Authentication,
Question Answer Model, Python Library, Batching, Error Handling,
Advanced topics, and class documentation on LLM Engine available at
https://lamini-ai.github.io/.


The below format is a example if a data set comes in different lable format

In [16]:
if "question" in examples and "answer" in examples:
  text = examples["question"][0] + examples["answer"][0]
elif "instruction" in examples and "response" in examples:
  text = examples["instruction"][0] + examples["response"][0]
elif "input" in examples and "output" in examples:
  text = examples["input"][0] + examples["output"][0]
else:
  text = examples["text"][0]

print(textwrap.fill(text))

What are the different types of documents available in the repository
(e.g., installation guide, API documentation, developer's
guide)?Lamini has documentation on Getting Started, Authentication,
Question Answer Model, Python Library, Batching, Error Handling,
Advanced topics, and class documentation on LLM Engine available at
https://lamini-ai.github.io/.


In [19]:
prompt_template_qa = """### Question:
{question}

### Answer:
{answer}"""
print(prompt_template_qa)

### Question:
{question}

### Answer:
{answer}


In [24]:
question = examples["question"][0]
answer = examples["answer"][0]

text_with_prompt_template = prompt_template_qa.format(question=question, answer=answer)
print(text_with_prompt_template)

### Question:
What are the different types of documents available in the repository (e.g., installation guide, API documentation, developer's guide)?

### Answer:
Lamini has documentation on Getting Started, Authentication, Question Answer Model, Python Library, Batching, Error Handling, Advanced topics, and class documentation on LLM Engine available at https://lamini-ai.github.io/.


In [27]:
prompt_template_q = """### Question:
{question}

### Answer:"""

print(prompt_template_q)

### Question:
{question}

### Answer:


In [30]:
num_examples = len(examples["question"])
finetuning_dataset_text_only = []
finetuning_dataset_question_answer = []
for i in range(num_examples):
  question = examples["question"][i]
  answer = examples["answer"][i]

  text_with_prompt_template_qa = prompt_template_qa.format(question=question, answer=answer)
  finetuning_dataset_text_only.append({"text": text_with_prompt_template_qa})

  text_with_prompt_template_q = prompt_template_q.format(question=question)
  finetuning_dataset_question_answer.append({"question": text_with_prompt_template_q, "answer": answer})

In [31]:
pprint(finetuning_dataset_text_only[0])

{'text': '### Question:\n'
         'What are the different types of documents available in the '
         "repository (e.g., installation guide, API documentation, developer's "
         'guide)?\n'
         '\n'
         '### Answer:\n'
         'Lamini has documentation on Getting Started, Authentication, '
         'Question Answer Model, Python Library, Batching, Error Handling, '
         'Advanced topics, and class documentation on LLM Engine available at '
         'https://lamini-ai.github.io/.'}


In [32]:
pprint(finetuning_dataset_question_answer[0])

{'answer': 'Lamini has documentation on Getting Started, Authentication, '
           'Question Answer Model, Python Library, Batching, Error Handling, '
           'Advanced topics, and class documentation on LLM Engine available '
           'at https://lamini-ai.github.io/.',
 'question': '### Question:\n'
             'What are the different types of documents available in the '
             'repository (e.g., installation guide, API documentation, '
             "developer's guide)?\n"
             '\n'
             '### Answer:'}


### Common ways of storing your data

In [35]:
with jsonlines.open(f'data/lamini_docs_processed.jsonl', 'w') as writer:
    writer.write_all(finetuning_dataset_question_answer)

In [34]:
finetuning_dataset_name = "lamini/lamini_docs"
finetuning_dataset = load_dataset(finetuning_dataset_name)
print(finetuning_dataset)

Downloading readme: 100%|██████████| 577/577 [00:00<00:00, 6.69MB/s]
Downloading data: 100%|██████████| 615k/615k [00:04<00:00, 139kB/s]
Downloading data: 100%|██████████| 83.7k/83.7k [00:01<00:00, 51.0kB/s]
Generating train split: 100%|██████████| 1260/1260 [00:00<00:00, 280020.30 examples/s]
Generating test split: 100%|██████████| 140/140 [00:00<00:00, 108439.99 examples/s]

DatasetDict({
    train: Dataset({
        features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1260
    })
    test: Dataset({
        features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 140
    })
})





# Instruction-tuning

### Load instruction tuned dataset

In [40]:
instruction_tuned_dataset = load_dataset("tatsu-lab/alpaca", split="train", streaming=True)
instruction_tuned_dataset

IterableDataset({
    features: ['instruction', 'input', 'output', 'text'],
    n_shards: 1
})

In [41]:
m = 5
print("Instruction-tuned dataset:")
top_m = list(itertools.islice(instruction_tuned_dataset, m))
for j in top_m:
  print(j)

Instruction-tuned dataset:
{'instruction': 'Give three tips for staying healthy.', 'input': '', 'output': '1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.', 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:\n1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.'}
{'instruction': 'What are the three primary colors?', 'input': '', 'output': 'The three primary colors are red, blue, and yellow.', 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nWhat are the three p

### Two prompt templates


The prompt structure is alpaca prompt structure. this the format used in the `unsloth` finetuning process also

In [44]:
prompt_template_with_input = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:"""

prompt_template_without_input = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:"""

### Hydrate prompts (add data to prompts)

In [47]:
processed_data = []
for j in top_m:
  if not j["input"]:
    processed_prompt = prompt_template_without_input.format(instruction=j["instruction"])
  else:
    processed_prompt = prompt_template_with_input.format(instruction=j["instruction"], input=j["input"])

  processed_data.append({"input": processed_prompt, "output": j["output"]})

pprint(processed_data[0])

{'input': 'Below is an instruction that describes a task. Write a response '
          'that appropriately completes the request.\n'
          '\n'
          '### Instruction:\n'
          'Give three tips for staying healthy.\n'
          '\n'
          '### Response:',
 'output': '1.Eat a balanced diet and make sure to include plenty of fruits '
           'and vegetables. \n'
           '2. Exercise regularly to keep your body active and strong. \n'
           '3. Get enough sleep and maintain a consistent sleep schedule.'}


### Save data to jsonl

In [48]:
with jsonlines.open(f'data/alpaca_processed.jsonl', 'w') as writer:
    writer.write_all(processed_data)

### Compare non-instruction-tuned vs. instruction-tuned models

In [49]:
dataset_path_hf = "lamini/alpaca"
dataset_hf = load_dataset(dataset_path_hf)
print(dataset_hf)

Downloading readme: 100%|██████████| 388/388 [00:00<00:00, 4.06MB/s]
Downloading data: 100%|██████████| 12.7M/12.7M [00:02<00:00, 6.17MB/s]
Generating train split: 100%|██████████| 52002/52002 [00:00<00:00, 1071641.16 examples/s]

DatasetDict({
    train: Dataset({
        features: ['input', 'output'],
        num_rows: 52002
    })
})





In [52]:
non_instruct_model = BasicModelRunner("meta-llama/Llama-2-7b-hf")
non_instruct_output = non_instruct_model("Tell me how to train my dog to sit")
print("Not instruction-tuned output (Llama 2 Base):")
print(textwrap.fill(non_instruct_output))

Not instruction-tuned output (Llama 2 Base):
. Tell me how to train my dog to sit. I have a 10 month old puppy and
I want to train him to sit. I have tried the treat method and the
verbal command method. I have tried both and he just doesn't seem to
get it. I have tried to get him to sit by putting my hand on his back
and pushing him down. I have tried to get him to sit by putting my
hand on his back and pushing him down. I have tried to get him to sit
by putting my hand on his back and pushing him down. I have tried to
get him to sit by putting my hand on his back and pushing him down. I
have tried to get him to sit by putting my hand on his back and
pushing him down. I have tried to get him to sit by putting my hand on
his back and pushing him down. I have tried to get him to sit by
putting my hand on his back and pushing him down. I have tried to get
him to sit by putting my hand on his back and pushing him down. I have
tried to get him to sit by putting my hand on his back and push

The above output is not good sentence are repeated.

In [53]:
instruct_model = BasicModelRunner("meta-llama/Llama-2-7b-chat-hf")
instruct_output = instruct_model("Tell me how to train my dog to sit")
print("Instruction-tuned output (Llama 2): ", instruct_output)

Instruction-tuned output (Llama 2):  on command.
Training a dog to sit on command is a basic obedience command that can be achieved with patience, consistency, and positive reinforcement. Here's a step-by-step guide on how to train your dog to sit on command:

1. Choose a quiet and distraction-free area: Find a quiet area with minimal distractions where your dog can focus on you.
2. Have treats ready: Choose your dog's favorite treats and have them ready to use as rewards.
3. Stand in front of your dog: Stand in front of your dog and hold a treat close to their nose.
4. Move the treat up and back: Slowly move the treat up and back, towards your dog's tail, while saying "sit" in a calm and clear voice.
5. Dog will sit: As you move the treat, your dog will naturally sit down to follow the treat. The moment their bottom touches the ground, say "good sit" and give them the treat.
6. Repeat the process: Repeat steps 3-5 several times, so your dog starts to associate the command "sit" with t

### Try smaller models

In [None]:
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-70m")
model = AutoModelForCausalLM.from_pretrained("EleutherAI/pythia-70m")

In [4]:
def inference(text, model, tokenizer, max_input_tokens=1000, max_output_tokens=100):
  # Tokenize
  input_ids = tokenizer.encode(
          text,
          return_tensors="pt",
          truncation=True,
          max_length=max_input_tokens
  )

  # Generate
  device = model.device
  generated_tokens_with_prompt = model.generate(
    input_ids=input_ids.to(device),
    max_length=max_output_tokens
  )

  # Decode
  generated_text_with_prompt = tokenizer.batch_decode(generated_tokens_with_prompt, skip_special_tokens=True)

  # Strip the prompt
  generated_text_answer = generated_text_with_prompt[0][len(text):]

  return generated_text_answer

In [5]:
finetuning_dataset_path = "lamini/lamini_docs"
finetuning_dataset = load_dataset(finetuning_dataset_path)
print(finetuning_dataset)

DatasetDict({
    train: Dataset({
        features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1260
    })
    test: Dataset({
        features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 140
    })
})


In [6]:
test_sample = finetuning_dataset["test"][0]
print(test_sample)

print(inference(test_sample["question"], model, tokenizer))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


{'question': 'Can Lamini generate technical documentation or user manuals for software projects?', 'answer': 'Yes, Lamini can generate technical documentation and user manuals for software projects. It uses natural language generation techniques to create clear and concise documentation that is easy to understand for both technical and non-technical users. This can save developers a significant amount of time and effort in creating documentation, allowing them to focus on other aspects of their projects.', 'input_ids': [5804, 418, 4988, 74, 6635, 7681, 10097, 390, 2608, 11595, 84, 323, 3694, 6493, 32, 4374, 13, 418, 4988, 74, 476, 6635, 7681, 10097, 285, 2608, 11595, 84, 323, 3694, 6493, 15, 733, 4648, 3626, 3448, 5978, 5609, 281, 2794, 2590, 285, 44003, 10097, 326, 310, 3477, 281, 2096, 323, 1097, 7681, 285, 1327, 14, 48746, 4212, 15, 831, 476, 5321, 12259, 247, 1534, 2408, 273, 673, 285, 3434, 275, 6153, 10097, 13, 6941, 731, 281, 2770, 327, 643, 7794, 273, 616, 6493, 15], 'attention

#### Compare to finetuned small model

In [7]:
instruction_model = AutoModelForCausalLM.from_pretrained("lamini/lamini_docs_finetuned")

In [9]:
print(textwrap.fill(inference(test_sample["question"], instruction_model, tokenizer)))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Yes, Lamini can generate technical documentation or user manuals for
software projects. This can be achieved by providing a prompt for a
specific technical question or question to the LLM Engine, or by
providing a prompt for a specific technical question or question.
Additionally, Lamini can be trained on specific technical questions or
questions to help users understand the process and provide feedback to
the LLM Engine. Additionally, Lamini


### Conclution

The model is performing well in the instruction model.

# Data Preparation

`How to prepare the data for training?` this what we are going to see in this section.

### Rules

**Better**
* Higher Quality
* Diversity
* Real
* More

**Worse**
* Lower Quality
* Homogenioty
* Generated
* less

Have to give the better quality data if you give garbage data you will get the garbage response.

`Diversity` refers to the inclusion of a wide range of different elements or variations within a dataset. Diverse data is varied and contains multiple perspectives, attributes, or types of information. This diversity helps in creating more robust, comprehensive, and unbiased models or analyses.

`Homogeneity` refers to the quality or state of being uniform, similar, or composed of similar elements. In the context of data, homogeneity means that the data points are very similar to each other and lack variety or diversity. This can lead to biased models and analyses that may not generalize well to different situations or populations.

the data set we are preparing should we `Diversity` not `Homogeneity`. which menad contains the same information in the each other data its not good.

### Steps to preapre your data

1. **Collect Instruction-Response Pairs:**
   - **Description:** Gather pairs of instructions (questions or prompts) and their corresponding responses (answers or outputs).
   - **Example:** If you're building a chatbot, an instruction-response pair could be:
     - Instruction: "What is the weather like today?"
     - Response: "The weather today is sunny with a high of 25 degrees Celsius."

2. **Concatenate Pairs (add prompt template if applicable):**
   - **Description:** Combine the instruction and response into a single string. If you have a specific template for how prompts should be structured, apply it here.
   - **Example:** If your prompt template is "Q: {instruction} A: {response}", the concatenated pair would be:
     - "Q: What is the weather like today? A: The weather today is sunny with a high of 25 degrees Celsius."

3. **Tokenize: pad, truncate:**
   - **Description:** Convert the concatenated strings into tokens that can be processed by a machine learning model. Padding and truncation are used to ensure all token sequences are the same length.
     - **Tokenization:** Splitting text into individual words, subwords, or characters and converting them into numerical representations (tokens).
     - **Padding:** Adding extra tokens (usually zeros) to shorter sequences so that all sequences in a batch have the same length.
     - **Truncation:** Cutting off longer sequences to ensure they fit within a specified maximum length.
   - **Example:** 
     - Input: "Q: What is the weather like today? A: The weather today is sunny with a high of 25 degrees Celsius."
     - Tokenized: [101, 136, 102, ...] (numbers representing the words/subwords)
     - Padded: [101, 136, 102, ..., 0, 0, 0] (extra zeros added to reach the maximum length)
     - Truncated: [101, 136, 102, ..., 256] (sequence cut off at the maximum length)

4. **Split into Train/Test:**
   - **Description:** Divide your dataset into two parts: one for training the model (train set) and one for evaluating its performance (test set).
     - **Train Set:** Used to train the model, allowing it to learn patterns from the data.
     - **Test Set:** Used to assess the model's performance on unseen data, providing an estimate of how well it will generalize to new inputs.
   - **Example:** If you have 1,000 instruction-response pairs, you might split them into 800 pairs for training and 200 pairs for testing.

By following these steps, you ensure that your data is properly formatted and ready for training a machine learning or natural language processing model. This preparation helps in building models that can understand and generate accurate responses based on given instructions.

In [4]:
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-70m")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [47]:
text = "Hi, how are you?"

### input_ids & attention_mask

In [48]:
encoded = tokenizer(text)#["input_ids"]
encoded

{'input_ids': [12764, 13, 849, 403, 368, 32], 'attention_mask': [1, 1, 1, 1, 1, 1]}

The output you are seeing is the result of tokenizing the text "Hi, how are you?" using a tokenizer. Here's an explanation of what each part of the output means:

#### Explanation of Tokenizer Output

```python
{'input_ids': [12764, 13, 849, 403, 368, 32], 'attention_mask': [1, 1, 1, 1, 1, 1]}
```

1. **input_ids:**
   - **Description:** This is a list of token IDs that represent the input text. Each token ID corresponds to a specific word or subword in the tokenizer's vocabulary.
   - **Example:** 
     - "Hi" -> 12764
     - "," -> 13
     - "how" -> 849
     - "are" -> 403
     - "you" -> 368
     - "?" -> 32

2. **attention_mask:**
   - **Description:** This is a list of binary values (1s and 0s) that indicate which tokens should be attended to (1) and which should be ignored (0). In this case, all tokens are valid, so they are all marked with 1.
   - **Example:** 
     - [1, 1, 1, 1, 1, 1] means all tokens in `input_ids` are to be attended to by the model.

#### Detailed Breakdown

- **Tokenization Process:**
  - The tokenizer breaks down the input text "Hi, how are you?" into smaller units called tokens. Each token is then mapped to a unique identifier (token ID) in the tokenizer's vocabulary.
  - For example, in the vocabulary of the tokenizer you are using, "Hi" might be represented by the ID 12764, and so on for the other tokens.

- **Creating the Attention Mask:**
  - The attention mask is used by the model to focus on the actual tokens in the sequence and ignore any padding tokens (if present). Since your input text does not include padding, all values in the attention mask are 1.

#### Why This is Useful

Tokenizing text into numerical representations is a crucial step in preparing data for natural language processing (NLP) models. These numerical representations allow models to process and understand text data effectively.

If you need to further process or modify the tokenized output (e.g., padding, truncation), you can use additional parameters in the tokenizer's method. Here's an example with padding and truncation:

```python
text = "Hi, how are you?"
encoded = tokenizer(text, padding='max_length', truncation=True, max_length=10)
encoded
```

This will ensure that the output is padded to a maximum length of 10 tokens and truncated if necessary. The resulting `encoded` dictionary will also include the padding and updated attention mask.

In [49]:
encoded_text = encoded["input_ids"]
encoded_text

[12764, 13, 849, 403, 368, 32]

In [46]:
decoded_text = tokenizer.decode(encoded_text)
print("Decoded tokens back into text: ", decoded_text)

Decoded tokens back into text:  Hi, how are you?<|endoftext|>


### Tokenize multiple texts at once

In [22]:
list_texts = ["Hi, how are you?", "I'm good", "Yes"]
encoded_texts = tokenizer(list_texts)
print("Encoded several texts: ", encoded_texts["input_ids"])

Encoded several texts:  [[12764, 13, 849, 403, 368, 32], [42, 1353, 1175], [4374]]


In [34]:
type(encoded_texts["input_ids"])

list

In [40]:
decoded_text = "\n".join([tokenizer.decode(text_) for text_ in encoded_texts["input_ids"]])
print("Decoded tokens back into text:\n", decoded_text)

Decoded tokens back into text:
 Hi, how are you?
I'm good
Yes


### Padding and truncation

In [52]:
tokenizer.eos_token

'<|endoftext|>'

Setting the `pad_token` to the `eos_token` (`end-of-sequence` token) is a common practice in certain natural language processing tasks and models. Here's an explanation of why this is done:

#### Why Use `eos_token` as `pad_token`

1. **Consistency in Training and Inference:**
   - Using the same token for padding and indicating the end of a sequence ensures that the model treats both scenarios consistently. During training, this can help the model learn to handle padded inputs similarly to how it handles the end of a sequence, potentially improving its performance during inference.

2. **Model Requirements:**
   - Some models are designed to use the `eos_token` to signal the end of meaningful input. By using the `eos_token` as the `pad_token`, you make sure the model interprets padded tokens in a way that aligns with its architecture and training objectives.

3. **Simplifying Post-Processing:**
   - When decoding sequences, having a single token that serves both as an end-of-sequence indicator and as padding can simplify the process of cleaning up the output. This reduces the need for separate handling of `pad_token` and `eos_token` in the generated sequences.

4. **Avoiding Unnecessary Tokens:**
   - Introducing a separate `pad_token` might require additional handling in the tokenizer and model, especially if the model wasn't originally designed with a distinct padding mechanism in mind. Using the `eos_token` as the `pad_token` can avoid the complexity and potential issues associated with managing an extra token.

#### Example

Here's how you might set the `pad_token` to the `eos_token`:

```python
tokenizer.pad_token = tokenizer.eos_token
```

#### Practical Considerations

While using the `eos_token` as the `pad_token` can be beneficial, it's important to consider the specific requirements and architecture of your model:

- **Model Compatibility:** Ensure that your model can handle the `eos_token` being used as padding without misinterpreting the padding as an end-of-sequence signal during intermediate layers of processing.
- **Training Data:** Make sure your training data is prepared in a way that the model can distinguish between actual end-of-sequence tokens and padding tokens if needed.

#### Example in Code

Here's a full example illustrating the use of `eos_token` as `pad_token`:

```python
from transformers import AutoTokenizer

# Load a tokenizer (example: GPT-2)
tokenizer = AutoTokenizer.from_pretrained('gpt2')

# Set the pad token to be the same as the eos token
tokenizer.pad_token = tokenizer.eos_token

# Example text
text = "Hi, how are you?"

# Tokenize with padding and truncation
encoded = tokenizer(text, padding='max_length', truncation=True, max_length=10)
print(encoded)

# Output will show the input_ids and attention_mask with padding token as eos token
```

In this example, when you inspect the `encoded` dictionary, you will see that the padding token IDs are the same as the `eos_token` ID, ensuring consistent treatment of end-of-sequence and padding in your model.

In [54]:
tokenizer.pad_token = tokenizer.eos_token # By default the pad_token is None we have to set the padding.

In [57]:
encoded_texts_longest = tokenizer(list_texts, padding=True)
print("Using padding: ", encoded_texts_longest["input_ids"])

Using padding:  [[12764, 13, 849, 403, 368, 32], [42, 1353, 1175, 0, 0, 0], [4374, 0, 0, 0, 0, 0]]


In [58]:
encoded_texts_truncation = tokenizer(list_texts, max_length=3, truncation=True)
print("Using truncation: ", encoded_texts_truncation["input_ids"])

Using truncation:  [[12764, 13, 849], [42, 1353, 1175], [4374]]


We are losing the data if max_length is not properly given

In [61]:
decoded_text = "\n".join([tokenizer.decode(text_) for text_ in encoded_texts_truncation["input_ids"]])
print("Decoded text: ", decoded_text)


Decoded text:  Hi, how
I'm good
Yes


**Trucation on left Side**

In [62]:
tokenizer.truncation_side = "left"
encoded_texts_truncation_left = tokenizer(list_texts, max_length=3, truncation=True)
print("Using left-side truncation: ", encoded_texts_truncation_left["input_ids"])

Using left-side truncation:  [[403, 368, 32], [42, 1353, 1175], [4374]]


In [64]:
decoded_text = "\n".join([tokenizer.decode(text_) for text_ in encoded_texts_truncation_left["input_ids"]])
print("Decoded text: ", decoded_text)

Decoded text:   are you?
I'm good
Yes


In [65]:
encoded_texts_both = tokenizer(list_texts, max_length=3, truncation=True, padding=True)
print("Using both padding and truncation: ", encoded_texts_both["input_ids"])

Using both padding and truncation:  [[403, 368, 32], [42, 1353, 1175], [4374, 0, 0]]


In [67]:
decoded_text = "\n".join([tokenizer.decode(text_) for text_ in encoded_texts_both["input_ids"]])
print("Decoded text: ", decoded_text)

Decoded text:   are you?
I'm good
Yes<|endoftext|><|endoftext|>


### Prepare instruction dataset

In [68]:
filename = "data/lamini_docs.csv"

In [71]:
instruction_dataset_df = pd.read_csv(filename)
examples = instruction_dataset_df.to_dict()

pprint(examples)

{'answer': {0: 'Lamini has documentation on Getting Started, Authentication, '
               'Question Answer Model, Python Library, Batching, Error '
               'Handling, Advanced topics, and class documentation on LLM '
               'Engine available at https://lamini-ai.github.io/.',
            1: 'Lamini can be downloaded as a python package and used in any '
               'codebase that uses python. Additionally, we provide a language '
               'agnostic REST API. We’ve seen users develop and train models '
               'in a notebook environment, and then switch over to a REST API '
               'to integrate with their production environment.',
            2: 'You can ask this model about documentation, which is trained '
               'on our publicly available docs and source code, or you can go '
               'to https://lamini-ai.github.io/.',
            3: 'Our documentation provides both real-world and toy examples of '
               'how one migh

In [74]:
if "question" in examples and "answer" in examples:
  text = examples["question"][0] + examples["answer"][0]
elif "instruction" in examples and "response" in examples:
  text = examples["instruction"][0] + examples["response"][0]
elif "input" in examples and "output" in examples:
  text = examples["input"][0] + examples["output"][0]
else:
  text = examples["text"][0]
  
print(textwrap.fill(text))

What are the different types of documents available in the repository
(e.g., installation guide, API documentation, developer's
guide)?Lamini has documentation on Getting Started, Authentication,
Question Answer Model, Python Library, Batching, Error Handling,
Advanced topics, and class documentation on LLM Engine available at
https://lamini-ai.github.io/.


In [75]:
prompt_template = """### Question:
{question}

### Answer:"""

In [76]:
num_examples = len(examples["question"])
finetuning_dataset = []
for i in range(num_examples):
  question = examples["question"][i]
  answer = examples["answer"][i]
  text_with_prompt_template = prompt_template.format(question=question)
  finetuning_dataset.append({"question": text_with_prompt_template, "answer": answer})

from pprint import pprint
print("One datapoint in the finetuning dataset:")
pprint(finetuning_dataset[0])

One datapoint in the finetuning dataset:
{'answer': 'Lamini has documentation on Getting Started, Authentication, '
           'Question Answer Model, Python Library, Batching, Error Handling, '
           'Advanced topics, and class documentation on LLM Engine available '
           'at https://lamini-ai.github.io/.',
 'question': '### Question:\n'
             'What are the different types of documents available in the '
             'repository (e.g., installation guide, API documentation, '
             "developer's guide)?\n"
             '\n'
             '### Answer:'}


### Tokenize a single example

In [80]:
print(finetuning_dataset[0]["question"])
print(finetuning_dataset[0]["answer"])

### Question:
What are the different types of documents available in the repository (e.g., installation guide, API documentation, developer's guide)?

### Answer:
Lamini has documentation on Getting Started, Authentication, Question Answer Model, Python Library, Batching, Error Handling, Advanced topics, and class documentation on LLM Engine available at https://lamini-ai.github.io/.


In [98]:
text = finetuning_dataset[0]["question"] + finetuning_dataset[0]["answer"]
tokenized_inputs = tokenizer(
    text,
    return_tensors="np",
    padding=True
)
print(tokenized_inputs["input_ids"])

[[ 4118 19782    27   187  1276   403   253  1027  3510   273  7177  2130
    275   253 18491   313    70    15    72   904 12692  7102    13  8990
  10097    13 13722   434  7102  6177   187   187  4118 37741    27    45
   4988    74   556 10097   327 27669 11075   264    13  5271 23058    13
  19782 37741 10031    13 13814 11397    13   378 16464    13 11759 10535
   1981    13 21798 12989    13   285   966 10097   327 21708    46 10797
   2130   387  5987  1358    77  4988    74    14  2284    15  7280    15
    900 14206]]


In [87]:
decoded_text = tokenizer.decode(tokenized_inputs["input_ids"][0])
print(decoded_text)

### Question:
What are the different types of documents available in the repository (e.g., installation guide, API documentation, developer's guide)?

### Answer:Lamini has documentation on Getting Started, Authentication, Question Answer Model, Python Library, Batching, Error Handling, Advanced topics, and class documentation on LLM Engine available at https://lamini-ai.github.io/.


In [94]:
tokenized_inputs["input_ids"]

array([[ 4118, 19782,    27,   187,  1276,   403,   253,  1027,  3510,
          273,  7177,  2130,   275,   253, 18491,   313,    70,    15,
           72,   904, 12692,  7102,    13,  8990, 10097,    13, 13722,
          434,  7102,  6177,   187,   187,  4118, 37741,    27,    45,
         4988,    74,   556, 10097,   327, 27669, 11075,   264,    13,
         5271, 23058,    13, 19782, 37741, 10031,    13, 13814, 11397,
           13,   378, 16464,    13, 11759, 10535,  1981,    13, 21798,
        12989,    13,   285,   966, 10097,   327, 21708,    46, 10797,
         2130,   387,  5987,  1358,    77,  4988,    74,    14,  2284,
           15,  7280,    15,   900, 14206]])

In [107]:
max_length = 2048
max_length = min(
    tokenized_inputs["input_ids"].shape[1],
    max_length,
)

In [110]:
tokenized_inputs = tokenizer(
    text,
    return_tensors="np",
    truncation=True,
    max_length=max_length
)
tokenized_inputs["input_ids"]

array([[ 4118, 19782,    27,   187,  1276,   403,   253,  1027,  3510,
          273,  7177,  2130,   275,   253, 18491,   313,    70,    15,
           72,   904, 12692,  7102,    13,  8990, 10097,    13, 13722,
          434,  7102,  6177,   187,   187,  4118, 37741,    27,    45,
         4988,    74,   556, 10097,   327, 27669, 11075,   264,    13,
         5271, 23058,    13, 19782, 37741, 10031,    13, 13814, 11397,
           13,   378, 16464,    13, 11759, 10535,  1981,    13, 21798,
        12989,    13,   285,   966, 10097,   327, 21708,    46, 10797,
         2130,   387,  5987,  1358,    77,  4988,    74,    14,  2284,
           15,  7280,    15,   900, 14206]])

### Tokenize the instruction dataset

In [137]:
def tokenize_function(examples):
    try:
        if "question" in examples and "answer" in examples:
          text = str(examples["question"][0]) + str(examples["answer"][0])
        elif "input" in examples and "output" in examples:
          text = examples["input"][0] + examples["output"][0]
        else:
          text = examples["text"][0]
    except:
        print(examples)

    tokenizer.pad_token = tokenizer.eos_token
    tokenized_inputs = tokenizer(
        text,
        return_tensors="np",
        padding=True,
    )

    max_length = min(
        tokenized_inputs["input_ids"].shape[1],
        2048
    )
    tokenizer.truncation_side = "left"
    tokenized_inputs = tokenizer(
        text,
        return_tensors="np",
        truncation=True,
        max_length=max_length
    )

    return tokenized_inputs

In [139]:

finetuning_dataset_loaded = load_dataset('csv', data_files='data/lamini_docs.csv', split="train")


tokenized_dataset = finetuning_dataset_loaded.map(
    tokenize_function,
    batched=True,
    batch_size=1,
    drop_last_batch=True
)

print(tokenized_dataset)

Map: 100%|██████████| 1400/1400 [00:01<00:00, 1244.18 examples/s]

Dataset({
    features: ['question', 'answer', 'input_ids', 'attention_mask'],
    num_rows: 1400
})





In [143]:
tokenized_dataset = tokenized_dataset.add_column("labels", tokenized_dataset["input_ids"])

ValueError: The table can't have duplicated columns but columns ['labels'] are duplicated.

In [144]:
tokenized_dataset

Dataset({
    features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 1400
})

### Prepare test/train splits


In [145]:
split_dataset = tokenized_dataset.train_test_split(test_size=0.1, shuffle=True, seed=123)
print(split_dataset)

DatasetDict({
    train: Dataset({
        features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1260
    })
    test: Dataset({
        features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 140
    })
})


### Some datasets for you to try

In [146]:
finetuning_dataset_path = "lamini/lamini_docs"
finetuning_dataset = datasets.load_dataset(finetuning_dataset_path)
print(finetuning_dataset)

DatasetDict({
    train: Dataset({
        features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1260
    })
    test: Dataset({
        features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 140
    })
})


In [147]:
taylor_swift_dataset = "lamini/taylor_swift"
bts_dataset = "lamini/bts"
open_llms = "lamini/open_llms"

In [148]:
dataset_swiftie = datasets.load_dataset(taylor_swift_dataset)
print(dataset_swiftie["train"][1])

Downloading readme: 100%|██████████| 573/573 [00:00<00:00, 6.43MB/s]
Downloading data: 100%|██████████| 257k/257k [00:01<00:00, 137kB/s]
Downloading data: 100%|██████████| 46.3k/46.3k [00:01<00:00, 34.9kB/s]
Generating train split: 100%|██████████| 783/783 [00:00<00:00, 66277.98 examples/s]
Generating test split: 100%|██████████| 87/87 [00:00<00:00, 58328.72 examples/s]

{'question': 'What is the most popular Taylor Swift song among millennials? How does this song relate to the millennial generation? What is the significance of this song in the millennial culture?', 'answer': 'Taylor Swift\'s "Shake It Off" is the most popular song among millennials. This song relates to the millennial generation as it is an anthem of self-acceptance and embracing one\'s individuality. The song\'s message of not letting others bring you down and to just dance it off resonates with the millennial culture, which is often characterized by a strong sense of individuality and a rejection of societal norms. Additionally, the song\'s upbeat and catchy melody makes it a perfect fit for the millennial generation, which is known for its love of pop music.', 'input_ids': [1276, 310, 253, 954, 4633, 11276, 24619, 4498, 2190, 24933, 8075, 32, 1359, 1057, 436, 4498, 14588, 281, 253, 24933, 451, 5978, 32, 1737, 310, 253, 8453, 273, 436, 4498, 275, 253, 24933, 451, 4466, 32, 37979, 24




# Training Process