In this tutorial I'll be demonstrating how to fine tune a pretrained model to perform complex tasks such as reasoning and completion etc

In [22]:
#Getting started with lamini
import lamini
import jsonlines
import itertools
import pandas as pd
from pprint import pprint
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import datasets
import torch
from datasets import load_dataset

In [23]:

lamini.api_key = "9830b83cdb2c1ceecf64b18ca4172743376e23bc275ed4112dbdf01658b9ad6d"

In [24]:
llm_nf = lamini.Lamini("meta-llama/Llama-2-7b-hf")

In [25]:
#Example demonstrating very poor performance
print(llm_nf.generate("Tell me how to train my dog to sit"))
print("XXXXXXXXXX")
print(llm_nf.generate("What do you think of mars"))
print("XXXXXXXXXX")
print(llm_nf.generate("""Agent: I'm here to help you with your Amazon deliver order.
Customer: I didn't get my item
Agent: I'm sorry to hear that. Which item was it?
Customer: the blanket
Agent:"""))

.
Tell me how to train my dog to sit. I have a 10 month old puppy and I want to train him to sit. I have tried the treat method and he just sits there and looks at me like I am crazy. I have tried the "sit" command and he just looks at me like I am crazy. I have tried the "sit" command and he just looks at me like I am crazy. I have tried the "sit" command and he just looks at me like I am crazy. I have tried the "sit" command and he just looks at me like I am crazy. I have tried the "sit" command and he just looks at me like I am crazy. I have tried the "sit" command and he just looks at me like I am crazy. I have tried the "sit" command and he just looks at me like I am crazy. I have tried the "sit" command and he just looks at me like I am crazy. I have tried the "sit" command and he just looks at me like I am crazy. I have tried the "sit" command and he just looks at me like I am crazy. I have tried the "sit" command and he just looks at me like I am crazy. I have tried the "sit" c

In [26]:
#Lets compare to a finetuned model that has been finetuned by some other person
llm_f=lamini.Lamini("meta-llama/Llama-2-7b-chat-hf")

In [27]:
#Far better result as than non finetuned model
print(llm_f.generate("[INST]Tell me how to train my dog to sit[/INST]"))
print("XXXXXXXXXX")
print(llm_f.generate("What do you think of the planet mars"))
print("XXXXXXXXXX")
print(llm_f.generate("""Agent: I'm here to help you with your Amazon deliver order.
Customer: I didn't get my item
Agent: I'm sorry to hear that. Which item was it?
Customer: the blanket
Agent:"""))

 Training your dog to sit is a basic obedience command that can be achieved with patience, consistency, and positive reinforcement. Here's a step-by-step guide on how to train your dog to sit:

1. Choose a quiet and distraction-free area: Find a quiet area with no distractions where your dog can focus on you.
2. Have treats ready: Choose your dog's favorite treats and have them ready to use as rewards.
3. Stand in front of your dog: Stand in front of your dog and hold a treat close to their nose.
4. Move the treat up and back: Slowly move the treat up and back, towards your dog's tail, while saying "sit" in a calm and clear voice.
5. Dog will sit: As you move the treat, your dog will naturally sit down to follow the treat. The moment their bottom touches the ground, say "good sit" and give them the treat.
6. Repeat the process: Repeat steps 3-5 several times, so your dog starts to associate the command "sit" with the action of sitting down.
7. Gradually phase out the treats: As your do

In [28]:
#Lets explore the dataset that we want to finetune our model on
#Here we are using the dataset alpaca from tatsu-lab
#Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. 
#This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better.
pretrained_dataset = load_dataset("tatsu-lab/alpaca", split="train", streaming=True)

In [29]:
print(pretrained_dataset.column_names)

['instruction', 'input', 'output', 'text']


Here text is the default prompt template provided the makers of the dataset but we will create our own prompt since we like to work more :)

In [30]:
count=0
for _ in pretrained_dataset:
    count+=1
print(count)

52002


In [31]:
n = 3
print("Pretrained dataset:")
top_n = list(itertools.islice(pretrained_dataset, n))
for i in top_n:
  print(i)
  print()
#Its output basically consists of instructions and response

Pretrained dataset:
{'instruction': 'Give three tips for staying healthy.', 'input': '', 'output': '1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.', 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:\n1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.'}

{'instruction': 'What are the three primary colors?', 'input': '', 'output': 'The three primary colors are red, blue, and yellow.', 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nWhat are the three primary

Below is the different types of prompt template that we can use

But we will create prompt according to our use case

In [32]:
prompt_template_with_input = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:"""

In [33]:
prompt_template_without_input = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:"""

In [34]:
processed_data = []
for j in top_n:
  if j["input"]:#if the input is there in the datapoint that use prompt template that handles the input
    processed_prompt = prompt_template_with_input.format(instruction=j["instruction"],input=j["input"])
  else:#If the input is not there in the datapoint then use the other prompt template that works on just instructions
    processed_prompt = prompt_template_without_input.format(instruction=j["instruction"])

  processed_data.append({"input": processed_prompt, "output": j["output"]})


In [35]:
#It tells what the model is gonna get that is input and what is it required to be outputted from the model which is given as output
#This is what goes into the model
pprint(processed_data[1])

{'input': 'Below is an instruction that describes a task. Write a response '
          'that appropriately completes the request.\n'
          '\n'
          '### Instruction:\n'
          'What are the three primary colors?\n'
          '\n'
          '### Response:',
 'output': 'The three primary colors are red, blue, and yellow.'}


Save data to Jsonl

In [36]:
with jsonlines.open(f'alpaca_processed.jsonl', 'w') as writer:
    writer.write_all(processed_data)

In [37]:
#Both the dataset it lamini/alpaca and tatsu-labs/alpaca are same and contain 52002 datapoints in it.
#It just the structured form of it 
dataset_path_hf = "lamini/alpaca"
dataset_hf = load_dataset(dataset_path_hf)
pprint(dataset_hf)

DatasetDict({
    train: Dataset({
        features: ['input', 'output'],
        num_rows: 52002
    })
})


Example Demonstrating the how to use an example at inference time.

In [59]:
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-70m")
model = AutoModelForCausalLM.from_pretrained("EleutherAI/pythia-70m")



In [60]:
#These are some heavy models and take huge time.
# tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
# model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf")

Below is an inference function which shows the working of how a particular datapoint is given to a finetuned as well as non-finetuned model
This is not for training it just how the model deals with an input at the inference time

In [64]:
def inference(text, model, tokenizer, max_input_tokens=1000, max_output_tokens=100):
  # Tokenize
  input_ids = tokenizer.encode(
          text,
          return_tensors="pt",#give the output in pytorch tensors
          truncation=True,
          max_length=max_input_tokens
  )

  # Generate
  device = model.device
  generated_tokens_with_prompt = model.generate(
    input_ids=input_ids.to(device),
    max_length=max_output_tokens
  )

  # Decode
  generated_text_with_prompt = tokenizer.batch_decode(generated_tokens_with_prompt, skip_special_tokens=True)

  # Strip the prompt
  generated_text_answer = generated_text_with_prompt[0]

  return generated_text_answer

In [65]:
finetuning_dataset_path = "lamini/lamini_docs"
finetuning_dataset = load_dataset(finetuning_dataset_path)
print(finetuning_dataset)

DatasetDict({
    train: Dataset({
        features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1260
    })
    test: Dataset({
        features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 140
    })
})


In [66]:
test_sample = finetuning_dataset["test"][0]
print(test_sample)

pprint(inference(test_sample["question"], model, tokenizer))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


{'question': 'Can Lamini generate technical documentation or user manuals for software projects?', 'answer': 'Yes, Lamini can generate technical documentation and user manuals for software projects. It uses natural language generation techniques to create clear and concise documentation that is easy to understand for both technical and non-technical users. This can save developers a significant amount of time and effort in creating documentation, allowing them to focus on other aspects of their projects.', 'input_ids': [5804, 418, 4988, 74, 6635, 7681, 10097, 390, 2608, 11595, 84, 323, 3694, 6493, 32, 4374, 13, 418, 4988, 74, 476, 6635, 7681, 10097, 285, 2608, 11595, 84, 323, 3694, 6493, 15, 733, 4648, 3626, 3448, 5978, 5609, 281, 2794, 2590, 285, 44003, 10097, 326, 310, 3477, 281, 2096, 323, 1097, 7681, 285, 1327, 14, 48746, 4212, 15, 831, 476, 5321, 12259, 247, 1534, 2408, 273, 673, 285, 3434, 275, 6153, 10097, 13, 6941, 731, 281, 2770, 327, 643, 7794, 273, 616, 6493, 15], 'attention

In [43]:
instruction_model = AutoModelForCausalLM.from_pretrained("lamini/lamini_docs_finetuned")

In [45]:
pprint(inference(test_sample["question"], instruction_model, tokenizer))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


('Yes, Lamini can generate technical documentation or user manuals for '
 'software projects. This can be achieved by providing a prompt for a specific '
 'technical question or question to the LLM Engine, or by providing a prompt '
 'for a specific technical question or question. Additionally, Lamini can be '
 'trained on specific technical questions or questions to help users '
 'understand the process and provide feedback to the LLM Engine. Additionally, '
 'Lamini')


In [53]:
model=lamini.Lamini("EleutherAI/pythia-70m")

In [55]:
pprint(model.generate("How to train my dog?"))

('\n'
 '\n'
 "I'm not sure if I'm going to be able to train my dog.\n"
 '\n'
 'A:\n'
 '\n'
 "I'm not sure if I'm going to be able to train my dog.\n"
 '\n'
 'A:\n'
 '\n'
 "I'm not sure if I'm going to be able to train my dog.\n"
 '\n'
 'A:\n'
 '\n'
 "I'm not sure if I'm going to be able to train my dog.\n"
 '\n'
 'A:\n'
 '\n'
 "I'm not sure if I'm going to be able to train my dog.\n"
 '\n'
 'A:\n'
 '\n'
 "I'm not sure if I'm going to be able to train my dog.\n"
 '\n'
 'A:\n'
 '\n'
 "I'm not sure if I'm going to be able to train my dog.\n"
 '\n'
 'A:\n'
 '\n'
 "I'm not sure if I'm going to be able to train my dog.\n"
 '\n'
 'A:\n'
 '\n'
 "I'm not sure if I'm going to be able to train my dog.\n"
 '\n'
 'A:\n'
 '\n'
 "I'm not sure if I'm going to be able to train my dog.\n"
 '\n'
 'A:\n'
 '\n'
 "I'm not sure if I'm going to be able to train my dog.\n"
 '\n'
 'A:\n'
 '\n'
 "I'm not sure if I'm going to be able to train my dog.\n"
 '\n'
 'A:\n'
 '\n'
 "I'm not sure if I'm going to be able t