In [1]:
!pip install datasets
!pip install transformers -U
!pip install accelerate -U
!pip install trl

Collecting trl
  Downloading trl-0.26.2-py3-none-any.whl.metadata (11 kB)
Downloading trl-0.26.2-py3-none-any.whl (518 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m518.9/518.9 kB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: trl
Successfully installed trl-0.26.2


# Since it printed cuda, it means in your PyTorch code running on Google Colab will now execute on the GPU instead of the CPU.

*   Any tensor or model you send to device (with .to(device)) will be stored in GPU memory

*   Training (including LLM fine-tuning) will run using CUDA acceleration

*   Computations will be much faster, which is essential for AI/data roles you’re targeting at companies using deep learning or GenAI

In [2]:
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [3]:
print(device)


cuda


# This dataset will allow us to fine tune GPT2 into a model that can perform Q&A (chatbot)

GPT2 is in its pretrained state. GPT2 is not as advanced ChatGPT you can't talk to it now (just a text completion model)

In [4]:
from datasets import load_dataset
DATASET_NAME = "mlabonne/guanaco-llama2-1k"
dataset = load_dataset(DATASET_NAME)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001-9ad84bb9cf65a4(…):   0%|          | 0.00/967k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [5]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 1000
    })
})


In [6]:
training_dataset = dataset['train']
print(training_dataset)

Dataset({
    features: ['text'],
    num_rows: 1000
})


model will learn langauges other than english

In [7]:
training_dataset[0]

{'text': '<s>[INST] Me gradué hace poco de la carrera de medicina ¿Me podrías aconsejar para conseguir rápidamente un puesto de trabajo? [/INST] Esto vale tanto para médicos como para cualquier otra profesión tras finalizar los estudios aniversarios y mi consejo sería preguntar a cuántas personas haya conocido mejor. En este caso, mi primera opción sería hablar con otros profesionales médicos, echar currículos en hospitales y cualquier centro de salud. En paralelo, trabajaría por mejorar mi marca personal como médico mediante un blog o formas digitales de comunicación como los vídeos. Y, para mejorar las posibilidades de encontrar trabajo, también participaría en congresos y encuentros para conseguir más contactos. Y, además de todo lo anterior, seguiría estudiando para presentarme a las oposiciones y ejercer la medicina en el sector público de mi país. </s>'}

format in question and answer pairs

In [8]:
training_dataset[11]

{'text': '<s>[INST] write me a 1000 words essay about deez nuts. [/INST] The Deez Nuts meme first gained popularity in 2015 on the social media platform Vine. The video featured a young man named Rodney Bullard, who recorded himself asking people if they had heard of a particular rapper. When they responded that they had not, he would respond with the phrase "Deez Nuts" and film their reactions. The video quickly went viral, and the phrase became a popular meme. \n\nSince then, Deez Nuts has been used in a variety of contexts to interrupt conversations, derail discussions, or simply add humor to a situation. It has been used in internet memes, in popular music, and even in politics. In the 2016 US presidential election, a 15-year-old boy named Brady Olson registered as an independent candidate under the name Deez Nuts. He gained some traction in the polls and even made appearances on national news programs.\n\nThe Deez Nuts meme has had a significant impact on popular culture. It has b

In [9]:
training_dataset[:5]

{'text': ['<s>[INST] Me gradué hace poco de la carrera de medicina ¿Me podrías aconsejar para conseguir rápidamente un puesto de trabajo? [/INST] Esto vale tanto para médicos como para cualquier otra profesión tras finalizar los estudios aniversarios y mi consejo sería preguntar a cuántas personas haya conocido mejor. En este caso, mi primera opción sería hablar con otros profesionales médicos, echar currículos en hospitales y cualquier centro de salud. En paralelo, trabajaría por mejorar mi marca personal como médico mediante un blog o formas digitales de comunicación como los vídeos. Y, para mejorar las posibilidades de encontrar trabajo, también participaría en congresos y encuentros para conseguir más contactos. Y, además de todo lo anterior, seguiría estudiando para presentarme a las oposiciones y ejercer la medicina en el sector público de mi país. </s>',
  '<s>[INST] Самый великий человек из всех живших на планете? [/INST] Для начала нужно выбрать критерии величия человека. Обыч

# When I want to fine tune a model in the future I will have block of code similar to this

Explanation of code below:

Block 1:Loads a pretrained DistilGPT‑2 from the Transformers ecosystem onto the best available accelerator, and configures its tokenizer so sequences can be padded

Block 2: takes the models text generation settings and updates them so it knows when to stop, how long to generate and how to sample the next token with creativity and filtering rules

*   distilgpt2 is a model already pretrained (model weights are there), download model at pretrained state
*   eos toekn = special marker to tell model this sentence has finished
*   padding is adding extra blank filler toekns to make all text inputs the same length so the model can process them in batches without breaking
*   gpt2 and chatbots are a type of LM callled casualLM because they generate one token at a time. Causal because toekns are predicted in a cause and effect way.
1.   My
2.   My Name
3.   My name is
4.   My name is Nathan





In [10]:
MODEL_NAME = "distilgpt2"
import transformers #now I can load HuggingFace models
from transformers import AutoModelForCausalLM #an api
from transformers import AutoTokenizer #an api


model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, device_map = "auto") #Downloads and loads the pretrained LLM. Auto move things to GPU if neccessary
model.config.use_cache = True #model is passing in one token at a time such as My, my name, My name is as input to model each time we call it. we don't have to recompute the previous tokens, just cache the previous results.
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True) #if model creator wrote their own special tokenizer code, allow colab to run it instead of only using default token rules
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = 'right' #add filler at the end of sentence, not beggining
tokenizer.pad_token_id = tokenizer.eos_token_id

generation_configuration = model.generation_config #copies models current generation settings into a variable so I can edit them
generation_configuration.pad_token_id = tokenizer.eos_token_id # sets padding token to be eos
generation_configuration.eos_token_id = tokenizer.eos_token_id #tells model this is the token that means stop generating here
#the below have to do with how we sample the model
generation_configuration.max_new_tokens = 1024 # GPT-2 has  factor in the past 1024 tokens in its repponse back to me
generation_configuration.do_sample = True  #GPT added line
generation_configuration.temperature = 0.7 #affects the quality and how diverse models outputs are, super bland or fun and varied. we dould divide every number by temperature in prob dist below
generation_configuration.top_p = 0.9 # we only care about top p probabilities in our dist
generation_configuration.tok_k = 20 #

#[0.2, 0.23, 0.79, 0.5,..., 0.04, 0.16]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

# We are ready to generate text now

This function takes a text prompt, converts it into model-readable tokens on your GPU, generates a continuation using sampling, decodes it back into text, and prints the result.

In [11]:
def generate(prompt):
  encoded = tokenizer.encode(
      prompt,
      add_special_tokens=True,
      return_tensors = "pt"
      ).to(device) #encode string as integers so model can actually understand, add special tokens true will auto take care of eos tokens, padding etc, return format compatibile with pytorch. use gpu
  out = model.generate(
      input_ids=encoded,
      repetition_penalty=2.0,
      do_sample=True
      ) #simulate sampling algo so responses generated are more interesting
  string_decoded = tokenizer.decode(out[0].tolist(), clean_up_tokenization_spaces=True) #we then decode the output so we can understand it
  print(string_decoded)

In [12]:
generate('this is')

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


this is a very cool product and I would love to hear your feedback!
So far, it has been my goal for the past few weeks. My first game was an old-school strategy RPG with lots of fun elements in mind but also some really interesting gameplay mechanics that were not part or parcel of this project (I hope you enjoyed them!). The story takes place during World War II where all four main heroes have their own army on board together - they are determined by one another's interests as well so there will be many different types available throughout each campaign which could make things even more challenging if we decided at any time between now and September 1st 2017:<|endoftext|>


In [13]:
generate('how are you')

how are you going to be in the first place?※<|endoftext|>



If I ask it how are you to ChatGPT right now it will answer me: I am good, but I am a LLM with no feelings...

Right now with gpt2 its just completing the sentence and generating more text.

Now we need to fine tune our model so it functions as a chatbot.

In [14]:
from trl import SFTConfig, SFTTrainer
from transformers import TrainingArguments

# set up your training arguments
training_args = SFTConfig(
gradient_accumulation_steps=1,
num_train_epochs=1, #how many times do we want to train the model over the entire dataset. Fine tuning does not require a lot
learning_rate=2e-4,
fp16=True,
output_dir="logs",
lr_scheduler_type="cosine",# a dynamic learning rate alpha is best, follow cosine pattern
warmup_ratio=0.05, #warm up the learning rate
group_by_length=True,
max_length=512

)

# initialize the SFTTrainer
trainer = SFTTrainer(
model=model,
train_dataset=training_dataset,
processing_class=tokenizer ,
args=training_args,
)

Adding EOS to train dataset:   0%|          | 0/1000 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/1000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1229 > 1024). Running this sequence through the model will result in indexing errors


Truncating train dataset:   0%|          | 0/1000 [00:00<?, ? examples/s]

The model is already on multiple devices. Skipping the move to device specified in `args`.


In [15]:
trainer.train()

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 50256}.
  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: (1) Create a W&B account
[34m[1mwandb[0m: (2) Use an existing W&B account
[34m[1mwandb[0m: (3) Don't visualize my results
[34m[1mwandb[0m: Enter your choice:

 2


[34m[1mwandb[0m: You chose 'Use an existing W&B account'
[34m[1mwandb[0m: Logging into https://api.wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: Find your API key here: https://wandb.ai/authorize?ref=models
[34m[1mwandb[0m: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: [32m[41mERROR[0m Invalid API key: API key must have 40+ characters, has 1.
[34m[1mwandb[0m: (1) Create a W&B account
[34m[1mwandb[0m: (2) Use an existing W&B account
[34m[1mwandb[0m: (3) Don't visualize my results
[34m[1mwandb[0m: Enter your choice:

 3


[34m[1mwandb[0m: You chose "Don't visualize my results"


`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
10,3.3094
20,3.3219
30,3.6554
40,3.3347
50,3.2645
60,3.4352
70,3.2958
80,3.1416
90,3.2986
100,3.1562




TrainOutput(global_step=125, training_loss=3.325929244995117, metrics={'train_runtime': 307.0327, 'train_samples_per_second': 3.257, 'train_steps_per_second': 0.407, 'total_flos': 94701699661824.0, 'train_loss': 3.325929244995117, 'entropy': 3.4376644611358644, 'num_tokens': 363618.0, 'mean_token_accuracy': 0.3801665484905243, 'epoch': 1.0})

In [16]:
generate('how are you')

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


how are you?
 (). for that<|endoftext|>


