**Install required libraries**

In [1]:
pip install transformers datasets torch accelerate



**Load tokenizer & model**

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "distilgpt2"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

**Load_dataset**

In [3]:
from datasets import load_dataset

dataset = load_dataset("text", data_files={"train": "data.txt"})


Generating train split: 0 examples [00:00, ? examples/s]

In [5]:
tokenizer.pad_token = tokenizer.eos_token

def tokenize_function(example):
    return tokenizer(
        example["text"],
        truncation=True,
        padding="max_length",
        max_length=128
    )

tokenized_dataset = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/77 [00:00<?, ? examples/s]

**Data collator**

In [6]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)


**Training arguments**

In [7]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./chatbot-model",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=1,
    save_steps=500,
    save_total_limit=2,
    logging_steps=50,
    report_to="none"
)


**Trainer**

In [8]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    data_collator=data_collator,
)


**START FINE-TUNING**

In [9]:
trainer.train()

`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
50,3.0242
100,2.3799
150,1.8412
200,1.2293


TrainOutput(global_step=231, training_loss=2.0107169873786694, metrics={'train_runtime': 477.5915, 'train_samples_per_second': 0.484, 'train_steps_per_second': 0.484, 'total_flos': 7544943673344.0, 'train_loss': 2.0107169873786694, 'epoch': 3.0})

**Saving Models**

In [10]:
trainer.save_model("./chatbot-model")
tokenizer.save_pretrained("./chatbot-model")


('./chatbot-model/tokenizer_config.json',
 './chatbot-model/special_tokens_map.json',
 './chatbot-model/vocab.json',
 './chatbot-model/merges.txt',
 './chatbot-model/added_tokens.json',
 './chatbot-model/tokenizer.json')

In [11]:
from transformers import pipeline

chatbot = pipeline(
    "text-generation",
    model="./chatbot-model",
    tokenizer="./chatbot-model"
)

chatbot("User: What is AI?\nBot:", max_length=100)


Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': 'User: What is AI?\nBot: AI is a data science process that uses machine learning and deep learning to predict intelligent and intelligent behavior. It is used to help humans learn from concepts, patterns, and patterns of thought. It is a popular programming language used to help humans understand and understand language. It is used in AI applications like text processing, AI and AI applications like AI applications.\n\nA Machine Learning Machine Learning Machine Learning Machine Learning Machine Learning Machine Learning Machine Learning Machine Learning Machine Learning Machine Learning Machine Learning Machine Learning Machine Learning Machine Learning Machine Learning Machine Learning Machine Learning Machine Learning Machine Learning Machine Learning Machine Learning Machine Learning Machine Learning Machine Learning Machine Learning Machine Learning Machine Learning Machine Learning Machine Learning Machine Learning Machine Learning Machine Learning Machine Lea