<a href="https://colab.research.google.com/github/CrisLeaf/chatbot/blob/master/text_generator_train.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install `transformers` library

In [1]:
!pip install transformers==4.15.0

Collecting transformers==4.15.0
  Downloading transformers-4.15.0-py3-none-any.whl (3.4 MB)
[K     |████████████████████████████████| 3.4 MB 5.1 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 59.4 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 6.9 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 59.3 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 71.1 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, tr

In [2]:
!nvidia-smi

Mon Jan 17 16:09:26 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.46       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   37C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Prepare Dataset

The dataset consists in a bunch of sentences gathered from differents sources related to Tolkien's universe.

In [3]:
from google.colab import drive
drive.mount("/content/gdrive")

Mounted at /content/gdrive


In [4]:
import pandas as pd
df = pd.read_csv("/content/gdrive/My Drive/data-science/chatbot/data.txt", 
                 sep=".", header=None).transpose()
df.columns = ["text"]
df["text"] = df["text"].apply(lambda x: str(x) + ". ")

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 39160 entries, 0 to 39159
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    39160 non-null  object
dtypes: object(1)
memory usage: 611.9+ KB


In [6]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2, random_state=159)

In [7]:
data = ""
for line in train["text"]:
    data += line

In [8]:
def build_text_files(data_column, destination_path):
    with open(destination_path, "w") as f:
        data = ""
        for line in data_column:
            data += line
        f.write(data)

build_text_files(train["text"], "train_dataset.txt")
build_text_files(test["text"], "test_dataset.txt")

print("Train dataset length: " + str(len(train)))
print("Test dataset length: " + str(len(test)))

Train dataset length: 31328
Test dataset length: 7832


# Load the Model and the Trainer

## Load

Load pre trained Spanish GPT-2 from [Hugging Face Website](https://huggingface.co/models).

In [9]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("DeepESP/gpt2-spanish")
model = AutoModelForCausalLM.from_pretrained("DeepESP/gpt2-spanish")

Downloading:   0%|          | 0.00/115 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/914 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/821k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/487k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/262 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/249M [00:00<?, ?B/s]

Adapt Training and Testing datasets for the model  flux.

In [10]:
from transformers import TextDataset, DataCollatorForLanguageModeling

def load_dataset(train_path, test_path, tokenizer):
    train_dataset = TextDataset(
        tokenizer=tokenizer,
        file_path=train_path,
        block_size=128
    )

    test_dataset = TextDataset(
        tokenizer=tokenizer, 
        file_path=test_path,
        block_size=128
    )

    data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

    return train_dataset, test_dataset, data_collator

train_dataset, test_dataset, data_collator = load_dataset("train_dataset.txt", "test_dataset.txt", 
                                                          tokenizer)



## Train Arguments

Define the training arguments.

In [11]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="/content/gdrive/My Drive/data-science/chatbot/text-generator",
    overwrite_output_dir=True,
    num_train_epochs=10,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    eval_steps=500,
    save_steps=500,
    warmup_steps=500,
    prediction_loss_only=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=test_dataset
)

# Train and Save the Model

Train the model

In [12]:
trainer.train()

***** Running training *****
  Num examples = 8544
  Num Epochs = 10
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 5340


Step,Training Loss
500,3.9854
1000,3.4324
1500,3.1947
2000,3.0189


Saving model checkpoint to /content/gdrive/My Drive/data-science/chatbot/text-generator/checkpoint-500
Configuration saved in /content/gdrive/My Drive/data-science/chatbot/text-generator/checkpoint-500/config.json
Model weights saved in /content/gdrive/My Drive/data-science/chatbot/text-generator/checkpoint-500/pytorch_model.bin
Saving model checkpoint to /content/gdrive/My Drive/data-science/chatbot/text-generator/checkpoint-1000
Configuration saved in /content/gdrive/My Drive/data-science/chatbot/text-generator/checkpoint-1000/config.json
Model weights saved in /content/gdrive/My Drive/data-science/chatbot/text-generator/checkpoint-1000/pytorch_model.bin
Saving model checkpoint to /content/gdrive/My Drive/data-science/chatbot/text-generator/checkpoint-1500
Configuration saved in /content/gdrive/My Drive/data-science/chatbot/text-generator/checkpoint-1500/config.json
Model weights saved in /content/gdrive/My Drive/data-science/chatbot/text-generator/checkpoint-1500/pytorch_model.bin
S

Step,Training Loss
500,3.9854
1000,3.4324
1500,3.1947
2000,3.0189
2500,2.8664
3000,2.7507
3500,2.6484
4000,2.5591
4500,2.4943
5000,2.449


Saving model checkpoint to /content/gdrive/My Drive/data-science/chatbot/text-generator/checkpoint-2500
Configuration saved in /content/gdrive/My Drive/data-science/chatbot/text-generator/checkpoint-2500/config.json
Model weights saved in /content/gdrive/My Drive/data-science/chatbot/text-generator/checkpoint-2500/pytorch_model.bin
Saving model checkpoint to /content/gdrive/My Drive/data-science/chatbot/text-generator/checkpoint-3000
Configuration saved in /content/gdrive/My Drive/data-science/chatbot/text-generator/checkpoint-3000/config.json
Model weights saved in /content/gdrive/My Drive/data-science/chatbot/text-generator/checkpoint-3000/pytorch_model.bin
Saving model checkpoint to /content/gdrive/My Drive/data-science/chatbot/text-generator/checkpoint-3500
Configuration saved in /content/gdrive/My Drive/data-science/chatbot/text-generator/checkpoint-3500/config.json
Model weights saved in /content/gdrive/My Drive/data-science/chatbot/text-generator/checkpoint-3500/pytorch_model.bi

TrainOutput(global_step=5340, training_loss=2.9068375376726356, metrics={'train_runtime': 3240.1993, 'train_samples_per_second': 26.369, 'train_steps_per_second': 1.648, 'total_flos': 5581197803520000.0, 'train_loss': 2.9068375376726356, 'epoch': 10.0})

Save model and tokenizer weights.

In [13]:
trainer.save_model()
tokenizer.save_pretrained("/content/gdrive/My Drive/data-science/chatbot/text-generator")

Saving model checkpoint to /content/gdrive/My Drive/data-science/chatbot/text-generator
Configuration saved in /content/gdrive/My Drive/data-science/chatbot/text-generator/config.json
Model weights saved in /content/gdrive/My Drive/data-science/chatbot/text-generator/pytorch_model.bin
tokenizer config file saved in /content/gdrive/My Drive/data-science/chatbot/text-generator/tokenizer_config.json
Special tokens file saved in /content/gdrive/My Drive/data-science/chatbot/text-generator/special_tokens_map.json


('/content/gdrive/My Drive/data-science/chatbot/text-generator/tokenizer_config.json',
 '/content/gdrive/My Drive/data-science/chatbot/text-generator/special_tokens_map.json',
 '/content/gdrive/My Drive/data-science/chatbot/text-generator/vocab.json',
 '/content/gdrive/My Drive/data-science/chatbot/text-generator/merges.txt',
 '/content/gdrive/My Drive/data-science/chatbot/text-generator/added_tokens.json',
 '/content/gdrive/My Drive/data-science/chatbot/text-generator/tokenizer.json')

# Test the Model

Re-load the tokenizer and model with the updated trained weights.

In [21]:
test_tokenizer = AutoTokenizer.from_pretrained("DeepESP/gpt2-spanish")
test_model = AutoModelForCausalLM.from_pretrained("/content/gdrive/My Drive/data-science/chatbot/text-generator",
                                                  pad_token_id=tokenizer.eos_token_id)

loading configuration file https://huggingface.co/DeepESP/gpt2-spanish/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/80f30f43ed5e5159d1924e70134443c131f059187f562435a148107dbf002fec.05892ce96bf9f5c8bfde40968186593b5d0feec27999103dbf7ea1d3ca7d11e1
Model config GPT2Config {
  "_name_or_path": "DeepESP/gpt2-spanish",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "gradient_checkpointing": false,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "c

Test with a simple sentence.

In [24]:
test_text = "Un día, el mago despertó y vió por la ventana como"
inputs = test_tokenizer.encode(test_text, return_tensors="pt")
outputs = test_model.generate(inputs, max_length=100, do_sample=True, top_k=0,
                            temperature=0.7, no_repeat_ngram_size=2, top_p=0.9)
tokenizer.decode(outputs[0], skip_special_tokens=True)

'Un día, el mago despertó y vió por la ventana como si fuera de día y miró el cielo.  Por fin los elfos despertaron, y los hombres despertaron de pronto, porque el sol había salido y las sombras de la noche se alargaban. En la tierra media, la hierba era verde y verde, pero las estrellas brillaban en el este. El aire tenía un olor dulce, como de árboles recién plantados y raíces maduras, a flores frescas y a manzanas frescas. A medida que'