**Author**: Todd Goldfarb

**Contact**: tcgoldfarb@gmail.com

**Date**: 3/30/2023

This is the **Cadet-Tiny** model, the first NLP SEQ-2-SEQ model I've ever made, for use on edge devices (such as a raspberry pi). In order to save on computing costs and reduce environmental footprint, I'll be using the pretrained T5-small from Google. It is trained on the SODA dataset, by AllenAI.

Inspired by Cosmo-3B, we will be using the contextual narrative of SODA (n), the perspective/speaker instruction (i), and the dialogue context (c) made up of the previous conversation utterances concatenated with the < TURN > indicator. We will seperate n, i, and c with < SEP >.

(link to the SODA and COSMO paper: https://arxiv.org/pdf/2212.10465.pdf)

(HuggingFace link to the t5-small, the backbone model https://huggingface.co/t5-small)

(HuggingFace link to SODA, the dataset https://huggingface.co/datasets/allenai/soda)

(Special thanks to Hyunwoo Kim for providing personal insight on how to train the model in alignment with how COSMO was trained.)

**Install packages and dependencies.**

In [None]:
!pip install datasets transformers


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.11.0-py3-none-any.whl (468 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m91.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting aiohttp
  Downloading aiohttp-3.8.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m33.9 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.7,>=0.3.0
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub<1.0.0,>=0.11.0
  Downl

**Load our dataset, SODA.**

In [None]:
from datasets import load_dataset

raw_SODA = load_dataset("allenai/soda")

Downloading readme:   0%|          | 0.00/4.92k [00:00<?, ?B/s]

Downloading and preparing dataset parquet/allenai--soda to /root/.cache/huggingface/datasets/allenai___parquet/allenai--soda-354e990899ae2f4a/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/689M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/84.2M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/82.9M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/allenai___parquet/allenai--soda-354e990899ae2f4a/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

**There are many labels here. The T5 has no limit on the amount of input or output labels, but there I am going to abide by a 512 token limit. The COSMO-3B, according to the research paper, takes in "*...situation descriptions,
along with dialogue history, and generates a next utterance according to a given role.*".**

They use:
- contextual narrative (**narrative label**)
- the perspective/speaker instruction(**we have to make this using the PersonX and PersonY, so we only use the parts of the datasets that have 2 partners**). I'm sure we could use monologues and 3 person dialogues, but there's no need to in a one-on-one conversational model.
- the dialogue (**dialogue, have to parse sequencing)**.

Before we can cut the labels, we need to parse out everything that has:

a. Only PersonX

b. Includes PersonZ

So, we need to cut the labels down to what we need.



In [None]:
# A filtering function, for parsing out solo conversations and triple conversations.
def duo_conversations_only(dataset):
  has_PersonX = dataset["PersonX"] is not None and dataset["PersonX"] != ""
  has_PersonY = dataset["PersonY"] is not None and dataset["PersonY"] != ""
  has_PersonZ = dataset["PersonZ"] is not None and dataset["PersonZ"] != ""

  return has_PersonX and has_PersonY and not has_PersonZ

duo_SODA = raw_SODA.filter(duo_conversations_only)

duo_SODA["train"]

Filter:   0%|          | 0/1191582 [00:00<?, ? examples/s]

Filter:   0%|          | 0/148968 [00:00<?, ? examples/s]

Filter:   0%|          | 0/146346 [00:00<?, ? examples/s]

Dataset({
    features: ['head', 'relation', 'tail', 'literal', 'narrative', 'dialogue', 'speakers', 'PersonX', 'PersonY', 'PersonZ', 'original_index', 'split', 'head_answer', 'pmi_head_answer', 'relation_tail_answer', 'pmi_relation_tail_answer'],
    num_rows: 348572
})

Now that we have filtered out everything but two person conversations, let's cut out the labels that aren't necessary: head, relation, tail, literal, PersonZ, original_index, split, head_answer, pmi_head_answer, pmi_relation_tail_answer.

In [None]:
raw_SODA_train = duo_SODA["train"]
raw_SODA_valid = duo_SODA["validation"]
raw_SODA_test = duo_SODA["test"]

# TRAIN SET
narrative_train = raw_SODA_train["narrative"]
PersonX_train = raw_SODA_train["PersonX"]
PersonY_train = raw_SODA_train["PersonY"]
dialogue_train = raw_SODA_train["dialogue"]

# TEST SET
narrative_test = raw_SODA_test["narrative"]
PersonX_test = raw_SODA_test["PersonX"]
PersonY_test = raw_SODA_test["PersonY"]
dialogue_test = raw_SODA_test["dialogue"]

# VALIDATION SET
narrative_valid = raw_SODA_valid["narrative"]
PersonX_valid = raw_SODA_valid["PersonX"]
PersonY_valid = raw_SODA_valid["PersonY"]
dialogue_valid = raw_SODA_valid["dialogue"]

The T5 input format has three values:

"prefix" = **"dialogue: "**

"input text" = **IMAGINE TAG** < SEP > **NARRATIVE** < SEP > **FIRST CAPTION IN DIALOGUE** < TURN >

"target text" = **SECOND CAPTION IN DIALOGUE**

Then we will add a < TURN > to the end of the output of the model, when the time comes.

What is the imagine scenario? This is how we fill in names for the imagine scenario.

In [None]:
def imagineTagMaker(PersonX, PersonY):
  imagineTag = "You are " + PersonY + " talking to " + PersonX +"."
  return imagineTag

# TRAINING SET
imagine_train = []
for i in range(len(PersonX_train)):
  imagine_train.append(imagineTagMaker(PersonX_train[i], PersonY_train[i]))

# TEST SET
imagine_test = []
for i in range(len(PersonX_test)):
  imagine_test.append(imagineTagMaker(PersonX_test[i], PersonY_test[i]))

# VALIDATION SET
imagine_valid = []
for i in range(len(PersonX_valid)):
  imagine_valid.append(imagineTagMaker(PersonX_valid[i], PersonY_valid[i]))

Now we need to tokenize our data. Using AutoTokenizer.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("t5-small", model_max_length=512)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Due to some errors, it seems we can't tokenize all of the dataset at once, so we'll try to tokenize it as we go...

We need to create each sentence, tokenize it, then add it to a list of lists. As an attempt to make the most of the duo only filtered dataset, we'll feed the model the first round of dialogue, and each consecutive round (up to 3).


In [None]:
def createInputSet(imagineSet, narrativeSet, dialogueSet):
  model_inputs = []

  # FOR FIRST ROUND OF DIALOUGE
  for i in range(len(imagineSet)):
    # THIS IS OUR: dialogue: Imagine you are... <SEP> Narrative <SEP> TALKING <TURN>
    rawString = "dialogue: " + imagineSet[i] + " <SEP> " + narrativeSet[i] + " <SEP> " + dialogueSet[i][0] + " <TURN> "
    tokenizedString = tokenizer(rawString, padding='max_length', truncation=True, max_length=512)
    with tokenizer.as_target_tokenizer():
      # This is the response to the dialogue[i][0] above.
      tokenizedResponse = tokenizer(dialogueSet[i][1], padding='max_length', truncation=True, max_length=512)
    tokenizedString["labels"] = tokenizedResponse["input_ids"]
    model_inputs.append(tokenizedString)

  # FOR SECOND ROUND OF DIALOGUE
  for i in range(len(imagineSet)):
    rawString = "dialogue: " + imagineSet[i] + " <SEP> " + narrativeSet[i] + " <SEP> " + dialogueSet[i][0] + " <TURN> " + dialogueSet[i][1] + " <TURN> " + dialogueSet[i][2] + " <TURN> "
    tokenizedString = tokenizer(rawString, padding='max_length', truncation=True, max_length=512)
    with tokenizer.as_target_tokenizer():
      # This is the response to the dialogue[i][0] above.
      tokenizedResponse = tokenizer(dialogueSet[i][3], padding='max_length', truncation=True, max_length=512)
    tokenizedString["labels"] = tokenizedResponse["input_ids"]
    model_inputs.append(tokenizedString)

  return model_inputs

model_train = createInputSet(imagine_train, narrative_train, dialogue_train)
model_eval = createInputSet(imagine_valid, narrative_valid, dialogue_valid)



Now we have tokenized inputs, and what the model should focus on (tokenized outputs).
I think it's time to train! Let's import our model.

In [None]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

Downloading pytorch_model.bin:   0%|          | 0.00/242M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

We need to configure our TrainingArguments for the Trainer class with Huggingface. Thank you HuggingFace for making AI so easy!

In [None]:
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq

args = Seq2SeqTrainingArguments(
    f"Cadet-Tiny",
    evaluation_strategy = "epoch",
    learning_rate = 0.001,
    per_device_train_batch_size = 40,
    per_device_eval_batch_size = 40,
    weight_decay = 0.01,
    save_total_limit = 15,
    num_train_epochs = 3,
    predict_with_generate = True,
    fp16 = True,
)

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

trainer = Seq2SeqTrainer(
    model=model,
    args=args,
    train_dataset=model_train,
    eval_dataset=model_eval,
)

Now that we have our data preprocessed, model loaded, and training arguments set up, we can finally train the model.

In [None]:
trainer.train()



Epoch,Training Loss,Validation Loss
1,0.0583,0.052288
2,0.052,0.047958
3,0.0496,0.046132


TrainOutput(global_step=52287, training_loss=0.05774725275756292, metrics={'train_runtime': 15994.368, 'train_samples_per_second': 130.761, 'train_steps_per_second': 3.269, 'total_flos': 2.830581745361879e+17, 'train_loss': 0.05774725275756292, 'epoch': 3.0})