<a href="https://colab.research.google.com/github/ChrisNavoczynski/AD470-Team-MPC-SP2022/blob/main/homer_simpson.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Homer Simpson Chatbot**

* Peter Torres - Business Problem Definition, Data Ingestion, Data CleanUp, Exploratory Data Analysis
* Chris Navoczynski - Model Training & Building
* Mona Mohamed - Evaluation and Hugging Face Chatbot deployment

## *Business problem definition*

- We utilized a familiar character well known across the globe, in this case Homer Simpson to encourage communication 
- The chatbot helps in raising that entertainment level, opening people up to engage more


## *Data Ingestion*

Gathered transcripts from [Simpsons](https://https://www.simpsonsarchive.com/episodes.html) episodes to combine into single text file.

## *Data Cleanup*

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [None]:
import os
os.chdir('/content/drive/My Drive/AD470-MPC-Group1-SP2022/Project_3')

In [None]:
import re
import pandas as pd

In [None]:
pattern = r'([a-zA-Z\s]+):(.+)'

In [None]:
data = {
    'name': [],
    'line': []
}

In [None]:
with open('simpsons.txt', 'rt') as file:
  for line in file.readlines():
    match = re.findall(pattern, line)
    if match:
      name, line = match[0]
      data['name'].append(name.strip())
      data['line'].append(line.strip())

In [None]:
df = pd.DataFrame(data)

In [None]:
df.head()

Unnamed: 0,name,line
0,Homer,[meekly raises his hand]
1,Supervisor,"I might have known it was you, Simpson."
2,Homer,"But sir, I..."
3,Supervisor,"I don't want to hear about it Simpson, your fi..."
4,Terry,"[waving] Hi, Daddy!"


In [None]:
sum(df['name'] == 'Homer')

701

In [None]:
len(df)

2591

In [None]:
df.to_csv('simpsons.csv', index=False)

In [None]:
!pip -q install transformers

[K     |████████████████████████████████| 4.2 MB 5.1 MB/s 
[K     |████████████████████████████████| 596 kB 54.6 MB/s 
[K     |████████████████████████████████| 86 kB 6.0 MB/s 
[K     |████████████████████████████████| 6.6 MB 51.4 MB/s 
[?25h

## *Exploratory Data Analysis*

In [None]:
# imports required:

import glob
import logging
import os
import pickle
import random
import re
import shutil
from typing import Dict, List, Tuple

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader, Dataset, RandomSampler, SequentialSampler
from torch.utils.data.distributed import DistributedSampler
from tqdm.notebook import tqdm, trange

from pathlib import Path

from transformers import (
    MODEL_WITH_LM_HEAD_MAPPING,
    WEIGHTS_NAME,
    AdamW,
    AutoConfig,
    PreTrainedModel,
    PreTrainedTokenizer,
    get_linear_schedule_with_warmup,
)

try: 
  from torch.utils.tensorboard import SummaryWriter
except ImportError:
  from tensorboardX import SummaryWriter 

In [None]:
!head simpsons.csv

name,line
Homer,[meekly raises his hand]
Supervisor,"I might have known it was you, Simpson."
Homer,"But sir, I..."
Supervisor,"I don't want to hear about it Simpson, your fired!"
Terry,"[waving]  Hi, Daddy!"
Time,6:27]
Lisa,Here's a good job at the fireworks factory.
Homer,"Those perfectionists, forget it."
Lisa,"How about this, a supervising technician at the toxic waste dump."


In [None]:
data = pd.read_csv('simpsons.csv', sep=',')

In [None]:
data.sample(10)

Unnamed: 0,name,line
1632,Homer,"My son, a genius!? How does it happen?"
2224,Botz,Don't turn your back on that boy for a second.
1934,Homer,Grrrrrrrrrr!
1557,Ned,"[picking up gifts] Well, this one's mine, and ..."
84,Miss Allbright,Today's topic will be Hell.
1628,Bart,"Oh, like you're reading my mind, man."
1213,Homer,"Well, here I am, right on time. I don't see B..."
1917,Bart,Bring it on home now!
623,Bart,"I'm scared, Lisa."
2335,Lisa,Twelfth hole.


In [None]:
len(data)

2591

In [None]:
sum(data['name'] == 'Homer')

701

## *Model Training*

In [None]:
#CHARACTER_NAME = '   Homer'
CHARACTER_NAME = 'Homer'

In [None]:
contexted = []

windSize = 7

for i in data[data.name == CHARACTER_NAME].index:
  if i < windSize:
    continue
  row = []
  prev = i - 1 - windSize
  for j in range(i, prev, -1):
    row.append(data.line[j])
  contexted.append(row)

  columns = ['response', 'context']
  columns = columns + ['context/' + str(i) for i in range(windSize -1)]

  df = pd.DataFrame.from_records(contexted, columns=columns)

In [None]:
df.sample(7)

Unnamed: 0,response,context,context/0,context/1,context/2,context/3,context/4,context/5
517,The den! Great idea! [heads into the den. B...,Did you check the den?,"Oh Homer, you'd lose your head if it weren't s...",Where the hell are my keys? Who stole my keys...,Moaning Lisa,[pegs Homer in the face with another balloon],You! Up in the tree! The tall grey-haired ki...,"[giggles] Heh heh, got him!"
117,"Yeah, sure, for you, a baby's all fun and games.","Oh, cool! We can race 'em!","Did you hear that, Maggie? Another baby in th...","You're a machine, Homer!","Whoa, awright! Way to go! [exchange high fives]",[fierce internal struggle manifests itself in ...,"Is Mom going to have another baby, Dad?",I smell a bun in the oven...
484,[slaps his forehead] Oh!,"Drink, my friends and don't be lonely.","[runs past chasing the kids, sees Marge] Huh?","He's here with me, my one and only.","Hey, Marge, and pour the wine!",Drink the drink that I have made.,"Hey, Marge, and pour the wine!","Here we sit, enjoying the shade."
434,"Hello, Flanders.","Oh ho ho, Simpson, it's you.",[bumps into Ned. Their respective armfuls of ...,"[hands over the list she was holding] Well, s...",I want to do the Christmas shopping this year!,[rubs his hand],Well... I...,Yes?
227,[resolutely] Something I should have done a _l...,"Well, Pop, what are you going to do?",it has the decision letters from all the,"[triumphant] Yes! Take _that_, Bitterman.",[growls; a bra falls on his head],"Lighten up, Bitterman...that youngster will ma...","Corey?! Don't worry, Mr. President, I --",Hey!
61,"All right, time for a family meeting. [shuts ...",11:45],"Hi, friends, I'm Dr. Marvin Monroe. Does this...",Why don't you <both shut up!>,Shut up! [little kid enters the bedroom],Shut up!,Oh shut up!,No <you> shut up!
665,Nothing.,"What's the matter, boy?",I reached Step One: She knew I existed.,"It was a jailhouse romance, man!",So it was love a first sight!,Six days!,It was worth it!,Five days!


### Split data into Taining and Test Sets

In [None]:
train_df, val_df = train_test_split(df, test_size=0.1)
train_df.head()

Unnamed: 0,response,context,context/0,context/1,context/2,context/3,context/4,context/5
89,What a movie! And that blonde cutie! Does sh...,"No, just give the Great Unwashed a pair of ove...",[gasp],I took in a movie. An appalling little piece ...,"Ugh. Well, Smithers, don't you know how to pa...","Oh, he's my Yorkshire terrier, sir. He's kind...",Who the devil is Hercules?,"Well, I caught up on my laundry, wrote a lette..."
103,"[angrily] What's <your> problem, boy?",Maybe you ate a clove.,Hm. It's actually more of a honey glaze.,"Tastes so bitter, it's like ashes in my mouth...","How are you enjoying your ham, Homey?","Look, I get enough admiration and respect at w...","[reading the plaque] For heroic competence, f...",[laughing at an Itchy and Scratchy cartoon]
519,Grrrrrrrrrr!,"No, I'm talking about your breakfast. [laughs]",You know where my keys are?,Warm.,The den! Great idea! [heads into the den. B...,Did you check the den?,"Oh Homer, you'd lose your head if it weren't s...",Where the hell are my keys? Who stole my keys...
363,"Wait a minute, Race. Wait a minute...wait!","With that hatch open, we'll burn up on re-entr...","Homer, you broke the handle.",Wait a minute...this unkempt youngster might j...,"[menacing] Quiet, you --","Ants, huh? We had quite a severe ant problem ...","Oh my God, the ants are shorting out our navig...",[with forced cheer] Who wants ginger snaps?
634,Aah!,Simpson!,"If I wasn't so spineless, I'd march into Mr. B...","Mm, better let him rest up a while first.",Is he well enough for me to start mothering hi...,How does a nice little girl like you know a bi...,"Excuse me, Mr. Hutz. Are you a shyster?",Pfft. Doctors. Doctors are idiots!


In [None]:
val_df.head()

Unnamed: 0,response,context,context/0,context/1,context/2,context/3,context/4,context/5
541,[looks around] Where's the video boxing?,"Yeah, right. [gives him the quarters]",Give me some quarters... I'm doing my laundry.,"Uh, hold on. [to everyone in the bar] Uh, Jo...","Jock, last name Strap.",Who?,Is Jock there?,"Yeah, Moe's Tavern, Moe speaking."
685,"I was afraid you'd cancel our date, so I staye...","I also said I hated you, and we haven't even t...",You said you'd go the prom with me.,That's what you get when you don't put out.,Marge's dates get homelier all the time.,Ladies pinch. Whores use rouge.,Couldn't we use just rouge for this?,"If you pinch your cheeks, they'll glow."
642,[thinks to himself] She's been your wife for ...,[thinks] Moe. Wish he'd shut up.,"Oh, you're better off. Rich people aren't hap...",[thinks] Just mouth polite nothings. [out loud...,Some celery with cream cheese on it?,"[thinks] No, I don't want any string beans eit...",Some string beans?,"[thinks] Yeah, a million dollars' worth, you ..."
28,And did you pay for those clothes you're wearing?,No.,"Oh. Look at this way, when you had breakfast ...",But everybody does it.,"Well, DUH.","Well, in Sunday School, we learned that steali...","[sotto voce] Oh, great...","Dad, why is the world such a cesspool of corru..."
497,"I appreciate that, honey,","Homer, couldn't we pawn my engagement ring ins...","Aw come on, Dad, anything but that!",No Dad! Please don't pawn the TV!,To save this family we're gonna have to make t...,"Boxing, Lisa, boxing. There's a world of diff...",You're sending us to a doctor who advertises o...,The fat guy on TV?


In [None]:
# Create a dataset for the model
def construct_conv(row, tokenizer, eos=True):
  flatten = lambda l: [item for sublist in l for item in sublist]
  convo = list(reversed([tokenizer.encode(x) + 
                         [tokenizer.eos_token_id] for x in row]))
  convo = flatten(convo) 
  return convo


class Conversation(Dataset):
  def __init__(self, tokenizer: PreTrainedTokenizer, args, df, block_size=512):

      block_size = block_size - (tokenizer.model_max_length - tokenizer.max_len_single_sentence)

      directory = args.cache_dir
      cached_features_file = os.path.join(
          directory, args.model_type + "_cached_lm_" + str(block_size)
      )

      if os.path.exists(cached_features_file) and not args.overwrite_cache:
          logger.info("Loading features from cached file %s", cached_features_file)
          with open(cached_features_file, "rb") as handle:
                self.examples = pickle.load(handle)
      else:
          logger.info("Creating features from dataset file at %s", directory)

          self.examples = []
          for _, row in df.iterrows():
              conv = construct_conv(row, tokenizer)
              self.examples.append(conv)

          logger.info("Saving features into cached file %s", cached_features_file)
          with open(cached_features_file, "wb") as handle:
              pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)

  def __len__(self):
      return len(self.examples)
#1
  def __getitem__(self, item):
      return torch.tensor(self.examples[item], dtype=torch.long)


In [None]:
# Cache and Store Data Checkpoints
def load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate=False):
    return Conversation(tokenizer, args, df_val if evaluate else df_trn)


def set_seed(args):
    random.seed(args.seed)
    np.random.seed(args.seed)
    torch.manual_seed(args.seed)
    if args.n_gpu > 0:
        torch.cuda.manual_seed_all(args.seed)


def _sorted_checkpoints(args, checkpoint_prefix="checkpoint", use_mtime=False) -> List[str]:
    ordering_and_checkpoint_path = []

    glob_checkpoints = glob.glob(os.path.join(args.output_dir, "{}-*".format(checkpoint_prefix)))

    for path in glob_checkpoints:
        if use_mtime:
            ordering_and_checkpoint_path.append((os.path.getmtime(path), path))
        else:
            regex_match = re.match(".*{}-([0-9]+)".format(checkpoint_prefix), path)
            if regex_match and regex_match.groups():
                ordering_and_checkpoint_path.append((int(regex_match.groups()[0]), path))

    checkpoints_sorted = sorted(ordering_and_checkpoint_path)
    checkpoints_sorted = [checkpoint[1] for checkpoint in checkpoints_sorted]
    return checkpoints_sorted


def _rotate_checkpoints(args, checkpoint_prefix="checkpoint", use_mtime=False) -> None:
    if not args.save_total_limit:
        return
    if args.save_total_limit <= 0:
        return

    # Check if we should delete older checkpoint(s)
    checkpoints_sorted = _sorted_checkpoints(args, checkpoint_prefix, use_mtime)
    if len(checkpoints_sorted) <= args.save_total_limit:
        return

    number_of_checkpoints_to_delete = max(0, len(checkpoints_sorted) - args.save_total_limit)
    checkpoints_to_be_deleted = checkpoints_sorted[:number_of_checkpoints_to_delete]
    for checkpoint in checkpoints_to_be_deleted:
        logger.info("Deleting older checkpoint [{}] due to args.save_total_limit".format(checkpoint))
        shutil.rmtree(checkpoint)

### Build the Model

In [None]:
from transformers import AutoModelWithLMHead, AutoModelForCausalLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small")
model = AutoModelWithLMHead.from_pretrained("microsoft/DialoGPT-small")

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/641 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]



Downloading:   0%|          | 0.00/335M [00:00<?, ?B/s]

In [None]:
# Configurations
logger = logging.getLogger(__name__)

MODEL_CONFIG_CLASSES = list(MODEL_WITH_LM_HEAD_MAPPING.keys())
MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)

In [None]:
# The Args allows for easy convertion of script to notebook
class Args():
    def __init__(self):
        self.output_dir = 'output-small'
        self.model_type = 'gpt2'
        self.model_name_or_path = 'microsoft/DialoGPT-small'
        self.config_name = 'microsoft/DialoGPT-small'
        self.tokenizer_name = 'microsoft/DialoGPT-small'
        self.cache_dir = 'cached'
        self.block_size = 512
        self.do_train = True
        self.do_eval = True
        self.evaluate_during_training = False
        self.per_gpu_train_batch_size = 4
        self.per_gpu_eval_batch_size = 4
        self.gradient_accumulation_steps = 1
        self.learning_rate = 5e-5
        self.weight_decay = 0.0
        self.adam_epsilon = 1e-8
        self.max_grad_norm = 1.0
        self.num_train_epochs = 10
        self.max_steps = -1
        self.warmup_steps = 0
        self.logging_steps = 1000
        self.save_steps = 3500
        self.save_total_limit = None
        self.eval_all_checkpoints = False
        self.no_cuda = False
        self.overwrite_output_dir = True
        self.overwrite_cache = True
        self.should_continue = False
        self.seed = 42
        self.local_rank = -1
        self.fp16 = False
        self.fp16_opt_level = 'O1'

args = Args()

### Train and Evaluation




In [None]:
def train(args, train_dataset, model: PreTrainedModel, tokenizer: PreTrainedTokenizer) -> Tuple[int, float]:
    """ Train the model """
    if args.local_rank in [-1, 0]:
        tb_writer = SummaryWriter()

    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)

    def collate(examples: List[torch.Tensor]):
        if tokenizer._pad_token is None:
            return pad_sequence(examples, batch_first=True)
        return pad_sequence(examples, batch_first=True, padding_value=tokenizer.pad_token_id)

    train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
    train_dataloader = DataLoader(
        train_dataset, sampler=train_sampler, batch_size=args.train_batch_size, collate_fn=collate, drop_last = True
    )

    if args.max_steps > 0:
        t_total = args.max_steps
        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
    else:
        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs

    model = model.module if hasattr(model, "module") else model  # Take care of distributed/parallel training
    model.resize_token_embeddings(len(tokenizer))
    # add_special_tokens_(model, tokenizer)

 # Prepare optimizer and schedule (linear warmup and decay)
    no_decay = ["bias", "LayerNorm.weight"]
    optimizer_grouped_parameters = [
        {
            "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
            "weight_decay": args.weight_decay,
        },
        {"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0},
    ]
    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
    scheduler = get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total
    )

    # Check if saved optimizer or scheduler states exist
    if (
        args.model_name_or_path
        and os.path.isfile(os.path.join(args.model_name_or_path, "optimizer.pt"))
        and os.path.isfile(os.path.join(args.model_name_or_path, "scheduler.pt"))
    ):
        # Load in optimizer and scheduler states
        optimizer.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "optimizer.pt")))
        scheduler.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "scheduler.pt")))

    if args.fp16:
        try:
            from apex import amp
        except ImportError:
            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
        model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)

    # multi-gpu training (should be after apex fp16 initialization)
    if args.n_gpu > 1:
        model = torch.nn.DataParallel(model)

    # Distributed training (should be after apex fp16 initialization)
    if args.local_rank != -1:
        model = torch.nn.parallel.DistributedDataParallel(
            model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True
        )

    # Train!
    logger.info("***** Running training *****")
    logger.info("  Num examples = %d", len(train_dataset))
    logger.info("  Num Epochs = %d", args.num_train_epochs)
    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
    logger.info(
        "  Total train batch size (w. parallel, distributed & accumulation) = %d",
        args.train_batch_size
        * args.gradient_accumulation_steps
        * (torch.distributed.get_world_size() if args.local_rank != -1 else 1),
    )
    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
    logger.info("  Total optimization steps = %d", t_total)

    global_step = 0
    epochs_trained = 0
    steps_trained_in_current_epoch = 0
    # Check if continuing training from a checkpoint
    if args.model_name_or_path and os.path.exists(args.model_name_or_path):
        try:
            # set global_step to gobal_step of last saved checkpoint from model path
            checkpoint_suffix = args.model_name_or_path.split("-")[-1].split("/")[0]
            global_step = int(checkpoint_suffix)
            epochs_trained = global_step // (len(train_dataloader) // args.gradient_accumulation_steps)
            steps_trained_in_current_epoch = global_step % (len(train_dataloader) // args.gradient_accumulation_steps)

            logger.info("  Continuing training from checkpoint, will skip to saved global_step")
            logger.info("  Continuing training from epoch %d", epochs_trained)
            logger.info("  Continuing training from global step %d", global_step)
            logger.info("  Will skip the first %d steps in the first epoch", steps_trained_in_current_epoch)
        except ValueError:
            logger.info("  Starting fine-tuning.")

    tr_loss, logging_loss = 0.0, 0.0

    model.zero_grad()
    train_iterator = trange(
        epochs_trained, int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0]
    )
    set_seed(args)  # Added here for reproducibility
    for _ in train_iterator:
        epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
        for step, batch in enumerate(epoch_iterator):

            # Skip past any already trained steps if resuming training
            if steps_trained_in_current_epoch > 0:
                steps_trained_in_current_epoch -= 1
                continue 

            inputs, labels = (batch, batch)
            if inputs.shape[1] > 1024: continue
            inputs = inputs.to(args.device)
            labels = labels.to(args.device)
            model.train()
            outputs = model(inputs, labels=labels)
            loss = outputs[0]  # model outputs are always tuple in transformers (see doc)

            if args.n_gpu > 1:
                loss = loss.mean()  # mean() to average on multi-gpu parallel training
            if args.gradient_accumulation_steps > 1:
                loss = loss / args.gradient_accumulation_steps

            if args.fp16:
                with amp.scale_loss(loss, optimizer) as scaled_loss:
                    scaled_loss.backward()
            else:
                loss.backward()

            tr_loss += loss.item()
            if (step + 1) % args.gradient_accumulation_steps == 0:
                if args.fp16:
                    torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
                else:
                    torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
                optimizer.step()
                scheduler.step()  # Update learning rate schedule
                model.zero_grad()
                global_step += 1

                if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
                    # Log metrics
                    if (
                        args.local_rank == -1 and args.evaluate_during_training
                    ):  # Only evaluate when single GPU otherwise metrics may not average well
                        results = evaluate(args, model, tokenizer)
                        for key, value in results.items():
                            tb_writer.add_scalar("eval_{}".format(key), value, global_step)
                    tb_writer.add_scalar("lr", scheduler.get_lr()[0], global_step)
                    tb_writer.add_scalar("loss", (tr_loss - logging_loss) / args.logging_steps, global_step)
                    logging_loss = tr_loss

                if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
                    checkpoint_prefix = "checkpoint"
                    # Save model checkpoint
                    output_dir = os.path.join(args.output_dir, "{}-{}".format(checkpoint_prefix, global_step))
                    os.makedirs(output_dir, exist_ok=True)
                    model_to_save = (
                        model.module if hasattr(model, "module") else model
                    )  # Take care of distributed/parallel training
                    model_to_save.save_pretrained(output_dir)
                    tokenizer.save_pretrained(output_dir)

                    torch.save(args, os.path.join(output_dir, "training_args.bin"))
                    logger.info("Saving model checkpoint to %s", output_dir)

                    _rotate_checkpoints(args, checkpoint_prefix)

                    torch.save(optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
                    torch.save(scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))
                    logger.info("Saving optimizer and scheduler states to %s", output_dir)

            if args.max_steps > 0 and global_step > args.max_steps:
                epoch_iterator.close()
                break
        if args.max_steps > 0 and global_step > args.max_steps:
            train_iterator.close()
            break

    if args.local_rank in [-1, 0]:
        tb_writer.close()

    return global_step, tr_loss / global_step               

# Evaluation of some model

def evaluate(args, model: PreTrainedModel, tokenizer: PreTrainedTokenizer, df_trn, df_val, prefix="") -> Dict:
    # Loop to handle MNLI double evaluation (matched, mis-matched)
    eval_output_dir = args.output_dir

    eval_dataset = load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate=True)
    os.makedirs(eval_output_dir, exist_ok=True)
    args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
    # Note that DistributedSampler samples randomly

    def collate(examples: List[torch.Tensor]):
        if tokenizer._pad_token is None:
            return pad_sequence(examples, batch_first=True)
        return pad_sequence(examples, batch_first=True, padding_value=tokenizer.pad_token_id)

    eval_sampler = SequentialSampler(eval_dataset)
    eval_dataloader = DataLoader(
        eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size, collate_fn=collate, drop_last = True
    )

    # multi-gpu evaluate
    if args.n_gpu > 1:
        model = torch.nn.DataParallel(model)
 
    # Evaluation!
    logger.info("***** Running evaluation {} *****".format(prefix))
    logger.info("  Num examples = %d", len(eval_dataset))
    logger.info("  Batch size = %d", args.eval_batch_size)
    eval_loss = 0.0
    nb_eval_steps = 0
    model.eval()

    for batch in tqdm(eval_dataloader, desc="Evaluating"):
        inputs, labels = (batch, batch)
        inputs = inputs.to(args.device)
        labels = labels.to(args.device)

        with torch.no_grad():
            outputs = model(inputs, labels=labels)
            lm_loss = outputs[0]
            eval_loss += lm_loss.mean().item()
        nb_eval_steps += 1

    eval_loss = eval_loss / nb_eval_steps
    perplexity = torch.exp(torch.tensor(eval_loss))

    result = {"perplexity": perplexity}

    output_eval_file = os.path.join(eval_output_dir, prefix, "eval_results.txt")
    with open(output_eval_file, "w") as writer:
        logger.info("***** Eval results {} *****".format(prefix))
        for key in sorted(result.keys()):
            logger.info("  %s = %s", key, str(result[key]))
            writer.write("%s = %s\n" % (key, str(result[key])))

    return result


In [None]:
# Main runner

def main(df_trn, df_val):
    args = Args()
    
    if args.should_continue:
        sorted_checkpoints = _sorted_checkpoints(args)
        if len(sorted_checkpoints) == 0:
            raise ValueError("Used --should_continue but no checkpoint was found in --output_dir.")
        else:
            args.model_name_or_path = sorted_checkpoints[-1]

    if (
        os.path.exists(args.output_dir)
        and os.listdir(args.output_dir)
        and args.do_train
        and not args.overwrite_output_dir
        and not args.should_continue
    ):
        raise ValueError(
            "Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(
                args.output_dir
            )
        )

    # Setup CUDA, GPU & distributed training
    device = torch.device("cuda")
    args.n_gpu = torch.cuda.device_count()
    args.device = device

    # Setup logging
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO if args.local_rank in [-1, 0] else logging.WARN,
    )
    logger.warning(
        "Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
        args.local_rank,
        device,
        args.n_gpu,
        bool(args.local_rank != -1),
        args.fp16,
    )

    # Set seed
    set_seed(args)

    config = AutoConfig.from_pretrained(args.config_name, cache_dir=args.cache_dir)
    tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name, cache_dir=args.cache_dir)
    model = AutoModelWithLMHead.from_pretrained(
        args.model_name_or_path,
        from_tf=False,
        config=config,
        cache_dir=args.cache_dir,
    )
    model.to(args.device)
    
    logger.info("Training/evaluation parameters %s", args)

    # Training
    if args.do_train:
        train_dataset = load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate=False)

        global_step, tr_loss = train(args, train_dataset, model, tokenizer)
        logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)

    # Saving best-practices: if you use save_pretrained for the model and tokenizer, you can reload them using from_pretrained()
    if args.do_train:
        # Create output directory if needed
        os.makedirs(args.output_dir, exist_ok=True)

        logger.info("Saving model checkpoint to %s", args.output_dir)
        # Save a trained model, configuration and tokenizer using `save_pretrained()`.
        # They can then be reloaded using `from_pretrained()`
        model_to_save = (
            model.module if hasattr(model, "module") else model
        )  # Take care of distributed/parallel training
        model_to_save.save_pretrained(args.output_dir)
        tokenizer.save_pretrained(args.output_dir)

        # Good practice: save your training arguments together with the trained model
        torch.save(args, os.path.join(args.output_dir, "training_args.bin"))

        # Load a trained model and vocabulary that you have fine-tuned
        model = AutoModelWithLMHead.from_pretrained(args.output_dir)
        tokenizer = AutoTokenizer.from_pretrained(args.output_dir)
        model.to(args.device)

    # Evaluation
    results = {}
    if args.do_eval and args.local_rank in [-1, 0]:
        checkpoints = [args.output_dir]
        if args.eval_all_checkpoints:
            checkpoints = list(
                os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + "/**/" + WEIGHTS_NAME, recursive=True))
            )
            logging.getLogger("transformers.modeling_utils").setLevel(logging.WARN)  # Reduce logging
        logger.info("Evaluate the following checkpoints: %s", checkpoints)
        for checkpoint in checkpoints:
            global_step = checkpoint.split("-")[-1] if len(checkpoints) > 1 else ""
            prefix = checkpoint.split("/")[-1] if checkpoint.find("checkpoint") != -1 else ""

            model = AutoModelWithLMHead.from_pretrained(checkpoint)
            model.to(args.device)
            result = evaluate(args, model, tokenizer, df_trn, df_val, prefix=prefix)
            result = dict((k + "_{}".format(global_step), v) for k, v in result.items())
            results.update(result)

    return results


### Run the main function

In [None]:
main(train_df, val_df)

05/26/2022 23:34:24 - INFO - __main__ -   Training/evaluation parameters <__main__.Args object at 0x7f7e13b1fe90>
05/26/2022 23:34:24 - INFO - __main__ -   Creating features from dataset file at cached
05/26/2022 23:34:25 - INFO - __main__ -   Saving features into cached file cached/gpt2_cached_lm_512
05/26/2022 23:34:25 - INFO - __main__ -   ***** Running training *****
05/26/2022 23:34:25 - INFO - __main__ -     Num examples = 629
05/26/2022 23:34:25 - INFO - __main__ -     Num Epochs = 10
05/26/2022 23:34:25 - INFO - __main__ -     Instantaneous batch size per GPU = 4
05/26/2022 23:34:25 - INFO - __main__ -     Total train batch size (w. parallel, distributed & accumulation) = 4
05/26/2022 23:34:25 - INFO - __main__ -     Gradient Accumulation steps = 1
05/26/2022 23:34:25 - INFO - __main__ -     Total optimization steps = 1570


Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

Iteration:   0%|          | 0/157 [00:00<?, ?it/s]

Iteration:   0%|          | 0/157 [00:00<?, ?it/s]

Iteration:   0%|          | 0/157 [00:00<?, ?it/s]

Iteration:   0%|          | 0/157 [00:00<?, ?it/s]

Iteration:   0%|          | 0/157 [00:00<?, ?it/s]

Iteration:   0%|          | 0/157 [00:00<?, ?it/s]

Iteration:   0%|          | 0/157 [00:00<?, ?it/s]



Iteration:   0%|          | 0/157 [00:00<?, ?it/s]

Iteration:   0%|          | 0/157 [00:00<?, ?it/s]

Iteration:   0%|          | 0/157 [00:00<?, ?it/s]

05/26/2022 23:39:15 - INFO - __main__ -    global_step = 1570, average loss = 1.8578902337201841
05/26/2022 23:39:15 - INFO - __main__ -   Saving model checkpoint to output-small
05/26/2022 23:39:27 - INFO - __main__ -   Evaluate the following checkpoints: ['output-small']
05/26/2022 23:39:29 - INFO - __main__ -   Creating features from dataset file at cached
05/26/2022 23:39:29 - INFO - __main__ -   Saving features into cached file cached/gpt2_cached_lm_512
05/26/2022 23:39:29 - INFO - __main__ -   ***** Running evaluation  *****
05/26/2022 23:39:29 - INFO - __main__ -     Num examples = 70
05/26/2022 23:39:29 - INFO - __main__ -     Batch size = 4


Evaluating:   0%|          | 0/17 [00:00<?, ?it/s]

05/26/2022 23:39:30 - INFO - __main__ -   ***** Eval results  *****
05/26/2022 23:39:30 - INFO - __main__ -     perplexity = tensor(3.8048)


{'perplexity_': tensor(3.8048)}

### Load the Train Model

In [None]:
#from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained('microsoft/DialoGPT-small')
model = AutoModelWithLMHead.from_pretrained('output-small')



In [None]:
# Let's chat for 4 lines
for step in range(3):
    # encode the new user input, add the eos_token and return a tensor in Pytorch
    new_user_input_ids = tokenizer.encode(input(">> User:") + tokenizer.eos_token, return_tensors='pt')
    # print(new_user_input_ids)

    # append the new user input tokens to the chat history
    bot_input_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1) if step > 0 else new_user_input_ids

    # generated a response while limiting the total chat history to 1000 tokens, 
    chat_history_ids = model.generate(
        bot_input_ids, max_length=200,
        pad_token_id=tokenizer.eos_token_id,  
        no_repeat_ngram_size=3,       
        do_sample=True, 
        top_k=100, 
        top_p=0.7,
        temperature=0.8
    )
    
    # pretty print last ouput tokens from bot
    print("HomerBot: {}".format(tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)))

>> User:What is going on today?
HomerBot: What's the matter, boy?
>> User:Nothing is going on
HomerBot: Oh, I'm sorry, Miss Bouvier, but I've -- I've never expelled anyone before.
>> User:Yeah thats not good
HomerBot: You ever known a college boy to be a good guy?


## *Model Evaluation*

## *Push Model to Hugging Face*

In [None]:

os.chdir('/content/')

In [None]:
!pip install huggingface_hub

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
!huggingface-cli login


        _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
        _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
        _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
        _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
        _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

        To login, `huggingface_hub` now requires a token generated from https://huggingface.co/settings/tokens .
        (Deprecated, will be removed in v0.3.0) To login with username and password instead, interrupt with Ctrl+C.
        
Token: 
Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on y

In [None]:
!huggingface-cli repo create DialoGPT-small-homersimpsonbot

[90mgit version 2.17.1[0m
Error: unknown flag: --version

[90mSorry, no usage text found for "git-lfs"[0m

You are about to create [1mHomerChatbot/DialoGPT-small-homersimpsonbot[0m
Proceed? [Y/n] Y

Your repo now lives at:
  [1mhttps://huggingface.co/HomerChatbot/DialoGPT-small-homersimpsonbot[0m

You can clone it locally with the command below, and commit/push as usual.

  git clone https://huggingface.co/HomerChatbot/DialoGPT-small-homersimpsonbot



In [None]:
!sudo apt-get install git-lfs 

Reading package lists... Done
Building dependency tree       
Reading state information... Done
git-lfs is already the newest version (2.3.4-1).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'sudo apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 42 not upgraded.


In [None]:
!git clone https://HomerChatbot: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@huggingface.co/HomerChatbot/DialoGPT-small-homersimpsonbot

Cloning into 'DialoGPT-small-homersimpsonbot'...
remote: Enumerating objects: 3, done.[K
remote: Counting objects: 100% (3/3), done.[K
remote: Compressing objects: 100% (2/2), done.[K
remote: Total 3 (delta 0), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (3/3), done.


In [None]:
!ls "/content/drive/My Drive/AD470-MPC-Group1-SP2022/Project_3/output-small"

config.json	  pytorch_model.bin	   tokenizer.json
eval_results.txt  special_tokens_map.json  training_args.bin
merges.txt	  tokenizer_config.json    vocab.json


In [None]:
!mv /content/drive/My\ Drive/AD470-MPC-Group1-SP2022/Project_3/output-small/* DialoGPT-small-homersimpsonbot/

In [None]:
os.chdir('DialoGPT-small-homersimpsonbot')

In [None]:
retval = os.getcwd()
print(retval)

/content/DialoGPT-small-homersimpsonbot


In [None]:
!ls

config.json	  pytorch_model.bin	   tokenizer.json
eval_results.txt  special_tokens_map.json  training_args.bin
merges.txt	  tokenizer_config.json    vocab.json


In [None]:
!git lfs install

Updated git hooks.
Git LFS initialized.


In [None]:
!ls
!pwd

config.json	  pytorch_model.bin	   tokenizer.json
eval_results.txt  special_tokens_map.json  training_args.bin
merges.txt	  tokenizer_config.json    vocab.json
/content/DialoGPT-small-homersimpsonbot


In [None]:
!git status 

On branch main
Your branch is up to date with 'origin/main'.

Untracked files:
  (use "git add <file>..." to include in what will be committed)

	[31mconfig.json[m
	[31meval_results.txt[m
	[31mmerges.txt[m
	[31mpytorch_model.bin[m
	[31mspecial_tokens_map.json[m
	[31mtokenizer.json[m
	[31mtokenizer_config.json[m
	[31mtraining_args.bin[m
	[31mvocab.json[m

nothing added to commit but untracked files present (use "git add" to track)


In [None]:
!git add . 

In [None]:
!git config --global user.email "monaatabani@gmail.com"

!git config --global user.name "HomerChatbot"
!git commit -m "HomerChatbot uploaded"


[main 084559c] HomerChatbot uploaded
 9 files changed, 150353 insertions(+)
 create mode 100644 config.json
 create mode 100644 eval_results.txt
 create mode 100644 merges.txt
 create mode 100644 pytorch_model.bin
 create mode 100644 special_tokens_map.json
 create mode 100644 tokenizer.json
 create mode 100644 tokenizer_config.json
 create mode 100644 training_args.bin
 create mode 100644 vocab.json


In [None]:
!git push

Git LFS: (2 of 2 files) 486.75 MB / 486.75 MB
Counting objects: 11, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (10/10), done.
Writing objects: 100% (11/11), 1.09 MiB | 945.00 KiB/s, done.
Total 11 (delta 0), reused 0 (delta 0)
remote: Enforcing permissions...[K
remote: Allowed refs: all[K
To https://huggingface.co/HomerChatbot/DialoGPT-small-homersimpsonbot
   94d0357..084559c  main -> main
