# Fine-tune a DialoGPT model (Generative Pre-Trained Transformer)

Adapted from the notebook in [this Medium post](https://towardsdatascience.com/make-your-own-rick-sanchez-bot-with-transformers-and-dialogpt-fine-tuning-f85e6d1f4e30?gi=e4a72d1510f0).

DialoGPT was proposed in DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation by Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, Bill Dolan. It’s a GPT2 Model trained on 147M conversation-like exchanges extracted from Reddit.

The abstract from the paper is the following:

We present a large, tunable neural conversational response generation model, DialoGPT (dialogue generative pre-trained transformer). Trained on 147M conversation-like exchanges extracted from Reddit comment chains over a period spanning from 2005 through 2017, DialoGPT extends the Hugging Face PyTorch transformer to attain a performance close to human both in terms of automatic and human evaluation in single-turn dialogue settings. We show that conversational systems that leverage DialoGPT generate more relevant, contentful and context-consistent responses than strong baseline systems. The pre-trained model and training pipeline are publicly released to facilitate research into neural response generation and the development of more intelligent open-domain dialogue systems.

## Setup

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [None]:
!pip -q install transformers # used for conversational AI ; follows encoder-decoder methodology

In [None]:
import os # os module provides function to interact with the OS
os.chdir("/content/drive/My Drive/Batman Chat_Bot") # setting up the path

In [None]:
# all the imports

import glob # short for global is used to return all file paths that match a specific pattern
import logging # a toolbox that provides insights ; keeps detailed log of everything
import os
import pickle # The process to converts any kind of python objects (list, dict, etc.) into byte streams (0s and 1s) is called pickling or serialization or flattening or marshalling
import random # used to generate random numbers
import re # to handle regular expressions (one that we used in parse script)
import shutil # Shutil module offers high-level operation on a file like a copy, create, and remote operation on the file
from typing import Dict, List, Tuple

import numpy as np # mathematical and logical operations on arrays can be performed
import pandas as pd # used for data analysis

from sklearn.model_selection import train_test_split # Split arrays or matrices into random train and test subsets

from torch.nn.utils.rnn import pad_sequence # used to ensure that all sequences in a list have the same length. By default this is done by padding 0 in the beginning of each sequence until each sequence has the same length as the longest sequence
from torch.utils.data import DataLoader, Dataset, RandomSampler, SequentialSampler
# DataLoader: Combines a dataset and a sampler, and provides an iterable over the given dataset
# RandomSampler: A Sampler that returns random indices (A Sampler is an object that yields an index with which to access a dataset)
# SequentialSampler: A Sampler that returns indices sequentially.
from torch.utils.data.distributed import DistributedSampler # Pytorch offers a DistributedSampler module that performs the training data split amongst the DDL instances and DistributedDataParallel that does the averaging of the gradients on the backward pass.
from tqdm.notebook import tqdm, trange # create progress bars
# tqdm: used for creating Progress Meters or Progress Bars. tqdm got its name from the Arabic name taqaddum which means 'progress'.
from pathlib import Path # pathlib: which provides an object API for working with files and directories

from transformers import (
    MODEL_WITH_LM_HEAD_MAPPING, # is a mapping containing all the Pytorch models that have an LM head (Returns tf.keras.layers.Layer)
    WEIGHTS_NAME,
    AdamW, # Gradient Descent -> Stochastic Gradient Descent -> Adam
    AutoConfig,
    # In many cases, the architecture you want to use can be guessed from the name or the path of the pretrained model you are supplying to the from_pretrained() method. 
    # AutoClasses are here to do this job for you so that you automatically retrieve the relevant model given the name/path to the pretrained weights/config/vocabulary.
    # Instantiating one of AutoConfig, AutoModel, and AutoTokenizer will directly create a class of the relevant architecture.
    PreTrainedModel, # PreTrainedModel takes care of storing the configuration of the models and handles methods for loading, downloading and saving models
    PreTrainedTokenizer, # A tokenizer is in charge of preparing the inputs for a model.
    get_linear_schedule_with_warmup, # Create a schedule with a learning rate that decreases linearly from the initial lr set in the optimizer to 0, after a warmup period during which it increases linearly from 0 to the initial lr set in the optimizer.
)


try:
    from torch.utils.tensorboard import SummaryWriter # The SummaryWriter class is your main entry to log data for consumption and visualization by TensorBoard
except ImportError:
    from tensorboardX import SummaryWriter
    # The SummaryWriter class provides a high-level API to create an event file in a given directory and add summaries and events to it.
    # The class updates the file contents asynchronously.
    # This allows a training program to call methods to add data to the file directly from the training loop, without slowing down training.

In [None]:
data = pd.read_csv('Batman.csv') # Read the csv file

In [None]:
data.sample(6) # check the csv file

Unnamed: 0,name,line
11,Batman,"I figured you could use my help, Selina."
94,Harley Quinn,Batter up!
49,Batman,There's the gun. It looks like it's being con...
36,Batman,"The room is secure, you're safe now!"
41,Batman,"After what happened at the asylum, I thought ..."
116,l,I have heard this words a hundred times.Let u...


In [None]:
len(data) # dataset size

143

In [None]:
sum(data.name == 'Batman') # main character dataset size

63

In [None]:
CHARACTER_NAME = 'Batman'

We will convert this dataset in a way that every response row will contain n previous responses as a context. For our purposes, seven previous responses will be enough.

In [None]:
contexted = [] # We run this cell to create a context data frame that includes the current line our character is speaking and several lines directly proceeding the line

# context window of size 7
n = 7

for i in data[data.name == CHARACTER_NAME].index:
  if i < n:
    continue
  row = []
  prev = i - 1 - n # we additionally substract 1, so row will contain current response and 7 previous responses  
  for j in range(i, prev, -1):
    row.append(data.line[j])
  contexted.append(row)

columns = ['response', 'context'] 
columns = columns + ['context/' + str(i) for i in range(n - 1)]

df = pd.DataFrame.from_records(contexted, columns=columns)

A context data frame is useful here because we are creating a conversational chatbot and we want to generate a response based on the conversation context

In [None]:
df.sample(6)

Unnamed: 0,response,context,context/0,context/1,context/2,context/3,context/4,context/5
35,I need to get out of here. Someone needs to s...,What the hell are you doing?,"Wait here, doctor.",At least they won't be getting in. We're safe...,"Get that door sealed up nice-n-tight, boys? W...",What's happening? How does she know what we'r...,"Uh, uh Doc! Isn't that information supposed t...","Yeah, do you know what he's talking about? He..."
7,I'm at the church. It looks like Harley Quinn...,"Nine lives, remember?",It was Joker. You're not safe here. No one is.,This place is dangerous. I like it. You expec...,"See you soon, Bats!",What the hell?,The ex-District Attorney here said something ...,"Twinkle, twinkle, little bat. Watch me kill y..."
9,"They don't know where I am. Good, let's keep ...","Don't worry, Alfred, Quinn never was too smar...",That dreadful woman is no doubt setting a tra...,I'm at the church. It looks like Harley Quinn...,"Nine lives, remember?",It was Joker. You're not safe here. No one is.,This place is dangerous. I like it. You expec...,"See you soon, Bats!"
0,If there's one person in Arkham City who know...,No it does not. Mr. Dent's predilection for a...,That doesn't sound good.,Understood.,Stand down. Let Two-Face have his fun.,Affirmative. Target is being held by Dent. We...,Is she in danger?,"All units, this is AIR TYGER 4. We have confi..."
15,"If I get behind those two without being seen,...",You'll be OK. Wait here and don't make a noise.,That idiot thinks he's safe in the confession...,You're safe. Stay quiet.,He's got a hostage. I can glide to the scaffo...,"Four thugs, all armed, two hostages. This is ...","They don't know where I am. Good, let's keep ...","Don't worry, Alfred, Quinn never was too smar..."
22,"Alfred, I've got a lock on the signal used to...","Well, look who it is. I haven't seen you sinc...",There's the gun. It looks like it's being con...,Will do. Hope you find the doc. Her name's St...,"If she's alive, I'll find her. You concentrat...","Wait! I mean, sorry, Batman, I forgot: Couple...",Thanks. I'll check it out.,That crazy bitch busted in a couple of hours ...


In [None]:
trn_df, val_df = train_test_split(df, test_size=0.1) # splitting the data set into training set and test set
trn_df.head()

Unnamed: 0,response,context,context/0,context/1,context/2,context/3,context/4,context/5
55,There's always a choice.,Why? You would never do it.You left me no cho...,You didn't need to...,Problem solved.,TALIA NO!,But you've already got the cure. (Talia the s...,"(Laughs) Now you want to talk? Too late, Batm...",Let's Just talk about this.
52,But you've already got the cure. (Talia the s...,"(Laughs) Now you want to talk? Too late, Batm...",Let's Just talk about this.,Hurry up and take your seat Batman. The shows...,Then may the spirits be kind. (sighs),Like always.,Of course not. I just... I just want you to b...,Are you trying to talk me out of this?
1,Taking out the thug with the gun is the key. ...,If there's one person in Arkham City who know...,No it does not. Mr. Dent's predilection for a...,That doesn't sound good.,Understood.,Stand down. Let Two-Face have his fun.,Affirmative. Target is being held by Dent. We...,Is she in danger?
14,You'll be OK. Wait here and don't make a noise.,That idiot thinks he's safe in the confession...,You're safe. Stay quiet.,He's got a hostage. I can glide to the scaffo...,"Four thugs, all armed, two hostages. This is ...","They don't know where I am. Good, let's keep ...","Don't worry, Alfred, Quinn never was too smar...",That dreadful woman is no doubt setting a tra...
12,You're safe. Stay quiet.,He's got a hostage. I can glide to the scaffo...,"Four thugs, all armed, two hostages. This is ...","They don't know where I am. Good, let's keep ...","Don't worry, Alfred, Quinn never was too smar...",That dreadful woman is no doubt setting a tra...,I'm at the church. It looks like Harley Quinn...,"Nine lives, remember?"


This is because we do not want to overfit the model ; in the case of overfitting the model will just memorize the lines from the dataset and talk back to us using the exact lines , we dont want that we want the conversation to be more organic so we're only training the model on the training set and evaluating the model on the test set

Now we will convert our dataset in a format suitable for our model. Basically we will concatenate responses in one string for each row (additionally we will add special ‘end of string’ token between responses, so the model will understand the end of each response in a string).

In [None]:
# create dataset suitable for our model

def construct_conv(row, tokenizer, eos = True): 
    flatten = lambda l: [item for sublist in l for item in sublist]
    conv = list(reversed([tokenizer.encode(x) + [tokenizer.eos_token_id] for x in row]))
    conv = flatten(conv)
    return conv

class ConversationDataset(Dataset):
    def __init__(self, tokenizer: PreTrainedTokenizer, args, df, block_size=512):

        block_size = block_size - (tokenizer.model_max_length - tokenizer.max_len_single_sentence)

        directory = args.cache_dir
        cached_features_file = os.path.join(
            directory, args.model_type + "_cached_lm_" + str(block_size)
        )

        if os.path.exists(cached_features_file) and not args.overwrite_cache:
            logger.info("Loading features from cached file %s", cached_features_file)
            with open(cached_features_file, "rb") as handle:
                self.examples = pickle.load(handle)
        else:
            logger.info("Creating features from dataset file at %s", directory)

            self.examples = []
            for _, row in df.iterrows():
                conv = construct_conv(row, tokenizer)
                self.examples.append(conv)

            logger.info("Saving features into cached file %s", cached_features_file)
            with open(cached_features_file, "wb") as handle:
                pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, item):
        return torch.tensor(self.examples[item], dtype=torch.long)

In [None]:
# Cacheing and storing of data/checkpoints

def load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate=False):
    return ConversationDataset(tokenizer, args, df_val if evaluate else df_trn)


def set_seed(args):
    random.seed(args.seed)
    np.random.seed(args.seed)
    torch.manual_seed(args.seed)
    if args.n_gpu > 0:
        torch.cuda.manual_seed_all(args.seed)


def _sorted_checkpoints(args, checkpoint_prefix="checkpoint", use_mtime=False) -> List[str]:
    ordering_and_checkpoint_path = []

    glob_checkpoints = glob.glob(os.path.join(args.output_dir, "{}-*".format(checkpoint_prefix)))

    for path in glob_checkpoints:
        if use_mtime:
            ordering_and_checkpoint_path.append((os.path.getmtime(path), path))
        else:
            regex_match = re.match(".*{}-([0-9]+)".format(checkpoint_prefix), path)
            if regex_match and regex_match.groups():
                ordering_and_checkpoint_path.append((int(regex_match.groups()[0]), path))

    checkpoints_sorted = sorted(ordering_and_checkpoint_path)
    checkpoints_sorted = [checkpoint[1] for checkpoint in checkpoints_sorted]
    return checkpoints_sorted


def _rotate_checkpoints(args, checkpoint_prefix="checkpoint", use_mtime=False) -> None:
    if not args.save_total_limit:
        return
    if args.save_total_limit <= 0:
        return

    # Check if we should delete older checkpoint(s)
    checkpoints_sorted = _sorted_checkpoints(args, checkpoint_prefix, use_mtime)
    if len(checkpoints_sorted) <= args.save_total_limit:
        return

    number_of_checkpoints_to_delete = max(0, len(checkpoints_sorted) - args.save_total_limit)
    checkpoints_to_be_deleted = checkpoints_sorted[:number_of_checkpoints_to_delete]
    for checkpoint in checkpoints_to_be_deleted:
        logger.info("Deleting older checkpoint [{}] due to args.save_total_limit".format(checkpoint))
        shutil.rmtree(checkpoint)

## Build Model

In [None]:
from transformers import AutoModelWithLMHead, AutoModelForCausalLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-medium")  # We build our model using Microsoft's pre-trained GPT medium ; medium here refers to the number of parametes in the model
model = AutoModelWithLMHead.from_pretrained("microsoft/DialoGPT-medium")

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/642 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]



Downloading:   0%|          | 0.00/823M [00:00<?, ?B/s]

In [None]:
"""
Fine-tuning the library models for language modeling on a text file (GPT, GPT-2, BERT, RoBERTa).
GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned
using a masked language modeling (MLM) loss.
"""

# Configs
logger = logging.getLogger(__name__)

MODEL_CONFIG_CLASSES = list(MODEL_WITH_LM_HEAD_MAPPING.keys())
MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)

In [None]:
# Args to allow for easy convertion of python script to notebook
class Args():
    def __init__(self):
        self.output_dir = 'output-medium' # is a special variable in each steppable that stores directory where the output of the current simulation will be written.
        self.model_type = 'gpt2' # a string that identifies the model type, that we serialize into the JSON file, and that we use to recreate the correct object in :class:`~transformers.AutoConfig
        self.model_name_or_path = 'microsoft/DialoGPT-medium' # Path to existing transformers model or name of transformer model to be used
        self.config_name = 'microsoft/DialoGPT-medium' # Config of model used
        self.tokenizer_name = 'microsoft/DialoGPT-medium' # Tokenizer used to process data for training the model. It usually has same name as model_name_or_path
        self.cache_dir = 'cached' # Path to cache files. It helps to save time when re-running code.
        self.block_size = 512 # It refers to the windows size that is moved across the text file. Set to -1 to use maximum allowed length.
        self.do_train = True # Whether to run training or not. I set this parameter to True because I want to train the model on my custom dataset.
        self.do_eval = True # Whether to run evaluation on the evaluation files or not. I set it to True since I have test data file and I want to evaluate how well the model trains
        self.evaluate_during_training = False # Whether to run evaluation during training at each logging step or not.
        self.per_gpu_train_batch_size = 4 # this is the no. of training examples that the model will see in the batch before it updates its gradient
        self.per_gpu_eval_batch_size = 4
        self.gradient_accumulation_steps = 1 # Number of updates steps to accumulate the gradients for, before performing a backward/update pass. defaults to 1
        self.learning_rate = 5e-5 # The initial learning rate for Adam. Defaults is set to 5e-5
        self.weight_decay = 0.0 # The weight decay to apply (if not zero)Defaults is set to 0.
        self.adam_epsilon = 1e-8
        '''
        Basically epsilon is the "bias" when it comes to the adaptive learn rates.
        What this means is that the step size for each parameter will be derived partially from a universal fixed value and partially from the learn rate devised from that particular parameter's trajectory/history.
        So a larger epsilon means more bias, which means less variance of learn rate between parameters and more like SGD.
        A smaller epsilon means the potential for larger variance between learn rate making the algorithm more "adaptive" on a per-parameter basis.
        So it makes sense for RIL algos to have larger epsilons because the target keeps changing, and therefore each parameter's optimal learn rate will be changing as well. 
        So fitting a parameter too tightly to it's current trajectory will cause suboptimal optimization as your agents learns and the true optimal learn rates start to change and become vastly different than what they were before.
        in cases where you know that you can take the reigns off so to speak, setting a small epsilon can speed up optimization as optimization is most catered to each parameter's specific needs.
        '''
        self.max_grad_norm = 1.0 # Maximum gradient norm (for gradient clipping). Defaults to 0.
        self.num_train_epochs = 10 # is the no. training epochs ; 10 is the no. of times the model will cycle through tthe training set ; inc. the no. as long as not leading to overfit leads to smarter models
        self.max_steps = -1 # optional, defaults to -1 – If set to a positive number, the total number of training steps to perform. Overrides num_train_epochs.
        self.warmup_steps = 0 # optional, defaults to 0 – Number of steps used for a linear warmup from 0 to learning_rate
        self.logging_steps = 1000 #  Number of update steps between two logs. How often to show logs
        self.save_steps = 3500 #  Number of updates steps before two checkpoint saves
        self.save_total_limit = None # If a value is passed, will limit the total amount of checkpoints. Deletes the older checkpoints in output_dir
        self.eval_all_checkpoints = False
        self.no_cuda = False
        self.overwrite_output_dir = True
        self.overwrite_cache = True
        self.should_continue = False
        self.seed = 42 #  Random seed for initialization.
        self.local_rank = -1 # defaults to -1 – During distributed training, the rank of the process.
        self.fp16 = False # Whether to use 16-bit (mixed) precision training (through NVIDIA apex) instead of 32-bit training.
        self.fp16_opt_level = 'O1' # For fp16 training, apex AMP optimization level selected in [‘O0’, ‘O1’, ‘O2’, and ‘O3’].
        '''
        (Automatic Mixed Precision), a tool to enable Tensor Core-accelerated training in only 3 lines of Python.
        Amp allows users to easily experiment with different pure and mixed precision modes. Commonly-used default modes are chosen by selecting an “optimization level” or opt_level; 
        each opt_level establishes a set of properties that govern Amp’s implementation of pure or mixed precision training.
        O1: Mixed Precision (recommended for typical use)
        '''
args = Args()

## Train and Evaluate

The remaining cells have been configured to take in this context data frame we created; train the model and save it to a folder

In [None]:
def train(args, train_dataset, model: PreTrainedModel, tokenizer: PreTrainedTokenizer) -> Tuple[int, float]:
    """ Train the model """
    if args.local_rank in [-1, 0]:
        tb_writer = SummaryWriter()

    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)

    def collate(examples: List[torch.Tensor]):
        if tokenizer._pad_token is None:
            return pad_sequence(examples, batch_first=True)
        return pad_sequence(examples, batch_first=True, padding_value=tokenizer.pad_token_id)

    train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
    train_dataloader = DataLoader(
        train_dataset, sampler=train_sampler, batch_size=args.train_batch_size, collate_fn=collate, drop_last = True
    )

    if args.max_steps > 0:
        t_total = args.max_steps
        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
    else:
        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs

    model = model.module if hasattr(model, "module") else model  # Take care of distributed/parallel training
    model.resize_token_embeddings(len(tokenizer))
    # add_special_tokens_(model, tokenizer)


    # Prepare optimizer and schedule (linear warmup and decay)
    no_decay = ["bias", "LayerNorm.weight"]
    optimizer_grouped_parameters = [
        {
            "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
            "weight_decay": args.weight_decay,
        },
        {"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0},
    ]
    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
    scheduler = get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total
    )

    # Check if saved optimizer or scheduler states exist
    if (
        args.model_name_or_path
        and os.path.isfile(os.path.join(args.model_name_or_path, "optimizer.pt"))
        and os.path.isfile(os.path.join(args.model_name_or_path, "scheduler.pt"))
    ):
        # Load in optimizer and scheduler states
        optimizer.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "optimizer.pt")))
        scheduler.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "scheduler.pt")))

    if args.fp16:
        try:
            from apex import amp
        except ImportError:
            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
        model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)

    # multi-gpu training (should be after apex fp16 initialization)
    if args.n_gpu > 1:
        model = torch.nn.DataParallel(model)

    # Distributed training (should be after apex fp16 initialization)
    if args.local_rank != -1:
        model = torch.nn.parallel.DistributedDataParallel(
            model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True
        )

    # Train!
    logger.info("***** Running training *****")
    logger.info("  Num examples = %d", len(train_dataset))
    logger.info("  Num Epochs = %d", args.num_train_epochs)
    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
    logger.info(
        "  Total train batch size (w. parallel, distributed & accumulation) = %d",
        args.train_batch_size
        * args.gradient_accumulation_steps
        * (torch.distributed.get_world_size() if args.local_rank != -1 else 1),
    )
    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
    logger.info("  Total optimization steps = %d", t_total)

    global_step = 0
    epochs_trained = 0
    steps_trained_in_current_epoch = 0
    # Check if continuing training from a checkpoint
    if args.model_name_or_path and os.path.exists(args.model_name_or_path):
        try:
            # set global_step to gobal_step of last saved checkpoint from model path
            checkpoint_suffix = args.model_name_or_path.split("-")[-1].split("/")[0]
            global_step = int(checkpoint_suffix)
            epochs_trained = global_step // (len(train_dataloader) // args.gradient_accumulation_steps)
            steps_trained_in_current_epoch = global_step % (len(train_dataloader) // args.gradient_accumulation_steps)

            logger.info("  Continuing training from checkpoint, will skip to saved global_step")
            logger.info("  Continuing training from epoch %d", epochs_trained)
            logger.info("  Continuing training from global step %d", global_step)
            logger.info("  Will skip the first %d steps in the first epoch", steps_trained_in_current_epoch)
        except ValueError:
            logger.info("  Starting fine-tuning.")

    tr_loss, logging_loss = 0.0, 0.0

    model.zero_grad()
    train_iterator = trange(
        epochs_trained, int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0]
    )
    set_seed(args)  # Added here for reproducibility
    for _ in train_iterator:
        epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
        for step, batch in enumerate(epoch_iterator):

            # Skip past any already trained steps if resuming training
            if steps_trained_in_current_epoch > 0:
                steps_trained_in_current_epoch -= 1
                continue

            inputs, labels = (batch, batch)
            if inputs.shape[1] > 1024: continue
            inputs = inputs.to(args.device)
            labels = labels.to(args.device)
            model.train()
            outputs = model(inputs, labels=labels)
            loss = outputs[0]  # model outputs are always tuple in transformers (see doc)

            if args.n_gpu > 1:
                loss = loss.mean()  # mean() to average on multi-gpu parallel training
            if args.gradient_accumulation_steps > 1:
                loss = loss / args.gradient_accumulation_steps

            if args.fp16:
                with amp.scale_loss(loss, optimizer) as scaled_loss:
                    scaled_loss.backward()
            else:
                loss.backward()

            tr_loss += loss.item()
            if (step + 1) % args.gradient_accumulation_steps == 0:
                if args.fp16:
                    torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
                else:
                    torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
                optimizer.step()
                scheduler.step()  # Update learning rate schedule
                model.zero_grad()
                global_step += 1

                if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
                    # Log metrics
                    if (
                        args.local_rank == -1 and args.evaluate_during_training
                    ):  # Only evaluate when single GPU otherwise metrics may not average well
                        results = evaluate(args, model, tokenizer)
                        for key, value in results.items():
                            tb_writer.add_scalar("eval_{}".format(key), value, global_step)
                    tb_writer.add_scalar("lr", scheduler.get_lr()[0], global_step)
                    tb_writer.add_scalar("loss", (tr_loss - logging_loss) / args.logging_steps, global_step)
                    logging_loss = tr_loss

                if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
                    checkpoint_prefix = "checkpoint"
                    # Save model checkpoint
                    output_dir = os.path.join(args.output_dir, "{}-{}".format(checkpoint_prefix, global_step))
                    os.makedirs(output_dir, exist_ok=True)
                    model_to_save = (
                        model.module if hasattr(model, "module") else model
                    )  # Take care of distributed/parallel training
                    model_to_save.save_pretrained(output_dir)
                    tokenizer.save_pretrained(output_dir)

                    torch.save(args, os.path.join(output_dir, "training_args.bin"))
                    logger.info("Saving model checkpoint to %s", output_dir)

                    _rotate_checkpoints(args, checkpoint_prefix)

                    torch.save(optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
                    torch.save(scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))
                    logger.info("Saving optimizer and scheduler states to %s", output_dir)

            if args.max_steps > 0 and global_step > args.max_steps:
                epoch_iterator.close()
                break
        if args.max_steps > 0 and global_step > args.max_steps:
            train_iterator.close()
            break

    if args.local_rank in [-1, 0]:
        tb_writer.close()

    return global_step, tr_loss / global_step

# Evaluation of some model

def evaluate(args, model: PreTrainedModel, tokenizer: PreTrainedTokenizer, df_trn, df_val, prefix="") -> Dict:
    # Loop to handle MNLI double evaluation (matched, mis-matched)
    eval_output_dir = args.output_dir

    eval_dataset = load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate=True)
    os.makedirs(eval_output_dir, exist_ok=True)
    args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
    # Note that DistributedSampler samples randomly

    def collate(examples: List[torch.Tensor]):
        if tokenizer._pad_token is None:
            return pad_sequence(examples, batch_first=True)
        return pad_sequence(examples, batch_first=True, padding_value=tokenizer.pad_token_id)

    eval_sampler = SequentialSampler(eval_dataset)
    eval_dataloader = DataLoader(
        eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size, collate_fn=collate, drop_last = True
    )

    # multi-gpu evaluate
    if args.n_gpu > 1:
        model = torch.nn.DataParallel(model)

    # Eval!
    logger.info("***** Running evaluation {} *****".format(prefix))
    logger.info("  Num examples = %d", len(eval_dataset))
    logger.info("  Batch size = %d", args.eval_batch_size)
    eval_loss = 0.0
    nb_eval_steps = 0
    model.eval()

    for batch in tqdm(eval_dataloader, desc="Evaluating"):
        inputs, labels = (batch, batch)
        inputs = inputs.to(args.device)
        labels = labels.to(args.device)

        with torch.no_grad():
            outputs = model(inputs, labels=labels)
            lm_loss = outputs[0]
            eval_loss += lm_loss.mean().item()
        nb_eval_steps += 1

    eval_loss = eval_loss / nb_eval_steps
    perplexity = torch.exp(torch.tensor(eval_loss))

    result = {"perplexity": perplexity}

    output_eval_file = os.path.join(eval_output_dir, prefix, "eval_results.txt")
    with open(output_eval_file, "w") as writer:
        logger.info("***** Eval results {} *****".format(prefix))
        for key in sorted(result.keys()):
            logger.info("  %s = %s", key, str(result[key]))
            writer.write("%s = %s\n" % (key, str(result[key])))

    return result

In [None]:
# Main runner

def main(df_trn, df_val):
    args = Args()
    
    if args.should_continue:
        sorted_checkpoints = _sorted_checkpoints(args)
        if len(sorted_checkpoints) == 0:
            raise ValueError("Used --should_continue but no checkpoint was found in --output_dir.")
        else:
            args.model_name_or_path = sorted_checkpoints[-1]

    if (
        os.path.exists(args.output_dir)
        and os.listdir(args.output_dir)
        and args.do_train
        and not args.overwrite_output_dir
        and not args.should_continue
    ):
        raise ValueError(
            "Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(
                args.output_dir
            )
        )

    # Setup CUDA, GPU & distributed training
    device = torch.device("cuda")
    args.n_gpu = torch.cuda.device_count()
    args.device = device

    # Setup logging
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO if args.local_rank in [-1, 0] else logging.WARN,
    )
    logger.warning(
        "Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
        args.local_rank,
        device,
        args.n_gpu,
        bool(args.local_rank != -1),
        args.fp16,
    )

    # Set seed
    set_seed(args)

    config = AutoConfig.from_pretrained(args.config_name, cache_dir=args.cache_dir)
    tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name, cache_dir=args.cache_dir)
    model = AutoModelWithLMHead.from_pretrained(
        args.model_name_or_path,
        from_tf=False,
        config=config,
        cache_dir=args.cache_dir,
    )
    model.to(args.device)
    
    logger.info("Training/evaluation parameters %s", args)

    # Training
    if args.do_train:
        train_dataset = load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate=False)

        global_step, tr_loss = train(args, train_dataset, model, tokenizer)
        logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)

    # Saving best-practices: if you use save_pretrained for the model and tokenizer, you can reload them using from_pretrained()
    if args.do_train:
        # Create output directory if needed
        os.makedirs(args.output_dir, exist_ok=True)

        logger.info("Saving model checkpoint to %s", args.output_dir)
        # Save a trained model, configuration and tokenizer using `save_pretrained()`.
        # They can then be reloaded using `from_pretrained()`
        model_to_save = (
            model.module if hasattr(model, "module") else model
        )  # Take care of distributed/parallel training
        model_to_save.save_pretrained(args.output_dir)
        tokenizer.save_pretrained(args.output_dir)

        # Good practice: save your training arguments together with the trained model
        torch.save(args, os.path.join(args.output_dir, "training_args.bin"))

        # Load a trained model and vocabulary that you have fine-tuned
        model = AutoModelWithLMHead.from_pretrained(args.output_dir)
        tokenizer = AutoTokenizer.from_pretrained(args.output_dir)
        model.to(args.device)

    # Evaluation
    results = {}
    if args.do_eval and args.local_rank in [-1, 0]:
        checkpoints = [args.output_dir]
        if args.eval_all_checkpoints:
            checkpoints = list(
                os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + "/**/" + WEIGHTS_NAME, recursive=True))
            )
            logging.getLogger("transformers.modeling_utils").setLevel(logging.WARN)  # Reduce logging
        logger.info("Evaluate the following checkpoints: %s", checkpoints)
        for checkpoint in checkpoints:
            global_step = checkpoint.split("-")[-1] if len(checkpoints) > 1 else ""
            prefix = checkpoint.split("/")[-1] if checkpoint.find("checkpoint") != -1 else ""

            model = AutoModelWithLMHead.from_pretrained(checkpoint)
            model.to(args.device)
            result = evaluate(args, model, tokenizer, df_trn, df_val, prefix=prefix)
            result = dict((k + "_{}".format(global_step), v) for k, v in result.items())
            results.update(result)

    return results

## Run the Main Function

In [None]:
main(trn_df, val_df)

Perplexity tensor: usually refers to how confused the model is ; in our case the dataset is quite less thus makes sense for high perplexity

To decrease the perpexlity we might need to train for more epochs

## Load the Trained Model

In [None]:
tokenizer = AutoTokenizer.from_pretrained('microsoft/DialoGPT-medium')
model = AutoModelWithLMHead.from_pretrained('output-medium')

In [None]:
# Let's chat for 4 lines
for step in range(4):
    # encode the new user input, add the eos_token and return a tensor in Pytorch
    new_user_input_ids = tokenizer.encode(input(">> User:") + tokenizer.eos_token, return_tensors='pt')
    # print(new_user_input_ids)

    # append the new user input tokens to the chat history
    bot_input_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1) if step > 0 else new_user_input_ids

    # generated a response while limiting the total chat history to 1000 tokens, 
    chat_history_ids = model.generate(
        bot_input_ids, max_length=200,
        pad_token_id=tokenizer.eos_token_id,  
        no_repeat_ngram_size=3,       
        do_sample=True, 
        top_k=100, 
        top_p=0.7,
        temperature=0.8
    )
    
    # pretty print last ouput tokens from bot
    print("BatmanBot: {}".format(tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)))

## Push Model to Hugging Face

In [None]:
os.chdir('/content/') # changing the directory 

In [None]:
!pip install huggingface_hub

In [None]:
!huggingface-cli login # logging in our account 

In [None]:
!huggingface-cli repo create DialoGPT-medium-BatmanBot # creating a repo

In [None]:
!sudo apt-get install git-lfs # lfs: large file storage ; this will allow us to push and pull our models

In [None]:
!cat /root/.huggingface/token # our generated token

In [None]:
!git clone https://huggingface.co/TejasARathod/DialoGPT-medium-BatmanBot

In [None]:
!mv /content/drive/My\ Drive/Batman\ Chat_Bot/output-medium/* DialoGPT-medium-BatmanBot/

In [None]:
os.chdir('DialoGPT-medium-BatmanBot')

In [None]:
!git lfs install

In [None]:
!ls

In [None]:
!pwd # print out the working dir

In [None]:
!git status # file status ; to add

In [None]:
!git add .

In [None]:
!git config --global user.email "tejasrathod709@gmail.com"
# Tip: using the same email as your huggingface.co account will link your commits to your profile
!git config --global user.name "TejasARathod"

In [None]:
!git commit -m "Initial commit"

In [None]:
!git push https://TejasARathod:********************@huggingface.co/TejasARathod/DialoGPT-medium-BatmanBot

## All Done!