# DialoGPT fine-tuning Chatbot


## Set up drive

In [1]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [2]:
! pip -q install transformers

[K     |████████████████████████████████| 1.5MB 12.8MB/s 
[K     |████████████████████████████████| 2.9MB 56.6MB/s 
[K     |████████████████████████████████| 890kB 43.1MB/s 
[?25h  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone


In [3]:
import os
os.chdir("/content/drive/My Drive/Colab")

In [4]:
# cd Colab/
!ls

cached	  GoogleNews-vectors-negative300.bin  gpt2-results.txt	 report2.gdoc
data-eng  GPT2-finetune			      nltk_data		 runs
data.zip  gpt2-results-conversation.txt       output-small-save


## First dialogue with DialoGPT

We will conduct all our experiments in Google Colab, its resources are enough to train the small DialoGPT model. Firstly, we will connect to Google Drive and install the necessary modules.

Let's move to the desired folder in which we will store all our data.

Try to chat with DialoGPT without fine-tuning.

In [5]:
from transformers import AutoModelWithLMHead, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small")
model = AutoModelWithLMHead.from_pretrained("microsoft/DialoGPT-small")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=641.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1042301.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…






HBox(children=(FloatProgress(value=0.0, description='Downloading', max=351265583.0, style=ProgressStyle(descri…




In [6]:
# Let's chat for 5 lines
for step in range(5):
    # encode the new user input, add the eos_token and return a tensor in Pytorch
    new_user_input_ids = tokenizer.encode(input(">> User:") + tokenizer.eos_token, return_tensors='pt')

    # append the new user input tokens to the chat history
    bot_input_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1) if step > 0 else new_user_input_ids

    # generated a response while limiting the total chat history to 1000 tokens    
    chat_history_ids = model.generate(
    bot_input_ids, max_length=1000,
    pad_token_id=tokenizer.eos_token_id
    )

    # pretty print last ouput tokens from bot
    print("DialoGPT: {}".format(tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)))

>> User:hello doctor
DialoGPT: I'm not your friend, doc.
>> User:i have cold symptoms
DialoGPT: I have cold symptoms
>> User:what is this
DialoGPT: I have cold symptoms
>> User:oh no
DialoGPT: oh no
>> User:bye
DialoGPT: oh no


## Model initial configuration

For start, we will need basic configuration and a dataset.
Configuration and training scripts are mostly based on this [script](https://github.com/huggingface/transformers/tree/master/examples/language-modeling) from Huggingface and great [tutorial](https://nathancooper.io/i-am-a-nerd/chatbot/deep-learning/gpt2/2020/05/12/chatbot-part-1.html) from Nathan Cooper.

In [7]:
"""
Fine-tuning the library models for language modeling on a text file (GPT, GPT-2, BERT, RoBERTa).
GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned
using a masked language modeling (MLM) loss.
"""

import glob
import logging
import os
import pickle
import random
import re
import shutil
from typing import Dict, List, Tuple

import pandas as pd
import numpy as np
import torch

from sklearn.model_selection import train_test_split

from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader, Dataset, RandomSampler, SequentialSampler
from torch.utils.data.distributed import DistributedSampler
from tqdm.notebook import tqdm, trange

from pathlib import Path

from transformers import (
    MODEL_WITH_LM_HEAD_MAPPING,
    WEIGHTS_NAME,
    AdamW,
    AutoConfig,
    AutoModelWithLMHead,
    AutoTokenizer,
    PreTrainedModel,
    PreTrainedTokenizer,
    get_linear_schedule_with_warmup,
)


try:
    from torch.utils.tensorboard import SummaryWriter
except ImportError:
    from tensorboardX import SummaryWriter

# Configs
logger = logging.getLogger(__name__)

MODEL_CONFIG_CLASSES = list(MODEL_WITH_LM_HEAD_MAPPING.keys())
MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)

In [8]:
# Args to allow for easy convertion of python script to notebook
class Args():
    def __init__(self):
        self.output_dir = 'output-small-save'
        self.model_type = 'gpt2'
        self.model_name_or_path = 'microsoft/DialoGPT-small'
        self.config_name = 'microsoft/DialoGPT-small'
        self.tokenizer_name = 'microsoft/DialoGPT-small'
        self.cache_dir = 'cached'
        self.block_size = 512
        self.do_train = True
        self.do_eval = True
        self.evaluate_during_training = False
        self.per_gpu_train_batch_size = 4
        self.per_gpu_eval_batch_size = 4
        self.gradient_accumulation_steps = 1
        self.learning_rate = 5e-5
        self.weight_decay = 0.0
        self.adam_epsilon = 1e-8
        self.max_grad_norm = 1.0
        self.num_train_epochs = 3
        self.max_steps = -1
        self.warmup_steps = 0
        self.logging_steps = 1000
        self.save_steps = 3500
        self.save_total_limit = None
        self.eval_all_checkpoints = False
        self.no_cuda = False
        self.overwrite_output_dir = True
        self.overwrite_cache = True
        self.should_continue = False
        self.seed = 42
        self.local_rank = -1
        self.fp16 = False
        self.fp16_opt_level = 'O1'

args = Args()

## Prepare Covid dataset

In [9]:
import json

f_train = open("data-eng/train_data.json")
train_data = json.load(f_train)
f_train.close()
# print(len(train_data))


f_validate = open("data-eng/validate_data.json")
validate_data = json.load(f_validate)
f_validate.close()
# print(len(validate_data))


In [10]:
train_contexted = []
train_data = train_data
 
for i in range(len(train_data)):
  row = []
  row.append(train_data[i][1])
  row.append(train_data[i][0])
  train_contexted.append(row)  

In [11]:
validate_contexted = []

for i in range(len(validate_data)):
  row = []
  row.append(validate_data[i][1])
  row.append(validate_data[i][0])
  validate_contexted.append(row)  

In [12]:
columns = ['response', 'context'] 
columns = columns + ['context/'+str(i) for i in range(0)]
columns

['response', 'context']

In [13]:
len(train_contexted)
trn_df = pd.DataFrame.from_records(train_contexted, columns=columns)
trn_df.head(5)

Unnamed: 0,response,context
0,"Hello, I understand your concern. I just have ...","Hello doctor, I get a cough for the last few d..."
1,"Hello, I can understand your concern.In my opi...","Hello doctor, I am suffering from coughing, th..."
2,Hello. Anxiety can manifest itself in physical...,"Hello doctor,I am a 23-year-old man. I have an..."
3,"Hello,please answer the following:Any travel h...","Hello doctor,Last night I was getting chills, ..."
4,Hello and welcome to Ask A Doctor service.I ha...,"Hi, I am Chaitanya, 27 years old. I use to swi..."


In [14]:
len(validate_contexted)
val_df = pd.DataFrame.from_records(validate_contexted, columns=columns)
val_df.head(5)

Unnamed: 0,response,context
0,Corona-virus. At 33 you may not need testing. ...,I have a constant cough and my chest has now b...
1,Less likely. Recommended to stay 6 feet apart....,If someone has carona virus and iam passing by...
2,"Test Please stay at home, rest, drink fluids...",I am concerned that I’m showing symptoms of co...
3,Death. At your age the risk of death is the fo...,What are my chances of becoming seriously ill ...
4,Unknown but low Based on current data it is ...,Nervous about coronavirus. I am 26 years old a...


## Augmented dataset

This is entirely optional, and may degrade the results.

I suggest simply skip this on first try, and come back to this if you want to explore the unreliable NLP data augmentation. 

In [None]:
augmented_link = 'https://raw.githubusercontent.com/chophilip21/covid_dialogue/main/augmented.csv' #augmented version

dataset = pd.read_csv(augmented_link, names = ['input_text', 'target_text'], header=0)

In [None]:
train_df, test_df = train_test_split(dataset, test_size=0.2)
valid_df, test_df = train_test_split(test_df, test_size=0.5)

In [None]:
valid_df
train_df

Unnamed: 0,input_text,target_text
6,"Hello doctor,\n","Hi there, how can we help today?\n"
1505,"Hi, got pneumonia, 1month ago. had hemoptasis,...","Hello,Go for vaccinations given below:- flu va..."
1566,Son is 9 years underpin old. He has strep and ...,Thanks for your question on Healthcare Magic.I...
1864,I got Princip back esos from Israel last Monda...,Brief opinion: Covid test two weeks of self...
1964,If you feel VicRoads not quite well 3 or 4 eve...,Brief opinion: Viral syndrome You're symptom...
...,...,...
80,My mother was admitted to St. Jude s last Wedn...,"Hello,To be honest, there is a chance of cross..."
271,How soon after being infected can someone tran...,Brief opinion: Looking at the data people that...
1294,I ' ve been in liaison with a possible covid -...,Stay home. Family members should stay home and...
240,What can I use against coronavirus if I run ou...,Brief opinion: Water and soap is the best. Try...


## Dataset setup

Now will convert our dataset in a format suitable for our model. Basically we will concatenate responses in one string for each row (additionally we will add special 'end of string' token between responses, so the model will understand end of each response in a string).  

In [15]:
def construct_conv(row, tokenizer, eos = True):
    flatten = lambda l: [item for sublist in l for item in sublist]
    conv = list(reversed([tokenizer.encode(x) + [tokenizer.eos_token_id] for x in row]))
    conv = flatten(conv)
    return conv

class ConversationDataset(Dataset):
    def __init__(self, tokenizer: PreTrainedTokenizer, args, df, block_size=512):

        block_size = block_size - (tokenizer.model_max_length - tokenizer.max_len_single_sentence)

        directory = args.cache_dir
        cached_features_file = os.path.join(
            directory, args.model_type + "_cached_lm_" + str(block_size)
        )

        if os.path.exists(cached_features_file) and not args.overwrite_cache:
            logger.info("Loading features from cached file %s", cached_features_file)
            with open(cached_features_file, "rb") as handle:
                self.examples = pickle.load(handle)
        else:
            logger.info("Creating features from dataset file at %s", directory)

            self.examples = []
            for _, row in df.iterrows():
                conv = construct_conv(row, tokenizer)
                self.examples.append(conv)

            logger.info("Saving features into cached file %s", cached_features_file)
            with open(cached_features_file, "wb") as handle:
                pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, item):
        return torch.tensor(self.examples[item], dtype=torch.long)

In [16]:
# Cacheing and storing of data/checkpoints

def load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate=False):
    return ConversationDataset(tokenizer, args, df_val if evaluate else df_trn)


def set_seed(args):
    random.seed(args.seed)
    np.random.seed(args.seed)
    torch.manual_seed(args.seed)
    if args.n_gpu > 0:
        torch.cuda.manual_seed_all(args.seed)


def _sorted_checkpoints(args, checkpoint_prefix="checkpoint", use_mtime=False) -> List[str]:
    ordering_and_checkpoint_path = []

    glob_checkpoints = glob.glob(os.path.join(args.output_dir, "{}-*".format(checkpoint_prefix)))

    for path in glob_checkpoints:
        if use_mtime:
            ordering_and_checkpoint_path.append((os.path.getmtime(path), path))
        else:
            regex_match = re.match(".*{}-([0-9]+)".format(checkpoint_prefix), path)
            if regex_match and regex_match.groups():
                ordering_and_checkpoint_path.append((int(regex_match.groups()[0]), path))

    checkpoints_sorted = sorted(ordering_and_checkpoint_path)
    checkpoints_sorted = [checkpoint[1] for checkpoint in checkpoints_sorted]
    return checkpoints_sorted


def _rotate_checkpoints(args, checkpoint_prefix="checkpoint", use_mtime=False) -> None:
    if not args.save_total_limit:
        return
    if args.save_total_limit <= 0:
        return

    # Check if we should delete older checkpoint(s)
    checkpoints_sorted = _sorted_checkpoints(args, checkpoint_prefix, use_mtime)
    if len(checkpoints_sorted) <= args.save_total_limit:
        return

    number_of_checkpoints_to_delete = max(0, len(checkpoints_sorted) - args.save_total_limit)
    checkpoints_to_be_deleted = checkpoints_sorted[:number_of_checkpoints_to_delete]
    for checkpoint in checkpoints_to_be_deleted:
        logger.info("Deleting older checkpoint [{}] due to args.save_total_limit".format(checkpoint))
        shutil.rmtree(checkpoint)

In [17]:
trn_df

Unnamed: 0,response,context
0,"Hello, I understand your concern. I just have ...","Hello doctor, I get a cough for the last few d..."
1,"Hello, I can understand your concern.In my opi...","Hello doctor, I am suffering from coughing, th..."
2,Hello. Anxiety can manifest itself in physical...,"Hello doctor,I am a 23-year-old man. I have an..."
3,"Hello,please answer the following:Any travel h...","Hello doctor,Last night I was getting chills, ..."
4,Hello and welcome to Ask A Doctor service.I ha...,"Hi, I am Chaitanya, 27 years old. I use to swi..."
...,...,...
478,Quarantine You should be in self quarantine ...,Girlfriend has coronavirus. With bad symptoms....
479,Fever with body ache Hello & welcome to Heal...,Scratching sore thoat and slight chest irrita...
480,"Cough,phlegm. At this time your symptoms are c...",Associated with phlegm and mucus?
481,Correct. Every bit helps.,Should one also cover eyes in addition to cove...


## Training and Evaluating

There will be quite a lot of code needed for training our model but don’t worry, everything should work as is, the main thing is to give the model the dataset in the right format.


In [18]:
def train(args, train_dataset, model: PreTrainedModel, tokenizer: PreTrainedTokenizer) -> Tuple[int, float]:
    """ Train the model """
    if args.local_rank in [-1, 0]:
        tb_writer = SummaryWriter()

    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)

    def collate(examples: List[torch.Tensor]):
        if tokenizer._pad_token is None:
            return pad_sequence(examples, batch_first=True)
        return pad_sequence(examples, batch_first=True, padding_value=tokenizer.pad_token_id)

    train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
    train_dataloader = DataLoader(
        train_dataset, sampler=train_sampler, batch_size=args.train_batch_size, collate_fn=collate, drop_last = True
    )

    if args.max_steps > 0:
        t_total = args.max_steps
        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
    else:
        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs

    model = model.module if hasattr(model, "module") else model  # Take care of distributed/parallel training
    model.resize_token_embeddings(len(tokenizer))
    # add_special_tokens_(model, tokenizer)


    # Prepare optimizer and schedule (linear warmup and decay)
    no_decay = ["bias", "LayerNorm.weight"]
    optimizer_grouped_parameters = [
        {
            "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
            "weight_decay": args.weight_decay,
        },
        {"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0},
    ]
    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
    scheduler = get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total
    )

    # Check if saved optimizer or scheduler states exist
    if (
        args.model_name_or_path
        and os.path.isfile(os.path.join(args.model_name_or_path, "optimizer.pt"))
        and os.path.isfile(os.path.join(args.model_name_or_path, "scheduler.pt"))
    ):
        # Load in optimizer and scheduler states
        optimizer.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "optimizer.pt")))
        scheduler.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "scheduler.pt")))

    if args.fp16:
        try:
            from apex import amp
        except ImportError:
            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
        model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)

    # multi-gpu training (should be after apex fp16 initialization)
    if args.n_gpu > 1:
        model = torch.nn.DataParallel(model)

    # Distributed training (should be after apex fp16 initialization)
    if args.local_rank != -1:
        model = torch.nn.parallel.DistributedDataParallel(
            model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True
        )

    # Train!
    logger.info("***** Running training *****")
    logger.info("  Num examples = %d", len(train_dataset))
    logger.info("  Num Epochs = %d", args.num_train_epochs)
    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
    logger.info(
        "  Total train batch size (w. parallel, distributed & accumulation) = %d",
        args.train_batch_size
        * args.gradient_accumulation_steps
        * (torch.distributed.get_world_size() if args.local_rank != -1 else 1),
    )
    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
    logger.info("  Total optimization steps = %d", t_total)

    global_step = 0
    epochs_trained = 0
    steps_trained_in_current_epoch = 0
    # Check if continuing training from a checkpoint
    if args.model_name_or_path and os.path.exists(args.model_name_or_path):
        try:
            # set global_step to gobal_step of last saved checkpoint from model path
            checkpoint_suffix = args.model_name_or_path.split("-")[-1].split("/")[0]
            global_step = int(checkpoint_suffix)
            epochs_trained = global_step // (len(train_dataloader) // args.gradient_accumulation_steps)
            steps_trained_in_current_epoch = global_step % (len(train_dataloader) // args.gradient_accumulation_steps)

            logger.info("  Continuing training from checkpoint, will skip to saved global_step")
            logger.info("  Continuing training from epoch %d", epochs_trained)
            logger.info("  Continuing training from global step %d", global_step)
            logger.info("  Will skip the first %d steps in the first epoch", steps_trained_in_current_epoch)
        except ValueError:
            logger.info("  Starting fine-tuning.")

    tr_loss, logging_loss = 0.0, 0.0

    model.zero_grad()
    train_iterator = trange(
        epochs_trained, int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0]
    )
    set_seed(args)  # Added here for reproducibility
    for _ in train_iterator:
        epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
        for step, batch in enumerate(epoch_iterator):

            # Skip past any already trained steps if resuming training
            if steps_trained_in_current_epoch > 0:
                steps_trained_in_current_epoch -= 1
                continue

            inputs, labels = (batch, batch)
            if inputs.shape[1] > 1024: continue
            inputs = inputs.to(args.device)
            labels = labels.to(args.device)
            model.train()
            outputs = model(inputs, labels=labels)
            loss = outputs[0]  # model outputs are always tuple in transformers (see doc)

            if args.n_gpu > 1:
                loss = loss.mean()  # mean() to average on multi-gpu parallel training
            if args.gradient_accumulation_steps > 1:
                loss = loss / args.gradient_accumulation_steps

            if args.fp16:
                with amp.scale_loss(loss, optimizer) as scaled_loss:
                    scaled_loss.backward()
            else:
                loss.backward()

            tr_loss += loss.item()
            if (step + 1) % args.gradient_accumulation_steps == 0:
                if args.fp16:
                    torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
                else:
                    torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
                optimizer.step()
                scheduler.step()  # Update learning rate schedule
                model.zero_grad()
                global_step += 1

                if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
                    # Log metrics
                    if (
                        args.local_rank == -1 and args.evaluate_during_training
                    ):  # Only evaluate when single GPU otherwise metrics may not average well
                        results = evaluate(args, model, tokenizer)
                        for key, value in results.items():
                            tb_writer.add_scalar("eval_{}".format(key), value, global_step)
                    tb_writer.add_scalar("lr", scheduler.get_lr()[0], global_step)
                    tb_writer.add_scalar("loss", (tr_loss - logging_loss) / args.logging_steps, global_step)
                    logging_loss = tr_loss

                if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
                    checkpoint_prefix = "checkpoint"
                    # Save model checkpoint
                    output_dir = os.path.join(args.output_dir, "{}-{}".format(checkpoint_prefix, global_step))
                    os.makedirs(output_dir, exist_ok=True)
                    model_to_save = (
                        model.module if hasattr(model, "module") else model
                    )  # Take care of distributed/parallel training
                    model_to_save.save_pretrained(output_dir)
                    tokenizer.save_pretrained(output_dir)

                    torch.save(args, os.path.join(output_dir, "training_args.bin"))
                    logger.info("Saving model checkpoint to %s", output_dir)

                    _rotate_checkpoints(args, checkpoint_prefix)

                    torch.save(optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
                    torch.save(scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))
                    logger.info("Saving optimizer and scheduler states to %s", output_dir)

            if args.max_steps > 0 and global_step > args.max_steps:
                epoch_iterator.close()
                break
        if args.max_steps > 0 and global_step > args.max_steps:
            train_iterator.close()
            break

    if args.local_rank in [-1, 0]:
        tb_writer.close()

    return global_step, tr_loss / global_step

# Evaluation of some model

def evaluate(args, model: PreTrainedModel, tokenizer: PreTrainedTokenizer, df_trn, df_val, prefix="") -> Dict:
    # Loop to handle MNLI double evaluation (matched, mis-matched)
    eval_output_dir = args.output_dir

    eval_dataset = load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate=True)
    os.makedirs(eval_output_dir, exist_ok=True)
    args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
    # Note that DistributedSampler samples randomly

    def collate(examples: List[torch.Tensor]):
        if tokenizer._pad_token is None:
            return pad_sequence(examples, batch_first=True)
        return pad_sequence(examples, batch_first=True, padding_value=tokenizer.pad_token_id)

    eval_sampler = SequentialSampler(eval_dataset)
    eval_dataloader = DataLoader(
        eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size, collate_fn=collate, drop_last = True
    )

    # multi-gpu evaluate
    if args.n_gpu > 1:
        model = torch.nn.DataParallel(model)

    # Eval!
    logger.info("***** Running evaluation {} *****".format(prefix))
    logger.info("  Num examples = %d", len(eval_dataset))
    logger.info("  Batch size = %d", args.eval_batch_size)
    eval_loss = 0.0
    nb_eval_steps = 0
    model.eval()

    for batch in tqdm(eval_dataloader, desc="Evaluating"):
        inputs, labels = (batch, batch)
        inputs = inputs.to(args.device)
        labels = labels.to(args.device)

        with torch.no_grad():
            outputs = model(inputs, labels=labels)
            lm_loss = outputs[0]
            eval_loss += lm_loss.mean().item()
        nb_eval_steps += 1

    eval_loss = eval_loss / nb_eval_steps
    perplexity = torch.exp(torch.tensor(eval_loss))

    result = {"perplexity": perplexity}

    output_eval_file = os.path.join(eval_output_dir, prefix, "eval_results.txt")
    with open(output_eval_file, "w") as writer:
        logger.info("***** Eval results {} *****".format(prefix))
        for key in sorted(result.keys()):
            logger.info("  %s = %s", key, str(result[key]))
            writer.write("%s = %s\n" % (key, str(result[key])))

    return result

In [19]:
# Main runner

def main(df_trn, df_val):
    args = Args()
    
    if args.should_continue:
        sorted_checkpoints = _sorted_checkpoints(args)
        if len(sorted_checkpoints) == 0:
            raise ValueError("Used --should_continue but no checkpoint was found in --output_dir.")
        else:
            args.model_name_or_path = sorted_checkpoints[-1]

    if (
        os.path.exists(args.output_dir)
        and os.listdir(args.output_dir)
        and args.do_train
        and not args.overwrite_output_dir
        and not args.should_continue
    ):
        raise ValueError(
            "Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(
                args.output_dir
            )
        )

    # Setup CUDA, GPU & distributed training
    device = torch.device("cuda")
    args.n_gpu = torch.cuda.device_count()
    args.device = device

    # Setup logging
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO if args.local_rank in [-1, 0] else logging.WARN,
    )
    logger.warning(
        "Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
        args.local_rank,
        device,
        args.n_gpu,
        bool(args.local_rank != -1),
        args.fp16,
    )

    # Set seed
    set_seed(args)

    config = AutoConfig.from_pretrained(args.config_name, cache_dir=args.cache_dir)
    tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name, cache_dir=args.cache_dir)
    model = AutoModelWithLMHead.from_pretrained(
        args.model_name_or_path,
        from_tf=False,
        config=config,
        cache_dir=args.cache_dir,
    )
    model.to(args.device)
    
    logger.info("Training/evaluation parameters %s", args)

    # Training
    if args.do_train:
        train_dataset = load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate=False)

        global_step, tr_loss = train(args, train_dataset, model, tokenizer)
        logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)

    # Saving best-practices: if you use save_pretrained for the model and tokenizer, you can reload them using from_pretrained()
    if args.do_train:
        # Create output directory if needed
        os.makedirs(args.output_dir, exist_ok=True)

        logger.info("Saving model checkpoint to %s", args.output_dir)
        # Save a trained model, configuration and tokenizer using `save_pretrained()`.
        # They can then be reloaded using `from_pretrained()`
        model_to_save = (
            model.module if hasattr(model, "module") else model
        )  # Take care of distributed/parallel training
        model_to_save.save_pretrained(args.output_dir)
        tokenizer.save_pretrained(args.output_dir)

        # Good practice: save your training arguments together with the trained model
        torch.save(args, os.path.join(args.output_dir, "training_args.bin"))

        # Load a trained model and vocabulary that you have fine-tuned
        model = AutoModelWithLMHead.from_pretrained(args.output_dir)
        tokenizer = AutoTokenizer.from_pretrained(args.output_dir)
        model.to(args.device)

    # Evaluation
    results = {}
    if args.do_eval and args.local_rank in [-1, 0]:
        checkpoints = [args.output_dir]
        if args.eval_all_checkpoints:
            checkpoints = list(
                os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + "/**/" + WEIGHTS_NAME, recursive=True))
            )
            logging.getLogger("transformers.modeling_utils").setLevel(logging.WARN)  # Reduce logging
        logger.info("Evaluate the following checkpoints: %s", checkpoints)
        for checkpoint in checkpoints:
            global_step = checkpoint.split("-")[-1] if len(checkpoints) > 1 else ""
            prefix = checkpoint.split("/")[-1] if checkpoint.find("checkpoint") != -1 else ""

            model = AutoModelWithLMHead.from_pretrained(checkpoint)
            model.to(args.device)
            result = evaluate(args, model, tokenizer, df_trn, df_val, prefix=prefix)
            result = dict((k + "_{}".format(global_step), v) for k, v in result.items())
            results.update(result)

    return results

## Start Fine-Tuning!

It is time to train our model!


In [20]:
main(trn_df, val_df)

01/12/2021 20:55:35 - INFO - __main__ -   Training/evaluation parameters <__main__.Args object at 0x7faede4e5c50>
01/12/2021 20:55:35 - INFO - __main__ -   Creating features from dataset file at cached
01/12/2021 20:55:35 - INFO - __main__ -   Saving features into cached file cached/gpt2_cached_lm_512
01/12/2021 20:55:36 - INFO - __main__ -   ***** Running training *****
01/12/2021 20:55:36 - INFO - __main__ -     Num examples = 483
01/12/2021 20:55:36 - INFO - __main__ -     Num Epochs = 3
01/12/2021 20:55:36 - INFO - __main__ -     Instantaneous batch size per GPU = 4
01/12/2021 20:55:36 - INFO - __main__ -     Total train batch size (w. parallel, distributed & accumulation) = 4
01/12/2021 20:55:36 - INFO - __main__ -     Gradient Accumulation steps = 1
01/12/2021 20:55:36 - INFO - __main__ -     Total optimization steps = 360


HBox(children=(FloatProgress(value=0.0, description='Epoch', max=3.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Iteration', max=120.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=120.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=120.0, style=ProgressStyle(description_wi…

01/12/2021 20:57:24 - INFO - __main__ -    global_step = 360, average loss = 2.844376615352101
01/12/2021 20:57:24 - INFO - __main__ -   Saving model checkpoint to output-small-save






01/12/2021 20:57:42 - INFO - __main__ -   Evaluate the following checkpoints: ['output-small-save']
01/12/2021 20:57:48 - INFO - __main__ -   Creating features from dataset file at cached
01/12/2021 20:57:48 - INFO - __main__ -   Saving features into cached file cached/gpt2_cached_lm_512
01/12/2021 20:57:48 - INFO - __main__ -   ***** Running evaluation  *****
01/12/2021 20:57:48 - INFO - __main__ -     Num examples = 60
01/12/2021 20:57:48 - INFO - __main__ -     Batch size = 4


HBox(children=(FloatProgress(value=0.0, description='Evaluating', max=15.0, style=ProgressStyle(description_wi…




01/12/2021 20:57:50 - INFO - __main__ -   ***** Eval results  *****
01/12/2021 20:57:50 - INFO - __main__ -     perplexity = tensor(14.2759)


{'perplexity_': tensor(14.2759)}

## Generate test results


In [None]:
f_test = open("data-eng/test_data.json")
test_data = json.load(f_test)
f_test.close()

test_query = []
test_response = []

for i in range(len(test_data)):
  test_response.append(test_data[i][1])
  test_query.append(test_data[i][0])

print(len(test_response))
print(len(test_query))


61
61


In [None]:
test_chatbot = []

for i in range(len(test_query)):
  tokenizer = AutoTokenizer.from_pretrained('microsoft/DialoGPT-small')
  model = AutoModelWithLMHead.from_pretrained('output-small-save')
  # append the new user input tokens to the chat history
  bot_input_ids = tokenizer.encode(test_query[i] + tokenizer.eos_token, return_tensors='pt')
  print("Patient: {} \n".format(test_query[i]))
  print("Reference:  {} \n".format(test_response[i]))


  # generated a response while limiting the total chat history to 1000 tokens, 
  chat_history_ids = model.generate(
      bot_input_ids, max_length=100,
      pad_token_id=tokenizer.eos_token_id,  
      no_repeat_ngram_size=3,       
      do_sample=True, 
      top_k=10, 
      top_p=0.7,
      temperature = 0.8
  )

  # pretty print last ouput tokens from bot
  print("Predict: {} \n\n".format(tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)))
  test_chatbot.append(tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True))

print(len(test_chatbot))


12/09/2020 05:24:55 - INFO - filelock -   Lock 140523183608328 acquired on /root/.cache/huggingface/transformers/0cbdd50f204f3ddbaa452e976340a5725f0b5ddb201704058c87e14d9679e070.e6898db50ba3aa698f0f652e876a1e4bd813321dea3e22b776f9a3c39d36aaab.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=641.0, style=ProgressStyle(description_…

12/09/2020 05:24:55 - INFO - filelock -   Lock 140523183608328 released on /root/.cache/huggingface/transformers/0cbdd50f204f3ddbaa452e976340a5725f0b5ddb201704058c87e14d9679e070.e6898db50ba3aa698f0f652e876a1e4bd813321dea3e22b776f9a3c39d36aaab.lock





12/09/2020 05:24:55 - INFO - filelock -   Lock 140523169989240 acquired on /root/.cache/huggingface/transformers/3cf340c89a43b5e6f31c4cd609fc2fc92f3d7aafdf6c8987e2ea9e02cb78b4e2.c7ed1f96aac49e745788faa77ba0a26a392643a50bb388b9c04ff469e555241f.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1042301.0, style=ProgressStyle(descript…

12/09/2020 05:24:56 - INFO - filelock -   Lock 140523169989240 released on /root/.cache/huggingface/transformers/3cf340c89a43b5e6f31c4cd609fc2fc92f3d7aafdf6c8987e2ea9e02cb78b4e2.c7ed1f96aac49e745788faa77ba0a26a392643a50bb388b9c04ff469e555241f.lock





12/09/2020 05:24:56 - INFO - filelock -   Lock 140523173146920 acquired on /root/.cache/huggingface/transformers/4e3f74e7c741909c4d1b48a23febe75c1be66a20c2b98cf7db4b8b10f12dc10c.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…

12/09/2020 05:24:57 - INFO - filelock -   Lock 140523173146920 released on /root/.cache/huggingface/transformers/4e3f74e7c741909c4d1b48a23febe75c1be66a20c2b98cf7db4b8b10f12dc10c.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b.lock







Patient: I have all the symptoms except fever, I went to Medicross and Dr said I can get tested if I want to I'm not sure if I should. She gave me antibiotics Klacid XL 500mg, she said I can take it if I feel worse I'm worried it will make immune system bad? 

Reference:  I have all the symptoms except fever, I went to Medicross and Dr said I can get tested if I want to I'm not sure if I should. She gave me antibiotics Klacid XL 500mg, she said I can take it if I feel worse I'm worried it will make immune system bad? 

Predict: It depends on your age and what type of infection you have. If you have a history of pneumonia, you might need to take a test to find out if it is contagious. If 


Patient: I have pain/discomfort in my lungs. I don't experience simultaneous on both lungs and it not always at the hame position. I don't have a head nor do I have high temperature. I sneeze and cough maybe once a day. Do I have corona, should I get tested? 

Reference:  I have pain/discomfort in my

In [None]:
with open('gpt2-results.txt', 'w') as f:
    for i in test_chatbot:
        f.write('Chatbot: %s\n\n' % i)

In [None]:
with open('gpt2-results-conversation.txt', 'w') as f:
    for i in range(len(test_chatbot)):
        f.write("--------------------example %s--------------------\n" %(i+1))
        f.write('Patient: %s\n' % test_query[i])
        f.write('Reference: %s\n' % test_response[i])
        f.write('Predict: %s\n' % test_chatbot[i])
        f.write('\n')

## Metrics

In [None]:
pip install "nltk==3.4.5"

Collecting nltk==3.4.5
[?25l  Downloading https://files.pythonhosted.org/packages/f6/1d/d925cfb4f324ede997f6d47bea4d9babba51b49e87a767c170b77005889d/nltk-3.4.5.zip (1.5MB)
[K     |▎                               | 10kB 19.4MB/s eta 0:00:01[K     |▌                               | 20kB 26.5MB/s eta 0:00:01[K     |▊                               | 30kB 24.9MB/s eta 0:00:01[K     |█                               | 40kB 15.7MB/s eta 0:00:01[K     |█▏                              | 51kB 18.0MB/s eta 0:00:01[K     |█▍                              | 61kB 18.9MB/s eta 0:00:01[K     |█▋                              | 71kB 15.9MB/s eta 0:00:01[K     |█▉                              | 81kB 16.0MB/s eta 0:00:01[K     |██                              | 92kB 16.3MB/s eta 0:00:01[K     |██▎                             | 102kB 17.3MB/s eta 0:00:01[K     |██▌                             | 112kB 17.3MB/s eta 0:00:01[K     |██▊                             | 122kB 17.3MB/s eta 0:0

In [None]:
import nltk
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.meteor_score import meteor_score
from nltk.translate.bleu_score import SmoothingFunction
from nltk.translate.meteor_score import meteor_score
from nltk.translate.nist_score import sentence_nist


In [None]:
def get_metrics(pred, target):
    turns = len(target)
    bleu_2 = 0
    bleu_4 = 0
    meteor = 0
    nist_2 = 0
    nist_4 = 0
    for index in range(turns):
        pred_utt = pred[index]
        target_utt = target[index]
        min_len = min(len(pred_utt), len(target_utt))
        lens = min(min_len, 4)
        if lens == 0:
            continue
        if lens >= 4:
            bleu_4_utt = sentence_bleu([target_utt], pred_utt, weights = (0.25, 0.25, 0.25, 0.25), smoothing_function = SmoothingFunction().method1)
            nist_4_utt = sentence_nist([target_utt], pred_utt, 4)
        else:
            bleu_4_utt = 0
            nist_4_utt = 0
        if lens >= 2:
            bleu_2_utt = sentence_bleu([target_utt], pred_utt, weights = (0.5, 0.5), smoothing_function = SmoothingFunction().method1)
            nist_2_utt = sentence_nist([target_utt], pred_utt, 2)
        else:
            bleu_2_utt = 0
            nist_2_utt = 0
            
        bleu_2 += bleu_2_utt
        bleu_4 += bleu_4_utt
        meteor += meteor_score([" ".join(target_utt)], " ".join(pred_utt))
        nist_2 += nist_2_utt
        nist_4 += nist_4_utt
        
    bleu_2 /= turns
    bleu_4 /= turns
    meteor /= turns
    nist_2 /= turns
    nist_4 /= turns
    return bleu_2, bleu_4, meteor, nist_2, nist_4


In [None]:
mkdir nltk_data

In [None]:
nltk.download('wordnet')


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [None]:
  # bleu_2, bleu_4, meteor, nist_2, nist_4 = get_metrics(pred_token, target_token)
  bleu_2, bleu_4, meteor, nist_2, nist_4 = get_metrics(test_chatbot, test_response)


In [None]:
 bleu_2, bleu_4, meteor, nist_2, nist_4

(0.10910625032198501,
 0.046969635749153485,
 0.12674155698173506,
 0.49653946717741193,
 0.512562750423553)

## Interactive Chat

A variety of methods can be used in responces generation. You can find more details about these methods by this [link](https://huggingface.co/blog/how-to-generate). 

In [None]:
tokenizer = AutoTokenizer.from_pretrained('microsoft/DialoGPT-small')
model = AutoModelWithLMHead.from_pretrained('output-small-save')

# Let's chat for 5 lines
for step in range(1):
    # encode the new user input, add the eos_token and return a tensor in Pytorch
    new_user_input_ids = tokenizer.encode(input(">> User:") + tokenizer.eos_token, return_tensors='pt')
    # print(new_user_input_ids)

    # append the new user input tokens to the chat history
    bot_input_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1) if step > 0 else new_user_input_ids

    # generated a response while limiting the total chat history to 1000 tokens, 
    chat_history_ids = model.generate(
        bot_input_ids, max_length=1000,
        pad_token_id=tokenizer.eos_token_id,  
        no_repeat_ngram_size=3,       
        do_sample=True, 
        top_k=10, 
        top_p=0.7,
        temperature = 0.8
    )
    
    # pretty print last ouput tokens from bot
    print("Chatbot: {}".format(tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)))



>> User:Can coronavirus symptoms be mild forsome people versus severe ? for example, couldit just involve being very fatigued, low gradefever for a few days and not the extreme symp-toms?   or  is  it  always  a  full  blown  cold  andstruggle to breathe?
Chatbot: It depends on your    severity,   and   type of cold.   If you are having a mild fever and or dry cough, then   should be treated.  If   is not having a cold, then you should be fine.  Would you like to video chat with me?
