# 🧠 Building a SheldonBot: A Big Bang Theory Chatbot with NLP & DialoGPT

I'm a huge fan of *The Big Bang Theory* and Natural Language Processing (NLP). After watching the final episode of Sheldon and Amy’s Nobel Prize journey, I had an epiphany—it wasn’t just coincidence. I had to combine my love for the show and my skills in NLP and Machine Learning to bring my own **SheldonBot** to life (for now, in chatbot form).

![Sheldon thinking deeply](https://media0.giphy.com/media/v1.Y2lkPTc5MGI3NjExZWZuZzl4Ym5zNHMxeDgyam9qOGVmZHBqM2M3ZGtpYzJzMXFycjBocCZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/f79OYWh5uwIfK/giphy.gif)

---

The road to creating *virtual Sheldon* wasn’t easy. I experimented with several models and approaches, but at first SheldonBot was as socially reluctant as the real Sheldon. His first reply to “Hi” was:

> **"You're in my spot."**

(For context, that was the **only phrase** he knew at the beginning!)

But I didn't give up. Thanks to the powerful 🤗 Transformers library and fine-tuning Microsoft's DialoGPT model using the **Big Bang Theory Dialogues** dataset from Kaggle, I finally had my breakthrough. SheldonBot and I started having quirky, delightful conversations.

---

## 💬 Sample Conversation

```plaintext
User: Hey, Sheldon, want to build a quantum computer together?  
SheldonBot: Only if I get to name it “Quanteen.”  

User: That's actually pretty clever  
SheldonBot: Of course it is. I have an IQ of 187.  

User: Can we use it to travel through time?  
SheldonBot: Time travel is theoretically possible but not advisable. Just ask Howard after his zero-gravity toilet incident.  

User: I want to visit the Renaissance  
SheldonBot: Fascinating choice. I’d suggest packing Purell and avoiding Galileo’s house—he’s touchy about the heliocentric model.


## A bit of theory

# 🤖 How to Create a Virtual Sheldon Using NLP and Transformers

In this article, I’ll show you how to create a virtual character whose responses are based on the transcript of my favorite science-themed sitcom — *The Big Bang Theory*. You’re welcome to use characters or shows of your own choice, but for this tutorial, I’ll walk you through how I turned Sheldon Cooper into a chatbot.

I also added the code to convert a simple text file of dialogues into a format that the model understands — no need for complex data labeling.

---

To accomplish this, we’ll use the amazing [Transformers](https://huggingface.co/transformers/) library by Hugging Face. This library hosts the latest state-of-the-art NLP models, including:

- [BERT](https://huggingface.co/transformers/model_doc/bert.html)  
- [XLNet](https://huggingface.co/transformers/model_doc/xlnet.html)  
- [GPT-2](https://huggingface.co/transformers/model_doc/gpt2.html)  

If you're curious about how transformers work, check out Jay Alammar’s beautifully illustrated [article on transformers](http://jalammar.github.io/illustrated-transformer/).

![Transformer model](http://jalammar.github.io/images/t/transformer_resideual_layer_norm_3.png)  
Image from [jalammar.github.io](http://jalammar.github.io/illustrated-transformer/)

---

## 🎯 Why DialoGPT?

Recently, Microsoft’s [DialoGPT](https://huggingface.co/transformers/model_doc/dialogpt.html) was added to the Hugging Face model hub. DialoGPT is a variant of GPT-2, trained on 147 million Reddit multi-turn conversation threads. You can learn more about GPT-2 in [this article](http://jalammar.github.io/illustrated-gpt2/).

DialoGPT is particularly well-suited for building chatbots based on TV scripts. It understands dialogue structure and can maintain context across multiple turns — even in its small version. This makes it ideal for bringing a character like Sheldon to life in a *Big Bang Theory*–style conversation.

---

> *Fun fact:* When I first trained SheldonBot, the only thing it would reply with was “You're in my spot.” But with a bit of fine-tuning, things got *Bazinga*-level good.

---

Next, we’ll go step-by-step through:
- Loading your own dialogue file (we'll use a Big Bang Theory script from Kaggle)
- Formatting it properly for training
- Fine-tuning DialoGPT
- Talking with your virtual Sheldon

Ready to build your own sitcom AI? Let’s do this! 🧠💬


## First dialogue with DialoGPT

We will conduct all our experiments in Google Colab / Kaggle, its resources are enough to train the small DialoGPT model.

In [1]:
! pip -q install transformers

Try to chat with DialoGPT without fine-tuning.

In [2]:
from transformers import AutoModelWithLMHead, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small")
model = AutoModelWithLMHead.from_pretrained("microsoft/DialoGPT-small")

tokenizer_config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/641 [00:00<?, ?B/s]

2025-05-06 13:24:58.403906: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1746537898.563036      19 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1746537898.609344      19 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


model.safetensors:   0%|          | 0.00/351M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small")
model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-small")

prompts = ["Hello, how are you?", "What can you do?"]
chat_history_ids = None

for step in range(2):
    new_user_input_ids = tokenizer.encode(prompts[step] + tokenizer.eos_token, return_tensors='pt')
    bot_input_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1) if step > 0 else new_user_input_ids

    chat_history_ids = model.generate(
        bot_input_ids, max_length=1000,
        pad_token_id=tokenizer.eos_token_id
    )

    response = tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)
    print(f"DialoGPT: {response}")


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


DialoGPT: Hi, I'm a guy.
DialoGPT: I can do this.


![alt text](https://media3.giphy.com/media/v1.Y2lkPTc5MGI3NjExNGdrbWkzYmFld3V5MXkzZ3YzcjE1NzVkcjN6Ymkyemc1b2hqbjdzaSZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/xT0xehbY7qJnF4xv8Y/giphy.gif)

Image from [Giphy](https://giphy.com/)

That's not quite good answers!. We will fix it with fine-tuning.

## Model initial configuration

Let's train our own SheldonBot. For start, we will need basic configuration and a dataset.
Configuration and training scripts are mostly based on this [script](https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py) from Huggingface and great [tutorial](https://nathancooper.io/i-am-a-nerd/chatbot/deep-learning/gpt2/2020/05/12/chatbot-part-1.html) from Nathan Cooper.

In [4]:
"""
Fine-tuning Hugging Face models for language modeling on a text file (GPT, GPT-2, BERT, RoBERTa).
GPT/GPT-2 use causal language modeling (CLM); BERT/RoBERTa use masked language modeling (MLM).
"""

import os
import re
import glob
import pickle
import random
import shutil
import logging
from pathlib import Path
from typing import Dict, List, Tuple

import numpy as np
import pandas as pd
import torch
from torch.utils.data import DataLoader, Dataset, RandomSampler, SequentialSampler
from torch.nn.utils.rnn import pad_sequence
from sklearn.model_selection import train_test_split
from tqdm.notebook import tqdm, trange
from torch.optim import AdamW
from transformers import (
    WEIGHTS_NAME,
    AutoConfig,
    AutoTokenizer,
    get_linear_schedule_with_warmup,
    AutoModelForCausalLM,      # For GPT, GPT-2
    AutoModelForMaskedLM       # For BERT, RoBERTa
)


# TensorBoard support
try:
    from torch.utils.tensorboard import SummaryWriter
except ImportError:
    from tensorboardX import SummaryWriter

# Setup logging
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

# Define supported model types
MODEL_TYPES = (
    "gpt2",         # causal LM
    "bert",         # masked LM
    "roberta",      # masked LM
    "distilbert",   # masked LM
    "openai-gpt",   # causal LM
    "xlnet",        # permutation LM
    "ctrl",         # conditional LM
    "transfo-xl",   # transformer-XL LM
    "xlm",          # multilingual LM
)


In [5]:
# Args to allow for easy convertion of python script to notebook
class Args():
    def __init__(self):
        self.output_dir = 'output-small'
        self.model_type = 'gpt2'
        self.model_name_or_path = 'microsoft/DialoGPT-small'
        self.config_name = 'microsoft/DialoGPT-small'
        self.tokenizer_name = 'microsoft/DialoGPT-small'
        self.cache_dir = 'cached'
        self.block_size = 512
        self.do_train = True
        self.do_eval = True
        self.evaluate_during_training = False
        self.per_gpu_train_batch_size = 4
        self.per_gpu_eval_batch_size = 4
        self.gradient_accumulation_steps = 1
        self.learning_rate = 5e-5
        self.weight_decay = 0.0
        self.adam_epsilon = 1e-8
        self.max_grad_norm = 1.0
        self.num_train_epochs = 3
        self.max_steps = -1
        self.warmup_steps = 0
        self.logging_steps = 1000
        self.save_steps = 3500
        self.save_total_limit = None
        self.eval_all_checkpoints = False
        self.no_cuda = False
        self.overwrite_output_dir = True
        self.overwrite_cache = True
        self.should_continue = False
        self.seed = 42
        self.local_rank = -1
        self.fp16 = False
        self.fp16_opt_level = 'O1'

args = Args()

## Prepare Dataset

# 🧠 The Big Bang Theory Bot - Dataset Setup

## Get the data set from kaggle
You can use the code from the dataset source in kaggle [https://www.kaggle.com/code/lydia70/big-bang-theory-tv-show]

In [6]:
import kagglehub
the_big_bang_theory_series_transcript_path = kagglehub.dataset_download('mitramir5/the-big-bang-theory-series-transcript')
print('Data source import complete.')

Data source import complete.


In [7]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


/kaggle/input/the-big-bang-theory-series-transcript/sentences_sentiment_dicts.pkl
/kaggle/input/the-big-bang-theory-series-transcript/1_10_seasons_tbbt.csv


In [8]:
#import dataset
path = '/kaggle/input/the-big-bang-theory-series-transcript/1_10_seasons_tbbt.csv'
df = pd.read_csv(path)
df.head()

Unnamed: 0,episode_name,dialogue,person_scene
0,Series 01 Episode 01 – Pilot Episode,A corridor at a sperm bank.,Scene
1,Series 01 Episode 01 – Pilot Episode,So if a photon is directed through a plane wi...,Sheldon
2,Series 01 Episode 01 – Pilot Episode,"Agreed, what’s your point?",Leonard
3,Series 01 Episode 01 – Pilot Episode,"There’s no point, I just think it’s a good id...",Sheldon
4,Series 01 Episode 01 – Pilot Episode,Excuse me?,Leonard


We will convert this dataset in a way that every responce row will contain **n** previous responces as a context. For our purposes seven previous responces will be enough.

In [9]:
contexted = []
n = 7
for i in range(n, len(df['dialogue'])):
  row = []
  prev = i - 1 - n # we additionally substract 1, so row will contain current responce and 7 previous responces
  for j in range(i, prev, -1):
    row.append(df['dialogue'][j])
  contexted.append(row)

In [10]:
len(contexted)

54399

In [11]:
columns = ['response', 'context']
columns = columns + ['context/'+str(i) for i in range(n-1)]
columns

['response',
 'context',
 'context/0',
 'context/1',
 'context/2',
 'context/3',
 'context/4',
 'context/5']

In [12]:
df = pd.DataFrame.from_records(contexted, columns=columns)
df.head(5)

Unnamed: 0,response,context,context/0,context/1,context/2,context/3,context/4,context/5
0,Can I help you?,"One across is Aegean, eight down is Nabakov, ...",Hang on.,Excuse me?,"There’s no point, I just think it’s a good id...","Agreed, what’s your point?",So if a photon is directed through a plane wi...,A corridor at a sperm bank.
1,"Yes. Um, is this the High IQ sperm bank?",Can I help you?,"One across is Aegean, eight down is Nabakov, ...",Hang on.,Excuse me?,"There’s no point, I just think it’s a good id...","Agreed, what’s your point?",So if a photon is directed through a plane wi...
2,"If you have to ask, maybe you shouldn’t be here.","Yes. Um, is this the High IQ sperm bank?",Can I help you?,"One across is Aegean, eight down is Nabakov, ...",Hang on.,Excuse me?,"There’s no point, I just think it’s a good id...","Agreed, what’s your point?"
3,I think this is the place.,"If you have to ask, maybe you shouldn’t be here.","Yes. Um, is this the High IQ sperm bank?",Can I help you?,"One across is Aegean, eight down is Nabakov, ...",Hang on.,Excuse me?,"There’s no point, I just think it’s a good id..."
4,Fill these out.,I think this is the place.,"If you have to ask, maybe you shouldn’t be here.","Yes. Um, is this the High IQ sperm bank?",Can I help you?,"One across is Aegean, eight down is Nabakov, ...",Hang on.,Excuse me?


Split our dataset into a training and test parts.

In [13]:
trn_df, val_df = train_test_split(df, test_size = 0.1)
trn_df.head()

Unnamed: 0,response,context,context/0,context/1,context/2,context/3,context/4,context/5
54296,I made a play for her and she shot me down.,"Yeah, she’s definitely going after Sheldon.",Do you really think there’s reason to worry?,The apartment.,You kept walking. I think you did.,I could have made her very happy.,"Dr. Cooper, over here.",She’s clearly having a working lunch and pref...
19777,All right. Hang on.,Todd Zarnecki was mean.,How come?,No. We failed in our noble quest.,So did you at least get Sheldon’s fake stuff ...,"This one’s funny, Leonard. How come you could...","Well, doesn’t matter if she gets it, as long ...","Yeah, she doesn’t really understand the whole..."
33659,"No. Now, Dr. Hofstadter. Can you walk us thro...","Now, isn’t there something you’d like to say ...",It’s all right.,"Thank you. Ira, if I may, I’d like to apologi...",Thanks.,"I’m Ira Flatow, and this is Science Friday. I...",The radio studio.,You should probably go.
14383,"What? What are you doing with, what?","Oh, Penny, excellent. I have a question about...","Hey, Sheldon.",The lobby,"Whoa, whoa, whoa. You heard the man. Where’s ...",I can’t believe they let him into Canada.,I can’t believe he’s friends with Elizabeth P...,When I’ve seen two consecutive negative throa...
3620,"Oh, alright, this is Missy, Missy this is Leo...","Sheldon, are you going to introduce us?","Nobody ever expects me, sometimes you just lo...","How can you be late, I wasn’t expecting you a...","Sorry I’m late, I’m working on a project that...",Buddy.,"Oh, hey buddy.",Thank you for coming by. (He rises from his d...


Now will convert our dataset in a format suitable for our model. Basically we will concatenate responses in one string for each row (additionally we will add special 'end of string' token between responses, so the model will understand end of each response in a string).  

In [14]:
from transformers import PreTrainedTokenizer

def construct_conv(row, tokenizer, eos=True):
    flatten = lambda l: [item for sublist in l for item in sublist]
    conv = list(
        reversed([
            tokenizer.encode(str(x)) + [tokenizer.eos_token_id]
            for x in row if pd.notna(x)
        ])
    )
    return flatten(conv)


class ConversationDataset(Dataset):
    def __init__(self, tokenizer: PreTrainedTokenizer, args, df, block_size=512):

        block_size = block_size - (tokenizer.model_max_length - tokenizer.max_len_single_sentence)

        directory = args.cache_dir
        cached_features_file = os.path.join(
            directory, args.model_type + "_cached_lm_" + str(block_size)
        )

        if os.path.exists(cached_features_file) and not args.overwrite_cache:
            logger.info("Loading features from cached file %s", cached_features_file)
            with open(cached_features_file, "rb") as handle:
                self.examples = pickle.load(handle)
        else:
            logger.info("Creating features from dataset file at %s", directory)

            self.examples = []
            for _, row in df.iterrows():
                conv = construct_conv(row, tokenizer)
                self.examples.append(conv)

            logger.info("Saving features into cached file %s", cached_features_file)
            with open(cached_features_file, "wb") as handle:
                pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, item):
        return torch.tensor(self.examples[item], dtype=torch.long)

In [15]:
# Cacheing and storing of data/checkpoints

def load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate=False):
    return ConversationDataset(tokenizer, args, df_val if evaluate else df_trn)


def set_seed(args):
    random.seed(args.seed)
    np.random.seed(args.seed)
    torch.manual_seed(args.seed)
    if args.n_gpu > 0:
        torch.cuda.manual_seed_all(args.seed)


def _sorted_checkpoints(args, checkpoint_prefix="checkpoint", use_mtime=False) -> List[str]:
    ordering_and_checkpoint_path = []

    glob_checkpoints = glob.glob(os.path.join(args.output_dir, "{}-*".format(checkpoint_prefix)))

    for path in glob_checkpoints:
        if use_mtime:
            ordering_and_checkpoint_path.append((os.path.getmtime(path), path))
        else:
            regex_match = re.match(".*{}-([0-9]+)".format(checkpoint_prefix), path)
            if regex_match and regex_match.groups():
                ordering_and_checkpoint_path.append((int(regex_match.groups()[0]), path))

    checkpoints_sorted = sorted(ordering_and_checkpoint_path)
    checkpoints_sorted = [checkpoint[1] for checkpoint in checkpoints_sorted]
    return checkpoints_sorted


def _rotate_checkpoints(args, checkpoint_prefix="checkpoint", use_mtime=False) -> None:
    if not args.save_total_limit:
        return
    if args.save_total_limit <= 0:
        return

    # Check if we should delete older checkpoint(s)
    checkpoints_sorted = _sorted_checkpoints(args, checkpoint_prefix, use_mtime)
    if len(checkpoints_sorted) <= args.save_total_limit:
        return

    number_of_checkpoints_to_delete = max(0, len(checkpoints_sorted) - args.save_total_limit)
    checkpoints_to_be_deleted = checkpoints_sorted[:number_of_checkpoints_to_delete]
    for checkpoint in checkpoints_to_be_deleted:
        logger.info("Deleting older checkpoint [{}] due to args.save_total_limit".format(checkpoint))
        shutil.rmtree(checkpoint)

## Training and Evaluating

There will be quite a lot of code needed for training our model but don’t worry, everything should work as is, the main thing is to give the model the dataset in the right format.

![alt text](https://media1.giphy.com/media/v1.Y2lkPTc5MGI3NjExN2Vsenc3b2M2dHhmeXNyY25qNmkzOXoxdGNhcjJhd2JzaGhtaXJtMSZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/29SqSyXlyO6WI/giphy.gif)

Image from [Giphy](https://giphy.com/)

In [16]:
from transformers import PreTrainedModel, PreTrainedTokenizer
def train(args, train_dataset, model: PreTrainedModel, tokenizer: PreTrainedTokenizer) -> Tuple[int, float]:
    """ Train the model """
    if args.local_rank in [-1, 0]:
        tb_writer = SummaryWriter()

    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)

    def collate(examples: List[torch.Tensor]):
        # Change is here, use tokenizer.pad_token_id instead of tokenizer._pad_token
        if tokenizer.pad_token_id is None:
            return pad_sequence(examples, batch_first=True)
        return pad_sequence(examples, batch_first=True, padding_value=tokenizer.pad_token_id)

    train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
    train_dataloader = DataLoader(
        train_dataset, sampler=train_sampler, batch_size=args.train_batch_size, collate_fn=collate, drop_last = True
    )

    if args.max_steps > 0:
        t_total = args.max_steps
        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
    else:
        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs

    model = model.module if hasattr(model, "module") else model  # Take care of distributed/parallel training
    model.resize_token_embeddings(len(tokenizer))
    # add_special_tokens_(model, tokenizer)


    # Prepare optimizer and schedule (linear warmup and decay)
    no_decay = ["bias", "LayerNorm.weight"]
    optimizer_grouped_parameters = [
        {
            "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
            "weight_decay": args.weight_decay,
        },
        {"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0},
    ]
    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
    scheduler = get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total
    )

    # Check if saved optimizer or scheduler states exist
    if (
        args.model_name_or_path
        and os.path.isfile(os.path.join(args.model_name_or_path, "optimizer.pt"))
        and os.path.isfile(os.path.join(args.model_name_or_path, "scheduler.pt"))
    ):
        # Load in optimizer and scheduler states
        optimizer.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "optimizer.pt")))
        scheduler.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "scheduler.pt")))

    if args.fp16:
        try:
            from apex import amp
        except ImportError:
            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
        model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)

    # multi-gpu training (should be after apex fp16 initialization)
    if args.n_gpu > 1:
        model = torch.nn.DataParallel(model)

    # Distributed training (should be after apex fp16 initialization)
    if args.local_rank != -1:
        model = torch.nn.parallel.DistributedDataParallel(
            model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True
        )

    # Train!
    logger.info("***** Running training *****")
    logger.info("  Num examples = %d", len(train_dataset))
    logger.info("  Num Epochs = %d", args.num_train_epochs)
    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
    logger.info(
        "  Total train batch size (w. parallel, distributed & accumulation) = %d",
        args.train_batch_size
        * args.gradient_accumulation_steps
        * (torch.distributed.get_world_size() if args.local_rank != -1 else 1),
    )
    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
    logger.info("  Total optimization steps = %d", t_total)

    global_step = 0
    epochs_trained = 0
    steps_trained_in_current_epoch = 0
    # Check if continuing training from a checkpoint
    if args.model_name_or_path and os.path.exists(args.model_name_or_path):
        try:
            # set global_step to gobal_step of last saved checkpoint from model path
            checkpoint_suffix = args.model_name_or_path.split("-")[-1].split("/")[0]
            global_step = int(checkpoint_suffix)
            epochs_trained = global_step // (len(train_dataloader) // args.gradient_accumulation_steps)
            steps_trained_in_current_epoch = global_step % (len(train_dataloader) // args.gradient_accumulation_steps)

            logger.info("  Continuing training from checkpoint, will skip to saved global_step")
            logger.info("  Continuing training from epoch %d", epochs_trained)
            logger.info("  Continuing training from global step %d", global_step)
            logger.info("  Will skip the first %d steps in the first epoch", steps_trained_in_current_epoch)
        except ValueError:
            logger.info("  Starting fine-tuning.")

    tr_loss, logging_loss = 0.0, 0.0

    model.zero_grad()
    train_iterator = trange(
        epochs_trained, int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0]
    )
    set_seed(args)  # Added here for reproducibility
    for _ in train_iterator:
        epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
        for step, batch in enumerate(epoch_iterator):

            # Skip past any already trained steps if resuming training
            if steps_trained_in_current_epoch > 0:
                steps_trained_in_current_epoch -= 1
                continue

            inputs, labels = (batch, batch)
            if inputs.shape[1] > 1024: continue
            inputs = inputs.to(args.device)
            labels = labels.to(args.device)
            model.train()
            outputs = model(inputs, labels=labels)
            loss = outputs[0]  # model outputs are always tuple in transformers (see doc)

            if args.n_gpu > 1:
                loss = loss.mean()  # mean() to average on multi-gpu parallel training
            if args.gradient_accumulation_steps > 1:
                loss = loss / args.gradient_accumulation_steps

            if args.fp16:
                with amp.scale_loss(loss, optimizer) as scaled_loss:
                    scaled_loss.backward()
            else:
                loss.backward()

            tr_loss += loss.item()
            if (step + 1) % args.gradient_accumulation_steps == 0:
                if args.fp16:
                    torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
                else:
                    torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
                optimizer.step()
                scheduler.step()  # Update learning rate schedule
                model.zero_grad()
                global_step += 1

                if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
                    # Log metrics
                    if (
                        args.local_rank == -1 and args.evaluate_during_training
                    ):  # Only evaluate when single GPU otherwise metrics may not average well
                        results = evaluate(args, model, tokenizer)
                        for key, value in results.items():
                            tb_writer.add_scalar("eval_{}".format(key), value, global_step)
                    tb_writer.add_scalar("lr", scheduler.get_lr()[0], global_step)
                    tb_writer.add_scalar("loss", (tr_loss - logging_loss) / args.logging_steps, global_step)
                    logging_loss = tr_loss

                if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
                    checkpoint_prefix = "checkpoint"
                    # Save model checkpoint
                    output_dir = os.path.join(args.output_dir, "{}-{}".format(checkpoint_prefix, global_step))
                    os.makedirs(output_dir, exist_ok=True)
                    model_to_save = (
                        model.module if hasattr(model, "module") else model
                    )  # Take care of distributed/parallel training
                    model_to_save.save_pretrained(output_dir)
                    tokenizer.save_pretrained(output_dir)

                    torch.save(args, os.path.join(output_dir, "training_args.bin"))
                    logger.info("Saving model checkpoint to %s", output_dir)

                    _rotate_checkpoints(args, checkpoint_prefix)

                    torch.save(optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
                    torch.save(scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))
                    logger.info("Saving optimizer and scheduler states to %s", output_dir)

            if args.max_steps > 0 and global_step > args.max_steps:
                epoch_iterator.close()
                break
        if args.max_steps > 0 and global_step > args.max_steps:
            train_iterator.close()
            break

    if args.local_rank in [-1, 0]:
        tb_writer.close()

    return global_step, tr_loss / global_step

# Evaluation of some model

def evaluate(args, model: PreTrainedModel, tokenizer: PreTrainedTokenizer, df_trn, df_val, prefix="") -> Dict:
    # Loop to handle MNLI double evaluation (matched, mis-matched)
    eval_output_dir = args.output_dir

    eval_dataset = load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate=True)
    os.makedirs(eval_output_dir, exist_ok=True)
    args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
    # Note that DistributedSampler samples randomly

    def collate(examples: List[torch.Tensor]):
        # Change is here, use tokenizer.pad_token_id instead of tokenizer._pad_token
        if tokenizer.pad_token_id is None:
            return pad_sequence(examples, batch_first=True)
        return pad_sequence(examples, batch_first=True, padding_value=tokenizer.pad_token_id)

    eval_sampler = SequentialSampler(eval_dataset)
    eval_dataloader = DataLoader(
        eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size, collate_fn=collate, drop_last = True
    )

    # multi-gpu evaluate
    if args.n_gpu > 1:
        model = torch.nn.DataParallel(model)

    # Eval!
    logger.info("***** Running evaluation {} *****".format(prefix))
    logger.info("  Num examples = %d", len(eval_dataset))
    logger.info("  Batch size = %d", args.eval_batch_size)
    eval_loss = 0.0
    nb_eval_steps = 0
    model.eval()

    for batch in tqdm(eval_dataloader, desc="Evaluating"):
        inputs, labels = (batch, batch)
        inputs = inputs.to(args.device)
        labels = labels.to(args.device)

        with torch.no_grad():
            outputs = model(inputs, labels=labels)
            lm_loss = outputs[0]
            eval_loss += lm_loss.mean().item()
        nb_eval_steps += 1

    eval_loss = eval_loss / nb_eval_steps
    perplexity = torch.exp(torch.tensor(eval_loss))

    result = {"perplexity": perplexity}

    output_eval_file = os.path.join(eval_output_dir, prefix, "eval_results.txt")
    with open(output_eval_file, "w") as writer:
        logger.info("***** Eval results {} *****".format(prefix))
        for key in sorted(result.keys()):
            logger.info("  %s = %s", key, str(result[key]))
            writer.write("%s = %s\n" % (key, str(result[key])))

    return result

In [17]:
import os
import torch
import logging
from transformers import AutoConfig, AutoTokenizer, AutoModelForCausalLM  # Updated import

# Main runner
def main(df_trn, df_val):
    args = Args()

    # Handle checkpoints if continuing training
    if args.should_continue:
        sorted_checkpoints = _sorted_checkpoints(args)
        if len(sorted_checkpoints) == 0:
            raise ValueError("Used --should_continue but no checkpoint was found in --output_dir.")
        else:
            args.model_name_or_path = sorted_checkpoints[-1]

    # Check output directory for training continuation
    if (
        os.path.exists(args.output_dir)
        and os.listdir(args.output_dir)
        and args.do_train
        and not args.overwrite_output_dir
        and not args.should_continue
    ):
        raise ValueError(
            "Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(
                args.output_dir
            )
        )

    # Setup CUDA, GPU & distributed training
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")  # Updated for Kaggle compatibility
    args.n_gpu = torch.cuda.device_count() if torch.cuda.is_available() else 0
    args.device = device

    # Setup logging
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO if args.local_rank in [-1, 0] else logging.WARN,
    )
    logger.warning(
        "Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
        args.local_rank,
        device,
        args.n_gpu,
        bool(args.local_rank != -1),
        args.fp16,
    )

    # Set seed
    set_seed(args)

    # Load model and tokenizer with updated class
    config = AutoConfig.from_pretrained(args.config_name, cache_dir=args.cache_dir)
    tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name, cache_dir=args.cache_dir)
    model = AutoModelForCausalLM.from_pretrained(  # Updated class for language model compatibility
        args.model_name_or_path,
        from_tf=False,
        config=config,
        cache_dir=args.cache_dir,
    )
    model.to(args.device)

    logger.info("Training/evaluation parameters %s", args)

    # Training
    if args.do_train:
        train_dataset = load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate=False)

        global_step, tr_loss = train(args, train_dataset, model, tokenizer)
        logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)

    # Save the model and tokenizer
    if args.do_train:
        os.makedirs(args.output_dir, exist_ok=True)
        logger.info("Saving model checkpoint to %s", args.output_dir)
        model.save_pretrained(args.output_dir)
        tokenizer.save_pretrained(args.output_dir)
        torch.save(args, os.path.join(args.output_dir, "training_args.bin"))

        # Reload the trained model for evaluation
        model = AutoModelForCausalLM.from_pretrained(args.output_dir)
        tokenizer = AutoTokenizer.from_pretrained(args.output_dir)
        model.to(args.device)

    # Evaluation
    results = {}
    if args.do_eval and args.local_rank in [-1, 0]:
        checkpoints = [args.output_dir]
        if args.eval_all_checkpoints:
            checkpoints = list(
                os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + "/**/" + WEIGHTS_NAME, recursive=True))
            )
            logging.getLogger("transformers.modeling_utils").setLevel(logging.WARN)  # Reduce logging
        logger.info("Evaluate the following checkpoints: %s", checkpoints)
        for checkpoint in checkpoints:
            global_step = checkpoint.split("-")[-1] if len(checkpoints) > 1 else ""
            prefix = checkpoint.split("/")[-1] if checkpoint.find("checkpoint") != -1 else ""

            model = AutoModelForCausalLM.from_pretrained(checkpoint)  # Updated class
            model.to(args.device)
            result = evaluate(args, model, tokenizer, df_trn, df_val, prefix=prefix)
            result = dict((k + "_{}".format(global_step), v) for k, v in result.items())
            results.update(result)

    return results


It is time to train our model!

![alt text](https://media4.giphy.com/media/v1.Y2lkPTc5MGI3NjExbm44c3I5OWJxaGgzZTkwYzZ2cDJ4N2FuZnpkcTI1YW4zeDJ6YW44dCZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/3ohs83cvmud7ThYTzq/giphy.gif)

Image from [Giphy](https://giphy.com/)

In [18]:
main(trn_df, val_df)

config.json:   0%|          | 0.00/641 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/351M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Iteration:   0%|          | 0/6119 [00:00<?, ?it/s]

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.
`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Iteration:   0%|          | 0/6119 [00:00<?, ?it/s]



Iteration:   0%|          | 0/6119 [00:00<?, ?it/s]



Evaluating:   0%|          | 0/680 [00:00<?, ?it/s]



{'perplexity_': tensor(3.3414)}

## Chatting with Sheldon

The model is ready, so it's time to chat with Sheldon. But don't forget that Sheldon can be rude, I warned you.

A variety of methods can be used in responces generation. You can find more details about these methods by this [link](https://huggingface.co/blog/how-to-generate).

![alt text](https://media1.giphy.com/media/v1.Y2lkPTc5MGI3NjExMW9tbGp5dXZsMGYxMHB4dHQwdncyMmJjczRuZGJwNjY2amNrbHBmbiZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/l2R01AMRh4g9XEKZO/giphy.gif)

Image from [Giphy](https://giphy.com/)

In [19]:
from transformers import AutoTokenizer, AutoModelWithLMHead
import torch

tokenizer = AutoTokenizer.from_pretrained('microsoft/DialoGPT-small')
model = AutoModelWithLMHead.from_pretrained('output-small')

# Predefined list of user messages
user_inputs = [
    "Hey, who are you?",
    "What can you do?",
    "Tell me a joke.",
    "Do you know Rick and Morty?",
    "Bye!"
]

chat_history_ids = None

for step in range(5):
    new_user_input_ids = tokenizer.encode(user_inputs[step] + tokenizer.eos_token, return_tensors='pt')

    bot_input_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1) if step > 0 else new_user_input_ids

    chat_history_ids = model.generate(
        bot_input_ids, max_length=200,
        pad_token_id=tokenizer.eos_token_id,
        no_repeat_ngram_size=3,
        do_sample=True,
        top_k=100,
        top_p=0.7,
        temperature=0.8
    )

    response = tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)
    print(f"SheldonBot: {response}")




SheldonBot:  I’m Zack.
SheldonBot:  Well, I‘m an actress.
SheldonBot:  You know, I know a guy.
SheldonBot: !!!
SheldonBot: !


I will give an example of a few more dialogues to show how Sheldon is our Sheldon now.

![alt text](https://media2.giphy.com/media/v1.Y2lkPTc5MGI3NjExMjBicDAwaGgybTYwZW9oajgweXB2Z3Z3MDNiYmNhMmJrcG11dXljNyZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/3osxY6y28AlWROLzqw/giphy.gif)

Image from [Giphy](https://giphy.com/)


## Conclusion

![alt text](https://media2.giphy.com/media/v1.Y2lkPTc5MGI3NjExaWZiaGY2OWo1MTRyb2FjMXM5b2JucGlyMGY1cmR6a29kdzNldGV1eSZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/3ohs4nAq7JaOydmxO0/giphy.gif)

Image from [Giphy](https://giphy.com/)

Congratulations! Our virtual Sheldon is alive (almost)! With the help of fine-tuning our model on a small dataset, we were able to create a virtual character with whom we can conduct interesting dialogs.

Using the proposed approach you can create many interesting virtual characters based on an arbitrary dialogs dataset (just a csv file with replicas, one replica per line).