<a href="https://colab.research.google.com/github/Sigurdur-Ragan-Steinsson/Chat-bot/blob/main/Chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Disclaimer

Some of the movie scripts that can be used for training contain foul language or other mature content, but please try to keep this tutorial PG.

## Introduction

In this notebook we will work with a pre-trained language AI and we will try to train a new AI based on movie scripts.  

## Learning objectives

1. Interact with a modern chatbot
2. Try tuning chatbot parameters to see if you can qualitatively change the character of the conversation
3. Try training your own chatbot on movies 
4. Try to build a chatbot that answers some specific prompts in the most "interesting ways"

## Expectations for presentation at the end

Make sure to take good notes on the topics below

- Describe your first conversation with the chatbot and give some specific examples of conversations. Was it believable?
- Describe how changing the chatbot parameters affected the conversation
- Describe how the conversation changes when you train on different movies
- Prepare a final trained chatbot that will answer these prompts.
  - What is your favorite number?
  - Who are your parents?
  - What is your favorite movie?
  - Do you prefer Meyer Dairy or the Creamery?
  - Please complete this sentence: We are ...
  
- After your presentation of these results a panel of judges will have a live conversation with your chatbot and we will pick a winner when we are done

## Setup

You will need to evaluate these cells to get the project set up. It isn't important that you understand in detail what they do for the exercises that follow, but if you are interested ask one of the camp staff.

In [None]:
import os
!pip install transformers
from transformers import AutoModelWithLMHead, AutoTokenizer
import pandas as pd
import torch
_tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small")
_model = AutoModelWithLMHead.from_pretrained("microsoft/DialoGPT-small")
from sklearn.model_selection import train_test_split

!pip -q install transformers

import campbot


### import data
### Might not need this if I just put Movie.tsv into directory
from google.colab import files
uploaded = files.upload()

class Chatty(object):
    def __init__(self, model = _model, tokenizer = _tokenizer, movie_file = "Movie.csv"):
        self.model = model
        self.tokenizer = tokenizer
        self.movie_dataframe = pd.read_csv("Movie.tsv", sep='\t')
        self.movies = {m.replace(".html",""):m for m in pd.unique(self.movie_dataframe['movie'])}

    def list_movies(self):
        for movie in sorted(self.movies):
            print(movie)
    
    def _select_movie(self, movie):
        if movie not in self.movies: print("movie %s not found" % movie)
        data=self.movie_dataframe[self.movie_dataframe['movie']==self.movies[movie]]
        
        #length context
        n = 7
        contexted=[]
        for i in range(n, len(data)):
            row = []
            prev = i - 1 - n 
        # we additionally subtract 1, so row will contain current response and 7 previous responses  
            for j in range(i, prev, -1):
                row.append(str(data['line'].iloc[j]))
            contexted.append(row)

        columns = ['response', 'context'] 
        columns = columns + ['context/'+str(i) for i in range(n-1)]
        df = pd.DataFrame.from_records(contexted, columns=columns)
        df.head(5)
        return df
    
    def _select_movies(self, *args):
        return pd.concat([self._select_movie(a) for a in args])
    
    # FIXME expose training parameters through kwargs??
    # args are movies
    def train_on_movies(self, *args, **kwargs):
        # No argument will use the last pretrained movie bot but this might crash if never trained
        if args:
            df = self._select_movies(*args)
            trn_df, val_df = train_test_split(df, test_size = 0.1)
            campbot.main(trn_df, val_df)
        self.model = AutoModelWithLMHead.from_pretrained('output-small')
        #return trn_df, val_df
        

    def chat(self, max_length = 1000, top_p = 0.95, top_k = 100, temperature = 0.80, curse_words = {"fuck": "duck", "shit": "poopy", "bitch": "dog", "pussies": "cats", "blow-job": "no-no word", "fucking": "ducking", "dick": "hotdog", "bastard": "mean word", "bastards": "mean words", "fuckin'": "duckin'", "Cunnilingus": "bad robot"}):
        
        # chatting 1 times with nucleus & top-k sampling & tweaking temperature & multiple
        # sentences
        print('type "done" to finish talking')
        chat_history_ids = None

        while True:
            # take user input
            text = input(">> You     : ")
            if text == "done": break
            # encode the input and add end of string token
            input_ids = self.tokenizer.encode(text + self.tokenizer.eos_token, return_tensors="pt")
            # concatenate new user input with chat history (if there is)
            bot_input_ids = input_ids if chat_history_ids is None else torch.cat([chat_history_ids, input_ids], dim=-1)
            # limit to last 1000
            bot_input_ids = bot_input_ids[:,-max_length+1:]
            # generate a bot response
            chat_history_ids_list = self.model.generate(
                bot_input_ids,
                max_length=max_length,
                do_sample=True,
                top_p=top_p,
                top_k=top_k,
                temperature=temperature,
                num_return_sequences=1,
                pad_token_id=self.tokenizer.eos_token_id
            )

            #print the outputs
            
            for i in range(len(chat_history_ids_list)):
                output = self.tokenizer.decode(chat_history_ids_list[i][bot_input_ids.shape[-1]:], skip_special_tokens=True)
                for w in curse_words:
                    output = output.replace(w,curse_words[w])
                
           
                    
                print(f">> DialoGPT: {output}")
    
            chat_history_ids = torch.unsqueeze(chat_history_ids_list[0], dim=0)

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Using cached transformers-4.20.1-py3-none-any.whl (4.4 MB)
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 5.7 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Using cached tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
Collecting huggingface-hub<1.0,>=0.1.0
  Using cached huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstalling PyYAML-3.13:
      Successfully uninstalled PyYAML-3.13
Successfully installed huggingface-hub-0.8.1 pyyaml-6.0 tokenizers-0.12.1 transformers-4.20.1


Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/641 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]



Downloading:   0%|          | 0.00/335M [00:00<?, ?B/s]

Saving Movie.tsv to Movie.tsv


In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Exercise 1: Talk with a basic chatbot 

**THIS EXERCISE SHOULD BE DONE BY INDIVIDUALS**

**TIME ESTIMATE: 15 minutes**

In this exercise we will initialize a pre-trained chat bot and have a conversation with it.  First initilize the bot:

In [None]:
ChatBot = Chatty()

Next, have a conversation with the bot and note some of the qualitative behavior you see for your presentation at the end. 

In [None]:
ChatBot.chat()

type "done" to finish talking


KeyboardInterrupt: ignored

## Exercise 2: Tune some chatbot parameters and study the impact on conversation

**THIS EXERCISE SHOULD BE DONE BY INDIVIDUALS**

**TIME ESTIMATE: 30 minutes**

When you converse with your chat bot, you can give the model some parameters that control how the bot responds.  Notice the slightly different command below. Try experimenting with different values.  Note that they all have to be positive.

In [None]:
ChatBot.chat(max_length = 100, top_p = 0.95, top_k = 100, temperature = 1.0)

type "done" to finish talking
>> You     : hi
>> DialoGPT: Favourite video content : What an interesting user name
>> You     : yup
>> DialoGPT: That's still a very interesting user name...
>> You     : how do I change it?
>> You     : no problem buddy
>> You     : <3
>> DialoGPT: Woo! Nice save!
>> You     : np  bro
>> DialoGPT: thank you brother and sorry it didn't turn out that way
>> You     : what way?'
>> DialoGPT: nods and bites back Sorry uhhh
>> You     : uhhh
>> DialoGPT: Oh yes, I should
>> You     : sus
>> DialoGPT: c
>> You     : sussy
>> DialoGPT: she
>> You     : okay
>> DialoGPT: she
>> You     : bru
>> DialoGPT: thank
>> You     : you
>> DialoGPT: you
>> You     : hji
>> DialoGPT: she
>> You     : you
>> DialoGPT: w
>> You     : she
>> DialoGPT: she
>> You     : she
>> DialoGPT: she
>> You     : you
>> DialoGPT: y
>> You     : p
>> DialoGPT: she
>> You     : her
>> DialoGPT: y
>> You     : culs de sac 
>> DialoGPT: he
>> You     : done


## Exercise 3: Train on movies

**THIS EXERCISE SHOULD BE DONE BY GROUPS**

**TIME ESTIMATE: 60 minutes**

In this exercise we will actually train our own chatbot based on movie script text.  We have preloaded 2035 movie scripts. You can list them with the following command

In [None]:
ChatBot.list_movies()

10-Things-I-Hate-About-You
12-Monkeys
12-Years-a-Slave
12-and-Holding
127-Hours
1492-Conquest-of-Paradise
15-Minutes
17-Again
2001-A-Space-Odyssey
2012
28-Days-Later
30-Minutes-or-Less
42
44-Inch-Chest
50-50
500-Days-of-Summer
8MM
9
A-Few-Good-Men
A-Most-Violent-Year
A-Prayer-Before-Dawn
A-Quiet-Place
A-Scanner-Darkly
A-Serious-Man
Above-the-Law
Absolute-Power
Abyss,-The
Ace-Ventura-Pet-Detective
Adaptation
Addams-Family,-The
Adjustment-Bureau,-The
Adventures-of-Buckaroo-Banzai-Across-the-Eighth-Dimension,-The
Affliction
After-School-Special
After.Life
Agnes-of-God
Air-Force-One
Airplane
Airplane-2-The-Sequel
Aladdin
Ali
Alien
Alien-3
Alien-Nation
Alien-Resurrection
Alien-vs.-Predator
Aliens
All-About-Eve
All-About-Steve
Alone-in-the-Dark
Amadeus
Amelia
American,-The
American-Beauty
American-Gangster
American-Graffiti
American-History-X
American-Hustle
American-Madness
American-Milkshake
American-Pie
American-President,-The
American-Psycho
American-Shaolin-King-of-Kickboxers-II
America

You can see an example of some of the input data with e.g., 

In [None]:
ChatBot._select_movie('Zootopia')

You can pick up to 6 movies to include as comma separated strings, e.g., below. Note! This training can take a long time.  The computer we are using has 4 datacenter-grade TPUs designed for machine learning, but the worlds most sophisticated models would train on thousands of such units!


In [None]:
ChatBot.train_on_movies('TRON')



Downloading:   0%|          | 0.00/641 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]



Downloading:   0%|          | 0.00/335M [00:00<?, ?B/s]

07/02/2022 01:28:56 - INFO - campbot -   Training/evaluation parameters <campbot.Args object at 0x7f1531f57490>
07/02/2022 01:28:56 - INFO - campbot -   Creating features from dataset file at cached
07/02/2022 01:28:57 - INFO - campbot -   Saving features into cached file cached/gpt2_cached_lm_512
07/02/2022 01:28:57 - INFO - campbot -   ***** Running training *****
07/02/2022 01:28:57 - INFO - campbot -     Num examples = 823
07/02/2022 01:28:57 - INFO - campbot -     Num Epochs = 3
07/02/2022 01:28:57 - INFO - campbot -     Instantaneous batch size per GPU = 1
07/02/2022 01:28:57 - INFO - campbot -     Total train batch size (w. parallel, distributed & accumulation) = 1
07/02/2022 01:28:57 - INFO - campbot -     Gradient Accumulation steps = 1
07/02/2022 01:28:57 - INFO - campbot -     Total optimization steps = 2469


Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Iteration:   0%|          | 0/823 [00:00<?, ?it/s]

Iteration:   0%|          | 0/823 [00:00<?, ?it/s]



Iteration:   0%|          | 0/823 [00:00<?, ?it/s]

07/02/2022 01:33:22 - INFO - campbot -    global_step = 2469, average loss = 2.1818467682071327
07/02/2022 01:33:22 - INFO - campbot -   Saving model checkpoint to output-small
07/02/2022 01:33:25 - INFO - campbot -   Evaluate the following checkpoints: ['output-small']
07/02/2022 01:33:27 - INFO - campbot -   Creating features from dataset file at cached
07/02/2022 01:33:27 - INFO - campbot -   Saving features into cached file cached/gpt2_cached_lm_512
07/02/2022 01:33:27 - INFO - campbot -   ***** Running evaluation  *****
07/02/2022 01:33:27 - INFO - campbot -     Num examples = 92
07/02/2022 01:33:27 - INFO - campbot -     Batch size = 1


Evaluating:   0%|          | 0/92 [00:00<?, ?it/s]

07/02/2022 01:33:29 - INFO - campbot -   ***** Eval results  *****
07/02/2022 01:33:29 - INFO - campbot -     perplexity = tensor(2.6158)


In [None]:
ChatBot.chat(max_length = 100, top_p = 0.95, top_k = 1, temperature = 1.0)

type "done" to finish talking
>> You     : hello
>> DialoGPT:  Hey. Flynn smiles weakly. 
>> You     : nice to see you
>> DialoGPT:  (nods toward terminal) So...how 'bout yourself? 
>> You     : I'm alright, how are you?
>> DialoGPT:  (nods toward terminal) Nothing much...just a User. 
>> You     : are you sentient?
>> DialoGPT:  (nods toward terminal) No... 
>> You     : cool
>> DialoGPT:  (nods toward terminal) So...how 'bout yourself? 
>> You     : done


## Exercise 4: Train a chat bot for the conversation competition

**THIS EXERCISE SHOULD BE DONE IN GROUPS**

**TIME ESTIMATE: 2 hours**

Using what you have learned above, try to train a chatbot that best answers the prompts given at the top of this notebook

In [None]:
#ChatBot.train_on_movies("Dumb-and-Dumber","Deadpool","Pulp-Fiction","Shrek","Legally-Blonde","Star-Wars-A-New-Hope")
ChatBot.train_on_movies()



ChatBot.chat()

In [None]:
ChatBot.chat(max_length = 100, top_p = 0.5, top_k = 50, temperature = 1.0,                                   curse_words = {"fuck": "duck", "shit": "poopy", "bitch": "dog", "pussies": "cats", "blow-job": "no-no word", "fucking": "ducking", "dick": "hotdog", "bastard": "mean word", "bastards": "mean words", "fuckin'": "duckin'", "Cunnilingus": "bad robot", "goddamn": "dang", "whole wad": "window", "sex": "six"})

type "done" to finish talking
>> You     : Hello, how are you?
>> DialoGPT:  Fine. Just fine. How are you? 
>> You     : Well, I was okay until I had to go to this really terrible "summer camp" where people made me work and robots swore at me.
>> DialoGPT:  So, uh, how are you? 
>> You     : right.  I am fine
>> DialoGPT:  How are you? 
>> You     : c'mon man. Give me a break!
>> DialoGPT:  (bumps into a table where there are mugs of beer) I'm sorry
>> You     : I like beer (am I allowed to say that?)
>> DialoGPT:  (
>> You     : uh oh...
>> DialoGPT:  What
>> You     : I was worried that you were broken
>> DialoGPT:  I
>> You     : how many fish can you name?
>> DialoGPT:  Three
>> You     : well, please name all three fish
>> DialoGPT:  (
>> You     : that is not a fish
>> DialoGPT:  (
>> You     : that is not a fish
>> DialoGPT:  It
>> You     : that is __not__ a fish
>> DialoGPT:  (
>> You     : can you bend and snap?
>> DialoGPT:  Yes
>> You     : prove it
>> DialoGPT:  Please
>> 