<a href="https://colab.research.google.com/github/Ciph3r007/Bengali-OCR/blob/main/ChatBot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Install JAX.
!pip install --upgrade jax
!pip install --upgrade jaxlib
!pip install --upgrade trax

# Make sure the Colab Runtime is set to Accelerator: TPU.
import requests
import os
if 'TPU_DRIVER_MODE' not in globals():
  url = 'http://' + os.environ['COLAB_TPU_ADDR'].split(':')[0] + ':8475/requestversion/tpu_driver0.1-dev20191206'
  resp = requests.post(url)
  TPU_DRIVER_MODE = 1

# The following is required to use TPU Driver as JAX's backend.
from jax.config import config
config.FLAGS.jax_xla_backend = "tpu_driver"
config.FLAGS.jax_backend_target = "grpc://" + os.environ['COLAB_TPU_ADDR']
print(config.FLAGS.jax_backend_target)

Requirement already up-to-date: jax in /usr/local/lib/python3.7/dist-packages (0.2.12)
Requirement already up-to-date: jaxlib in /usr/local/lib/python3.7/dist-packages (0.1.65+cuda110)
Collecting trax
[?25l  Downloading https://files.pythonhosted.org/packages/42/51/305b839f51d53abb393777f743e497d27bb341478f3fdec4d6ddaccc9fb5/trax-1.3.7-py2.py3-none-any.whl (521kB)
[K     |████████████████████████████████| 522kB 6.5MB/s 
Collecting tensorflow-text
[?25l  Downloading https://files.pythonhosted.org/packages/b6/c0/c0fed4301f592c3b56638ae7292612c17d91a43891ba1aaf9636d535beae/tensorflow_text-2.4.3-cp37-cp37m-manylinux1_x86_64.whl (3.4MB)
[K     |████████████████████████████████| 3.4MB 7.4MB/s 
[?25hCollecting funcsigs
  Downloading https://files.pythonhosted.org/packages/69/cb/f5be453359271714c01b9bd06126eaf2e368f1fddfff30818754b5ac2328/funcsigs-1.0.2-py2.py3-none-any.whl
Collecting t5
[?25l  Downloading https://files.pythonhosted.org/packages/d0/e4/e2dc66207464795aafecc5c8cef9a35b5c9a

# Chatbot

- [1:   Dataset](#1)
- [2:   Preprocessing](#2)
    - [2.1:   Creating input pipeline](#2.1)
- [3:   Model Training](#4)
- [4:   Testing](#5)


<a name="1"></a>
# 1. The MultiWoz dataset

Installation and importing

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
%cd /content/drive/My\ Drive/colab_data/chatbot/
!ls

/content/drive/My Drive/colab_data/chatbot
cbot.jpg  model        Reformer.jpg	ReversibleDecoder.png
data	  __pycache__  reversible2.PNG	w4_unittest.py


In [4]:
import json
import random
import numpy as np
from termcolor import colored

import trax   
from trax import layers as tl
from trax.supervised import training
!pip list | grep trax

trax                          1.3.7                


Dataset INFO

In [5]:
with open('data/README') as file:
    print(file.read())

#####################################################
#####################################################
#  Copyright Cambridge Dialogue Systems Group, 2018 #
#####################################################
#####################################################

Dataset contains the following files:
1. data.json: the woz dialogue dataset, which contains the conversation  users and wizards, as well as a set of coarse labels for each user turn. This file contains both system and user dialogue acts annotated at the turn level. Files with multi-domain dialogues have "MUL" in their names. Single domain dialogues have either "SNG" or "WOZ" in their names.
2. restaurant_db.json: the Cambridge restaurant database file, containing restaurants in the Cambridge UK area and a set of attributes.
3. attraction_db.json: the Cambridge attraction database file, contining attractions in the Cambridge UK area and a set of attributes.
4. hotel_db.json: the Cambridge hotel database file, containing

Declaring some CONSTANTS to be used later

In [6]:
DATA_FILE = 'data.json'
DATA_DIR = './data'
DIALOGUE_DB = {}

VOCAB_FILE = 'en_32k.subword'
VOCAB_DIR = 'data/vocabs'

N_LAYERS = 6
TRAIN_STEPS = 500
LOAD_MODEL = False
TRAIN = True

Loading the MultiWoz dataset from json

In [7]:
def load_json(directory, file):
    with open(f'{directory}/{file}') as file: 
        db = json.load(file)
    return db
    
DIALOGUE_DB = load_json(DATA_DIR, DATA_FILE)

In [8]:
print(f'The number of dialogues is: {len(DIALOGUE_DB)}')

The number of dialogues is: 10438


The dialogues are composed of multiple files and the filenames are used as keys in the dictionary. Those with multi-domain dialogues have "MUL" in their filenames while single domain dialogues have either "SNG" or "WOZ".

In [9]:
print(list(DIALOGUE_DB.keys())[0:7]) 

['SNG01856.json', 'SNG0129.json', 'PMUL1635.json', 'MUL2168.json', 'SNG0073.json', 'SNG01445.json', 'MUL2105.json']


In [10]:
# get keys of the fifth file in the list above
print(DIALOGUE_DB['SNG0073.json'].keys())

dict_keys(['goal', 'log'])


Here `goal` points to a dictionary containing several key objectives of the conversation. `log` (a list) on the other hand contains the dialog in each of its item's `text` key.

In [11]:
DIALOGUE_DB['SNG0073.json']['goal']

{'attraction': {},
 'hospital': {},
 'hotel': {},
 'message': ["You want to book a <span class='emphasis'>taxi</span>. The taxi should go to <span class='emphasis'>pizza hut fen ditton</span> and should depart from <span class='emphasis'>saint john's college</span>",
  "The taxi should <span class='emphasis'>leave after 17:15</span>",
  "Make sure you get <span class='emphasis'>car type</span> and <span class='emphasis'>contact number</span>"],
 'police': {},
 'restaurant': {},
 'taxi': {'fail_info': {},
  'info': {'departure': "saint john's college",
   'destination': 'pizza hut fen ditton',
   'leaveAt': '17:15'},
  'reqt': ['car type', 'phone']},
 'train': {}}

In [12]:
DIALOGUE_DB['SNG0073.json']['log'][0]

{'metadata': {},
 'text': "I would like a taxi from Saint John's college to Pizza Hut Fen Ditton."}

The conversion goes between two persons back and forth

In [13]:
print(' Person 1: ', DIALOGUE_DB['SNG0073.json']['log'][0]['text'])
print(' Person 2: ',DIALOGUE_DB['SNG0073.json']['log'][1]['text'])

 Person 1:  I would like a taxi from Saint John's college to Pizza Hut Fen Ditton.
 Person 2:  What time do you want to leave and what time do you want to arrive by?


In [14]:
def get_conversation(file, data_db):
    result = ''
    len_msg_log = len(data_db[file]['log'])
    delimiter_1 = ' Person 1: '
    delimiter_2 = ' Person 2: '
    
    logs = data_db[file]['log']
    
    for i in range(len_msg_log):
        cur_log = logs[i]['text']
        
        if i % 2 == 0:
            result += delimiter_1
        else:
            result += delimiter_2
            
        result += cur_log

    return result

In [15]:
file = 'SNG01856.json'
conversation = get_conversation(file, DIALOGUE_DB)

print(conversation)

 Person 1: am looking for a place to to stay that has cheap price range it should be in a type of hotel Person 2: Okay, do you have a specific area you want to stay in? Person 1: no, i just need to make sure it's cheap. oh, and i need parking Person 2: I found 1 cheap hotel for you that includes parking. Do you like me to book it? Person 1: Yes, please. 6 people 3 nights starting on tuesday. Person 2: I am sorry but I wasn't able to book that for you for Tuesday. Is there another day you would like to stay or perhaps a shorter stay? Person 1: how about only 2 nights. Person 2: Booking was successful.
Reference number is : 7GAWK763. Anything else I can do for you? Person 1: No, that will be all. Good bye. Person 2: Thank you for using our services.


Prettifier function using termcolor

In [16]:
def print_conversation(conversation):
    
    delimiter_1 = 'Person 1: '
    delimiter_2 = 'Person 2: '
    
    split_list_d1 = conversation.split(delimiter_1)
    
    for sublist in split_list_d1[1:]:
        split_list_d2 = sublist.split(delimiter_2)
        print(colored(f'Person 1: {split_list_d2[0]}', 'red'))
        
        if len(split_list_d2) > 1:
            print(colored(f'Person 2: {split_list_d2[1]}', 'green'))

            
print_conversation(conversation)

[31mPerson 1: am looking for a place to to stay that has cheap price range it should be in a type of hotel [0m
[32mPerson 2: Okay, do you have a specific area you want to stay in? [0m
[31mPerson 1: no, i just need to make sure it's cheap. oh, and i need parking [0m
[32mPerson 2: I found 1 cheap hotel for you that includes parking. Do you like me to book it? [0m
[31mPerson 1: Yes, please. 6 people 3 nights starting on tuesday. [0m
[32mPerson 2: I am sorry but I wasn't able to book that for you for Tuesday. Is there another day you would like to stay or perhaps a shorter stay? [0m
[31mPerson 1: how about only 2 nights. [0m
[32mPerson 2: Booking was successful.
Reference number is : 7GAWK763. Anything else I can do for you? [0m
[31mPerson 1: No, that will be all. Good bye. [0m
[32mPerson 2: Thank you for using our services.[0m


<a name="2"></a>
# 2. Preprocessing

In [17]:
all_files = DIALOGUE_DB.keys()
untokenized_data = []

for file in all_files:
    result = get_conversation(file, DIALOGUE_DB)
    untokenized_data.append(result)

print(untokenized_data[0])

 Person 1: am looking for a place to to stay that has cheap price range it should be in a type of hotel Person 2: Okay, do you have a specific area you want to stay in? Person 1: no, i just need to make sure it's cheap. oh, and i need parking Person 2: I found 1 cheap hotel for you that includes parking. Do you like me to book it? Person 1: Yes, please. 6 people 3 nights starting on tuesday. Person 2: I am sorry but I wasn't able to book that for you for Tuesday. Is there another day you would like to stay or perhaps a shorter stay? Person 1: how about only 2 nights. Person 2: Booking was successful.
Reference number is : 7GAWK763. Anything else I can do for you? Person 1: No, that will be all. Good bye. Person 2: Thank you for using our services.


Splitting the list to a train and eval dataset.

In [18]:
random.shuffle(untokenized_data)
cut_off = int(len(untokenized_data) * .05)
train_data, eval_data = untokenized_data[:-cut_off], untokenized_data[-cut_off:]

print(f'number of conversations in the data set: {len(untokenized_data)}')
print(f'number of conversations in train set: {len(train_data)}')
print(f'number of conversations in eval set: {len(eval_data)}')

number of conversations in the data set: 10438
number of conversations in train set: 9917
number of conversations in eval set: 521


<a name="2.1"></a>
## Creating input pipeline

In [19]:
def stream(data):
    while True:
        d = random.choice(data)
        yield (d, d)

Let's define our data pipeline for tokenizing and batching our data. We will also filter by maxlen and use bucketing for batch

In [31]:
data_pipeline = trax.data.Serial(
    trax.data.Shuffle(),
    trax.data.Tokenize(vocab_dir=VOCAB_DIR, vocab_file=VOCAB_FILE),
    trax.data.FilterByLength(2048),
    trax.data.BucketByLength(boundaries=[128, 256, 512, 1024],
                             batch_sizes=[1024, 512, 256, 128, 64]),
    trax.data.AddLossWeights(id_to_mask=0)
)

train_stream = data_pipeline(stream(train_data))
eval_stream = data_pipeline(stream(eval_data))

Peek into the train stream.

In [21]:
# the stream generators will yield (input, target, mask_weights).
inp, _, _ = next(train_stream)
print("input shape: ", inp.shape)
print(trax.data.detokenize(inp[0], vocab_dir=VOCAB_DIR, vocab_file=VOCAB_FILE))

input shape:  (64, 512)
 Person 1: I'm looking for places to eat in the North part of town. Person 2: its called city stop restaurant, serves european food and address is Cambridge City Football Club Milton Road Chesterton Person 1: Is it a cheap restaurant? Person 2: No, it is expensive. Person 1: I need a cheap place, please. Person 2: Royal spice is cheap and looks great. Person 1: Thanks, will you please book a table for 6 people on saturday at 12:45? Person 2: Booking was successful. The table will be reserved for 15 minutes.
Reference number is : Q91F26L3. Person 1: Perfect! Thank you for all of your help. Person 2: You're more than welcome. May I do anything else for you today? Person 1: Wait, I might want to change my mind about that restaurant. Are there any that serve food from Corsica? Person 2: No, there are none fitting that description, sir. Sorry about that. Person 1: I figured, my wife asks me odd random question sometimes, had to check.   I am all set, thanks. Person 2

<a name="3"></a>
# 3. Model Training

In [22]:
def ReformerLM(vocab_size=33000, n_layers=2, mode='train', attention_type=tl.SelfAttention):
    model = trax.models.reformer.ReformerLM(
        vocab_size=vocab_size,
        n_layers=n_layers,
        mode=mode,
        attention_type=attention_type
    )
    
    return model

In [23]:
temp_model = ReformerLM(mode='train')
print(str(temp_model))

del temp_model 

Serial[
  Serial[
    ShiftRight(1)
  ]
  Embedding_33000_512
  Dropout
  PositionalEncoding
  Dup_out2
  ReversibleSerial_in2_out2[
    ReversibleHalfResidual_in2_out2[
      Serial[
        LayerNorm
      ]
      SelfAttention
    ]
    ReversibleSwap_in2_out2
    ReversibleHalfResidual_in2_out2[
      Serial[
        LayerNorm
        Dense_2048
        Dropout
        Serial[
          FastGelu
        ]
        Dense_512
        Dropout
      ]
    ]
    ReversibleSwap_in2_out2
    ReversibleHalfResidual_in2_out2[
      Serial[
        LayerNorm
      ]
      SelfAttention
    ]
    ReversibleSwap_in2_out2
    ReversibleHalfResidual_in2_out2[
      Serial[
        LayerNorm
        Dense_2048
        Dropout
        Serial[
          FastGelu
        ]
        Dense_512
        Dropout
      ]
    ]
    ReversibleSwap_in2_out2
  ]
  Concatenate_in2
  LayerNorm
  Dropout
  Serial[
    Dense_33000
  ]
]


In [32]:
def training_loop(ReformerLM, train_gen, eval_gen, n_layers=2, output_dir = "./model/"):
    lr_schedule = trax.lr.warmup_and_rsqrt_decay(n_warmup_steps=1000, max_value=0.008)
    
    train_task = training.TrainTask(
        labeled_data=train_gen,
        loss_layer=tl.WeightedCategoryCrossEntropy(),
        optimizer=trax.optimizers.Adam(0.01),
        lr_schedule=lr_schedule,
        n_steps_per_checkpoint=50
    )
    
    eval_task = training.EvalTask(
        labeled_data=eval_gen,
        metrics=[tl.WeightedCategoryCrossEntropy(), tl.WeightedCategoryAccuracy()]
    )
    
    loop = training.Loop(model=ReformerLM(n_layers=n_layers),
                         tasks=[train_task],
                         eval_tasks=[eval_task],
                         output_dir=output_dir)
    
    return loop

Training the model

In [33]:
if LOAD_MODEL == False:
  !rm -f model/model.pkl.gz
  loop = training_loop(ReformerLM, train_stream, eval_stream, n_layers=N_LAYERS)
else:
  loop = training_loop(ReformerLM, train_stream, eval_stream, n_layers=N_LAYERS)
  loop.model.init_from_file('model/model.pkl.gz')

if TRAIN == True:
  loop.run(TRAIN_STEPS)


Step      1: Total number of trainable weights: 70673640
Step      1: Ran 1 train steps in 110.10 secs
Step      1: train WeightedCategoryCrossEntropy |  10.41821575
Step      1: eval  WeightedCategoryCrossEntropy |  10.40265846
Step      1: eval      WeightedCategoryAccuracy |  0.00001045

Step     50: Ran 49 train steps in 285.97 secs
Step     50: train WeightedCategoryCrossEntropy |  7.41175556
Step     50: eval  WeightedCategoryCrossEntropy |  5.57908297
Step     50: eval      WeightedCategoryAccuracy |  0.06284281

Step    100: Ran 50 train steps in 142.69 secs
Step    100: train WeightedCategoryCrossEntropy |  5.55420208
Step    100: eval  WeightedCategoryCrossEntropy |  5.57832050
Step    100: eval      WeightedCategoryAccuracy |  0.06384477

Step    150: Ran 50 train steps in 137.62 secs
Step    150: train WeightedCategoryCrossEntropy |  5.42753839
Step    150: eval  WeightedCategoryCrossEntropy |  5.08133507
Step    150: eval      WeightedCategoryAccuracy |  0.14760922

Step 

In [34]:
loop.run(100)


Step    550: Ran 50 train steps in 144.24 secs
Step    550: train WeightedCategoryCrossEntropy |  2.26461148
Step    550: eval  WeightedCategoryCrossEntropy |  2.14184332
Step    550: eval      WeightedCategoryAccuracy |  0.53136408

Step    600: Ran 50 train steps in 139.03 secs
Step    600: train WeightedCategoryCrossEntropy |  2.10417056
Step    600: eval  WeightedCategoryCrossEntropy |  2.11950707
Step    600: eval      WeightedCategoryAccuracy |  0.52626514


In [44]:
loop.run(100)


Step    650: Ran 50 train steps in 132.58 secs
Step    650: train WeightedCategoryCrossEntropy |  1.98943090
Step    650: eval  WeightedCategoryCrossEntropy |  1.98838365
Step    650: eval      WeightedCategoryAccuracy |  0.54585487

Step    700: Ran 50 train steps in 139.27 secs
Step    700: train WeightedCategoryCrossEntropy |  1.87878287
Step    700: eval  WeightedCategoryCrossEntropy |  1.88107276
Step    700: eval      WeightedCategoryAccuracy |  0.56508166


<a name="4"></a>
# 4. Testing

In [49]:
def attention(*args, **kwargs):
    # number of input positions to remember in a cache when doing fast inference. 
    kwargs['predict_mem_len'] = 1024
    # number of input elements to drop once the fast inference input cache fills up.
    kwargs['predict_drop_len'] = 128
    # return the attention layer with the parameters defined above
    return tl.SelfAttention(*args, **kwargs)

# Getting the model with new attention for prediction
model = ReformerLM(
    vocab_size=33000,
    n_layers=N_LAYERS,
    mode='predict',
    attention_type=attention,
)

In [50]:
# TRAX needs the model to be initialized with this shape
shape11 = trax.shapes.ShapeDtype((1, 1), dtype=np.int32)
model.init(shape11)

# Loading weights from the trained model
model.weights = loop.eval_model.weights

# saving the starting state for each new dialogue prediction
STARTING_STATE = model.state

In [51]:
str(model) == str(loop.eval_model)

True

Utility functions

In [52]:
def tokenize(sentence, vocab_file, vocab_dir):
    return list(trax.data.tokenize(iter([sentence]), vocab_file=vocab_file, vocab_dir=vocab_dir))[0]

def detokenize(tokens, vocab_file, vocab_dir):
    return trax.data.detokenize(tokens, vocab_file=vocab_file, vocab_dir=vocab_dir)

In [53]:
def ReformerLM_output_gen(ReformerLM, start_sentence, vocab_file, vocab_dir, temperature):
    input_tokens = tokenize(start_sentence, vocab_file, vocab_dir)
    input_tokens_with_batch = input_tokens[None]
    
    # Using the autoregressive_sample_stream function from trax
    output_gen = trax.supervised.decoding.autoregressive_sample_stream( 
        model=ReformerLM,
        inputs=input_tokens_with_batch,
        temperature=temperature
    )
    
    return output_gen

In [57]:
def generate_dialogue(ReformerLM, model_state, start_sentence, vocab_file, vocab_dir, max_len, temperature):
    delimiter_1 = 'Person 1: ' 
    delimiter_2 = 'Person 2: '
    sentence = ''
    counter = 0
    
    result = [tokenize(': ', vocab_file=vocab_file, vocab_dir=vocab_dir)]
    
    ReformerLM.state = model_state
    
    output = ReformerLM_output_gen(ReformerLM, start_sentence, vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR, temperature=temperature)
    
    print(colored(start_sentence.split(delimiter_2)[0].strip(), 'green'))
    
    for o in output:
        
        result.append(o)
        
        sentence = detokenize(np.concatenate(result, axis=0), vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR)
        
        if sentence.endswith(delimiter_1):
            sentence = sentence.split(delimiter_1)[0]
            print(colored(f'{delimiter_2}{sentence}', 'red'))
            sentence = ''
            result.clear()
        
        elif sentence.endswith(delimiter_2):
            sentence = sentence.split(delimiter_2)[0]
            print(colored(f'{delimiter_1}{sentence}', 'green'))
            sentence = ''
            result.clear()

        counter += 1
        
        if counter > max_len:
            break    



In [58]:
sample_sentence = ' Person 1: Are there theatres in town? Person 2: '
generate_dialogue(ReformerLM=model, model_state=STARTING_STATE, start_sentence=sample_sentence, vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR, max_len=120, temperature=0.2)

[32mPerson 1: Are there theatres in town?[0m
[31mPerson 2: : There are 13 attractions in the centre of town. Do you have a preference? [0m
[32mPerson 1: I'd like to go to go to go to go to go to go to go to go to go to go to go to go. [0m
[31mPerson 2: There are many options for you. Is there a specific area you would like to visit? [0m
[32mPerson 1: I'd like to go to go to go to go to the theatre. [0m
[31mPerson 2: There are many museums in the centre. Do you have a preference? [0m


In [59]:
sample_sentence = ' Person 1: Is there a hospital nearby? Person 2: '
generate_dialogue(ReformerLM=model, model_state=STARTING_STATE, start_sentence=sample_sentence, vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR, max_len=120, temperature=0.2)

[32mPerson 1: Is there a hospital nearby?[0m
[31mPerson 2: : The address is Hills Rd, Cambridge, Cambridge, Cambridge, Cambridge, Cambridge, Cambridge, CB20QQ. Is there anything else I can help you with? [0m
[32mPerson 1: No, I'm looking for a train leaving on Saturday. [0m
[31mPerson 2: There are 202 trains leaving on Friday on Friday. Where will you be departing from? [0m
[32mPerson 1: I'm leaving from Cambridge and going to Cambridge on Monday. [0m
[31mPerson 2: There are 202 trains leaving at 05:17. Would you like me to book a ticket? [0m


In [60]:
sample_sentence = ' Person 1: Can you book a taxi? Person 2: '
generate_dialogue(ReformerLM=model, model_state=STARTING_STATE, start_sentence=sample_sentence, vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR, max_len=120, temperature=0.2)

[32mPerson 1: Can you book a taxi?[0m
[31mPerson 2: : Sure! Where are you departing from? [0m
[32mPerson 1: I'm going to go to Cambridge. [0m
[31mPerson 2: I can help narrow down with that. Where are you departing from? [0m
[32mPerson 1: I'd like to leave after 15:15. [0m
[31mPerson 2: I have a yellow volkswagen. The contact number is 076749756. [0m
[32mPerson 1: Thank you so much for your help. [0m
