# **Chatbot - NLP and Deep Learning**

For this project, we will use the PyTorch library.

---

## 1. Theory and NLP concepts

We will talk about stemming, tokenization, bag of words.

First, we put all words (of each patterns) into an array.

- **Bag of words :** For each different pattern, we create an array w/ the same size as the all words array. If this word is included into the all words array, we put a 1 at his position, 0 otherwise.
- **Tokenization :** Splitting string into meaningful units (e.g. words, punctuation characters, numbers)
- **Stemming :** Generate the root form of the words. It is an heuristic that chops of the ends off of words. 

### **Whole NLP pre-processing pipeline :** 

At the beginning, we have the Whole sentence, then we tokenize it. We lower all the words, then we stem the words. We then exclude punctuation characters. And based on this array, we calculate the bag of words. 

---

## 2. Create training data

We are going to use a free Natural Language data, using a framework, called NLTK - Natural Language toolkit. 

In [1]:
import nltk

# Download a package from nltk
nltk.download('punkt')   # package w/ a pre-trained tokenizer

# Stemming: reduce a word to its root form
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

import numpy as np

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader


# Tokenize: split a sentence into a list of words
def tokenize(sentence):
    return nltk.word_tokenize(sentence)

def stem(word):
    return stemmer.stem(word.lower())

def bag_of_words(tokenized_sentence, all_words):
    """
    sentence = ["hello", "how", "are", "you"]
    words = ["hi", "hello", "I", "you", "bye", "thank", "cool"]
    bag =   [  0,      1,      0,    1,      0,      0,       0]
    """

    tokenized_sentence = [stem(w) for w in tokenized_sentence]
    
    bag = np.zeros(len(all_words), dtype=np.float32)
    for idx, w in enumerate(all_words):
        if w in tokenized_sentence:
            bag[idx] = 1.0
            
    return bag

In [None]:
import json

with open('intents.json', 'r') as f:
    intents = json.load(f)

all_words = []
tags = []
xy = []

# Loop through each sentence in our intents patterns
for intent in intents['intents']:   #key: intents, value: list of intents
    tag = intent['tag']             #key: tag, value: intent
    tags.append(tag)
    # Loop through each pattern in the patterns
    for pattern in intent['patterns']:
        # Tokenize each word in the sentence
        w = tokenize(pattern)
        # Add to our words list (not append, because we don't want a list of lists)
        all_words.extend(w)
        # Add to xy pair
        # pattern and tag for each pattern
        xy.append((w, tag))

# Stem and lower each word and remove duplicates
ignore_words = ['?', '!', '.', ',']
all_words = [stem(w) for w in all_words if w not in ignore_words]

# Sort all words and remove duplicates
all_words = sorted(set(all_words))
# Sort tags and remove duplicates
tags = sorted(set(tags))

# Create training data
X_train = [] # bag of words for each pattern
y_train = [] # label for each tag

for (pattern_sentence, tag) in xy:
    # X: bag of words for each pattern
    bag = bag_of_words(pattern_sentence, all_words)
    X_train.append(bag)
    # y: PyTorch CrossEntropyLoss
    label = tags.index(tag)
    y_train.append(label) # CrossEntropyLoss

# Convert to numpy arrays
X_train = np.array(X_train)
y_train = np.array(y_train)

In [None]:
class ChatDataset(Dataset):
    def __init__(self):
        self.n_samples = len(X_train)
        self.x_data = X_train
        self.y_data = y_train

    # Dataset[idx]
    def __getitem__(self, index):
        return self.x_data[index], self.y_data[index]
    
    # len(Dataset)
    def __len__(self):
        return self.n_samples
    
# Hyperparameters
batch_size = 8
# hidden_size = 8
# output_size = len(tags)
# input_size = len(X_train[0])
# learning_rate = 0.001
# num_epochs = 1000

dataset = ChatDataset()
# Data loader which takes the dataset, shuffles it, and creates batches
train_loader = DataLoader(dataset=dataset, batch_size=batch_size, shuffle=True, num_workers=2)

# # Neural network
# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


---

## 3. PyTorch model and training

---

## 4. Save and load model and implement the chat