# Testing Multi-Class Classifictaion

With typical models so far for the chatbot the input dataset has heavily struggled as it is a very small dataset and there are many different ways to write a sentence. This causes the accuracy to be very low as well as the fact that there are about 50 different categories (classes) and only about 350 entries. I have already tested Sentence Similarity as well with interesting results. A huge con to sentence similarity is that it would require a model for each class and would likely still not be the most accurate. I am going to test Muli-Class Classification with a well made pre-built model to hopefully achive an higher accuracy.

## The Model
For this I found the model Distilbert-base-uncased, Bert is a very well known and well made model. I am hoping despite my limited data and high amount of classes it will be able to achieve a higher accuracy than previous models.

This model can be found here:

https://huggingface.co/distilbert-base-uncased?text=The+goal+of+life+is+%5BMASK%5D.

I heavily utilized a guide in order to achieve this in Tensorflow and not PyTorch as that is what I wanted to use for this project. 

This guide can be found here:

https://www.sunnyville.ai/fine-tuning-distilbert-multi-class-text-classification-using-transformers-and-tensorflow/



## Prepping Our Libraries/Classes

In [None]:
# Installing the latest version of transformers
!pip install git+https://github.com/huggingface/transformers.git

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-hflp2hep
  Running command git clone -q https://github.com/huggingface/transformers.git /tmp/pip-req-build-hflp2hep
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.10.1-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 5.1 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 49.8 MB/s 
Building wheels for collected packages: transformers
  Building wheel for transformers (PEP 517) ... [?25l[?25

In [None]:
# Importing our needed libraries
from transformers import DistilBertTokenizer
from transformers import TFDistilBertForSequenceClassification
import tensorflow as tf
import pandas as pd

## Data Processing
Here we will load in my inputs and category dataset from github.

This Dataset can be found here:

https://github.com/BridgetteBXP13/CS-4395.001---Human-Language-Technologies/blob/main/Chatbot/Data/Inputs.csv

In [None]:
# Import our data
url = 'https://raw.githubusercontent.com/BridgetteBXP13/CS-4395.001---Human-Language-Technologies/main/Chatbot/Data/Inputs.csv'
df = pd.read_csv(url)
print("\nOur loaded dataframe:\n")
df.head


Our loaded dataframe:



<bound method NDFrame.head of                                                  Input        Category
0                                Snakes are aggressive        Behavior
1                                 Are they aggressive         Behavior
2    Snakes will not bite unless you try to approac...        Behavior
3                              Do snakes like to bite?        Behavior
4                                  Snakes chase people        Behavior
..                                                 ...             ...
325                             Snakes are emotionless           Brain
326                               Can snakes grow hair            Body
327                                    Snakes are mean        Behavior
328                            Why should snakes exist  Snake Benefits
329                                Are snakes any good  Snake Benefits

[330 rows x 2 columns]>

In [None]:
# Display our categories and their counts
print("\nNumber of Categories:", len(df.Category.unique()))
print("\nOur Categories:")
print(df.Category.unique())
print("\nCounts of each Category:")
print(df.Category.value_counts())


Number of Categories: 51

Our Categories:
['Behavior' 'Body' 'Brain' 'Bye' 'Creators' 'Crossbreeding'
 'Dangerous Snakes' 'Deaf' 'Definition' 'Diamond' 'Dislocate Jaws' 'Eat'
 'Endangered' 'Escape' 'Evil' 'Eyesight' 'Fear' 'Flying snakes' 'Generic'
 'Greeting' 'Infared' 'Kill Snakes' 'Lay Eggs' 'Legless Lizard' 'Legs'
 'Live Forever' 'Lizards Discourage' 'Misunderstand' 'Mother Snake'
 'Music' 'Musk' 'Name' 'Pairs' 'Pet Snakes' 'Poisonous' 'Pupils' 'Purpose'
 'Rattle' 'Scared' 'Size' 'Slimy' 'Smell' 'Snake Attraction'
 'Snake Benefits' 'Snake Bite' 'Suffication' 'Swim' 'Tails' 'Topics'
 'Understand' 'Venomous']

Counts of each Category:
Pet Snakes            17
Body                  16
Behavior              12
Eat                   11
Understand            10
Eyesight              10
Venomous              10
Pairs                  8
Brain                  8
Misunderstand          8
Generic                8
Legs                   8
Snake Benefits         8
Kill Snakes            7
Disl

In [None]:
# Adding a column of category codes (cat codes)
df['encoded_cat'] = df.Category.astype('category').cat.codes
print("\nOur new dataframe with cat codes:\n")
df.head()


Our new dataframe with cat codes:



'Greeting'

In [None]:
# Converting Our Data to Lists
data_input_texts = df.Input.to_list()
data_cats = df.encoded_cat.to_list()
print("\nThe first five of our input text list:")
print(data_input_texts[:5])
print("\nThe first five cat codes for the input text list:")
print(data_cats[:5])


The first five of our input text list:
['Snakes are aggressive', 'Are they aggressive ', 'Snakes will not bite unless you try to approach/handle them.', 'Do snakes like to bite?', 'Snakes chase people']

The first five cat codes for the input text list:
[0, 0, 0, 0, 0]


In [None]:
# Splitting Into Train, Test, & Validation
from sklearn.model_selection import train_test_split

# Split Train & Validate
train_inputs, val_inputs, train_cats, val_cats = train_test_split(data_input_texts, data_cats, test_size=0.2, random_state=1234)
# Split Train & Test
train_inputs, test_inputs, train_cats, test_cats = train_test_split(train_inputs, train_cats, test_size=.1, random_state=1234)

## Tensorflow

### Creating Model and Prepping Data

In [None]:
# Tokenizing and Encoding our input text
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
train_encodings = tokenizer(train_inputs, truncation=True, padding=True)
val_encodings = tokenizer(val_inputs, truncation=True, padding=True)

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

In [None]:
# Creating our Tensorflow Dataset Objects
train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_cats
))
val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    val_cats
))

### Fine Tuning the Model

In [None]:
# Building & Compliling the Model
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=51)

optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer, loss=model.hf_compute_loss, metrics=['accuracy'])

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['activation_13', 'vocab_layer_norm', 'vocab_transform', 'vocab_projector']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier', 'dropout_99', 'classifier']
You should probably TRAIN this model on a down-stream task to be able to use i

AttributeError: ignored

In [None]:
# Training & Fine Tuning the Model
model.fit(train_dataset.shuffle(1000).batch(32), epochs=100, batch_size=32,
          validation_data=val_dataset.shuffle(1000).batch(32))

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.callbacks.History at 0x7fe833689c90>

In [None]:
# Save our Model
model.save_pretrained("models/")
tokenizer.save_pretrained("tokenizers/")


('tokenizers/tokenizer_config.json',
 'tokenizers/special_tokens_map.json',
 'tokenizers/vocab.txt',
 'tokenizers/added_tokens.json')

In [None]:
# Loading (if needed)
loaded_tokenizer = DistilBertTokenizer.from_pretrained("tokenizers/")
loaded_model = TFDistilBertForSequenceClassification.from_pretrained("models/")

Some layers from the model checkpoint at models/ were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at models/ and are newly initialized: ['dropout_79']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Making Predictions

In [None]:
# Remember our input_tests from earlier!
# We will loop through each test input and print it along with the predicted category
for test_input in test_inputs:
  predict_input = loaded_tokenizer.encode(test_input,
                                 truncation=True,
                                 padding=True,
                                 return_tensors="tf")

  print("\nInput: ", test_input)
  output = loaded_model(predict_input)[0]

  prediction_cat_code = tf.argmax(output, axis=1).numpy()[0]

  pred_cat = ''
  if prediction_cat_code == 0:
    pred_cat = 'Behavior' 
  elif prediction_cat_code == 1:
    pred_cat = 'Body'
  elif prediction_cat_code == 2:
    pred_cat = 'Brain'
  elif prediction_cat_code == 3:
    pred_cat = 'Bye'
  elif prediction_cat_code == 4:
    pred_cat = 'Creators'
  elif prediction_cat_code == 5:
    pred_cat = 'Crossbreeding'
  elif prediction_cat_code == 6:
    pred_cat = 'Dangerous Snakes'
  elif prediction_cat_code == 7:
    pred_cat = 'Deaf'
  elif prediction_cat_code == 8:
    pred_cat = 'Definition'
  elif prediction_cat_code == 9:
    pred_cat = 'Diamond'
  elif prediction_cat_code == 10:
    pred_cat = 'Dislocate Jaws'
  elif prediction_cat_code == 11:
    pred_cat = 'Eat'
  elif prediction_cat_code == 12:
    pred_cat = 'Endangered'
  elif prediction_cat_code == 13:
    pred_cat = 'Escape'
  elif prediction_cat_code == 14:
    pred_cat = 'Evil'
  elif prediction_cat_code == 15:
    pred_cat = 'Eyesight'
  elif prediction_cat_code == 16:
    pred_cat = 'Fear'
  elif prediction_cat_code == 17:
    pred_cat = 'Flying Snakes'
  elif prediction_cat_code == 18:
    pred_cat = 'Generic'
  elif prediction_cat_code == 19:
    pred_cat = 'Greeting'
  elif prediction_cat_code == 20:
    pred_cat = 'Infared'
  elif prediction_cat_code == 21:
    pred_cat = 'Kill Snakes'
  elif prediction_cat_code == 22:
    pred_cat = 'Lay Eggs'
  elif prediction_cat_code == 23:
    pred_cat = 'Legless Lizard'
  elif prediction_cat_code == 24:
    pred_cat = 'Legs'
  elif prediction_cat_code == 25:
    pred_cat = 'Live Forever'
  elif prediction_cat_code == 26:
    pred_cat = 'Lizards Discourage'
  elif prediction_cat_code == 27:
    pred_cat = 'Misunderstand'
  elif prediction_cat_code == 28:
    pred_cat = 'Mother Snake'
  elif prediction_cat_code == 29:
    pred_cat = 'Music'
  elif prediction_cat_code == 30:
    pred_cat = 'Musk'
  elif prediction_cat_code == 31:
    pred_cat = 'Name'
  elif prediction_cat_code == 32:
    pred_cat = 'Pairs'
  elif prediction_cat_code == 33:
    pred_cat = 'Pet Snakes'
  elif prediction_cat_code == 34:
    pred_cat = 'Poisonous'
  elif prediction_cat_code == 35:
    pred_cat = 'Pupils'
  elif prediction_cat_code == 36:
    pred_cat = 'Purpose'
  elif prediction_cat_code == 37:
    pred_cat = 'Rattle'
  elif prediction_cat_code == 38:
    pred_cat = 'Scared'
  elif prediction_cat_code == 39:
    pred_cat = 'Size'
  elif prediction_cat_code == 40:
    pred_cat = 'Slimy'
  elif prediction_cat_code == 41:
    pred_cat = 'Smell'
  elif prediction_cat_code == 42: 
    pred_cat = 'Snake Attraction'
  elif prediction_cat_code == 43:
    pred_cat = 'Snake Benefits'
  elif prediction_cat_code == 44:
    pred_cat = 'Snake Bite'
  elif prediction_cat_code == 45:
    pred_cat = 'Suffication'
  elif prediction_cat_code == 46:
    pred_cat = 'Swim'
  elif prediction_cat_code == 47:
    pred_cat = 'Tails'
  elif prediction_cat_code == 48:
    pred_cat = 'Topics'
  elif prediction_cat_code == 49:
    pred_cat = 'Understand'
  elif prediction_cat_code == 50:
    pred_cat = 'Venomous'

  print("Predicted Category: " + str(prediction_cat_code) + ": " + pred_cat)

print("Our Model got 20/27 Correct, that gives use an accuracy of 74% on our test data!")


Input:  Snakes can sting with their tails
Predicted Category: 47: Tails

Input:  Lizards without legs are snakes
Predicted Category: 23: Legless Lizard

Input:  Can snakes give live birth?
Predicted Category: 28: Mother Snake

Input:  Should we kill snakes
Predicted Category: 21: Kill Snakes

Input:  I see
Predicted Category: 3: Bye

Input:  Name?
Predicted Category: 31: Name

Input:  Snakes cannot strike underwater?
Predicted Category: 17: Flying Snakes

Input:  is it true that when rattlesnakes are babies that they let out all or most of their venom on their prey?
Predicted Category: 37: Rattle

Input:  Do snakes ever blink
Predicted Category: 15: Eyesight

Input:  Are there flying snakes
Predicted Category: 1: Body

Input:  Lizards have legs and snakes don't
Predicted Category: 24: Legs

Input:  Snakes don't have ears
Predicted Category: 1: Body

Input:  Why should snakes exist
Predicted Category: 43: Snake Benefits

Input:  They stink
Predicted Category: 30: Musk

Input:  Snakes d