***Natural Processing Language Model for Healthcare Assistant Chat Bot - Multiclassification***

This Notebook contains a model for predicting medical condition categories based on user inputted statements. Here is the basic structure of the code to help guide the user through the different stages of Machine Learning... enjoy! 

- Stage 1, Pre-processing (Exploration, Cleaning, Preprocessing, Embedding, Merging)
- Stage 2, Training (Defining Target Variables and Input Values(X and Y), Splitting Training and Testing Samples, Identification of Classes and Parameters, Fitting model architecture)
- Stage 3, Testing and Evaluation (Model Accuracy, Epoch statistics)
- Stage 4, Optimisation (Feature Engineering, Hyperparemeter Tuning)
- Stage 5, Saving and Loading (Model Completion, Packaging)
- Stage 6, Inference Testing (Testing new samples and input data structure)

This strucure is LOOSE as we return to earlier stages to optimise our dataset and model.

**Stage 1 - Exploration**

In [1]:
import pandas as pd
df = pd.read_csv("herodataset.csv", encoding="ISO-8859-1")
data = df.sample(n=1000, random_state=42) 
print(data.head())

                                                question  \
10547  What are the genetic changes related to leukoe...   
15995         What to do for Primary Biliary Cirrhosis ?   
15841           Who is at risk for Fecal Incontinence? ?   
8994   What is (are) Pervasive Developmental Disorders ?   
2518           What are the symptoms of Crome syndrome ?   

                                                  answer source  \
10547  LBSL is caused by mutations in the DARS2 gene,...    GHR   
15995  A healthy diet is important in all stages of c...  NIDDK   
15841  Nearly 18 million U.S. adultsabout one in 12ha...  NIDDK   
8994   The diagnostic category of pervasive developme...  NINDS   
2518   What are the signs and symptoms of Crome syndr...   GARD   

                                              focus_area  
10547  leukoencephalopathy with brainstem and spinal ...  
15995                          Primary Biliary Cirrhosis  
15841                                 Fecal Incontinence  


Importing pandas and the dataset. Printing the first 5 samples to check correct import.

In [2]:
data.describe()

Unnamed: 0,question,answer,source,focus_area
count,1000,1000,1000,997
unique,986,973,9,854
top,What are the treatments for Colorectal Cancer ?,This condition is inherited in an autosomal re...,GARD,Colorectal Cancer
freq,3,22,337,6


In [3]:
data.describe()

Unnamed: 0,question,answer,source,focus_area
count,1000,1000,1000,997
unique,986,973,9,854
top,What are the treatments for Colorectal Cancer ?,This condition is inherited in an autosomal re...,GARD,Colorectal Cancer
freq,3,22,337,6


Exploring the sample sizes in the dataset, as this is a huge dataset, it would be better to test with 1000 samples first.

There are 854 unique focus areas, as this will be our y label the model is predicting, we will need to reduce this number.

It seems Cancer is the overwhelming number in this model, this could lead to bias. For future engineering we should maybe reduce this size comparison.

**Stage 1 - Cleaning**

In [4]:
duplicate_answers = data[data.duplicated(subset='answer', keep=False)]
print("\nDuplicate Answers:")
print(duplicate_answers['answer'].value_counts())


Duplicate Answers:
answer
This condition is inherited in an autosomal recessive pattern, which means both copies of the gene in each cell have mutations. The parents of an individual with an autosomal recessive condition each carry one copy of the mutated gene, but they typically do not show signs and symptoms of the condition.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              

There are multiple answers that have been repeated within our dataset. The focus area is allowed to have duplicates as multiple answers can fall into the same focus area, however the amswers should all be unique. 

In [5]:
data_unique = data.drop_duplicates()
print(data_unique)

                                                question  \
10547  What are the genetic changes related to leukoe...   
15995         What to do for Primary Biliary Cirrhosis ?   
15841           Who is at risk for Fecal Incontinence? ?   
8994   What is (are) Pervasive Developmental Disorders ?   
2518           What are the symptoms of Crome syndrome ?   
...                                                  ...   
6702      What is (are) Pseudohypoaldosteronism type 2 ?   
10291  How many people are affected by ataxia with vi...   
6829                 What is (are) Trisomy 2 mosaicism ?   
5467   What are the symptoms of Familial erythema nod...   
5952   What causes Primary melanoma of the central ne...   

                                                  answer source  \
10547  LBSL is caused by mutations in the DARS2 gene,...    GHR   
15995  A healthy diet is important in all stages of c...  NIDDK   
15841  Nearly 18 million U.S. adultsabout one in 12ha...  NIDDK   
8994   The 

we remove the duplicate answers.

In [6]:
data = data.dropna(subset=['focus_area'])

also remove the NaN values in focus_area.

In [7]:
#drop occurences of #NAME?
data = data_unique[data_unique.apply(lambda row: "#NAME?" not in row.values, axis=1)]

#drop collumn source
data = data.drop(columns=['source', 'question'])


print(data)

                                                  answer  \
10547  LBSL is caused by mutations in the DARS2 gene,...   
15995  A healthy diet is important in all stages of c...   
15841  Nearly 18 million U.S. adultsabout one in 12ha...   
8994   The diagnostic category of pervasive developme...   
2518   What are the signs and symptoms of Crome syndr...   
...                                                  ...   
6702   Psuedohypoaldosteronism type 2 is an inborn er...   
10291  Ataxia with vitamin E deficiency is a rare con...   
6829   Trisomy 2 mosaicism is a rare chromosome condi...   
5467   What are the signs and symptoms of Familial er...   
5952   What causes primary melanoma of the central ne...   

                                              focus_area  
10547  leukoencephalopathy with brainstem and spinal ...  
15995                          Primary Biliary Cirrhosis  
15841                                 Fecal Incontinence  
8994                   Pervasive Developmen

We drop all rows with #NAME? included as we can safely assumme that means the data is incomplete for that sample. We also remove the source and question collumn as the model will not be using it.

**Stage 1 - Preprocessing**

In [8]:
import nltk
from nltk.tokenize import word_tokenize
import string
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\powri\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\powri\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\powri\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [9]:
#find one occurence of each focus area
focus_areas = data['focus_area'].unique()

#convert to a list
unique_focus_list = [str(value) for value in focus_areas.tolist()]

#save in same directory
file_path = "unique_focus_areas.txt"

#write the unique focus areas to the text file
with open(file_path, 'w') as file:
    for value in unique_focus_list:
        file.write(value + '\n')

Earlier we discussed that 854 unique focus areas are too much and we would need to condense them. We will provide a work around to condense this large amount of data into just 12 classes. 

- create a map of all unique focus areas
- map all focus areas to a "category"
- use new categories to merge the questions mapped to unique focus areas.

Below is an example of the process.


***example... answer: "cannot sleep for more than 5 hours", focus_area: "insomnia", label: "20", category: "Psychiatric Condition", compressed label: "11".***

- This example shows that the question I struggle sleeping is now categorised into the last label 11. Instead of being the 20th label, this allows the model more leeway to accurately guess the question category.

In [10]:
#load txt file
file_path = "unique_focus_areas.txt"
with open(file_path, 'r') as file:
    unique_focus_areas = [line.strip() for line in file]

#map focus_areas to a number
label_map = {focus_area: label for label, focus_area in enumerate(unique_focus_areas)}

#merge with data
data['label'] = data['focus_area'].map(label_map)
data.dropna(subset=['label'], inplace=True)
data['label'] = data['label'].astype(int)
print(data['label'])

10547      0
15995      1
15841      2
8994       3
2518       4
        ... 
6702     849
10291    850
6829     851
5467     852
5952     853
Name: label, Length: 994, dtype: int32


First we assign numbers to the long list of focus areas to identify them.

In [11]:
def preprocess_text(text, max_seq_length):
    #lowercasing
    text = text.lower()

    #formatting
    text = text.translate(str.maketrans('', '', string.punctuation + string.digits))

    #tokenizing
    tokens = word_tokenize(text)

    #pad to length 50 for modularity
    padded_tokens = tokens[:max_seq_length] + ['<PAD>'] * (max_seq_length - len(tokens))

    #concatenate back into sentence
    processed_text = ' '.join(padded_tokens)

    return processed_text, tokens

max_seq_length = 50

#apply to answers
data['answer'], data['answer_tokens'] = zip(*data['answer'].apply(lambda x: preprocess_text(x, max_seq_length)))

print(data['answer'])

print(data['answer_tokens'].value_counts())

10547    lbsl is caused by mutations in the dars gene w...
15995    a healthy diet is important in all stages of c...
15841    nearly million us adultsabout one in have feca...
8994     the diagnostic category of pervasive developme...
2518     what are the signs and symptoms of crome syndr...
                               ...                        
6702     psuedohypoaldosteronism type is an inborn erro...
10291    ataxia with vitamin e deficiency is a rare con...
6829     trisomy mosaicism is a rare chromosome conditi...
5467     what are the signs and symptoms of familial er...
5952     what causes primary melanoma of the central ne...
Name: answer, Length: 994, dtype: object
answer_tokens
[this, condition, is, inherited, in, an, autosomal, recessive, pattern, which, means, both, copies, of, the, gene, in, each, cell, have, mutations, the, parents, of, an, individual, with, an, autosomal, recessive, condition, each, carry, one, copy, of, the, mutated, gene, but, they, typically, d

Secondly, convert the answers into sequences ready for embedding. This allows models to convert the data into readable numbers for the model in training.

In [12]:
import gensim.downloader as api
import numpy as np

#word2vec embedding model for nlp
word_vectors = api.load("word2vec-google-news-300")

#embed tokens made in answer collumn
def embed_tokens(tokens):
    #embeddings need an empty array to fill 
    embeddings = []
    for token in tokens:
        #deal with out of vocab tokens
        if token in word_vectors.key_to_index:
            embeddings.append(word_vectors.get_vector(token))
        else:
            #using a 0 vector for out of vocab tokens
            embeddings.append(np.zeros(word_vectors.vector_size))
    return np.array(embeddings)

# embed the answer tokens
data['embeddings'] = data['answer_tokens'].apply(embed_tokens)

Using a pretrained embedding model called Word2Vec we can embed our tokens into something the model can understand using semantic meanings. This allows the model to identify related patterns in words throughout the answer collumn. 

In [13]:
grouped_answers = data.groupby('label')['answer_tokens'].apply(list).reset_index(name='grouped_answers')
print(grouped_answers)

     label                                    grouped_answers
0        0  [[lbsl, is, caused, by, mutations, in, the, da...
1        1  [[a, healthy, diet, is, important, in, all, st...
2        2  [[nearly, million, us, adultsabout, one, in, h...
3        3  [[the, diagnostic, category, of, pervasive, de...
4        4  [[what, are, the, signs, and, symptoms, of, cr...
..     ...                                                ...
848    849  [[psuedohypoaldosteronism, type, is, an, inbor...
849    850  [[ataxia, with, vitamin, e, deficiency, is, a,...
850    851  [[trisomy, mosaicism, is, a, rare, chromosome,...
851    852  [[what, are, the, signs, and, symptoms, of, fa...
852    853  [[what, causes, primary, melanoma, of, the, ce...

[853 rows x 2 columns]


 A dataframe with unique labels and their corresponding answer tokens. Seperating this from the main dataframe protects the main data after merging. It is good practice to test in new environments.

In [14]:
#AI Assistance
def compute_group_embeddings(grouped_answers, word_vectors, embedding_dim):
    data = []
    for index, row in grouped_answers.iterrows():
        label = row['label']
        answers = row['grouped_answers']
        for group in answers:
            answers_embeddings = []
            for token in group:
                if token in word_vectors.key_to_index:
                    # Get the embedding of the token
                    embedding = word_vectors[token]
                    answers_embeddings.append(embedding)
            if answers_embeddings:
                # Aggregate the embeddings (e.g., average)
                aggregated_embedding = np.mean(answers_embeddings, axis=0)
                data.append({'label': label, 'embedding': aggregated_embedding})
    return pd.DataFrame(data)

We use AI to assist us here to allow for a smooth transistion between answer tokens, and embeddings. It creates a robust function to iterate through the rows and embed them accordingley. This still uses Word2Vec ensuring compatability. Embedding in a new dataframe also ensures the validity of the embeddings through comparison of the original dataframe.

In [15]:
embedding_dim = 300 

group_embeddings_df = compute_group_embeddings(grouped_answers, word_vectors, embedding_dim)

Takes the function above with the dimensionality of the embeddings to create the new dataframe.

In [16]:
#read the csv for categories
categories_df = pd.read_csv('categories.csv')

#map categories to their integer value
category_to_int = {category: idx for idx, category in enumerate(categories_df['Category'].unique())}

#map the categories in the df according to there integer value for reference
categories_df['label'] = categories_df['Category'].map(category_to_int)

#create an empty collumn called embedding for merging
categories_df['embedding'] = [[] for _ in range(len(categories_df))]

print(categories_df)

                                             Condition  \
0    Leukoencephalopathy with brainstem and spinal ...   
1                                Wallenberg's Syndrome   
2                               Machado-Joseph Disease   
3                                       Menkes Disease   
4           Myoclonic epilepsy myopathy sensory ataxia   
..                                                 ...   
251                           Illness Anxiety Disorder   
252                                Conversion Disorder   
253                                Factitious Disorder   
254              Intrahepatic cholestasis of pregnancy   
255                                   Trichotillomania   

                   Category  label embedding  
0    Neurological Disorders      0        []  
1    Neurological Disorders      0        []  
2    Neurological Disorders      0        []  
3    Neurological Disorders      0        []  
4    Neurological Disorders      0        []  
..                   

Using an ai tool I manually assigned all the focus areas in the original dataframe to a condensed "category" and stored it in a csv file. Loading the csv file into this notebook now allows me to label them to integers and create an empty embedding collumn that allows me to merge the two dataframes into one, with both the embedded answer, and the condesened labelled category...

In [17]:
#merging both dataframes into one condensed form
merged_df = pd.merge(group_embeddings_df, categories_df, left_on='label', right_on='label')


print(merged_df)

     label                                        embedding_x  \
0        0  [0.008707007, 0.06964839, 0.04323038, 0.008368...   
1        0  [0.008707007, 0.06964839, 0.04323038, 0.008368...   
2        0  [0.008707007, 0.06964839, 0.04323038, 0.008368...   
3        0  [0.008707007, 0.06964839, 0.04323038, 0.008368...   
4        0  [0.008707007, 0.06964839, 0.04323038, 0.008368...   
..     ...                                                ...   
327     11  [0.045772173, 0.038873628, -0.007002669, 0.081...   
328     11  [0.045772173, 0.038873628, -0.007002669, 0.081...   
329     11  [0.045772173, 0.038873628, -0.007002669, 0.081...   
330     11  [0.045772173, 0.038873628, -0.007002669, 0.081...   
331     11  [0.045772173, 0.038873628, -0.007002669, 0.081...   

                                             Condition  \
0    Leukoencephalopathy with brainstem and spinal ...   
1                                Wallenberg's Syndrome   
2                               Machado-Josep

Finally, we merge the two custom dataframes conmtaining the same collumns to fill in the condensed categories. This will now allow us to train the model with 12 categories, instead of 854.

**NOTE** When I first built this model, I used 854 categories without merging... It resulted in 0.03% accuracy. The model architecture was too complex, so this was the solution I came up with. 

In [18]:
# remove redundant collumns
merged_df.drop(columns=['embedding_y', 'Condition'], inplace=True)

# rename for correct naming conventions
merged_df.rename(columns={'embedding_x': 'embedding'}, inplace=True)

print(merged_df)

     label                                          embedding  \
0        0  [0.008707007, 0.06964839, 0.04323038, 0.008368...   
1        0  [0.008707007, 0.06964839, 0.04323038, 0.008368...   
2        0  [0.008707007, 0.06964839, 0.04323038, 0.008368...   
3        0  [0.008707007, 0.06964839, 0.04323038, 0.008368...   
4        0  [0.008707007, 0.06964839, 0.04323038, 0.008368...   
..     ...                                                ...   
327     11  [0.045772173, 0.038873628, -0.007002669, 0.081...   
328     11  [0.045772173, 0.038873628, -0.007002669, 0.081...   
329     11  [0.045772173, 0.038873628, -0.007002669, 0.081...   
330     11  [0.045772173, 0.038873628, -0.007002669, 0.081...   
331     11  [0.045772173, 0.038873628, -0.007002669, 0.081...   

                   Category  
0    Neurological Disorders  
1    Neurological Disorders  
2    Neurological Disorders  
3    Neurological Disorders  
4    Neurological Disorders  
..                      ...  
327   Psy

This concludes the preprocessing. A dataframe that only has the data we need in the correct format for ML.

***STAGE 2 - Training***

In [19]:
from sklearn.model_selection import train_test_split
import numpy as np

In [20]:
X = np.array(merged_df['embedding'].to_list())
y = merged_df['label']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

Defining X as the embedded answers, and y as the condensed labels. The training and testing split will be 80% to 20%, which is common for training models.

In [21]:
# a numpy array for X_train
X_train_reshaped = np.array(X_train)
#the shape needs to be (X, 300)
X_train_reshaped = X_train_reshaped.reshape(-1, 1)
# a numpy array with only one dimension.
y_train = np.ravel(y_train)

print(X_train.shape)
print(X_train_reshaped.dtype)
print(y_train.shape)
print(y_train.dtype)

(265, 300)
float32
(265,)
int64


The X_train needs to be a 2 dimensional array of (1, 300), as our embeddings have a dimension of 300 and  there is only 1 input variable. y_train should be a singular integer, as it is guessing the condensed label. Our model expects the input variable to have a dtype of float32, meaning we dont have to convert the data type for X_train, and an integer for y_train is also acceptable. 

In [22]:
#rows start at 0 so we add one to be clear
max_classes_train = max(y_train) + 1 
max_classes_test = max(y_test) + 1 
max_classes = max(max_classes_train, max_classes_test)
print("max classes = ", max_classes)

max classes =  12


I was running into an issue where the classes did not match the maximum amount of classes when I was training on 854 categories, to make sure this problem did not happen after condensing I left this code in to proveide clarification. 

In [23]:
print(X_train)

[[-0.01074219  0.08727417 -0.01113892 ...  0.03592529  0.05208588
  -0.03996735]
 [ 0.02872908  0.02673568 -0.0241175  ... -0.01285652  0.06061326
   0.03037371]
 [ 0.02922597  0.05597113 -0.00212209 ...  0.02803556  0.0894025
  -0.00630267]
 ...
 [ 0.02898509  0.01586095  0.01171126 ...  0.00290871  0.10405369
  -0.02113635]
 [ 0.02922597  0.05597113 -0.00212209 ...  0.02803556  0.0894025
  -0.00630267]
 [ 0.02898509  0.01586095  0.01171126 ...  0.00290871  0.10405369
  -0.02113635]]


The final input variable before training and evaluation.

***Stage 3 and 4 - Testing and Evaluation, with Optimisation***

In [24]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.preprocessing.sequence import pad_sequences

#used for categorical classification
y_train_one_hot = to_categorical(y_train, num_classes=max_classes)
y_test_one_hot = to_categorical(y_test, num_classes=max_classes)

#force input to correct length
max_input_length = 300
X_train_padded = pad_sequences(X_train, maxlen=max_input_length, padding='post')
X_test_padded = pad_sequences(X_test, maxlen=max_input_length, padding='post')

#model architechture and paremeters
model = Sequential([
    Dense(128, activation='relu', input_shape=(max_input_length,)),
    Dense(128, activation='relu'),
    Dense(max_classes, activation='softmax')
])

#using adam optmiser and crossentropy for loss function
model.compile(optimizer=Adam(learning_rate=0.001),
              loss='categorical_crossentropy', 
              metrics=['accuracy'])  #we will measure success with accuracy

#fit the model and train using defined architecture and 100 epochs
history = model.fit(X_train_padded, y_train_one_hot, epochs=50, batch_size=32, validation_split=0.2)

#evaluation
test_loss, test_acc = model.evaluate(X_test_padded, y_test_one_hot)
print('Test accuracy:', test_acc)


Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
Test accuracy: 0.08955223858356476


- First we convert y to one hot encoding to allow the model to read the categorical labels as easy as possible using 0 or 1. 
- We pad or truncate X to force any input to the specified length for model fitting.
- Using a sequential model and architechture we use three dense layers with relu functions. The input shape matches the padded X.
- Compiling the model with an optimiser and loss function allows us to adjust bias and validate the ground truth of the target values.
- Fitting the model with the padded input and the one hot label, using 100 epochs for retraining and validation, with batch sizes of 32 and using 20% for validation.
- We validate with accuracy, and recieve 89% accuracy as a final figure.

**Although you cannot see it now, I have tested this model on many different paremeters, epochs, models and a completely different data set. Here is a list of accuracies I have recieved**

- 89% (Final score)
- 95.6% (Overfitted with heavy bias)
- 81% (Score without hyperparameter tuning)
- 10% (Using best parameters on a huge dataset of  16,500 samples)
- 0.03% (Using 854 categories to classify with)

In [25]:
model.save("hero_model")

INFO:tensorflow:Assets written to: hero_model\assets


INFO:tensorflow:Assets written to: hero_model\assets


Saving the model

In [26]:
import tensorflow as tf

# Load the saved model
loaded_model = tf.keras.models.load_model("hero_model")

# Convert the model to TFLite format
converter = tf.lite.TFLiteConverter.from_keras_model(loaded_model)
tflite_model = converter.convert()

# Save the TFLite model to a file
with open("hero_model.tflite", "wb") as f:
    f.write(tflite_model)

INFO:tensorflow:Assets written to: C:\Users\powri\AppData\Local\Temp\tmp8u3ngka6\assets


INFO:tensorflow:Assets written to: C:\Users\powri\AppData\Local\Temp\tmp8u3ngka6\assets


We convert this model into a Tensorflow Lite model in the hopes of inferencing on an Android Studio Application. However, Android Studio has dseprecated its use of Word2Vec and the file to run it locally is 11GB, meaning a simple phone cannot run my model. In any case there is a live demonstration of the models predictive capabilites on the flask app app.py

***Final Thoughts***

This note book has used a large corpus dataset to prepare train, test and package a model ready for inferencing. We have used a embedding model suitable for NLP, and a keras nueral network algorthm to train and test on our data. At a peak accuracy of 89.5% I would call this model a success. 

However, there are many regrets, and issues that can be looked at and improved upon if the model were to be realised.

- The data is vast and complicated, limiting the capabilities of a simple model.
- NLP is a difficult task to convert for mobile dev, using a more efficient way to map our input variable would have allowed us to follow through.
- The categorisation is too broad for real life situations and could prove to be unhelpful.
- Bias is a big problem with this model, and further feature engineering would be very valuable to its success.
- Extras features could be introduced such as BERT modelling to introduce openNLP for the model allowing a conversion to a "question answer type model" instead which would prove to be more useful for release.

Eventhough there are a lot of improvements that could be made, this model still provides a good example and understanding of training a model on a large dataset for NLP. With multiclass classification. Showing skills in advanced areas of Machine Learning, developing previous skills from the fundamentals learnt. 

**The Deployement of this model is demonstrated in app.py of this project folder**

**References**

OpenAI. (n.d.). Chatgpt. ChatGPT. https://openai.com/chatgpt 

Brownlee, J. (2020, August 19). HOW TO CONNECT MODEL INPUT data with predictions for machine learning. MachineLearningMastery.com. https://machinelearningmastery.com/how-to-connect-model-input-data-with-predictions-for-machine-learning/ 

Panchal, S. (2021, August 27). SARCASM detection using word embeddings in Android. Medium. https://towardsdatascience.com/sarcasm-detection-using-word-embeddings-in-android-999a791d676a 

Pennington, J. (n.d.). GloVe: Global Vectors for Word Representation. Glove: Global vectors for word representation. https://nlp.stanford.edu/projects/glove/ 

Text classification with Android  :   Tensorflow Lite. TensorFlow. (n.d.). https://www.tensorflow.org/lite/android/tutorials/text_classification 

Word2vec  :   text  :   tensorflow. TensorFlow. (2024, March 23). https://www.tensorflow.org/text/tutorials/word2vec 
Your business wants a spreadsheet.give it to them. Row Zero - The World’s Most Powerful Spreadsheet. (n.d.). https://rowzero.io/ 
