# Keras NLP: Text Classification
In this notebook, I will use text classification techniques using the Keras machine learning library to classify text review data for Amazon fashion products.

## Imports

In [176]:
# loads the libraries used in this notebook
import tensorflow
import tensorflow_text
from tensorflow import feature_column
import pandas as pd
import numpy as np
import os
from sklearn.model_selection import train_test_split


In [9]:
! pip install keras-nlp tensorflow==2.11 tensorflow-text==2.11 --upgrade -q
os.environ["KERAS_BACKEND"] = "torch"  # I will be using torch as it is a dependency in our environment.yml
import keras_nlp
import keras

## The Data
The dataset being used is the Amazon fashion reviews dataset from Julian McAuley's database from UC San Diego. 

In [350]:
fashion_reviews = pd.read_csv('../../../data/text_data/combined_preprocessed.csv')

In [411]:
fashion_reviews.head()

Unnamed: 0,overall,reviewerID,asin,reviewText,summary,unixReviewTime,vote,Color1,Color2,Color3,...,Size,unixReviewTime.1,reviewTimeYear,reviewTimeMonth,reviewTimeWeek,reviewTimeDay,reviewTimeDayofweek,reviewTimeDayofyear,reviewTimeElapsed,sentiment
0,5.0,ALJ66O1Y6SLHA,B000K2PJ4K,Great product and price!,Five Stars,1441325000.0,0,Blue,Orange,missing,...,Big Boys,1441325000.0,2015.0,9.0,36.0,4.0,4.0,247.0,1441325000.0,1
1,5.0,ALJ66O1Y6SLHA,B000K2PJ4K,Great product and price!,Five Stars,1441325000.0,0,Black (37467610),Red,White,...,Big Boys,1441325000.0,2015.0,9.0,36.0,4.0,4.0,247.0,1441325000.0,1
2,5.0,ALJ66O1Y6SLHA,B000K2PJ4K,Great product and price!,Five Stars,1441325000.0,0,Blue,Gray Logo,missing,...,Big Boys,1441325000.0,2015.0,9.0,36.0,4.0,4.0,247.0,1441325000.0,1
3,5.0,ALJ66O1Y6SLHA,B000K2PJ4K,Great product and price!,Five Stars,1441325000.0,0,Blue (37867638-99),Yellow,missing,...,Big Boys,1441325000.0,2015.0,9.0,36.0,4.0,4.0,247.0,1441325000.0,1
4,5.0,ALJ66O1Y6SLHA,B000K2PJ4K,Great product and price!,Five Stars,1441325000.0,0,Blue,Pink,missing,...,Big Boys,1441325000.0,2015.0,9.0,36.0,4.0,4.0,247.0,1441325000.0,1


## The Keras Text Classification Model

### Finetuning a pretrained backbone


Pre-trained models in Keras are commonly called "tasks" and involve a pre-trained backbone model being fitted with task-specific layers on top. The goal is to use the text reviewText feature and classify it by its overall sentiment using the pre-trained & inference ready bert sentiment model from keras: "bert_tiny_en_uncased_sst2".

In [373]:
reviewText_pretrained_classifier = keras_nlp.models.BertClassifier.from_preset("bert_tiny_en_uncased_sst2")



In [407]:
test_input = fashion_reviews.loc[0,'reviewText']
test_input

'Great product and price!'

In [408]:
predictions = reviewText_pretrained_classifier.predict(fashion_reviews.loc[:,'reviewText']) # will build the model

[1m97/97[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m52s[0m 524ms/step


In [409]:
print(predictions.shape)
predictions[0] # assuming index 0 is a negative sentiment and index 1 is a positive sentiment

(3079, 2)


array([-2.555856 ,  2.5678942], dtype=float32)

In [413]:
type(predictions)

numpy.ndarray

In [412]:
def convert_prediction(prediction):
  """
  Converts a model prediction to a binary 0/1 if index 1 is greater.

  Args:
    prediction: A NumPy array containing the model prediction.

  Returns:
    A binary value (0 or 1) based on the prediction.
  """

  if prediction[1] > prediction[0]:
    return 1
  else:
    return 0

In [416]:
sentiment_preds = [convert_prediction(pred) for pred in predictions]
fashion_reviews['pretrained_sentiment_inference'] = sentiment_preds

In [422]:
fashion_reviews.loc[:, ['pretrained_sentiment_inference', 'reviewText']].sample(5)

Unnamed: 0,pretrained_sentiment_inference,reviewText
2790,1,BEST sneakers I've ever purchased!!!!
1687,0,"Bought these shoes with HIIT in mind, but not ..."
50,1,Did not fit well. Was not comfortable. Switche...
1634,0,"Great shoe! Outside arch is kind of high, but ..."
2471,0,Ugh... way to large - is this for a man?


### Finetuning via Modules
This model is not inference ready and must be fine-tuned "i.e training the head on-top of the pre-trained backbone"

In [425]:
reviewText_pretrained_classifier_from_modules = keras_nlp.models.BertClassifier.from_preset(
    "bert_tiny_en_uncased",
    num_classes=2
)



In [429]:
# sentiment is going to be used (the bert model is pretrained to predict sentiment)
# we can simulate a sentiment value from our review stars data
# assuming anything >= 4 is a good review sentiment
fashion_reviews['sentiment'] = fashion_reviews['overall'].copy().astype(int)
fashion_reviews['sentiment'] = fashion_reviews.sentiment.apply(lambda x: 1 if x >= 4 else 0)

text_vars = ['reviewText', 'sentiment']

fashion_reviews.loc[:, text_vars].sample(5)

fashion_reviews['reviewText'] = fashion_reviews['reviewText'].fillna(' ')

fashion_reviews['sentiment'] = fashion_reviews['sentiment'].fillna(0)

print(fashion_reviews.loc[:,text_vars].isna().sum())

X = fashion_reviews.loc[:, 'reviewText'].astype(str)
y = fashion_reviews.loc[:, 'sentiment']
X_train, X_test , y_train, y_test = train_test_split(X, y , test_size = 0.20)

X_train = X_train.apply(lambda x: x.decode('utf-8') if isinstance(x, bytes) else x)
X_test = X_test.apply(lambda x: x.decode('utf-8') if isinstance(x, bytes) else x)

X_train_tfdata = tensorflow.data.Dataset.from_tensor_slices(X_train.values)
X_test_tfdata = tensorflow.data.Dataset.from_tensor_slices(X_test.values)

# the labels must be transformed to a categorical format using tf.keras.utils
train_ds = tensorflow.data.Dataset.zip((X_train_tfdata, tensorflow.data.Dataset.from_tensor_slices(y_train)))
test_ds = tensorflow.data.Dataset.zip((X_test_tfdata, tensorflow.data.Dataset.from_tensor_slices(y_test)))

reviewText    0
sentiment     0
dtype: int64


In [434]:
train_ds.batch(1).take(1).get_single_element()

(<tf.Tensor: shape=(1,), dtype=string, numpy=array([b'Super.'], dtype=object)>,
 <tf.Tensor: shape=(1,), dtype=int64, numpy=array([1])>)

In [435]:
reviewText_pretrained_classifier_from_modules.fit(
    train_ds.batch(batch_size=10),
    validation_data=test_ds.batch(batch_size=10),
    epochs=1,
)

[1m247/247[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m139s[0m 561ms/step - loss: 0.5105 - sparse_categorical_accuracy: 0.8154 - val_loss: 0.2898 - val_sparse_categorical_accuracy: 0.8539


<keras_core.src.callbacks.history.History at 0x7fee4216a340>

In [438]:
finetuned_module_preds = reviewText_pretrained_classifier_from_modules.predict(test_ds.batch(batch_size=10))

[1m62/62[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 110ms/step


In [448]:
X_test_eval = X_test.copy()

In [443]:
module_sentiment_preds = [convert_prediction(pred) for pred in finetuned_module_preds]

In [450]:
pretrained_module_test = pd.DataFrame({'reviewText': X_test, 'pretrained_module_sentiment_inference': module_sentiment_preds})

In [457]:
pretrained_module_test.sample(5)

Unnamed: 0,reviewText,pretrained_module_sentiment_inference
2727,"These shoes are extremely comfortable, and fit...",1
2436,Love the color and fit. I use them to work on ...,1
42,"Was terribly disappointed, the pants were way ...",0
1539,Nice looking and fit nice,1
2952,These are as far as comfort goes the most comf...,1


### Finetuning via Preprocessing
Pre-processing of the data can be done seperately before fitting or making predictions of the data. This may be useful if using large datasets where pre-processing on-the-fly is too memory intensive and could impact the speed of inference.

In [459]:
preprocessor = keras_nlp.models.BertPreprocessor.from_preset(
    "bert_tiny_en_uncased",
    sequence_length=512,
)

train_cached = (
    train_ds.map(preprocessor, tensorflow.data.AUTOTUNE).cache().prefetch(tensorflow.data.AUTOTUNE)
)
test_cached = (
    test_ds.map(preprocessor, tensorflow.data.AUTOTUNE).cache().prefetch(tensorflow.data.AUTOTUNE)
)

reviewText_pretrained_classifier_via_preprocessing = keras_nlp.models.BertClassifier.from_preset(
    "bert_tiny_en_uncased", preprocessor=None, num_classes=2
)
reviewText_pretrained_classifier_via_preprocessing.fit(
    train_cached.batch(10),
    validation_data=test_cached.batch(10),
    epochs=3,
)



Epoch 1/3


2024-03-12 00:50:00.500247: W tensorflow/core/kernels/data/cache_dataset_ops.cc:856] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.


[1m247/247[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m188s[0m 760ms/step - loss: 0.5001 - sparse_categorical_accuracy: 0.8155 - val_loss: 0.2880 - val_sparse_categorical_accuracy: 0.8620
Epoch 2/3
[1m247/247[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m232s[0m 939ms/step - loss: 0.2381 - sparse_categorical_accuracy: 0.9179 - val_loss: 0.0659 - val_sparse_categorical_accuracy: 0.9919
Epoch 3/3
[1m247/247[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m227s[0m 918ms/step - loss: 0.0750 - sparse_categorical_accuracy: 0.9814 - val_loss: 0.0188 - val_sparse_categorical_accuracy: 0.9968


<keras_core.src.callbacks.history.History at 0x7fee4b102b50>

In [461]:
finetuned_preprocessing_preds = reviewText_pretrained_classifier_via_preprocessing.predict(test_cached.batch(batch_size=10))

[1m62/62[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 116ms/step


In [462]:
preprocessing_sentiment_preds = [convert_prediction(pred) for pred in finetuned_preprocessing_preds]

In [463]:
pretrained_preprocessing_test = pd.DataFrame({'reviewText': X_test, 'pretrained_preprocessing_sentiment_inference': preprocessing_sentiment_preds})

In [465]:
pretrained_preprocessing_test.sample(10)

Unnamed: 0,reviewText,pretrained_preprocessing_sentiment_inference
2081,"They fit great, look great, are quite comforta...",1
191,I love the shoe and it fit as expected the pho...,1
2587,Light-weight comfy shoes.,1
539,The fit is as expected,1
19,We have used these inserts for years. They pr...,1
916,"Not sure why, but my mid section of my foot st...",0
2492,Love them,1
305,they are very comfortable feel like you have n...,1
2489,"They fit as expected and perfect for training,...",1
654,I got the impression it's cushiony and comfy b...,0


## Conclusion

Using Keras' text classification methods require a little bit more pre-processing steps that can significantly improve the performance of the model by fitting a fine-tuning model on top of a pretrained text classifier. Using different fine-tuning methods can also increase the flexibility and utility of the base model.