# Deep Learning Course

## Deep Language Learning

### Dataset: Yelp review [(Source: Hugging Face)](https://huggingface.co/datasets/Yelp/yelp_review_full)

**Implementation of Various Natural Language Processing Models for Text Classification**

A series of various NLP methods (Binary Bag of Words, Frequency Bag of Words, TF-IDF, Word Embedding) are implemented for the binary classification of the Yelp Reviews dataset.

The dataset is pre-processed to adapt it to our needs, afterwards the various models are implemented.

Binary Bag of Words, Frequency Bag of Words, and TF-IDF are implemented with both ngrams = 1 and ngrams = 2, while Word Embedding is implemented both from scratch and using a pre-computed word embedder [(GloVe 6B 50d)](https://nlp.stanford.edu/projects/glove/).

In conclusion, the various models are trained and evaluated on a test set to verify their accuracy.


In [None]:
# Libraries used

from sklearn.model_selection import train_test_split 
from datasets import concatenate_datasets
from datasets import load_dataset
import tensorflow as tf
import urllib.request
import numpy as np
import os


The **Yelp Review** dataset contains commercial activity reviews from Yelp (textual reviews) and a valutation (from 1 to 5 stars). 

The dataset is binarized by considering reviews from 1 to 2 stars as negative, from 4 to 5 stars as positive, and 3-star reviews as neutral (they will not be included).

In conclusion we will work on 2 classes.

| Label | Review      |
|------|---------------|
| 0    | Negative       |
| 1    | Positive       |


In [None]:
# Pre processing

yelp = load_dataset("Yelp/yelp_review_full")

def binarize(example):
    if example["label"] <= 1:
        return {"binary_label": 0}
    elif example["label"] >= 3:
        return {"binary_label": 1}
    else:
        return {"binary_label": -1}


yelp_binary = yelp.map(binarize)

yelp_train_full = yelp_binary["train"].filter(lambda example: example["binary_label"] != -1)
yelp_test = yelp_binary["test"].filter(lambda example: example["binary_label"] != -1)

negative = yelp_train_full.filter(lambda example: example["binary_label"] == 0)
positive = yelp_train_full.filter(lambda example: example["binary_label"] == 1)

negative_subset = negative.select(range(50000))
positive_subset = positive.select(range(50000))

yelp_train_val = concatenate_datasets([negative_subset, positive_subset])

yelp_train_val = yelp_train_val.shuffle(seed=21)

x_train_val = np.array(yelp_train_val["text"])
y_train_val = np.array(yelp_train_val["binary_label"])

x_test = np.array(yelp_test["text"])
y_test = np.array(yelp_test["binary_label"])

x_train, x_val, y_train, y_val = train_test_split(x_train_val, y_train_val, test_size=0.2, random_state=21)

y_train_final = tf.cast(y_train, tf.float32)
y_val = tf.cast(y_val, tf.float32)
y_test = tf.cast(y_test, tf.float32)

batch_size = 32
max_tokens = 20000

train_ds = tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(batch_size)
val_ds = tf.data.Dataset.from_tensor_slices((x_val, y_val)).batch(batch_size)
test_ds = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(batch_size)

print("Final Dimension:")
print(f"Train: {len(x_train)}")
print(f"Validation: {len(x_val)}")
print(f"Test: {len(x_test)}")

Final Dimension:
Train: 80000
Validation: 20000
Test: 40000


In [3]:
# Binary Bags of Words (ngrams 1)

BBoW_text_vectorization = tf.keras.layers.TextVectorization(name="TextVectorization_Binary_BOW",
                                                            max_tokens=max_tokens,
                                                            output_mode="multi_hot")

BBoW_text_vectorization.adapt(x_train)

BBoW_model = tf.keras.Sequential(name="BINARY_BOW", layers=[
    tf.keras.Input(shape=(1,), dtype=tf.string),
    
    BBoW_text_vectorization,
    
    tf.keras.layers.Dense(name="Dense",
                          units=16, 
                          activation="relu"),
    
    tf.keras.layers.Dropout(name="Dropout",
                            rate=0.5),
    
    tf.keras.layers.Dense(name="Output",
                          units=1, 
                          activation="sigmoid")
])

optimizer_BBoW = tf.keras.optimizers.Adam(learning_rate=0.001)

BBoW_model.compile(optimizer=optimizer_BBoW,
                   loss="binary_crossentropy",
                   metrics=["accuracy"]
)

BBoW_model.summary()   

Model: "BINARY_BOW"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 TextVectorization_Binary_BO  (None, 20000)            0         
 W (TextVectorization)                                           
                                                                 
 Dense (Dense)               (None, 16)                320016    
                                                                 
 Dropout (Dropout)           (None, 16)                0         
                                                                 
 Output (Dense)              (None, 1)                 17        
                                                                 
Total params: 320,033
Trainable params: 320,033
Non-trainable params: 0
_________________________________________________________________


In [4]:
# Binary Bags of Words (ngrams 2)

BBoW2_text_vectorization = tf.keras.layers.TextVectorization(name="TextVectorization_Binary_BOW",
                                                             max_tokens=max_tokens,
                                                             output_mode="multi_hot",
                                                             ngrams=2)

BBoW2_text_vectorization.adapt(x_train)

BBoW2_model = tf.keras.Sequential(name="BINARY_BOW_ngrams2", layers=[
    tf.keras.Input(shape=(1,), dtype=tf.string),
    
    BBoW2_text_vectorization,
    
    tf.keras.layers.Dense(name="Dense",
                          units=16, 
                          activation="relu"),
    
    tf.keras.layers.Dropout(name="Dropout",
                            rate=0.5),
    
    tf.keras.layers.Dense(name="Output",
                          units=1, 
                          activation="sigmoid")
])

optimizer_BBoW2 = tf.keras.optimizers.legacy.Adam(learning_rate=0.001)

BBoW2_model.compile(optimizer=optimizer_BBoW2,
                    loss="binary_crossentropy",
                    metrics=["accuracy"]
)

BBoW2_model.summary()   

Model: "BINARY_BOW_ngrams2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 TextVectorization_Binary_BO  (None, 20000)            0         
 W (TextVectorization)                                           
                                                                 
 Dense (Dense)               (None, 16)                320016    
                                                                 
 Dropout (Dropout)           (None, 16)                0         
                                                                 
 Output (Dense)              (None, 1)                 17        
                                                                 
Total params: 320,033
Trainable params: 320,033
Non-trainable params: 0
_________________________________________________________________


In [5]:
# Frequency Bag of Words (ngrams 1)

FBoW_text_vectorization = tf.keras.layers.TextVectorization(name="TextVectorization_Frequency_BOW",
                                                            max_tokens=max_tokens,
                                                            output_mode="count")

FBoW_text_vectorization.adapt(x_train)

FBoW_model = tf.keras.Sequential(name="FREQUENCY_BOW", layers=[
    tf.keras.Input(shape=(1,), dtype=tf.string),
    
    FBoW_text_vectorization,
    
    tf.keras.layers.BatchNormalization(name="Normalization"),
    
    tf.keras.layers.Dense(name="Dense",
                          units=16, 
                          activation="relu"),
    
    tf.keras.layers.Dropout(name="Dropout",
                            rate=0.5),
    
    tf.keras.layers.Dense(name="Output",
                          units=1, 
                          activation="sigmoid")
])

optimizer_FboW = tf.keras.optimizers.legacy.Adam(learning_rate=0.001)

FBoW_model.compile(
    optimizer=optimizer_FboW,
    loss="binary_crossentropy",
    metrics=["accuracy"]
)

FBoW_model.summary()   

Model: "FREQUENCY_BOW"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 TextVectorization_Frequency  (None, 20000)            0         
 _BOW (TextVectorization)                                        
                                                                 
 Normalization (BatchNormali  (None, 20000)            80000     
 zation)                                                         
                                                                 
 Dense (Dense)               (None, 16)                320016    
                                                                 
 Dropout (Dropout)           (None, 16)                0         
                                                                 
 Output (Dense)              (None, 1)                 17        
                                                                 
Total params: 400,033
Trainable params: 360,033
Non-t

In [6]:
# Frequency Bag of Words (ngrams 2)

FBoW2_text_vectorization = tf.keras.layers.TextVectorization(name="TextVectorization_Frequency_BOW",
                                                             max_tokens=max_tokens,
                                                             output_mode="count",
                                                             ngrams=2)

FBoW2_text_vectorization.adapt(x_train)

FBoW2_model = tf.keras.Sequential(name="FREQUENCY_BOW_ngrams2", layers=[
    tf.keras.Input(shape=(1,), dtype=tf.string),
    
    FBoW2_text_vectorization,
    
    tf.keras.layers.BatchNormalization(name="Normalization"),
    
    tf.keras.layers.Dense(name="Dense",
                          units=16, 
                          activation="relu"),
    
    tf.keras.layers.Dropout(name="Dropout",
                            rate=0.5),
    
    tf.keras.layers.Dense(name="Output",
                          units=1, 
                          activation="sigmoid")
])

optimizer_FBoW2 = tf.keras.optimizers.legacy.Adam(learning_rate=0.001)

FBoW2_model.compile(
    optimizer=optimizer_FBoW2,
    loss="binary_crossentropy",
    metrics=["accuracy"]
)

FBoW2_model.summary()   

Model: "FREQUENCY_BOW_ngrams2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 TextVectorization_Frequency  (None, 20000)            0         
 _BOW (TextVectorization)                                        
                                                                 
 Normalization (BatchNormali  (None, 20000)            80000     
 zation)                                                         
                                                                 
 Dense (Dense)               (None, 16)                320016    
                                                                 
 Dropout (Dropout)           (None, 16)                0         
                                                                 
 Output (Dense)              (None, 1)                 17        
                                                                 
Total params: 400,033
Trainable params: 360,0

In [7]:
# TF-IDF (ngrams 1)

TFIDF_text_vectorization = tf.keras.layers.TextVectorization(name="TextVectorization_TF_IDF",
                                                             max_tokens=max_tokens,
                                                             output_mode="tf_idf")

TFIDF_text_vectorization.adapt(x_train)

TFIDF_model = tf.keras.Sequential(name="TF_IDF", layers=[
    tf.keras.Input(shape=(1,), dtype=tf.string),
    
    TFIDF_text_vectorization,
    
    tf.keras.layers.BatchNormalization(name="Normalization"),
    
    tf.keras.layers.Dense(name="Dense",
                          units=16, 
                          activation="relu"),
    
    tf.keras.layers.Dropout(name="Dropout",
                            rate=0.5),
    
    tf.keras.layers.Dense(name="Output",
                          units=1, 
                          activation="sigmoid")
])

optimizer_TFIDF = tf.keras.optimizers.legacy.Adam(learning_rate=0.001)

TFIDF_model.compile(
    optimizer=optimizer_TFIDF,
    loss="binary_crossentropy",
    metrics=["accuracy"]
)

TFIDF_model.summary()   

Model: "TF_IDF"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 TextVectorization_TF_IDF (T  (None, 20000)            1         
 extVectorization)                                               
                                                                 
 Normalization (BatchNormali  (None, 20000)            80000     
 zation)                                                         
                                                                 
 Dense (Dense)               (None, 16)                320016    
                                                                 
 Dropout (Dropout)           (None, 16)                0         
                                                                 
 Output (Dense)              (None, 1)                 17        
                                                                 
Total params: 400,034
Trainable params: 360,033
Non-trainabl

In [8]:
# TF-IDF (ngrams 2)

TFIDF2_text_vectorization = tf.keras.layers.TextVectorization(name="TextVectorization_TF_IDF",
                                                              max_tokens=max_tokens,
                                                              output_mode="tf_idf",
                                                              ngrams=2)

TFIDF2_text_vectorization.adapt(x_train)

TFIDF2_model = tf.keras.Sequential(name="TF_IDF_ngrams2", layers=[
    tf.keras.Input(shape=(1,), dtype=tf.string),
    
    TFIDF2_text_vectorization,
    
    tf.keras.layers.BatchNormalization(name="Normalization"),
    
    tf.keras.layers.Dense(name="Dense",
                          units=16, 
                          activation="relu"),
    
    tf.keras.layers.Dropout(name="Dropout",
                            rate=0.5),
    
    tf.keras.layers.Dense(name="Output",
                          units=1, 
                          activation="sigmoid")
])

optimizer_TFIDF2 = tf.keras.optimizers.legacy.Adam(learning_rate=0.001)

TFIDF2_model.compile(
    optimizer=optimizer_TFIDF2,
    loss="binary_crossentropy",
    metrics=["accuracy"]
)

TFIDF2_model.summary()   

Model: "TF_IDF_ngrams2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 TextVectorization_TF_IDF (T  (None, 20000)            1         
 extVectorization)                                               
                                                                 
 Normalization (BatchNormali  (None, 20000)            80000     
 zation)                                                         
                                                                 
 Dense (Dense)               (None, 16)                320016    
                                                                 
 Dropout (Dropout)           (None, 16)                0         
                                                                 
 Output (Dense)              (None, 1)                 17        
                                                                 
Total params: 400,034
Trainable params: 360,033
Non-

In [9]:
max_length = 350  # 95 percentile

WordEmbedding_text_vectorization = tf.keras.layers.TextVectorization(max_tokens=max_tokens,
                                                                     output_sequence_length=max_length,
                                                                     output_mode="int")

WordEmbedding_text_vectorization.adapt(x_train)

preprocessed_train_ds = train_ds.map(lambda x, y: (WordEmbedding_text_vectorization(x), y))

preprocessed_val_ds = val_ds.map(lambda x, y: (WordEmbedding_text_vectorization(x), y))

preprocessed_test_ds = test_ds.map(lambda x, y: (WordEmbedding_text_vectorization(x), y))

WordEmbedding_model = tf.keras.Sequential(name="Word_Embedding", layers=[
    tf.keras.Input(shape=(max_length,), 
                   dtype=tf.int64),
    
    tf.keras.layers.Embedding(name="Embedding",
                              input_dim=max_tokens,
                              output_dim=128),
    
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(units=32),
                                  name="LSTM"),
    
    tf.keras.layers.Dropout(name="Dropout",
                            rate=0.5),
    
    tf.keras.layers.Dense(name="Output",
                          units=1, 
                          activation="sigmoid")
])

optimizer_WE = tf.keras.optimizers.legacy.Adam(learning_rate=0.001)

WordEmbedding_model.compile(
    optimizer=optimizer_WE,
    loss="binary_crossentropy",
    metrics=["accuracy"]
)

WordEmbedding_model.summary()

Model: "Word_Embedding"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 Embedding (Embedding)       (None, 350, 128)          2560000   
                                                                 
 LSTM (Bidirectional)        (None, 64)                41216     
                                                                 
 Dropout (Dropout)           (None, 64)                0         
                                                                 
 Output (Dense)              (None, 1)                 65        
                                                                 
Total params: 2,601,281
Trainable params: 2,601,281
Non-trainable params: 0
_________________________________________________________________


In [None]:
if not os.path.exists("glove.6B.50d.txt"):
    print("Downloading GloVe 6B 50d...")
    url = "https://nlp.stanford.edu/data/glove.6B.50d.txt"
    urllib.request.urlretrieve(url, "glove.6B.50d.txt")

In [None]:
max_length = 350  # 95 percentile

WordEmbedding_GloVe_text_vectorization = tf.keras.layers.TextVectorization(max_tokens=max_tokens,
                                                                           output_sequence_length=max_length,
                                                                           output_mode="int")

WordEmbedding_GloVe_text_vectorization.adapt(x_train)

vocab = WordEmbedding_GloVe_text_vectorization.get_vocabulary()
word_index = {word: i for i, word in enumerate(vocab)}

glove_file = "glove.6B.50d.txt"  

embeddings_index = {}
with open(glove_file, encoding="utf-8") as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

embedding_dim = 50  
embedding_matrix = np.zeros((max_tokens, embedding_dim))

for word, i in word_index.items():
    if i < max_tokens:
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector


preprocessed_train_ds = train_ds.map(lambda x, y: (WordEmbedding_GloVe_text_vectorization(x), y))

preprocessed_val_ds = val_ds.map(lambda x, y: (WordEmbedding_GloVe_text_vectorization(x), y))

preprocessed_test_ds = test_ds.map(lambda x, y: (WordEmbedding_GloVe_text_vectorization(x), y))

WordEmbedding_GloVe_model = tf.keras.Sequential(name="Word_Embedding_GloVe", layers=[
    tf.keras.Input(shape=(max_length,), 
                   dtype=tf.int64),
    
    tf.keras.layers.Embedding(name="Embedding",
                              input_dim=max_tokens,
                              output_dim=embedding_dim,
                              embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix),
                              trainable=False,
                              mask_zero=True),
    
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(units=32),
                                  name="LSTM"),
    
    tf.keras.layers.Dropout(name="Dropout",
                            rate=0.5),
    
    tf.keras.layers.Dense(name="Output",
                          units=1, 
                          activation="sigmoid")
])

optimizer_WEGloVe = tf.keras.optimizers.legacy.Adam(learning_rate=0.001)

WordEmbedding_GloVe_model.compile(
    optimizer=optimizer_WEGloVe,
    loss="binary_crossentropy",
    metrics=["accuracy"]
)

WordEmbedding_GloVe_model.summary()

Model: "Word_Embedding_GloVe"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 Embedding (Embedding)       (None, 350, 50)           1000000   
                                                                 
 LSTM (Bidirectional)        (None, 64)                21248     
                                                                 
 Dropout (Dropout)           (None, 64)                0         
                                                                 
 Output (Dense)              (None, 1)                 65        
                                                                 
Total params: 1,021,313
Trainable params: 21,313
Non-trainable params: 1,000,000
_________________________________________________________________


In [11]:
# Fitting and Evaluation

early_stopping = tf.keras.callbacks.EarlyStopping(monitor="val_accuracy", 
                                                  patience = 2,
                                                  restore_best_weights=True)

models = ["BBoW", "BBoW2", "FBoW", "FBoW2", "TFIDF", "TFIDF2", "WordEmbedding", "WordEmbedding_GloVe"]

results = {}

for model_name in models:
    if model_name == "BBoW":
        model = BBoW_model
        train = train_ds
        val = val_ds
        test = test_ds
    elif model_name == "BBoW2":
        model = BBoW2_model
        train = train_ds
        val = val_ds
        test = test_ds
    elif model_name == "FBoW":
        model = FBoW_model
        train = train_ds
        val = val_ds
        test = test_ds
    elif model_name == "FBoW2":
        model = FBoW2_model
        train = train_ds
        val = val_ds
        test = test_ds
    elif model_name == "TFIDF":
        model = TFIDF_model
        train = train_ds
        val = val_ds
        test = test_ds
    elif model_name == "TFIDF2":
        model = TFIDF2_model
        train = train_ds
        val = val_ds
        test = test_ds
    elif model_name == "WordEmbedding":
        model = WordEmbedding_model
        train = preprocessed_train_ds
        val = preprocessed_val_ds
        test = preprocessed_test_ds
    elif model_name == "WordEmbedding_GloVe":
        model = WordEmbedding_GloVe_model
        train = preprocessed_train_ds
        val = preprocessed_val_ds
        test = preprocessed_test_ds
        
    print(f"{model_name} fit:")
    model.fit(train,
              epochs=10,
              validation_data=val,
              callbacks=[early_stopping])
    
    test_loss, test_accuracy = model.evaluate(test)
    
    results[model_name] = (test_loss, test_accuracy)

BBoW fit:
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
BBoW2 fit:
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
FBoW fit:
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
FBoW2 fit:
Epoch 1/10
Epoch 2/10
Epoch 3/10
TFIDF fit:
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
TFIDF2 fit:
Epoch 1/10
Epoch 2/10
Epoch 3/10
WordEmbedding fit:
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
WordEmbedding_GloVe fit:
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [12]:
# Results

print(f"{'Model':^20} | {'Test Loss':^10} | {'Test Accuracy':^10}")
print("-" * 46)  

sorted_models = sorted(models, 
                       key=lambda x: results[x][1], 
                       reverse=True)

for model in sorted_models:
    print(f"{model:^20} | {results[model][0]:^10.4f} | {results[model][1]:^10.4f}")

       Model         | Test Loss  | Test Accuracy
----------------------------------------------
       BBoW2         |   0.1838   |   0.9341  
   WordEmbedding     |   0.2271   |   0.9246  
        BBoW         |   0.2057   |   0.9212  
WordEmbedding_GloVe  |   0.2266   |   0.9117  
       TFIDF2        |   0.3539   |   0.9017  
       FBoW2         |   0.3475   |   0.8941  
        FBoW         |   0.4109   |   0.8899  
       TFIDF         |   1.0670   |   0.8341  


It can be noted that setting ngrams=2 improves the model in all cases (Binary Bag of Words, Frequency Bag of Words, and TFIDF).

Among Word Embeddings, there is a subtle difference in accuracy in favor of the model implemented from scratch.

The TFIDF model (ngrams=1) is the worst with a gap of 10 percentage points in accuracy compared to the best (Binary Bag of Words with ngrams=2), and it also has a high loss.

In general, excluding TFIDF with ngrams=1, all models have an acceptable accuracy.