# Text Mining Project Work (Team 7)

**Toxic Comment Classification with Naive Bayes and DistilBERT**

_Prof. Gianluca Moro, Prof. Giacomo Frisoni – DISI, University of Bologna_

name.surname@unibo.it


**Bologna Business School** - Alma Mater Studiorum Università di Bologna

## Instructions
- The provided exercises must be executed by the students of Team 7.
- At the end, the file must contain all the required results (as code cell outputs) along with all the commands necessary to reproduce them.
- The function of every command or group of related commands must be documented clearly and concisely.
- The submission deadline is March 18th, 2024.
- When finished, one team member will send the notebook file (having .ipynb extension) via mail (using your BBS email account) to the teacher (giacomo.frisoni@unibo.it) indicating “[BBS Teamwork] Your last names” as subject, also keeping an own copy of the file for safety.
- You are allowed to consult the teaching material and to search the Web for quick reference.
- If still in doubt about anything, ask the teacher.
- It is severely NOT allowed to communicate with other teams. Ask the teacher for any clarification about the exercises.
- Each correctly developed point counts 2/30.

## Setup

Run the following cells to import some necessary packages and download all the needed files.

In [1]:
import os
from urllib.request import urlretrieve
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.linear_model import LogisticRegression

In [2]:
def download(file, url):
    if not os.path.exists(file):
        urlretrieve(url, file)

In [3]:
download("reviews-electronics.csv.gz", "https://www.dropbox.com/s/5aidj1ns3wiuchi/reviews-electronics.csv.gz?dl=1")
download("reviews-home.csv.gz", "https://www.dropbox.com/s/9dlvc0nntibibk3/reviews-home.csv.gz?dl=1")
download("reviews-books.csv.gz", "https://www.dropbox.com/s/otbdd2u7x9ylzku/reviews-books.csv.gz?dl=1")

In [4]:
download("positive-words.txt", "https://www.dropbox.com/s/pmju477pv8ayzho/positive-words.txt?dl=1")
download("negative-words.txt", "https://www.dropbox.com/s/yy4l1ezlrsar8cf/negative-words.txt?dl=1")

## Exercises

**1)** Load the dataset from the file `reviews-electronics.csv.gz` into a new dataframe named `reviews_A`. Next, import the dataset from the file `reviews-home.csv.gz` into a new dataframe named `reviews_B`, and from the file `reviews-books.csv.gz` into a new dataframe named `reviews_C`. Lastly, extract the opinion word lists from the files `positive-words.txt` and `negative-words.txt`, assigning them to two new variables `pos_words` and `neg_words` respectively.

In [5]:
# Load dataset from reviews-electronics.csv.gz into reviews_A dataframe
reviews_A = pd.read_csv('reviews-electronics.csv.gz', sep='\t', engine='python')

# Import dataset from reviews-home.csv.gz into reviews_B dataframe
reviews_B = pd.read_csv('reviews-home.csv.gz', sep='\t', engine='python')

# Import dataset from reviews-books.csv.gz into reviews_C dataframe
reviews_C = pd.read_csv('reviews-books.csv.gz', sep='\t', engine='python')

# Extract opinion word lists from positive-words.txt and negative-words.txt
def extract_opinion_words(file):
    with open(file, 'r', encoding='latin1') as f:
        words = f.read().splitlines()
    return words

pos_words = extract_opinion_words('positive-words.txt')
neg_words = extract_opinion_words('negative-words.txt')

**2)** Print the first five rows of the three `reviews_X` datasets. Then, print their cardinality and the distribution of the `label` feature.

In [6]:
# Print the first five rows of reviews_A dataset
print("First five rows of reviews_A:")
print(reviews_A.head(5))

# Print the first five rows of reviews_B dataset
print("\nFirst five rows of reviews_B:")
print(reviews_B.head(5))

# Print the first five rows of reviews_C dataset
print("\nFirst five rows of reviews_C:")
print(reviews_C.head(5))

# Print the cardinality of reviews_A dataset
print("\nCardinality of reviews_A:", len(reviews_A))

# Print the cardinality of reviews_B dataset
print("\nCardinality of reviews_B:", len(reviews_B))

# Print the cardinality of reviews_C dataset
print("\nCardinality of reviews_C:", len(reviews_C))

# Print the distribution of the label feature in reviews_A dataset
print("\nDistribution of label feature in reviews_A:")
print(reviews_A['label'].value_counts())

# Print the distribution of the label feature in reviews_B dataspad_sequenceset
print("\nDistribution of label feature in reviews_B:")
print(reviews_B['label'].value_counts())

# Print the distribution of the label feature in reviews_C dataset
print("\nDistribution of label feature in reviews_C:")
print(reviews_C['label'].value_counts())

First five rows of reviews_A:
  label                                               text
0   pos  We got this GPS for my husband who is an (OTR)...
1   neg  I'm a professional OTR truck driver, and I bou...
2   pos  This adaptor is real easy to setup and use rig...
3   neg  I've had mine for a year and here's what we go...
4   pos  This product really works great but I found th...

First five rows of reviews_B:
  label                                               text
0   pos  This book is a must have if you get a Zoku (wh...
1   neg  I did sloppy shopping. This machine is exactly...
2   pos  This book is so beautifully illustrated and ea...
3   neg  If you type the wrong word in, it &#34;might&#...
4   pos  This beautifully illustrated book featuring te...

First five rows of reviews_C:
  label                                               text
0   pos  This is one my must have books. It is a master...
1   neg  As Amin Rihani described his own friend Gibran...
2   pos  This book prov

**3)** Divide the `reviews_A` dataset into a train set and a test set by choosing the initial half of reviews for training and the latter half for testing.

In [7]:
from sklearn.model_selection import train_test_split

# Divide the reviews_A dataset into train and test sets
train_reviews_A, test_reviews_A = train_test_split(reviews_A, test_size=0.5, random_state=42)

# Print the sizes of the train and test sets
print("Size of train set: ", len(train_reviews_A))
print("Size of test set: ", len(test_reviews_A))

Size of train set:  5000
Size of test set:  5000


**4)** Classify the reviews in the `reviews_A` test set by first assigning to each instance a score equal to the sum of scores of known words within it. Then, return `"pos"` for reviews with a positive score, and `"neg"` for reviews with a negative or neutral score.

Score each word:
 - -1 if it is found in negative words list
 - -2 if it is found in negative word list and it is preceded by the word "very"
 - +1 if it is found in positive words list
 - +2 if it is found in positive word list and it is preceded by the word "very"

Start with the setup of NLTK and the definition of the scoring function.
Then, apply the function to all the instances in the `reviews_A` test set.
Finally, compare the obtained labels with the known ones and compute the accuracy as the ratio of matches.

In [8]:
import nltk
from nltk.tokenize import word_tokenize

# Setup NLTK
nltk.download('punkt')

# Define scoring function
def score_review(review):
    score = 0
    words = word_tokenize(review.lower())
    for i in range(len(words)):
        if words[i] in neg_words:
            if i > 0 and words[i-1] == 'very':
                score -= 2
            else:
                score -= 1
        elif words[i] in pos_words:
            if i > 0 and words[i-1] == 'very':
                score += 2
            else:
                score += 1
    return score

# Apply scoring function to reviews_A test set
predicted_labels = []
for review in reviews_A['text']:
    score = score_review(review)
    if score > 0:
        predicted_labels.append('pos')
    else:
        predicted_labels.append('neg')

# Compare predicted labels with known labels and compute accuracy
matches = sum(predicted_labels == reviews_A['label'])
accuracy = matches / len(reviews_A)

print("Accuracy: ", accuracy)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Accuracy:  0.7204


**5)** Create a pipeline including a `CountVectorizer` to convert reviews into word count vectors (excluding words that appear in less than 3 documents) and a `LogisticRegression` model.

In [9]:
# Create the pipeline
pipeline = Pipeline([
    ('vectorizer', CountVectorizer(min_df=3)),
    ('classifier', LogisticRegression())
])

**6)** Train the model on all `reviews_B` data.

In [10]:
pipeline.fit(reviews_B['text'], reviews_B['label'])

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


**7)** Evaluate the model on the `reviews_A` test set.

In [11]:
predicted_labels = pipeline.predict(reviews_A['text'])

# Compare predicted labels with known labels and compute accuracy
matches = sum(predicted_labels == reviews_A['label'])
accuracy = matches / len(reviews_A)

print("Accuracy: ", accuracy)

Accuracy:  0.8498


**8)** Create a new pipeline as above, but replacing the `CountVectorizer` with a `TfidfVectorizer`, and the `LogisticRegression` model with a `MultinomialNB` one.

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create the new pipeline
new_pipeline = Pipeline([
    ('vectorizer', TfidfVectorizer(min_df=3)),
    ('classifier', MultinomialNB())
])

**9)** Train the new pipeline on all the `reviews_B` data and evaluate the resulting model on the `reviews_A` test set.

In [13]:
# Train the new pipeline on all the reviews_B data
new_pipeline.fit(reviews_B['text'], reviews_B['label'])

# Evaluate the resulting model on the reviews_A test set
predicted_labels = new_pipeline.predict(reviews_A['text'])

# Compare predicted labels with known labels and compute accuracy
matches = sum(predicted_labels == reviews_A['label'])
accuracy = matches / len(reviews_A)

print("Accuracy: ", accuracy)

Accuracy:  0.7254


**10)** Repeat points (8) and (9) but set the `ngram_range` parameter of the `TfidfVectorizer` to use only bigrams.

In [14]:
# Create the new pipeline with bigram n-grams
new_pipeline_bigram = Pipeline([
    ('vectorizer', TfidfVectorizer(min_df=3, ngram_range=(2, 2))),
    ('classifier', MultinomialNB())
])

# Train the new pipeline on all the reviews_B data
new_pipeline_bigram.fit(reviews_B['text'], reviews_B['label'])

# Evaluate the resulting model on the reviews_A test set
predicted_labels_bigram = new_pipeline_bigram.predict(reviews_A['text'])

# Compare predicted labels with known labels and compute accuracy
matches = sum(predicted_labels_bigram == reviews_A['label'])
accuracy = matches / len(reviews_A)

print("Accuracy: ", accuracy)

Accuracy:  0.8478


**11)** Repeat the evaluation of the three models above, this time on the `reviews_C` data.

In [15]:
# Evaluate the first pipeline on the reviews_C dataset
predicted_labels_first = pipeline.predict(reviews_C['text'])

# Compare predicted labels with known labels and compute accuracy
matches = sum(predicted_labels_first == reviews_A['label'])
accuracy = matches / len(reviews_A)

print("Accuracy for first model: ", accuracy)

# Evaluate the second pipeline on the reviews_C dataset
predicted_labels_second = new_pipeline.predict(reviews_C['text'])

# Compare predicted labels with known labels and compute accuracy
matches = sum(predicted_labels_second == reviews_A['label'])
accuracy = matches / len(reviews_A)

print("Accuracy for second model: ", accuracy)

# Evaluate the third pipeline on the reviews_C dataset
predicted_labels_third = new_pipeline_bigram.predict(reviews_C['text'])

# Compare predicted labels with known labels and compute accuracy
matches = sum(predicted_labels_third == reviews_A['label'])
accuracy = matches / len(reviews_A)

print("Accuracy for third model: ", accuracy)

Accuracy for first model:  0.6997
Accuracy for second model:  0.7199
Accuracy for third model:  0.7243


**12)** Tokenize the `reviews_A` train instances and use them to build a 300-dimensional Word2Vec vector space using a window size equals to 5 and excluding all the terms that appear less than 7 times.

In [16]:
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize

# Tokenize the reviews_A train instances
tokenized_reviews = [word_tokenize(review) for review in train_reviews_A['text']]

# Build the Word2Vec model
word2vec_model = Word2Vec(tokenized_reviews, vector_size=300, window=5, min_count=7)

**13)** Convert the tokenized training reviews into a list of lists of term indices in the Word2Vec model, leaving out terms not present in the model.

In [17]:
term_indices = []
for review in tokenized_reviews:
    indices = [word2vec_model.wv.key_to_index[word] for word in review if word in word2vec_model.wv.key_to_index] # The "if" statement filters out words that are not present in the word2vec model's vocabulary, ensuring that only valid words are used to retrieve their corresponding indices.
    term_indices.append(indices)

**14)** Make all index sequences of the same length (250 words for each review), trimming longer sequences to that size and padding shorter sequences with zero values.

In [18]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Set the maximum sequence length
max_sequence_length = 250

# Pad sequences to the maximum length
padded_sequences = pad_sequences(term_indices, maxlen=max_sequence_length, padding='post', truncating='post', value=0)
# padding=” post”: add the zeros at the end of the sequence to make the samples in the same size
# truncating=”post” setting this truncating parameter as post means that when a sentence exceeds the number of maximum words drop the last words in the sentence instead of the default setting which drops the words from the beginning of the sentence.

**15)** Train an LSTM or GRU neural network of your choice on the training sequences defined above. Finally, assess the neural network on `reviews_A` test reviews. Try to maximize the accuracy on test data.

In [19]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from sklearn.preprocessing import LabelEncoder
import tensorflow as tf
from numpy.random import seed

# Setting the seed
seed(99)
tf.random.set_seed(99)

# Convert labels to numerical values
label_encoder = LabelEncoder() # Using LabelEncoder to convert categorical data into numerical format.
encoded_labels = label_encoder.fit_transform(train_reviews_A['label'])

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(padded_sequences, encoded_labels, test_size=0.2, random_state=42)

# Define the LSTM model
lstm_model = Sequential() # Sequential model is used to create a linear stack of layers for building the neural network architecture.
lstm_model.add(Embedding(len(word2vec_model.wv.key_to_index), 300, input_length=max_sequence_length, weights=[word2vec_model.wv.vectors], trainable=False))
lstm_model.add(LSTM(64)) # LSTM is chosen here over GRU for potential advantages in capturing long-term dependencies in sequential data.
lstm_model.add(Dense(1, activation='sigmoid')) # Using sigmoid activation function and a single-node dense layer for binary classification task.

# Compile the model
lstm_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # Binary cross-entropy is used as the loss function because it is suitable for binary classification tasks, effectively measuring the difference between predicted and true binary outcomes.

# Train the model
lstm_model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=10, batch_size=32) # Validation data is used to evaluate the model's performance on unseen data during training, helping to monitor for overfitting and providing insights into generalization capability.

# Evaluate the model on the test set
test_sequences = [word_tokenize(review) for review in reviews_A['text']]
test_term_indices = []
for review in test_sequences:
    indices = [word2vec_model.wv.key_to_index[word] for word in review if word in word2vec_model.wv.key_to_index] # The "if" statement filters out words that are not present in the word2vec model's vocabulary, ensuring that only valid words are used to retrieve their corresponding indices.
    test_term_indices.append(indices)

padded_test_sequences = pad_sequences(test_term_indices, maxlen=max_sequence_length, padding='post', truncating='post', value=0)
encoded_test_labels = label_encoder.transform(reviews_A['label'])
loss, accuracy = lstm_model.evaluate(padded_test_sequences, encoded_test_labels)

# After experimentation, lowering the number of hidden layers in the LSTM from 128 to 64 and setting the seed to 99 can improve accuracy up to 0.8
print("Accuracy: ", accuracy)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Accuracy:  0.8230999708175659
