# Introduction
This notebook presents the development of two text sentiment classifiers: one leveraging Word2Vec embeddings and the other based on a Convolutional Neural Network (CNN) architecture.

##About the Dataset
The project uses a dataset containing consumer reviews of products purchased from e-commerce platforms. The dataset is publicly available on [Kaggle](https://www.kaggle.com/datasets/fredericods/ptbr-sentiment-analysis-datasets).

This dataset is composed of several CSV files. With the exception of the concatenated.csv file, which combines all the other files, each file represents data from a different e-commerce platform. Among the available columns, the following will be used in this notebook:
* review_text: the text containing the consumer's review of the product;
* polarity: indicates whether the review is positive (value 1) or negative (value 0).

## CNN neural network for text analysis
Although originally designed for image analysis, Convolutional Neural Networks (CNNs) have proven to be useful for Natural Language Processing (*NLP*), as highlighted in the books Deep Learning for Natural Language Processing (RAAIJMAKERS, 2019) and Deep Learning for *NLP* and Speech Recognition (UDAY KAMATH et al., 2019).

# Install libraries


In [None]:
!pip install gensim
!pip install spacy
!pip install pycaret==3.0.0rc8 # I need install this version because of incompatibility of previous pycaret version with gensim and spacy
!pip install tqdm
!pip install pyspark


# Downloading and Loading the Word2Vec Model

In [None]:
!mkdir models
!mkdir models/word2vec
!curl -O http://143.107.183.175:22980/download.php?file=embeddings/word2vec/cbow_s300.zip cbow_s300.zip
!mv cbow_s300.zip models/word2vec
!cd models/word2vec ; unzip -o -e  cbow_s300.zip

In [None]:
import gensim
import os
def load_word2vec_model():
    if(os.path.isfile('models/word2vec/cbow_s300')==False):
        model=gensim.models.KeyedVectors.load_word2vec_format('models/word2vec/cbow_s300.txt')
        model.save('models/word2vec/cbow_s300')
    model_word2vec=gensim.models.KeyedVectors.load('models/word2vec/cbow_s300')
    return model_word2vec

model_word2vec=load_word2vec_model()

# Database loading



In [None]:
from google.colab import drive
drive.mount('/content/drive')


In [None]:
!mkdir data
!cp /content/drive/MyDrive/Colab\ Notebooks/braz_port_sent_ana_data/archive.zip data
!cd data ; unzip -o -e archive.zip

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
def get_dataset():
    df=pd.read_csv("data/concatenated.csv")
    df=df[['review_text','polarity']].dropna()
    data = list(zip(df['review_text'], df['polarity']))
    dataset={ 'all_dataset':data,'max_rating':df['polarity'].max()}
    return dataset

def get_dataset_of_train_and_test():
    data=get_dataset()
    train, test = train_test_split(data['all_dataset'], test_size=0.3, random_state=0)
    dataset_train_test={'train':train,'test':test,'max_rating':data['max_rating']}
    return dataset_train_test


In [None]:
from pyspark.sql import SparkSession
import pyspark
from pyspark.sql.functions import col, udf
import pyspark.sql.functions as F
from pyspark.sql.functions import expr
from pyspark.ml.feature import StandardScaler
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.mllib.evaluation import RegressionMetrics
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.mllib.tree import RandomForest
from pyspark.ml.regression import RandomForestRegressor
import os
os.environ["SPARK_LOCAL_DIRS"]='spark_temp'

MAX_MEMORY = "10g"

spark = SparkSession.builder.appName('sparkdf341')\
        .config("spark.executor.memory", MAX_MEMORY)\
        .config("spark.driver.memory", MAX_MEMORY)\
        .config("spark.local.dir",os.environ["SPARK_LOCAL_DIRS"])\
        .getOrCreate()

#spark.sparkContext.setLogLevel('ERROR')


# Creating a Sentiment Classifier with Word2Vec
## Creating a Dataset with the Word2Vec and Polarity Columns



In [None]:
import numpy as np
from functools import partial
import shutil
import pathlib
import tqdm
import pandas as pd
def text_to_word2vec_sum(wor2vec_model,text):
    ncolumns=wor2vec_model['ola'].shape[0]
    word_vec=np.zeros(ncolumns,dtype=np.float)
    words=gensim.utils.simple_preprocess(str(text),min_len=1,max_len=25, deacc=False)
    for i in range(len(words)):
        word=words[i]
        if(word in wor2vec_model):
            word_vec=wor2vec_model[word]+word_vec
    return word_vec.tolist()


dataset=get_dataset()

def apply_wordtovec_to_dataset(dataset,model_word2vec):
    shutil.rmtree("data/dataset_word2vec",ignore_errors=True)
    pathlib.Path("data/dataset_word2vec").mkdir(parents=True, exist_ok=True)
    buffer_to_save_df_in_dict={'polarity':[],'word2vec':[]}
    number_of_subdf=0
    for row in tqdm.tqdm(dataset):
        array=text_to_word2vec_sum(model_word2vec,row[0])
        buffer_to_save_df_in_dict['word2vec'].append(array)
        buffer_to_save_df_in_dict['polarity'].append(row[1])
        if(len(buffer_to_save_df_in_dict['polarity'])>1000):
            sub_df_name="data/dataset_word2vec/"+str(number_of_subdf)+'.parquet'
            pd.DataFrame.from_dict(buffer_to_save_df_in_dict).to_parquet(sub_df_name)
            number_of_subdf=number_of_subdf+1
            buffer_to_save_df_in_dict={'polarity':[],'word2vec':[]}

    if(len(buffer_to_save_df_in_dict['polarity'])>0):
        sub_df_name="data/dataset_word2vec/"+str(number_of_subdf)+'.parquet'
        pd.DataFrame.from_dict(buffer_to_save_df_in_dict).to_parquet(sub_df_name)


apply_wordtovec_to_dataset(dataset['all_dataset'],model_word2vec)


#text_to_word2vec_sum_with_model=partial(text_to_word2vec_sum,model_word2vec)
#dataget_dataset()

### Explanation of the Word2Vec Column Creation


The Word2Vec column is derived from a preprocessing operation applied to the product review text. The preprocessing operation used was *Word2Vec sum*.

###  Word2vec dataset loading

In [None]:
df_word2vec=spark.read.parquet('data/dataset_word2vec')

## Creating a Sentiment Classifier for the Word2Vec Dataset
### Using PyCaret to Determine the Best Classification Algorithm
Load a sample of the Word2Vec dataset into a Pandas DataFrame.

In [None]:
import tqdm
def pass_spark_df_word2vec_to_pandas_df(spark_df):
    df_with_dataset_word2vec_in_dict={'polarity':[]}
    firstIteration=True
    with tqdm.tqdm(total=spark_df.count()) as pbar:
        for iterator in spark_df.collect():
            word2vec=iterator['word2vec']
            if(firstIteration):
                for i in range(len(word2vec)):
                    df_with_dataset_word2vec_in_dict['col'+str(i)]=[]
                firstIteration=False

            for i in range(len(word2vec)):
                df_with_dataset_word2vec_in_dict['col'+str(i)].append(word2vec[i])
            df_with_dataset_word2vec_in_dict['polarity'].append(iterator['polarity'])
            pbar.update(1)
    return pd.DataFrame.from_dict(df_with_dataset_word2vec_in_dict)

df_word2vec_pd=pass_spark_df_word2vec_to_pandas_df(df_word2vec.sample(0.04))
df_word2vec_pd



Apply PyCaret to the sample

In [None]:
from pycaret.classification import *
clf1 = setup(data = df_word2vec_pd, target = 'polarity')
compare_models(exclude=['knn','gbc'])

Based on the PyCaret data, the best classification algorithm is the *Light Gradient Boosting Machine*. The closest equivalent in PySpark is Gradient Boosting. PySpark will be used to create the final machine learning model, as the dataset exceeds the available RAM.

### Creating the Final Classifier Using the PySpark Machine Learning Package

In [None]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.linalg import Vectors,VectorUDT
convert_array_to_vector_udf= udf(lambda x:  Vectors.dense(list(x)),VectorUDT())
df_word2vec_mlib=df_word2vec.withColumn('word2vec_vector',convert_array_to_vector_udf(col('word2vec')))

df_train, df_test = df_word2vec_mlib.randomSplit(weights=[0.8,0.2], seed=200)

In [None]:
from pyspark.ml.classification import GBTClassifier

gbc=GBTClassifier(featuresCol='word2vec_vector', labelCol='polarity')
model = gbc.fit(df_train)


In [None]:
gb_predictions = model.transform(df_test)
gb_predictions.write.parquet('predictions_wordvec_test.parquet')
gb_predictions_train = model.transform(df_train)
gb_predictions_train.write.parquet('predictions_wordvec_train.parquet')


In [None]:
gb_predictions=df_word2vec=spark.read.parquet('predictions_wordvec_test.parquet')
gb_predictions_train=df_word2vec=spark.read.parquet('predictions_wordvec_train.parquet')

#### Analyzing Prediction Quality on the Test Set







In [None]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from sklearn.metrics import confusion_matrix
import pandas as pd

evaluator = BinaryClassificationEvaluator(labelCol='polarity',rawPredictionCol='probability')
auroc = evaluator.evaluate(gb_predictions, {evaluator.metricName: "areaUnderROC"})
print("Area under ROC Curve: {:.4f}".format(auroc))

y_true = gb_predictions.select(['polarity']).collect()
y_pred = gb_predictions.select(['prediction']).collect()


Confusion Matrix

In [None]:
pd.DataFrame(confusion_matrix(y_true,y_pred),columns=[['predito','predito'],['Polaridade 0','Polaridade 1']], index=[['real','real'],['Polaridade 0','Polaridade 1']])


The model shows a reasonably good AUC. However, the confusion matrix indicates a high number of false positives. This suggests that only items with a high predicted probability of being positive should be classified as positive.

# Creating a Sentiment Classifier Using Neural Networks
## Creating data generator

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import BatchNormalization
from tensorflow.keras.layers import Conv1D
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import GRU
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import GlobalMaxPooling1D
from tensorflow.keras.layers import MaxPool1D
from tensorflow.keras.layers import Embedding, Input,Concatenate
from tensorflow.keras.utils import Sequence
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np
import random
import tqdm

class DataGenerator(tf.keras.utils.Sequence):
    'Generates data for Keras'
    def __init__(self, df,max_ratings,batch_size,vocab_size,nwords, shuffle=True):
        self.__df = df
        self.__batch_size = batch_size
        self.__shuffle = shuffle
        self.__vocab_size=vocab_size
        self.__nwords=nwords
        self.__max_ratings=max_ratings

    def __len__(self):
        return int(np.floor(len(self.__df) / self.__batch_size))

    def get_vocab_size(self):
        return int(self.__vocab_size)

    def get_nwords(self):
        return int(self.__nwords)

    def get_nclasses(self):
        return int(self.__max_ratings)+1

    def __getitem__(self, index):
        df = self.__df[index*self.__batch_size:(index+1)*self.__batch_size]
        x=np.zeros((len(df),self.__nwords),dtype=np.int32)
        y=np.zeros((len(df),int(self.__max_ratings)+1),dtype=np.float)
        for i in range(len(df)):
            row=df[i]
            text=row[0]
            text_np = self.__data_generation(text)
            x[i,:]=text_np
            y[i,int(row[1])]=1
        return x, y

    def on_epoch_end(self):
        'Updates indexes after each epoch'
        if self.__shuffle == True:
            random.shuffle(self.__df)

    def __data_generation(self, text):

        text_enc=one_hot(text,self.__vocab_size)
        padded_reviews = pad_sequences([text_enc],maxlen=self.get_nwords(),padding='post')[0]
        return padded_reviews




## Define model

In [None]:
def create_model_cnn(datagen:DataGenerator):

    number_of_classes=datagen.get_nclasses()
    model = Sequential()
    deep_inputs = Input(shape=(datagen.get_nwords(),))
    emb=Embedding(input_dim=datagen.get_vocab_size(),output_dim=128)(deep_inputs)


    conv1=Conv1D(32,3,padding='same',strides=1,activation='relu')(emb)
    maxPool1=MaxPool1D(pool_size=2,strides=2)(conv1)
    drop1=Dropout(0.2)(maxPool1)

    conv2=Conv1D(32,2,padding='same',strides=1,activation='relu')(drop1)
    maxPool2=MaxPool1D(pool_size=2,strides=2)(conv2)
    drop2=Dropout(0.2)(maxPool2)

    conv3=Conv1D(32,2,padding='same',strides=1,activation='relu')(drop2)
    maxPool3=MaxPool1D(pool_size=2,strides=2)(conv3)
    drop3=Dropout(0.2)(maxPool3)

    conv4=Conv1D(32,2,padding='same',strides=1,activation='relu')(drop3)
    maxPool4=MaxPool1D(pool_size=2,strides=2)(conv4)
    drop4=Dropout(0.2)(maxPool4)

    conv5=Conv1D(32,2,padding='same',strides=1,activation='relu')(drop4)
    maxPool5=MaxPool1D(pool_size=2,strides=2)(conv5)
    drop5=Dropout(0.2)(maxPool5)

    conv6=Conv1D(32,2,padding='same',strides=1,activation='relu')(drop5)
    maxPool6=MaxPool1D(pool_size=2,strides=2)(conv6)
    drop6=Dropout(0.2)(maxPool6)

    fla=Flatten()(drop6)
    den1=Dense(2000, activation='relu')(fla)
    drop_den1=Dropout(0.2)(den1)
    den2=Dense(2000, activation='relu')(drop_den1)
    drop_den2=Dropout(0.2)(den2)
    out=Dense(number_of_classes, activation='softmax')(drop_den2)
    model=Model(inputs=deep_inputs, outputs=out)
    # choosing Adam optimizer
    optimizer = tf.keras.optimizers.Adam(learning_rate=0.0001)
    # we do not have one-hot vectors, we can use sparce categorical cross entropy and accuracy


    model.compile(loss=tf.keras.losses.CategoricalCrossentropy(),
                   optimizer=optimizer,
                   metrics='accuracy')


    return model


## Creating a Classifier Using a CNN Neural Network

In [None]:
batch_size=64
vocab_size = 50000
datasets=get_dataset_of_train_and_test()
datagen_train=DataGenerator(datasets['train'],max_ratings=datasets['max_rating'],batch_size=batch_size,vocab_size= vocab_size,nwords=256)
datagen_test=DataGenerator(datasets['test'],max_ratings=datasets['max_rating'],batch_size=batch_size,vocab_size= vocab_size,nwords=256)
model = create_model_cnn(datagen_train)

In [None]:
number_of_epochs = 5
history = model.fit(datagen_train,epochs=number_of_epochs,validation_data=datagen_test)

### Analyzing Prediction Quality on the Test Set

AUC

In [None]:
from sklearn.metrics import roc_auc_score
y_test=[]
y_test_prob=[]

for i in tqdm.tqdm(range(datagen_test.__len__())):
    x,y=datagen_test.__getitem__(i)
    prob=model.predict(x,verbose=False)
    for j in range(len(prob)):
        y_test.append(int(y[j][1]))
        y_test_prob.append(prob[j,1])
roc_auc_score(y_test, y_test_prob)

Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix

y_pred=[]
for x in y_test_prob:
    if(x>0.5):
        y_pred.append(1)
    else:
        y_pred.append(0)

pd.DataFrame(confusion_matrix(y_test,y_pred),columns=[['predito','predito'],['Polaridade 0','Polaridade 1']], index=[['real','real'],['Polaridade 0','Polaridade 1']])

By analyzing the metrics, it can be concluded that the model has a good AUC, close to 1, which is the maximum value. The confusion matrix shows that the model achieves a good accuracy (93%) along with strong precision (94%) and recall (97%) values.

# Conclusion

This notebook compares two approaches for creating a sentiment classifier for text data. The first approach used gradient boosting to build a classifier that processes texts with Word2Vec sum. The second approach employed a CNN neural network. Based on the results, the second approach demonstrated better performance.

For future work, it would be interesting to assess the computational cost of both models when making predictions. This is particularly important when deploying machine learning algorithms in production environments.

# Bibliographic references

RAAIJMAKERS, S. Deep Learning for Natural Language Processing. [s.l.] Manning Publications Company, 2019.

UDAY KAMATH et al. Deep Learning for NLP and Speech Recognition. Cham: Springer International Publishing, 2019.