# Product Review Classifier



## Methodology

The neural network I decided to use was a Bidirectional LSTM, as I believed that for this particular use case, where the ordering of the words in each review would influence subsequent semantic meaning, I could achieve a higher accuracy with a model that was able to analyse text from both left to right and right to left. This involved an embedding layer, followed by the Bidirectional LSTM followed by a sigmoid layer for classification, with binary cross entropy loss as my loss function.

I first determined a way of parsing the data within the corpus to construct a training set. I did not believe that feeding entire reviews, which are made up of multiple sentences, directly into my model would prove successful as this would mean the input sequences would be very large. This is a problem, as:

1. With longer sequences, even though LSTMs help to mitigate this, we can begin to encounter the
vanishing gradient problem
2. Longer sequences result in a Neural Network that is quite large, leading to longer training times

In addition to the above, reviews can be difficult to assign to a single label as they will often have both positive and negative elements to them as well as unannotated sentences, leading to ambiguity with regards to how best to classify them.

Thus, to circumnavigate these issues, I decided to process the corpus on a sentence-by-sentence basis. Each annotated sentence would be given a label of 1 if the opinion was positive and 0 if the opinion was negative (which is why I use a sigmoid layer for classification). The idea here is that if one wanted to use this system to classify longer articles, they could simply aggregate the scores for each sentence, then threshold the result.

In [None]:
import nltk, string, re, random, statistics, math
nltk.download('punkt')
nltk.download('stopwords')
import numpy as np
from nltk.corpus import PlaintextCorpusReader
from nltk.tokenize import word_tokenize
from sklearn.model_selection import StratifiedKFold
from tensorflow.keras import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [None]:
def read_data(corpus_root: str, file_pattern: str):
    """
    Read files from a directory and then append the raw data of each file onto a list
    """
    wordlists = PlaintextCorpusReader(corpus_root, file_pattern)
    fileids = wordlists.fileids()

    data = []
    valid_fileids = []

    # Constructing list of tuples in the form (docid, raw doc contents)
    for f_id in fileids:
        if f_id != "README.txt":
            data.append((f_id, wordlists.raw(f_id)))
            valid_fileids.append(f_id)
    
    return data, valid_fileids

def clean(sentence: str):
    """
    Clean the input string
    """
    # Tokenize the document
    sentence_to_add = word_tokenize(sentence)

    # Then, casefolding the document by lowercasing everything
    sentence_to_add = [word.lower() for word in sentence_to_add]

    # Next, removing stop words
    stopwords = set(nltk.corpus.stopwords.words('english'))
    sentence_to_add = [w for w in sentence_to_add if w not in stopwords]

    # In addition, removing punctuation
    punc_table = str.maketrans('', '', string.punctuation)
    sentence_to_add = [w.translate(punc_table) for w in sentence_to_add]

    # Above lines replace punctuation tokens with whitespace, so must
    # filter them out of the wordlist
    while("" in sentence_to_add):
        sentence_to_add.remove("")
    
    # Removing numbers, as they do not convey meaning
    pattern = "\d+"
    sentence_to_add = [w for w in sentence_to_add if not re.match(pattern, w)]
    
    return " ".join(sentence_to_add)
    
def preprocess(data: list):
    """
    pre-process and construct the training set for the corpus
    str->list"""
    X = []
    Y = []
    vocabulary = set()
    for docid, text in data:
        # First split on [t] to extract reviews for doc
        reviews = re.split('\[t\]', text)
        
        # Next, process each review
        for review in reviews:
            # Split on newline char to retrieve sentences
            sentences = review.split('\n')
            
            # For each sentence, store class in Y if the information is present
            # i.e. [+1] or [-3] otherwise ignore it
            # 1 -> Positive Class
            # 0 -> Negative Class
            for sentence_num in range(len(sentences)):
                sentence = sentences[sentence_num]
                
                # By splitting on ##, we can divide the sentence into its label and its
                # content
                label_and_content = sentence.split('##')
                
                # Then we must decide what class to assign each sentence
                if len(label_and_content) == 2:
                    label = label_and_content[0]
                    content = label_and_content[1]
                    
                    # Depending on + or -, we also store an appropriate label
                    # If both are present, the label is determined by the sum of the
                    # opinion values
                    if '+' in label and '-' in label:
                        opinions = re.findall('[+|-]\d', label)
                        net_opinion = 0
                        for opinion in opinions:
                            value = int(opinion)
                            net_opinion += int(value)
                        
                        if net_opinion > 0:
                            Y.append(1)
                        else:
                            Y.append(0)
                    elif '+' in label:
                        Y.append(1)
                    elif '-' in label:
                        Y.append(0)
                    else:
                        continue
                    
                    # Clean the sentence then store
                    sentence_to_add = clean(content)
                    X.append(sentence_to_add)
                    
                    # We also maintain a vocabulary of the words in the corpus
                    sentence_to_add = sentence_to_add.split(" ")
                    vocabulary.update(sentence_to_add)
    
    # Remove empty tokens left behind after processing
    vocabulary.remove('')
    
    while '' in X:
        ind_of_empty = X.index('')
        X.remove('')
        Y.pop(ind_of_empty)
    
    # Convert vocab back into a list for sentence to sequence conversion
    # and log the vocab size
    vocabulary = list(vocabulary)
    vocab_size = len(vocabulary)
    
    # Now that the labelled sentences in the corpus have been cleaned and stored, we must now
    # vectorize each sentence
    # This is done by replacing the word in each sentence with its id in the vocabulary
    for i in range(len(X)):
        review_sentence = X[i]
        text_as_sequence = []
        tokens = review_sentence.split(" ")
        for token in tokens:
            text_as_sequence.append(vocabulary.index(token))
        
        X[i] = text_as_sequence
    
    # Finally, pad the data to ensure uniform dimensions
    # Data matrix is padded to the length of the longest sentence
    max_length = 0
    for example in X:
        max_length = max(max_length, len(example))
    
    padding_type = 'post'
    
    X_padded = pad_sequences(X, maxlen=max_length, padding=padding_type)
    
    return np.array(X_padded), np.array(Y), vocab_size

In [None]:
# Corpus root = product_review location - set appropriately
corpus_root = r"C:\Users\halor\Desktop\NLP_ProductReviewClassifier\product_reviews"
file_pattern = r".*.txt"

data, fileids = read_data(corpus_root, file_pattern)
X, Y, vocab_size = preprocess(data)
embedding_dim = 100

In [None]:
def compile_model():
    model = Sequential()
    model.add(Embedding(vocab_size, embedding_dim))
    model.add(Bidirectional(LSTM(100)))
    model.add(Dense(1,activation='sigmoid'))
    optimizer = Adam(learning_rate=0.0001)
    model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    return model

model = compile_model()
print(model.summary())

## Analysis

To examine the model’s performance, I used K-Fold Cross Validation. Analysing the results, I determined that with appropriate hyperparameter selection the model performs well considering the small amount of training data (approx. 2k sentences), achieving an average accuracy of around 72% across 5 folds. The hyperparameters tuned in this case were the learning rate and the number of epochs. Optimal results were achieved with a learning rate of 0.0001 and 13 to 15 epochs.

As previously mentioned, I believe the success of this model can be attributed to the fact that it is able to learn the semantic relationship between the tokens in each sentence not just from left to right but also from right to left, improving accuracy. This is confirmed by the fact that with the same experimentation on a normal LSTM, I was only able to achieve an average accuracy of around 60%.

In [None]:
# KFold validation, with k = 5
num_folds = 5
kfold = StratifiedKFold(n_splits=num_folds, shuffle=True)

# The results across all iterations of KFold are compiled to determine the average accuracy
# and the standard deviation
results = []
count = 0
num_epochs = 13
for train, test in kfold.split(X, Y):
    trainX, testX = X[train], X[test]
    trainY, testY = Y[train], Y[test]
    
    # Recompiling the model to reset the weights, then training on the new split
    model = compile_model()
    history = model.fit(trainX, trainY, validation_data=(testX, testY), epochs=num_epochs, batch_size=64)
    
    try:
        accuracy = history.history['val_accuracy'][num_epochs - 1]
    except:
        accuracy = history.history['val_acc'][num_epochs - 1]
        
    print('Iteration: #{} Accuracy: {}%'.format(count + 1, accuracy * 100))
    results.append(accuracy)
    count += 1

print('\nAverage accuracy: {}%'.format(statistics.mean(results) * 100))
print('Standard deviation: {}'.format(statistics.stdev(results)))