# NLP - Assignment 1 - Q1, Q2, Q3

Amaya Syed - 190805496

In [11]:
import csv
import re
import os

import numpy as np
import pandas as pd  

from sklearn.svm import LinearSVC
from nltk.classify import SklearnClassifier
from sklearn.metrics import precision_recall_fscore_support
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline

from random import shuffle
from random import seed
from random import randrange
from collections import Counter

# Question 1

The aim of this assignment is to classify a corpus of 21 000 Amazon reviews, equally distributed between "real" or "fake" reviews. where fake is defined as being reviews written for a product with the intent of receiving financial compensation for doing so. The reviews have been previously analysed and manually labelled by the company itself. The corpus also contains information regarding each reviews: rating, verified purchase, product category, product ID, product title, review title.

This task is therefore fundamentally one of deception detection and detection by an NLP algorithm of the stylistc differences that might arise between spontaneous reviews left by consummers and reviews that are written motivated by a potential monetary gain.

We will first implement the functions detailed in Question 1, 2, and 3 of this assignement and then build on them to try better our classification accuracy by using the features provided with the reviews and which might potentially give us better insight into the differences between real and fake reviews. 

The very first task for an NLP task is to preprocess the input text in such a way as to make it usable for the machine learning algorithm of our choice. To do so one must first load the data (*cf.* **loadData** function), in this case a tab delimitated text file, with utf8 coding. The data is loaded review by review, with all the features associated to said review kept in a list. The **parseReview** function then accepts as input each review (also indicated as *line*) and keep only the review ID as an integer, labels as real or fake and review text as string, passes the label names to real or fake, as well as passing the review ID from string to integer. Each element of the list is then assigned to a separate variable: *Id*, *Text* and *Label* and returned by **parseReview** within **loadData**. Each triple so returned is then appended to the *rawData* variable and kept. 

In [12]:
# Question 1:

# a) Convert line from input file into an id/text/label tuple


def str_to_int(element):
    
    """
    Auxiliary function which takes an element of a list and tries to pass it from string to integer. If it can't it returns the original element.'
    """
    try:
        element = int(element) # trying to convert string to integer
    except ValueError: # if Value Error is raised because the string can't be passed to integer, then pass and return original input
        pass
    return element


def parseReview(line):
    
    """
    Returns a triple of an integer, a string containing the review, and a string indicating the label.
  
    """
    indices = [0, 1, 8] # create a list of indices to keep in the review list
    label_dict = {'__label1__':'fake', '__label2__':'real'} # creating a dictionary assigning labels to more descriptive terms. 
            
    review = [line[index] for index in indices] # 2) keep id, label, text using the indices
    review = [label_dict.get(x, x) for x in review] # 3) passing the label names to real or fake
    review = [str_to_int(x) for x in review] # 4) pass id to int using the conv function defined above
    Id, Label, Text = review # 5) assign each element of the list to a separate variable
        
    return Id, Text, Label


The next step is to tokenise the text review, which we will do in the **preProcess** function, which accepts as input an element of the *Text* list of *rawData* and is called from inside the **spliData** function. First space is added between words and punctionion, then the text is split using white space as the divider and tokens are all passed to lower case and returned as list of strings, where each token is an element. No other preprocessing, such as lemmatization, stop-word removal or punctuation removal was attempted at this stage. 

In [13]:
        
# b) TEXT PREPROCESSING AND FEATURE VECTORIZATION        

# Input: a string of one  review
def preProcess(text):
    
    order = 1
    # Simple tokenisation
    text = re.sub(r"(\w)([.,;:!?'\"\)])", r"\1 \2", text) # add a space between word and the punctuation in between square bracket
    text = re.sub(r"([.,;:!?'\"\(])(\w)", r"\1 \2", text) # add a space between punctuation and word
     
    tokens = re.split(r"\s+", text) # divide the text into tokens using white space as the divider
    tokens = [t.lower() for t in tokens] # pass all words to lower case
    #tokens = ['<s>'] * (order-1) + tokens + ['</s>'] # add beginning to pad for order > 2 and ending sequence for each review. 
    
    return tokens


# Question 2

The tokenised text of a review can now be passed as input to the **toFeatureVector** function, which initialises a global dictionary (*featureDict*) and a local dictionary (*token_frequency*) of tokens present in the reviews. We then go over all the elements in the token list and check if they're in the local and global dictionary: if they aren't we add them as dictionary keys with a frequency of one, whilst if they are, we just add one to that count. We therefore obtain a local dictionary with the tokens counts for a given review and a global dictionary with the token counts for all reviews. We return the local token frequency dictionary within **splitData**, which associates it to its given label and splits the data into a training and testing sets using the lenght of *rawData* and a chosen percentage input by the user. 

In [14]:
featureDict = {} # A global dictionary of features

def toFeatureVector(tokens):
    ""    
    ""
    token_frequency = {} # creating a local dictionary for token frequency in the review 

    for token in tokens: # for each word in the review
    
        if token not in token_frequency.keys(): 
            token_frequency[token] = 1  # if the word is not within the dict we add it as key = 1
        else:
            token_frequency[token] += 1  # if its already in the dic, add one
            
        if token not in featureDict.keys(): # same for global dictionary
             featureDict[token] = 1
        else:
            featureDict[token] += 1
        
    return token_frequency


In [15]:
# load data from a file and append it to the rawData

def loadData(path, Text=None):
    
    with open(path, encoding='utf8') as f: # open file
        
        reader = csv.reader(f, delimiter='\t')
        next(reader)
        
        for line in reader: # each line corresponds to a review and its associated features
            
            (Id, Text, Label) = parseReview(line) # keep the id, text, label triple for a review
            
            rawData.append((Id, Text, Label)) # keep the triple for all reviews
            


def splitData(percentage):
    
    # A method to split the data between trainData and testData 
    
    dataSamples = len(rawData) #lenght of the dataset
    halfOfData = int(len(rawData)/2)
    trainingSamples = int((percentage*dataSamples)/2) # size of training set based on percentage chosen by user. 
    
    for (_, Text, Label) in rawData[:trainingSamples] + rawData[halfOfData:halfOfData+trainingSamples]:
        trainData.append((toFeatureVector(preProcess(Text)),Label))
        
    for (_, Text, Label) in rawData[trainingSamples:halfOfData] + rawData[halfOfData+trainingSamples:]:
        testData.append((toFeatureVector(preProcess(Text)),Label))           


In [16]:
# TRAINING AND VALIDATING OUR CLASSIFIER
def trainClassifier(trainData):
    print("Training Classifier...")
    pipeline =  Pipeline([('svc', LinearSVC(max_iter=3500))])
    return SklearnClassifier(pipeline).train(trainData)

# Question 3

Now that our data is organised as a dictionary of token frequencies for each review, we can use it to train our Support Vector Machine (SVM) classifier on the training set. We must first define the **crossValidate** function which will take our training data and successively split it into a training and a hold out set. The size of the hold out set will depend on the number of folds we wish to divide our data into and is a parameter of the function. The cross validation will then loop over the data creating a foldsize split for the test data, which moves across the data length and a training set composed of all the data except the hold out set. The classifier is training on the training set and then its predictive abilities tested on the hold out set - measuring precision, recall and fscore. We therefore retrain our classifier for each different split and obtain different accuracy results on the testing set: the accuracy results are then presented as the average of the results obtained in each split.

In [17]:
def crossValidate(trainData, folds):

    shuffle(trainData) # reshuffle dataset as currently its ordered into fake, then real reviews
    cv_results = [] # initiate list to keep cv results for each fold
    foldSize = int(len(trainData) / folds) # initiate fold size by dividing training data by # folds.
    
    for i in range(0, len(trainData), foldSize): # data goes from i to lenght of data in foldsize jumps.
        
        cv_test = trainData[i : i + foldSize] # from i to i + foldsize -> for 1st loop test data will be 0 to i_foldsize, 2nd loop from i_foldize to i_(foldsize*2), therefore going over the whole training data in foldsize chunks. 
        cv_train  = trainData[0 : i] + trainData[i + foldSize : ] # for 1st loop from 0 to 0 + i_foldsize, 2nd loop from 0 to i_foldsize + i_(foldsize*2) to end, 3rd loop 0 to i_(foldsize*2) + i_(foldsize*3) to end, etc
        
        classifier = trainClassifier(cv_train)  # train the SVM classifier
        testTrue = [t[1] for t in cv_test]   # get the ground-truth labels from the data
        testPred = predictLabels(cv_test, classifier)  # classify the test data to get predicted labels
        precision, recall, fscore, _ = precision_recall_fscore_support(testTrue, testPred, average='weighted') # evaluate
        cv_results.append((precision, recall, fscore)) # append results obtained for each training set 
    cv_results = (np.mean(np.array(cv_results), axis = 0)) # average the cv results for precision, recall and fscore
   # print(cv_results)
    print("Done training!")
    print("Precision %f\nRecall: %f\nF Score:%f" % (cv_results[0], cv_results[:1], cv_results[2]))   
    
    return cv_results

In [18]:
# PREDICTING LABELS GIVEN A CLASSIFIER

def predictLabels(reviewSamples, classifier):
    return classifier.classify_many(map(lambda t: t[0], reviewSamples))

The MAIN part of the code is below and calls all the functions defined above to train the SVM classifier using cross validation to check the consistency of our findings.

In [19]:
# MAIN

folds = 10
# loading reviews
# initialize global lists that will be appended to by the methods below
rawData = []          # the filtered data from the dataset file (should be 21000 samples)
trainData = []        # the pre-processed training data as a percentage of the total dataset (currently 80%, or 16800 samples)
testData = []         # the pre-processed test data as a percentage of the total dataset (currently 20%, or 4200 samples)

# the output classes
fakeLabel = 'fake'
realLabel = 'real'

# references to the data files
reviewPath = 'amazon_reviews.txt'

# Do the actual stuff (i.e. call the functions we've made)
# We parse the dataset and put it in a raw data list
print("Now %d rawData, %d trainData, %d testData" % (len(rawData), len(trainData), len(testData)),
      "Preparing the dataset...",sep='\n')
loadData(reviewPath) 

# We split the raw dataset into a set of training data and a set of test data (80/20)
# You do the cross validation on the 80% (training data)
# We print the number of training samples and the number of features before the split
print("Now %d rawData, %d trainData, %d testData" % (len(rawData), len(trainData), len(testData)),
      "Preparing training and test data...",sep='\n')

splitData(0.8)

# We print the number of training samples and the number of features after the split
print("After split, %d rawData, %d trainData, %d testData" % (len(rawData), len(trainData), len(testData)),
      "Training Samples: ", len(trainData), "Features: ", len(featureDict), sep='\n')

# QUESTION 3 - Make sure there is a function call here to the
# crossValidate function on the training set to get your results

crossValidate(trainData, folds)



Now 0 rawData, 0 trainData, 0 testData
Preparing the dataset...
Now 21000 rawData, 0 trainData, 0 testData
Preparing training and test data...
After split, 21000 rawData, 16800 trainData, 4200 testData
Training Samples: 
16800
Features: 
43900
Training Classifier...
Training Classifier...
Training Classifier...
Training Classifier...
Training Classifier...
Training Classifier...
Training Classifier...
Training Classifier...
Training Classifier...
Training Classifier...
Done training!
Precision 0.618166
Recall: 0.618166
F Score:0.617615


array([0.61816613, 0.6177381 , 0.61761458])

We obtain results which correspond to a 10 % increase above that expected by always naively predicting the most frequent outcome, as we know there is no class imbalance between labels and the minimum accuracy obtained should be 50%. There is therefore still a significant amount of work to be done to obtain higher accuracy results. 

# Evaluate on test set

Now that we have checked our accuracy results are robust by using cross-validation and that the functions defined all work and intermesh as they should, we will train our classifier on all the *trainData* available and test in on the *testData* obtained from **splitData**. We present the results in terms of precision, recall and fscore, as above. We broadly obtain similar results to the ones obtained by cross-validation, although they are one point lower. 

In [20]:
# Finally, check the accuracy of your classifier by training on all the tranin data
# and testing on the test set
# Will only work once all functions are complete
functions_complete = True  # set to True once you're happy with your methods for cross val
if functions_complete:
    print(testData[0])   # have a look at the first test data instance
    classifier = trainClassifier(trainData)  # train the classifier
    testTrue = [t[1] for t in testData]   # get the ground-truth labels from the data
    testPred = predictLabels(testData, classifier)  #Â classify the test data to get predicted labels
    finalScores = precision_recall_fscore_support(testTrue, testPred, average='weighted') # evaluate
    print("Done training!")
    print("Precision: %f\nRecall: %f\nF Score:%f" % finalScores[:3])


({'this': 2, 'assortment': 1, 'is': 1, 'really': 1, 'hershey': 1, "'": 1, 's': 1, 'at': 1, 'their': 1, 'best': 1, '.': 2, 'the': 2, 'little': 1, 'ones': 1, 'are': 1, 'always': 1, 'excited': 1, 'whenever': 1, 'holidays': 1, 'come': 1, 'because': 1, 'of': 1}, 'fake')
Training Classifier...
Done training!
Precision: 0.603607
Recall: 0.603571
F Score:0.603537
