# NLP - Assignment 1 - Q4

Amaya Syed - 190805496

In [1]:
import csv
import os
import numpy as np
import pandas as pd
from random import shuffle

import nltk
from nltk.corpus import stopwords
from sklearn.svm import LinearSVC
from nltk.classify import SklearnClassifier
from sklearn.metrics import precision_recall_fscore_support
from sklearn.pipeline import Pipeline
from nltk.stem import WordNetLemmatizer 

# Question 4

Now that we have established a low benchmark with which to compare accuracy results, we can start trying to incorporate different elements into the function defined in part 1 to study their effect on overall accuracy.

In [2]:
def str_to_int(element):
    
    """
    Auxiliary function which takes an element of a list and tries to pass it from string to integer. If it can't it returns the original element.'
    """
    try:
        element = int(element) # trying to convert string to integer
    except ValueError: # if Value Error is raised because the string can't be passed to integer, then pass and return original input
        pass
    return element


def parseReview(line):
    
    """
    Returns a triple of an integer, a string containing the review, and a string indicating the label.
  
    """
    indices = [0, 1, 8] # create a list of indices to keep in the review list
    label_dict = {'__label1__':'fake', '__label2__':'real'} # creating a dictionary assigning labels to more descriptive terms. 
            
    review = [line[index] for index in indices] # 2) keep id, label, text using the indices
    review = [label_dict.get(x, x) for x in review] # 3) passing the label names to real or fake
    review = [str_to_int(x) for x in review] # 4) pass id to int using the conv function defined above
    Id, Label, Text = review # 5) assign each element of the list to a separate variable
        
    return Id, Text, Label


The first thing we can do to improve the accuracy is to improve the preprocessing. As mentioned earlier, possible strategies for this are lemmatization, stop-word removal or punctuation removal. We implement all three of these strategies:

- lemmatisation is the technique whereby we convert a word to its base form, therefore seeking to improve the sparseness of the word count. We implemenet lemmatisation by using the WordNetLemmatizer package from NLTK and downloading the known lemmatisations of English words. 

- stop word removel consists in removing connective words from the text, therefore removing frequent words with unimportant meaning. This also improves sparseness. 

- Removing punctuation, which as it says, is simply stripping the text of any punctuation marks - improving sparseness and removing potentially unimportant information.

A caveat to the above descriptions is that because the task is fake review detection, removing this information might actually hurt our classification as small stylistic differences might be the main differences between a fake and a real review, at least if one is solely looking at the review text itself.

In addition to this we can also add bigram or even trigram tokens to our unigram tokens, which might help detect patterns of speech between reviews. 

In [3]:
# b) TEXT PREPROCESSING AND FEATURE VECTORIZATION        

# Input: a string of one  review
def preProcess(text):
    
    # we can try lemmatizing the words and check effect on accuracy - lemmatizing converts a word to its base form.
    
    # Initialting the Wordnet Lemmatizer
    #lemmatizer = WordNetLemmatizer()

    tokens = []
    output = []  
    
    #stop_words = set(stopwords.words('english')) # removing stop words actually decreases accuracy slightly
    
    for token in text.split():
        
        #if token not in stop_words: # stop word removal
        
        #tokens.append(lemmatizer.lemmatize(token.lower())) # lemmatization of the tokens
        tokens.append(token.lower())
        
        output = [' '.join(token) for token in nltk.bigrams(tokens)] + tokens
        # adding trigrams doesn't do anything for the accuracy. Using bigrams and unigram adds a marginal increment in accuracy, but is slower to run. 
        
    return output

In our **toFeatureVector** function we have implemented a simple count of token frequency for each review. To continue making this function more sophisticated, it would be necessary to implement a different weighting scheme, such as term frequency–inverse document frequency, amongst other potential schemes. However, in this case, due to lack of programming knowledge it was not possible to implement any of these schemes in such a way that it made sense and didn't take an extremely long time to compute. We therefore made the decision to continue forward using count data. 

In [4]:
featureDict = {}

def toFeatureVector(tokens):
    
    token_frequency = {} # creating a local dictionary for token frequency in the review
    
    for token in tokens: # for each word in the review
    
        if token not in token_frequency.keys(): 
            token_frequency[token] = 1  # if the word is not within the dict we add it as key = 1
        else:
            token_frequency[token] += 1  # if its already in the dic, add one
    
        if token not in featureDict.keys():
            featureDict[token] = 1
        else:
            featureDict[token] += 1
                
    return token_frequency


In [5]:
# load data from a file and append it to the rawData


def loadData(path, Text=None):
    
    with open(path, encoding='utf8') as f: # open file
        
        reader = csv.reader(f, delimiter='\t')
        next(reader)
        
        for line in reader: # each line corresponds to a review and its associated features
            
            (Id, Text, Label) = parseReview(line) # keep the id, text, label triple for a review
            
            rawData.append((Id, Text, Label)) # keep the triple for all reviews
            

def splitData(percentage):
    
    # A method to split the data between trainData and testData 
    
    dataSamples = len(rawData) #lenght of the dataset
    halfOfData = int(len(rawData)/2)
    trainingSamples = int((percentage*dataSamples)/2) # size of training set based on percentage chosen by user. 
    
    for (_, Text, Label) in rawData[:trainingSamples] + rawData[halfOfData:halfOfData+trainingSamples]:
        trainData.append((toFeatureVector(preProcess(Text)),Label))
        
    for (_, Text, Label) in rawData[trainingSamples:halfOfData] + rawData[halfOfData+trainingSamples:]:
        testData.append((toFeatureVector(preProcess(Text)),Label))           


To continue tweaking the accuracy we can change some parameter values of the SVM classifier. In our case, after running the classifier with different values of *C* we settled on a value of **0.001**. The *C* parameter essentially acts a regularisation parameter which controls the tradeoff between minimising training error and minimsing the norm of the weights, where a small value of *C* means more regularisation (smaller weights). In practical terms it means the hyperplane dividing the data will have a larger minimum marginnif the value of C is smaller. 

In [6]:
# TRAINING AND VALIDATING OUR CLASSIFIER
def trainClassifier(trainData):
    print("Training Classifier...")
    pipeline =  Pipeline([('svc', LinearSVC(max_iter=3000, C = 0.001))])
    return SklearnClassifier(pipeline).train(trainData)

In [7]:
def crossValidate(trainData, folds):

    shuffle(trainData) # reshuffle dataset as currently its ordered into fake, then real reviews
    cv_results = [] # initiate list to keep cv results for each fold
    foldSize = int(len(trainData) / folds) # initiate fold size by dividing training data by # folds.
    
    for i in range(0, len(trainData), foldSize): # data goes from i to lenght of data in foldsize jumps.
        
        cv_test = trainData[i : i + foldSize] # from i to i + foldsize -> for 1st loop test data will be 0 to i_foldsize, 2nd loop from i_foldize to i_(foldsize*2), therefore going over the whole training data in foldsize chunks. 
        cv_train  = trainData[0 : i] + trainData[i + foldSize : ] # for 1st loop from 0 to 0 + i_foldsize, 2nd loop from 0 to i_foldsize + i_(foldsize*2) to end, 3rd loop 0 to i_(foldsize*2) + i_(foldsize*3) to end, etc
        
        classifier = trainClassifier(cv_train)  # train the SVM classifier
        testTrue = [t[1] for t in cv_test]   # get the ground-truth labels from the data
        testPred = predictLabels(cv_test, classifier)  # classify the test data to get predicted labels
        precision, recall, fscore, _ = precision_recall_fscore_support(testTrue, testPred, average='weighted') # evaluate
        cv_results.append((precision, recall, fscore)) # append results obtained for each training set 
    cv_results = (np.mean(np.array(cv_results), axis = 0)) # average the cv results for precision, recall and fscore
   # print(cv_results)
    print("Done training!")
    print("Precision %f\nRecall: %f\nF Score:%f" % (cv_results[0], cv_results[:1], cv_results[2]))   
    
    return cv_results

In [8]:
# PREDICTING LABELS GIVEN A CLASSIFIER

def predictLabels(reviewSamples, classifier):
    return classifier.classify_many(map(lambda t: t[0], reviewSamples))

The MAIN part of the code is below and calls all the functions defined above to train the SVM classifier using cross validation to check the consistency of our findings.

In [9]:
# MAIN

folds = 10
# loading reviews
# initialize global lists that will be appended to by the methods below
rawData = []          # the filtered data from the dataset file (should be 21000 samples)
trainData = []        # the pre-processed training data as a percentage of the total dataset (currently 80%, or 16800 samples)
testData = []         # the pre-processed test data as a percentage of the total dataset (currently 20%, or 4200 samples)

# the output classes
fakeLabel = 'fake'
realLabel = 'real'

# references to the data files
reviewPath = 'amazon_reviews.txt'

# Do the actual stuff (i.e. call the functions we've made)
# We parse the dataset and put it in a raw data list
print("Now %d rawData, %d trainData, %d testData" % (len(rawData), len(trainData), len(testData)),
      "Preparing the dataset...",sep='\n')
loadData(reviewPath) 

# We split the raw dataset into a set of training data and a set of test data (80/20)
# You do the cross validation on the 80% (training data)
# We print the number of training samples and the number of features before the split
print("Now %d rawData, %d trainData, %d testData" % (len(rawData), len(trainData), len(testData)),
      "Preparing training and test data...",sep='\n')

splitData(0.8)

# We print the number of training samples and the number of features after the split
print("After split, %d rawData, %d trainData, %d testData" % (len(rawData), len(trainData), len(testData)),
      "Training Samples: ", len(trainData), "Features: ", len(featureDict), sep='\n')

# QUESTION 3 - Make sure there is a function call here to the
# crossValidate function on the training set to get your results

crossValidate(trainData, folds)



Now 0 rawData, 0 trainData, 0 testData
Preparing the dataset...
Now 21000 rawData, 0 trainData, 0 testData
Preparing training and test data...
After split, 21000 rawData, 16800 trainData, 4200 testData
Training Samples: 
16800
Features: 
592379
Training Classifier...
Training Classifier...
Training Classifier...
Training Classifier...
Training Classifier...
Training Classifier...
Training Classifier...
Training Classifier...
Training Classifier...
Training Classifier...
Done training!
Precision 0.652254
Recall: 0.652254
F Score:0.650863


array([0.6522536 , 0.65113095, 0.6508634 ])

We obtain results which correspond to a 10 % increase above that expected by always naively predicting the most frequent outcome, as we know there is no class imbalance between labels and the minimum accuracy obtained should be 50%. There is therefore still a significant amount of work to be done to obtain higher accuracy results. 

# Evaluate on test set

Now that we have checked our accuracy results are robust by using cross-validation and that the functions defined all work and intermesh as they should, we will train our classifier on all the *trainData* available and test in on the *testData* obtained from **splitData**. We present the results in terms of precision, recall and fscore, as above. We broadly obtain similar results to the ones obtained by cross-validation, although they are one point lower. 

**1- Testing on changes in preprocessing**

First we check the effect on accuracy for changes in the preprocessing, namely lemmatization, stop word removal and adding bigrams 

- Lemmatization + unigrams: 


    - Precision: 0.588811
    - Recall: 0.588810
    - F Score:0.588808
    
- Lemmatization + stop words + unigrams:  


    - Precision: 0.577639
    - Recall: 0.577619
    - F Score:0.577591

- Lemmatization + bigrams + unigrams:


    - Precision: 0.603337
    - Recall: 0.603333
    - F Score:0.603330

- Lemmatization + stopwords + bigrams + unigrams:


    - Precision: 0.605477
    - Recall: 0.605476
    - F Score:0.605475

The changes in preprocessing have not affected the accuracy in any significant way. Going forward we will use bigrams and unigrams, but no lemmatization or stop word removal. 

- Preprocessing: just bigrams and unigrams - SVC with C = 0.001


    - Precision: 0.628835
    - Recall: 0.628333
    - F Score:0.627971

In [10]:
# Finally, check the accuracy of your classifier by training on all the tranin data
# and testing on the test set
# Will only work once all functions are complete
functions_complete = True  # set to True once you're happy with your methods for cross val
if functions_complete:
    print(testData[0])   # have a look at the first test data instance
    classifier = trainClassifier(trainData)  # train the classifier
    testTrue = [t[1] for t in testData]   # get the ground-truth labels from the data
    testPred = predictLabels(testData, classifier)  # classify the test data to get predicted labels
    finalScores = precision_recall_fscore_support(testTrue, testPred, average='weighted') # evaluate
    print("Done training!")
    print("Precision: %f\nRecall: %f\nF Score:%f" % finalScores[:3])


({'this assortment': 1, 'assortment is': 1, 'is really': 1, "really hershey's": 1, "hershey's at": 1, 'at their': 1, 'their best.': 1, 'best. the': 1, 'the little': 1, 'little ones': 1, 'ones are': 1, 'are always': 1, 'always excited': 1, 'excited whenever': 1, 'whenever the': 1, 'the holidays': 1, 'holidays come': 1, 'come because': 1, 'because of': 1, 'of this.': 1, 'this': 1, 'assortment': 1, 'is': 1, 'really': 1, "hershey's": 1, 'at': 1, 'their': 1, 'best.': 1, 'the': 2, 'little': 1, 'ones': 1, 'are': 1, 'always': 1, 'excited': 1, 'whenever': 1, 'holidays': 1, 'come': 1, 'because': 1, 'of': 1, 'this.': 1}, 'fake')
Training Classifier...
Done training!
Precision: 0.628835
Recall: 0.628333
F Score:0.627971
