# NLP Assignement 1 - Q5

Amaya Syed - 190805496

In [15]:
import csv
import os
import numpy as np
import pandas as pd
from random import shuffle

import nltk
from sklearn.svm import LinearSVC
from nltk.classify import SklearnClassifier
from sklearn.metrics import precision_recall_fscore_support
from sklearn.pipeline import Pipeline
from nltk.stem import WordNetLemmatizer 

To continue improving accuracy the next step is to add some of the other features available in the corpus. First we will load the data and take a look at the features available. We will also check, at a qualitative level, if there are any obvious class imbalances which might guide us in selecting appropriate features for the classifier.  

In [16]:
# loading data using pandas and taking a look at the first few lines
data = pd.read_csv("amazon_reviews.txt", delimiter = "\t")
data.head()

Unnamed: 0,DOC_ID,LABEL,RATING,VERIFIED_PURCHASE,PRODUCT_CATEGORY,PRODUCT_ID,PRODUCT_TITLE,REVIEW_TITLE,REVIEW_TEXT
0,1,__label1__,4,N,PC,B00008NG7N,"Targus PAUK10U Ultra Mini USB Keypad, Black",useful,"When least you think so, this product will sav..."
1,2,__label1__,4,Y,Wireless,B00LH0Y3NM,Note 3 Battery : Stalion Strength Replacement ...,New era for batteries,Lithium batteries are something new introduced...
2,3,__label1__,3,N,Baby,B000I5UZ1Q,"Fisher-Price Papasan Cradle Swing, Starlight",doesn't swing very well.,I purchased this swing for my baby. She is 6 m...
3,4,__label1__,4,N,Office Products,B003822IRA,Casio MS-80B Standard Function Desktop Calculator,Great computing!,I was looking for an inexpensive desk calcolat...
4,5,__label1__,4,N,Beauty,B00PWSAXAM,Shine Whitening - Zero Peroxide Teeth Whitenin...,Only use twice a week,I only use it twice a week and the results are...


In [17]:
# checking for inbalances in the dataset for fake or real reviews, in procduct category, rating, verified purchase and review length
label_prod_counts = data.groupby(data["LABEL"]).PRODUCT_CATEGORY.value_counts()
label_rating_counts = data.groupby(data["LABEL"]).RATING.value_counts()
rating_label_counts = data.groupby(data["RATING"]).LABEL.value_counts()
verified_purchase_label_counts = data.groupby(data["LABEL"]).VERIFIED_PURCHASE.value_counts()
rating_vp_counts = data.groupby(data["RATING"]).VERIFIED_PURCHASE.value_counts()

data['review_length'] = data['REVIEW_TEXT'].apply(len)
data.groupby(["LABEL"]).review_length.mean()

# label_prod_counts - 350 reviews in each product category. 
# label_rating_counts - roughly same number of reviews for each rating 
# verified_purchase_label_counts - WAY MORE real reviews in verified purchase. 
# review_length is slighlty longer for fake reviews 

LABEL
__label1__    316.550000
__label2__    428.102857
Name: review_length, dtype: float64

The imbalance in the verified purchase vs label is the only one to really jump out here, although it seems the review length is slightly longer in fake reviews, which makes sense if one considers the spontaneous nature of a real review as compared to a planned fake review. We will therefore concentrate on adding as features whether the review was made after a *Verified Purchase*. We will also add the review length and finally, although we have seen there is a balanced number of fake and real reviews for the different ratings, we will also add *Rating* as a category - as one would intuitively expect there might be a difference between reviews based on their ratings if they are real or fake. To add these features we will modify the **paseReview**, **toFeatureVector**, **loadData** and **splitData** slightly. 

In **paseReview** review we will now keep the elements 2 and 3 of the input list, corresponding to the Rating and Verified purchase status of the product reviewed. We then return (Id, Text, Label, Rating, Verified_purchase) and assign these to new variables within **loadData**, as well as append them to *rawData*. 

In [18]:
def str_to_int(element):
    
    """
    Auxiliary function which takes an element of a list and tries to pass it from string to integer. If it can't it returns the original element.'
    """
    try:
        element = int(element) # trying to convert string to integer
    except ValueError: # if Value Error is raised because the string can't be passed to integer, then pass and return original input
        pass
    return element


def parseReview(line):
    
    """
    Returns a triple of an integer, a string containing the review, and a string indicating the label.
  
    """
    indices = [0, 1, 2, 3, 8] # here we add Rating and Verified purchase in addition to id, text and label
    label_dict = {'__label1__':'fake', '__label2__':'real'} # creating a dictionary assigning labels to more descriptive terms. 
            
    review = [line[index] for index in indices] # 2) keep id, label, text using the indices
    review = [label_dict.get(x, x) for x in review] # 3) passing the label names to real or fake
    review = [str_to_int(x) for x in review] # 4) pass id to int using the conv function defined above
    Id, Label, Rating, Verified_purchase, Text = review # 5) assign each element of the list to a separate variable
        
    return Id, Text, Label, Rating, Verified_purchase

In [19]:
# b) TEXT PREPROCESSING AND FEATURE VECTORIZATION        

# Input: a string of one  review
def preProcess(text):
    
    tokens = []
    output = []  
    
    for token in text.split():
        
        tokens.append(token.lower())
        
        output = [' '.join(token) for token in nltk.bigrams(tokens)] + tokens
        
    return output

The main difference in the code is in **toFeatureVector**, as we need to modify the dictionary in such as way that we now also have keys which reflect the verified purchase status and the rating assigned to the product. To do so we simply input *Rating* and *Verified_purchase* as parameters of the function and create keys for them within our local dictionary *token_frequency*. We then return the dictionary with these new entry keys for each review. The function **splitData** is modified slightly to add the new parameters inputed into **toFeatureVector** and then to append the *trainData* and *testData* properly. After having done that we simply train our model as before. We see the testing accuracy obtained at the end of this notebook.

In [20]:
featureDict = {}

def toFeatureVector(tokens, Rating, Verified_purchase):
    
    token_frequency = {} # creating a dictionary to store token frequency

# Rating
 
    token_frequency["Rating"] = Rating # creating a key to store the rating given to the product

# Verified_Purchase

    if Verified_purchase == "N":  # creating a key for Verified purchase and storing N as 0 and Y as 1
        token_frequency["Verified Purchase"] = 0
    else:
        token_frequency["Verified Purchase"] = 1
        
# length review

    token_frequency['Length Review'] = len(tokens) # creating a key for the number of tokens in the review. This counts bigrams as well. 

# Text is counted as it was before

    for token in tokens: # for each word in the review
    
        if token not in token_frequency.keys(): 
            token_frequency[token] = 1  # if the word is not within the dic we add it as key = 1
        else:
            token_frequency[token] += 1  # if its already in the dic, we just add one
        
        if token not in featureDict.keys():
            featureDict[token] = 1
        else:
            featureDict[token] += 1
        
    return token_frequency


In [21]:
def loadData(path, Text=None):
    
    with open(path, encoding='utf8') as f:
        
        reader = csv.reader(f, delimiter='\t')
        next(reader)
        
        for line in reader:
            
            (Id, Text, Label, Rating, Verified_purchase) = parseReview(line)
            
            rawData.append((Id, Text, Label, Rating, Verified_purchase))
          
def splitData(percentage):
    
    # A method to split the data between trainData and testData 
    
    dataSamples = len(rawData)
    halfOfData = int(len(rawData)/2)
    trainingSamples = int((percentage*dataSamples)/2)
    
    for (_, Text, Label, Rating, Verified_purchase) in rawData[:trainingSamples] + rawData[halfOfData:halfOfData+trainingSamples]:
        trainData.append((toFeatureVector(preProcess(Text), Rating, Verified_purchase), Label))
        
    for (_, Text, Label, Rating, Verified_purchase) in rawData[trainingSamples:halfOfData] + rawData[halfOfData+trainingSamples:]:
        testData.append((toFeatureVector(preProcess(Text), Rating, Verified_purchase), Label))


In [22]:
def crossValidate(trainData, folds):

    shuffle(trainData) # reshuffle dataset as currently its ordered into fake, then real reviews
    cv_results = [] # initiate list to keep cv results for each fold
    foldSize = int(len(trainData) / folds) # initiate fold size by dividing training data by # folds.

    # DESCRIBE YOUR METHOD IN WORD
    
    for i in range(0, len(trainData), foldSize): # data goes from i to lenght of data in foldsize jumps.
        
        cv_test = trainData[i : i + foldSize]
        cv_train  = trainData[0 : i] + trainData[i + foldSize : ]
        
        classifier = trainClassifier(cv_train)  # train the classifier
        testTrue = [t[1] for t in cv_test]   # get the ground-truth labels from the data
        testPred = predictLabels(cv_test, classifier)  # classify the test data to get predicted labels
        precision, recall, fscore, _ = precision_recall_fscore_support(testTrue, testPred, average='weighted') # evaluate
        cv_results.append((precision, recall, fscore))
    cv_results = (np.mean(np.array(cv_results), axis = 0))
   # print(cv_results)
    print("Done training!")
    print("Precision %f\nRecall: %f\nF Score:%f" % (cv_results[0], cv_results[:1], cv_results[2]))   
        
    
    return cv_results


In [23]:
def predictLabels(reviewSamples, classifier):
    return classifier.classify_many(map(lambda t: t[0], reviewSamples))

In [24]:
# TRAINING AND VALIDATING OUR CLASSIFIER
def trainClassifier(trainData):
    print("Training Classifier...")
    pipeline =  Pipeline([('svc', LinearSVC(C=0.001, max_iter = 3500))])
    return SklearnClassifier(pipeline).train(trainData)

In [25]:
# MAIN

folds = 10
# loading reviews
# initialize global lists that will be appended to by the methods below
rawData = []          # the filtered data from the dataset file (should be 21000 samples)
trainData = []        # the pre-processed training data as a percentage of the total dataset (currently 80%, or 16800 samples)
testData = []         # the pre-processed test data as a percentage of the total dataset (currently 20%, or 4200 samples)

# the output classes
fakeLabel = 'fake'
realLabel = 'real'

# references to the data files
reviewPath = 'amazon_reviews.txt'

# Do the actual stuff (i.e. call the functions we've made)
# We parse the dataset and put it in a raw data list
print("Now %d rawData, %d trainData, %d testData" % (len(rawData), len(trainData), len(testData)),
      "Preparing the dataset...",sep='\n')

loadData(reviewPath) 

# We split the raw dataset into a set of training data and a set of test data (80/20)
# You do the cross validation on the 80% (training data)
# We print the number of training samples and the number of features before the split
print("Now %d rawData, %d trainData, %d testData" % (len(rawData), len(trainData), len(testData)),
      "Preparing training and test data...",sep='\n')

splitData(0.8)

# We print the number of training samples and the number of features after the split
print("After split, %d rawData, %d trainData, %d testData" % (len(rawData), len(trainData), len(testData)),
      "Training Samples: ", len(trainData), "Features: ", len(featureDict), sep='\n')

# QUESTION 3 - Make sure there is a function call here to the
# crossValidate function on the training set to get your results

crossValidate(trainData, folds)


Now 0 rawData, 0 trainData, 0 testData
Preparing the dataset...
Now 21000 rawData, 0 trainData, 0 testData
Preparing training and test data...
After split, 21000 rawData, 16800 trainData, 4200 testData
Training Samples: 
16800
Features: 
592379
Training Classifier...
Training Classifier...
Training Classifier...
Training Classifier...
Training Classifier...
Training Classifier...
Training Classifier...
Training Classifier...
Training Classifier...
Training Classifier...
Done training!
Precision 0.815087
Recall: 0.815087
F Score:0.810187


array([0.81508737, 0.81083333, 0.81018722])

In [26]:
# # Evaluate on test set

# Finally, check the accuracy of your classifier by training on all the tranin data
# and testing on the test set
# Will only work once all functions are complete
functions_complete = True  # set to True once you're happy with your methods for cross val
if functions_complete:
    print(testData[0])   # have a look at the first test data instance
    classifier = trainClassifier(trainData)  # train the classifier
    testTrue = [t[1] for t in testData]   # get the ground-truth labels from the data
    testPred = predictLabels(testData, classifier)  #Â classify the test data to get predicted labels
    finalScores = precision_recall_fscore_support(testTrue, testPred, average='weighted') # evaluate
    print("Done training!")
    print("Precision: %f\nRecall: %f\nF Score:%f" % finalScores[:3])

({'Rating': 5, 'Verified Purchase': 0, 'Length Review': 41, 'this assortment': 1, 'assortment is': 1, 'is really': 1, "really hershey's": 1, "hershey's at": 1, 'at their': 1, 'their best.': 1, 'best. the': 1, 'the little': 1, 'little ones': 1, 'ones are': 1, 'are always': 1, 'always excited': 1, 'excited whenever': 1, 'whenever the': 1, 'the holidays': 1, 'holidays come': 1, 'come because': 1, 'because of': 1, 'of this.': 1, 'this': 1, 'assortment': 1, 'is': 1, 'really': 1, "hershey's": 1, 'at': 1, 'their': 1, 'best.': 1, 'the': 2, 'little': 1, 'ones': 1, 'are': 1, 'always': 1, 'excited': 1, 'whenever': 1, 'holidays': 1, 'come': 1, 'because': 1, 'of': 1, 'this.': 1}, 'fake')
Training Classifier...
Done training!
Precision: 0.823328
Recall: 0.818333
F Score:0.817629


**Testing the feature set**

- Review text, verified purchase:


    - Precision: 0.822767
    - Recall: 0.817857
    - F Score:0.817162

- Review text, rating:


    - Precision: 0.606135
    - Recall: 0.605952
    - F Score:0.605783

- Review text, verified purchase, rating:


    - Precision: 0.822487
    - Recall: 0.817619
    - F Score:0.816928

- Review text, review lenght:


    - Precision: 0.635665
    - Recall: 0.635000
    - F Score:0.634552

- Review text, verified purchase, review lenght:


    - Precision: 0.822922
    - Recall: 0.817857
    - F Score:0.817140

- Review text, rating, verified purchase and review lenght:


    - Precision: 0.823328
    - Recall: 0.818333
    - F Score:0.817629

We can see from the combination of vectors that the feature which most contributes to the increasing testing accuracy is *Verified purchase*, which increases our Precision score from 0.63 to to 0.82 - a nearly 20 points improvement reached by only adding one feature. Clearly this feature is given a lot of importance when the specialists at amazon manually curate the reviews and decide which are real and which are fake. Regarding other features, review length alone only gave us an increase of 1 point in overall accuracy, whilst rating alone actually resulted in decrease accuracy. The combination of these three features gave a very marginal improvement in accuracy compared to using *Verified purchase* alone - reaching scores of **Precision: 0.823, Recall: 0.818, F-Score:0.818**. 

As explained in part 2 of this assignment, to further improve this score, some kind of weighting scheme for the tokens in the *token_frequency* dictionary would be an important avenue to explore, but a good strategy to do this wasn't found. 