##[Abstract]

In this notebook, I decided to recreate the work of Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. The main goal was to check if using same machine learning methods but with modern libraries we could achieve better performance in sentiment classification.
I conclude that modern libraries could give slightly better performance than Pang's team tools.





## [Project Description]
I used cleaned dataset consist of 690 positive and 690 negative preprocessed movie reviews as corpus. Like authors, I didn't apply stemming or stoplists.
Moreover, I decided to not check top 2633 unigrams and unigrams positioning because these approaches didn't give good results.

##[Libraries]

In [0]:
import nltk
from nltk import word_tokenize
from nltk.util import ngrams
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn import model_selection, naive_bayes, svm
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import numpy as np


nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('all')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package biocreative_ppi is already up-to-date!
[nltk_data]    | Downloading package brown to /root/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package brown_tei to /root/nltk_data...
[nltk_data]    |   Package brown_tei is already up

True

## [Data Loading]
Data was downloaded from Cornell University site and due to a different encoding, I processed it to one CSV file and uploaded it to my GitHub.

In [0]:
corpus = pd.read_csv("https://raw.githubusercontent.com/DamianWiatrzyk/PangDatasetCorpus/master/corpus.csv", sep='\t')

## [Data Exploration] 
A quick overview of used corpus.

In [0]:
corpus.shape

(1380, 2)

In [0]:
corpus.describe

<bound method NDFrame.describe of                                                document label
0     the farrelly brothers' third film , after dumb...   pos
1     more movie views by jamey hughton at : http : ...   pos
2     if chris farley had strapped some fake mutton-...   pos
3     in zoolander , the world's most successful , i...   pos
4     if the current trends of hollywood filmmaking ...   pos
5     for more reviews and movie trailers , visit ht...   pos
6     director : george armitage cast : john cusack ...   pos
7     written by : peter wang and shirley sunstarrin...   pos
8     if you¹ve been paying attention to the media f...   pos
9     director : john sayles || screenplay : john sa...   pos
10    starring liam neeson , ewan mcgregor , jake ll...   pos
11    one never quite knows what one is going to get...   pos
12    an entertaining 2 hours awaits the audience in...   pos
13    made ( 2001 ) . starring jon favreau , vince v...   pos
14    director :   steven soderbergh

## [Data Preprocessing and Vectorizers]
This step was done by Pang's team so I could skip it and prepare vectorizers. Each vectorizer corresponds with different Bag-Of-Features tokenization.

In [0]:
#returns pos_tagged unigrams
def unigram_POS_tokens(docs):
  tokens = nltk.word_tokenize(docs)
  unigram_POS = nltk.pos_tag(tokens)
  return list(unigram_POS)

In [0]:
#returns only pos_tagged adjectives
def unigram_adjectives(docs):
    tokens = nltk.word_tokenize(docs)
    adjectives = list()
    for tag in filter(lambda x: x[1] == 'JJ' or x[1] == 'JJS' or x[1] == 'JJR',
                      nltk.pos_tag(tokens)):
        adjectives.extend([tag[0]])

    return adjectives

In [0]:
def unigrams_frequency_vectorization(docs):
  vectorizer = CountVectorizer(ngram_range=(1, 1), binary=False)
  # tokenize and build vocab
  vectorizer.fit(docs)
  # encode document
  vector = vectorizer.transform(docs)
  # summarize encoded vector
  print('shape: ', vector.shape)
  return vector

In [0]:
def unigrams_presence_vectorization(docs):
  vectorizer = CountVectorizer(ngram_range=(1, 1), binary=True)
  # tokenize and build vocab
  vectorizer.fit(docs)
  # encode document
  vector = vectorizer.transform(docs)
  # summarize encoded vector
  print('shape: ', vector.shape)
  return vector

In [0]:
def bigrams_presence_vectorization(docs):
  vectorizer = CountVectorizer(ngram_range=(2, 2), binary=True)
  # tokenize and build vocab
  vectorizer.fit(docs)
  # encode document
  vector = vectorizer.transform(docs)
  # summarize encoded vector
  print('shape: ', vector.shape)
  return vector

In [0]:
def unigrams_and_bigrams_presence_vectorization(docs):
  vectorizer = CountVectorizer(ngram_range=(1, 2), binary=True)
  # tokenize and build vocab
  vectorizer.fit(docs)
  # encode document
  vector = vectorizer.transform(docs)
  # summarize encoded vector
  print('shape: ', vector.shape)
  return vector

In [0]:
def POStagged_unigrams_presence_vectorization(docs):
  vectorizer = CountVectorizer(ngram_range=(1, 1), binary=True, tokenizer=unigram_POS_tokens)
  # tokenize and build vocab
  vectorizer.fit(docs)
  # encode document
  vector = vectorizer.transform(docs)
  # summarize encoded vector
  print('shape: ', vector.shape)
  return vector

In [0]:
def POStagged_ajectives_presence_vectorization(docs):
  vectorizer = CountVectorizer(ngram_range=(1, 1), binary=True, tokenizer=unigram_adjectives)
  # tokenize and build vocab
  vectorizer.fit(docs)
  # encode document
  vector = vectorizer.transform(docs)
  # summarize encoded vector
  print('shape: ', vector.shape)
  return vector

## [Classifiers]

In [0]:
#used classifiers
multinomialNaiveBayes = clf = MultinomialNB()
logReg = LogisticRegression(random_state=0, solver='lbfgs',multi_class='multinomial', max_iter=500)
svm = svm.SVC(kernel='linear', C=1)

## [Training Model]
In the next steps, I vectorize data depending on the examined case. Accuracies are average three-fold cross-validation results.

##Accuracies for unigrams(freq.) 
Please note that in this case, I use feature-count vectors. However, MaxEnt classifier only reflects the
presence or absence of a feature, rather than directly incorporating feature frequency.
In further cases, binary vectors are applied.

In [0]:
data=unigrams_frequency_vectorization(corpus['document'])
encoder = LabelEncoder()
target = encoder.fit_transform(corpus['label'])

shape:  (1380, 34989)


In [0]:
scoresNB = cross_val_score(multinomialNaiveBayes, data, target, cv=3)
print("Accuracy for NB: %0.3f (+/- %0.3f)" % (scoresNB.mean(), scoresNB.std() * 2))

Accuracy for NB: 0.791 (+/- 0.026)


In [0]:
scoresSVM = cross_val_score(svm, data, target, cv=3)
print("Accuracy for SVM: %0.3f (+/- %0.3f)" % (scoresSVM.mean(), scoresSVM.std() * 2))

Accuracy for SVM: 0.794 (+/- 0.008)


In [0]:
unigrams_frequency_acc={'Features:' : 'unigrams(freq.)', 'NB' : [scoresNB.mean()], 'MaxEnt': [np.nan], 'SVM' : [scoresSVM.mean()]}

In [0]:
unigrams_frequency_accDF=pd.DataFrame(unigrams_frequency_acc)

In [0]:
accuracies = pd.DataFrame(unigrams_frequency_accDF, columns=['Features:', 'NB', 'MaxEnt','SVM'])

##Accuracies for unigrams(pres.) 

In [0]:
data=unigrams_presence_vectorization(corpus['document'])

shape:  (1380, 34989)


In [0]:
scoresNB = cross_val_score(multinomialNaiveBayes, data, target, cv=3)
print("Accuracy for NB: %0.3f (+/- %0.3f)" % (scoresNB.mean(), scoresNB.std() * 2))

Accuracy for NB: 0.810 (+/- 0.041)


In [0]:
scoresMaxEnt = cross_val_score(logReg, data, target, cv=3)
print("Accuracy for MaxEnt: %0.3f (+/- %0.3f)" % (scoresMaxEnt.mean(), scoresMaxEnt.std() * 2))

Accuracy for MaxEnt: 0.838 (+/- 0.011)


In [0]:
scoresSVM = cross_val_score(svm, data, target, cv=3)
print("Accuracy for SVM: %0.3f (+/- %0.3f)" % (scoresSVM.mean(), scoresSVM.std() * 2))

Accuracy for SVM: 0.831 (+/- 0.026)


In [0]:
unigrams_presence_acc={'Features:' : 'unigrams(pres.)', 'NB' : [scoresNB.mean()], 'MaxEnt': [scoresMaxEnt.mean()], 'SVM' : [scoresSVM.mean()]}

In [0]:
unigrams_presence_accDF=pd.DataFrame(unigrams_presence_acc)

In [0]:
accuracies=accuracies.append(unigrams_presence_accDF, ignore_index=True)

##Accuracies for unigrams and bigrams(pres.) 

In [0]:
data=unigrams_and_bigrams_presence_vectorization(corpus['document'])

shape:  (1380, 412424)


In [0]:
scoresNB = cross_val_score(multinomialNaiveBayes, data, target, cv=3)
print("Accuracy for NB: %0.3f (+/- %0.3f)" % (scoresNB.mean(), scoresNB.std() * 2))

Accuracy for NB: 0.838 (+/- 0.033)


In [0]:
scoresMaxEnt = cross_val_score(logReg, data, target, cv=3)
print("Accuracy for MaxEnt: %0.3f (+/- %0.3f)" % (scoresMaxEnt.mean(), scoresMaxEnt.std() * 2))

Accuracy for MaxEnt: 0.834 (+/- 0.009)


In [0]:
scoresSVM = cross_val_score(svm, data, target, cv=3)
print("Accuracy for SVM: %0.3f (+/- %0.3f)" % (scoresSVM.mean(), scoresSVM.std() * 2))

Accuracy for SVM: 0.831 (+/- 0.023)


In [0]:
unigrams_and_bigrams_presence_acc={'Features:' : 'unigrams and bigrams(pres.) ', 'NB' : [scoresNB.mean()], 'MaxEnt': [scoresMaxEnt.mean()], 'SVM' : [scoresSVM.mean()]}

In [0]:
unigrams_and_bigrams_presence_accDF=pd.DataFrame(unigrams_and_bigrams_presence_acc)

In [0]:
accuracies=accuracies.append(unigrams_and_bigrams_presence_accDF, ignore_index=True)

##Accuracies for bigrams(pres.) 

In [0]:
data=bigrams_presence_vectorization(corpus['document'])

shape:  (1380, 377435)


In [0]:
scoresNB = cross_val_score(multinomialNaiveBayes, data, target, cv=3)
print("Accuracy for NB: %0.3f (+/- %0.3f)" % (scoresNB.mean(), scoresNB.std() * 2))

Accuracy for NB: 0.825 (+/- 0.009)


In [0]:
scoresMaxEnt = cross_val_score(logReg, data, target, cv=3)
print("Accuracy for MaxEnt: %0.3f (+/- %0.3f)" % (scoresMaxEnt.mean(), scoresMaxEnt.std() * 2))

Accuracy for MaxEnt: 0.804 (+/- 0.020)


In [0]:
scoresSVM = cross_val_score(svm, data, target, cv=3)
print("Accuracy for SVM: %0.3f (+/- %0.3f)" % (scoresSVM.mean(), scoresSVM.std() * 2))

Accuracy for SVM: 0.801 (+/- 0.014)


In [0]:
bigrams_presence_acc={'Features:' : 'bigrams(pres.)', 'NB' : [scoresNB.mean()], 'MaxEnt': [scoresMaxEnt.mean()], 'SVM' : [scoresSVM.mean()]}

In [0]:
bigrams_presence_accDF=pd.DataFrame(bigrams_presence_acc)

In [0]:
accuracies=accuracies.append(bigrams_presence_accDF, ignore_index=True)

##Accuracies for unigrams with POS tagging(pres.) 

In [0]:
data=POStagged_unigrams_presence_vectorization(corpus['document'])

shape:  (1380, 57378)


In [0]:
scoresNB = cross_val_score(multinomialNaiveBayes, data, target, cv=3)
print("Accuracy for NB: %0.3f (+/- %0.3f)" % (scoresNB.mean(), scoresNB.std() * 2))

Accuracy for NB: 0.811 (+/- 0.039)


In [0]:
scoresMaxEnt = cross_val_score(logReg, data, target, cv=3)
print("Accuracy for MaxEnt: %0.3f (+/- %0.3f)" % (scoresMaxEnt.mean(), scoresMaxEnt.std() * 2))

Accuracy for MaxEnt: 0.831 (+/- 0.024)


In [0]:
scoresSVM = cross_val_score(svm, data, target, cv=3)
print("Accuracy for SVM: %0.3f (+/- %0.3f)" % (scoresSVM.mean(), scoresSVM.std() * 2))

Accuracy for SVM: 0.819 (+/- 0.030)


In [0]:
POStagged_unigrams_presence_acc={'Features:' : 'unigrams+POS', 'NB' : [scoresNB.mean()], 'MaxEnt': [scoresMaxEnt.mean()], 'SVM' : [scoresSVM.mean()]}

In [0]:
POStagged_unigrams_presence_accDF=pd.DataFrame(POStagged_unigrams_presence_acc)

In [0]:
accuracies=accuracies.append(POStagged_unigrams_presence_accDF, ignore_index=True)

##Accuracies for adjectives(pres.) 

In [0]:
data=POStagged_ajectives_presence_vectorization(corpus['document'])

shape:  (1380, 13101)


In [0]:
scoresNB = cross_val_score(multinomialNaiveBayes, data, target, cv=3)
print("Accuracy for NB: %0.3f (+/- %0.3f)" % (scoresNB.mean(), scoresNB.std() * 2))

Accuracy for NB: 0.785 (+/- 0.028)


In [0]:
scoresMaxEnt = cross_val_score(logReg, data, target, cv=3)
print("Accuracy for MaxEnt: %0.3f (+/- %0.3f)" % (scoresMaxEnt.mean(), scoresMaxEnt.std() * 2))

Accuracy for MaxEnt: 0.772 (+/- 0.020)


In [0]:
scoresSVM = cross_val_score(svm, data, target, cv=3)
print("Accuracy for SVM: %0.3f (+/- %0.3f)" % (scoresSVM.mean(), scoresSVM.std() * 2))

Accuracy for SVM: 0.743 (+/- 0.026)


In [0]:
POStagged_ajectives_presence_acc={'Features:' : 'adjectives', 'NB' : [scoresNB.mean()], 'MaxEnt': [scoresMaxEnt.mean()], 'SVM' : [scoresSVM.mean()]}

In [0]:
POStagged_ajectives_presence_accDF=pd.DataFrame(POStagged_ajectives_presence_acc)

In [0]:
accuracies=accuracies.append(POStagged_ajectives_presence_accDF, ignore_index=True)

## [Results]

In [0]:
pangs_acc=[{'Features:' : 'unigrams(freq.)', 'NB' : [79.0], 'MaxEnt': [np.nan], 'SVM' : [73.0]},
           {'Features:' : 'unigrams(pres.)', 'NB' : [81.0], 'MaxEnt': [80.2], 'SVM' : [82.9]},
           {'Features:' : 'unigrams and bigrams', 'NB' : [80.7], 'MaxEnt': [80.7], 'SVM' : [82.8]},
           {'Features:' : 'bigrams', 'NB' : [77.3], 'MaxEnt': [77.5], 'SVM' : [76.5]},
           {'Features:' : 'unigrams+POS', 'NB' : [81.3], 'MaxEnt': [80.3], 'SVM' : [82.0]},
           {'Features:' : 'adjectives', 'NB' : [76.6], 'MaxEnt': [77.6], 'SVM' : [75.3]}]

pangsDF = pd.DataFrame(pangs_acc, columns=['Features:', 'NB', 'MaxEnt','SVM'])

In [0]:
#my results
accuracies.describe

<bound method NDFrame.describe of                       Features:        NB    MaxEnt       SVM
0               unigrams(freq.)  0.790580       NaN  0.794203
1               unigrams(pres.)  0.810145  0.838406  0.831159
2  unigrams and bigrams(pres.)   0.838406  0.834058  0.831159
3                bigrams(pres.)  0.824638  0.803623  0.800725
4                  unigrams+POS  0.810870  0.831159  0.818841
5                    adjectives  0.784783  0.771739  0.743478>

In [0]:
#pangs results
pangsDF.describe

<bound method NDFrame.describe of               Features:      NB  MaxEnt     SVM
0       unigrams(freq.)  [79.0]   [nan]  [73.0]
1       unigrams(pres.)  [81.0]  [80.2]  [82.9]
2  unigrams and bigrams  [80.7]  [80.7]  [82.8]
3               bigrams  [77.3]  [77.5]  [76.5]
4          unigrams+POS  [81.3]  [80.3]  [82.0]
5            adjectives  [76.6]  [77.6]  [75.3]>

## [Final Words]
The results produced via modern machine learning techniques are slightly better and differences between accuracies for different methods are smaller. Also using more folds give better performance but I decided to stay with three folds to compare results from the paper.
In this case, we can see that binary vectors works better but also we see that intuitive approaches like covering just adjectives bring much worse results in sentiment classification. 
In this paper, I cover mainly the presence of features in the document, but we have also techniques like TF-IDF which covers term weights or word2vec which covers words context.
In the future, I also intend to check these approaches.

## [References]
[Pang, Lee, Vaithyanathan. Thumbs up?  Sentiment Classification using Machine Learning Techniques, 2002.
](http://www.cs.cornell.edu/home/llee/papers/sentiment.pdf)