## Sentiment Analysis using Scikit-learn (SVM vs Naive Bayes)

This Notebooks with conduct sentiment-analysis using publicly available data dump [Movie Review Data](http://www.cs.cornell.edu/people/pabo/movie-review-data/). We will compare performance of two Machine Learning techniques, [Support Vector Machines (SVM)](http://scikit-learn.org/stable/modules/svm.html) and [Naive Bayes](http://scikit-learn.org/stable/modules/naive_bayes.html) to see which one gives a more accurate _sentiment polarity_ (positive or negative)

### Data sources
**Training set:** [polarity dataset v1.1 (2.2Mb)](http://www.cs.cornell.edu/people/pabo/movie-review-data/mix20_rand700_tokens_0211.tar.gz) _includes a README_ – approximately 700 positive and 700 negative processed reviews. Released November 2002. This alternative version was created by Nathan Treloar, who removed a few non-English/incomplete reviews and changing some of the labels (judging some polarities to be different from the original author's rating). The complete list of changes made to v1.1 can be found in diff.txt.

**Testing set:** [polarity dataset v0.9 (2.8Mb)](http://www.cs.cornell.edu/people/pabo/movie-review-data/mix20_rand700_tokens.zip) _includes a README_ – 700 positive and 700 negative processed reviews. Introduced in Pang/Lee/Vaithyanathan EMNLP 2002. Released July 2002.

### Download
Download and extract the above linked files preferably on the same directory as your iPython Notebook.

### Define Methods

Define two methods:

  * `list_textfiles()` - Retrieves all the text files in a given directory
  * `read_txt()` - Reading the contents of a given text file

In [1]:
from os import listdir
import codecs

def list_textfiles(directory):
    """Return a list of filenames ending in '.txt'
    directory - name of folder to look into
    """
    textfiles = []
    for filename in listdir(directory):
        if filename.endswith(".txt"):
            textfiles.append(directory + "/" + filename)
    return textfiles


def read_txt(filename):
    """Return text content of a given file
    filename - name of file to open and read
    """
    try:
        with codecs.open(filename,'r',encoding='utf-8', errors='ignore') as f :
            text = f.read()
    except EnvironmentError: # parent of IOError, OSError *and* WindowsError where available
        print("Oops! Couldn't read the given file")
    return text

### Load data

Import the training datasets into lists

In [2]:
# import training data
filenames_pos = list_textfiles("movieReview_data/tokens/pos")
filenames_neg = list_textfiles("movieReview_data/tokens/neg")

# create two lists, one to store review text and one stores the polarity
data_train = []
data_labels_train = []

for f in filenames_pos:
    data_train.append(read_txt(f))
    data_labels_train.append('pos')

for f in filenames_neg:
    data_train.append(read_txt(f))
    data_labels_train.append('neg')

### Vectorization
Next, we initialize a [sckit-learn](http://scikit-learn.org/stable/index.html) vector with the [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) class. Because the data could be in any format, we’ll set lowercase to False and exclude common words such as “the” or “and”. This vectorizer will transform our data into vectors of features. 

`min_df` - Used to set a threshold for the vocabulary that can be ignored. Giving a float of 0.5 implies getting the proportion of documents, giving an integer implies absolute count of documents.

`max_df` - Used to set a threshold for vocabulary terms to ignore that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts.

Once the TfidfVectorizer class is initialized, we fit it onto the data above and convert it to an array for easy usage.

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=5,max_df=0.8, sublinear_tf=True,use_idf=True)
features_train = vectorizer.fit_transform(data_train)

## Import Classifier
### SVM
Using Support vector machines (SVM), we classify the training data using the polarity labels and create a model for testing.

In [4]:
from sklearn import svm
clf = svm.SVC()
# train svm model
clf.fit(features_train, data_labels_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

#### Evaluation

Load the dataset for testing. Import the dataset into lists.

In [5]:
from sklearn import metrics
import numpy as np;

# import test data
filenames_pos_test = list_textfiles("mix20_rand700_tokens/tokens/pos")
filenames_neg_test = list_textfiles("mix20_rand700_tokens/tokens/neg")

# create two lists, one to store review text and one stores the polarity
data_test = []
data_labels_test = []

for f in filenames_pos:
    data_test.append(read_txt(f))
    data_labels_test.append('pos')

for f in filenames_neg:
    data_test.append(read_txt(f))
    data_labels_test.append('neg')

# vectorize
features_test = vectorizer.fit_transform(data_test)

#features_nd_test = features_test.toarray() # for easy usage

#### Scores

Obtain the accuracy scores 

In [6]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

predicted = clf.predict(features_test)

# print accuracy score
print("Accuracy score of SVM model: {}\n".format(accuracy_score(data_labels_test,predicted)))
# print evaluation report showing: precision, recall, f1-score, support
print(classification_report(data_labels_test, predicted))

Accuracy score of SVM model: 0.9385714285714286

             precision    recall  f1-score   support

        neg       0.97      0.91      0.94       700
        pos       0.91      0.97      0.94       700

avg / total       0.94      0.94      0.94      1400



### Naive Bayes

We perform the same steps using Naive Bayes [MultinomialNB](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html) class. 

In [7]:
from sklearn.naive_bayes import MultinomialNB

mnb = MultinomialNB()
mnb.fit(features_train, data_labels_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

#### Scores

In [8]:
from sklearn.metrics import classification_report

mnb_predict = mnb.predict(features_test)

print("Accuracy score of Naive Bayes model: {}\n".format(accuracy_score(data_labels_test,mnb_predict)))
print(classification_report(data_labels_test, mnb_predict))

Accuracy score of Naive Bayes model: 0.965

             precision    recall  f1-score   support

        neg       0.96      0.97      0.97       700
        pos       0.97      0.96      0.96       700

avg / total       0.97      0.96      0.96      1400



### Dumping model

joblib.dump(value, filename, compress=0, protocol=None, cache_size=None)

In [9]:
from sklearn.externals import joblib
joblib.dump(mnb,'sentNB.model')

['sentNB.model']

### Question:

Why Naive Bayes performed much better than SVM in this prediction?