# Classification

Perform sentiment analysis on a corpus of book reviews from Amazon.

<b> Exercise 1.1 - Generative Classifiers: Naıve Bayes </b> 
<br>
In this exercise we will use the Amazon sentiment analysis data (Blitzer et al., 2007), where the goal is to classify text documents as expressing a positive or negative sentiment (i.e., a classification problem with two classes). We are going to focus on book reviews.

In [1]:
import lxmls.readers.sentiment_reader as srs
scr = srs.SentimentCorpus("books")

In [2]:
import numpy as np
from __future__ import division
print "My instances to train have the shape", np.shape(scr.train_X), "and are the following:\n", scr.train_X
print "\n Their targets/outputs belongo to the classes", np.unique(scr.train_y),"and are the following:\n", scr.train_y 

print "\n My instances to test have the shape", np.shape(scr.test_X)

print '\nSo, each instance (row) represents a doc, being the doc represented as a bag-of-words  (each collum represents the freq of that word in that doc). Each document can either be classified as positive or a negative sentiment (1, ou 0, respectively).'

My instances to train have the shape (1600, 13989) and are the following:
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 1. 0.]
 [0. 0. 0. ... 0. 0. 0.]]

 Their targets/outputs belongo to the classes [0 1] and are the following:
[[0]
 [1]
 [0]
 ...
 [1]
 [0]
 [1]]

 My instances to test have the shape (400, 13989)

So, each instance (row) represents a doc, being the doc represented as a bag-of-words  (each collum represents the freq of that word in that doc). Each document can either be classified as positive or a negative sentiment (1, ou 0, respectively).


 < 1. Implement the Naive Bayes algorithm. Open the file multinomial naive bayes.py, which is inside the classifiers folder. In the MultinomialNaiveBayes class you will find the train method. We have already placed some code in that file to help you get started.

In [3]:
#This is the code that I implemented in the train method:
''' what was given, where x and y where scr.train_X and trainY:'''
x=scr.train_X
y=scr.train_y
# n_docs = no. of documents
# n_words = no. of unique words
n_docs, n_words = x.shape
# classes = a list of possible classes
classes = np.unique(y)
# n_classes = no. of classes
n_classes = np.unique(y).shape[0]

# initialization of the prior and likelihood variables
prior = np.zeros(n_classes)
likelihood = np.zeros((n_words, n_classes))

'''my solution'''

for i in range(n_classes):
    docs_of_class=x[np.where(y==classes[i])[0]] 

    prior[i]=len(docs_of_class)/n_docs

    freq_of_each_word=docs_of_class.sum(0)  #[freq_w1, freq_w2, etc]
    likelihood[:,i]=freq_of_each_word/ freq_of_each_word.sum()

< 2. After implementing, run NaiveBayes with the multinomial model on the Amazon dataset(sentiment classification) and report results both for training and testing:

In [4]:
smoothing=False 

import lxmls.classifiers.multinomial_naive_bayes as mnbb
mnb = mnbb.MultinomialNaiveBayes()
params_nb_sc = mnb.train(scr.train_X,scr.train_y, smoothing)
y_pred_train = mnb.test(scr.train_X,params_nb_sc)
acc_train = mnb.evaluate(scr.train_y, y_pred_train)
y_pred_test = mnb.test(scr.test_X,params_nb_sc)
acc_test = mnb.evaluate(scr.test_y, y_pred_test)
print "Multinomial Naive Bayes Amazon Sentiment Accuracy train: %f test: %f"%(acc_train,acc_test)

  params[1:, i] = np.nan_to_num(np.log(likelihood[:, i]))


Multinomial Naive Bayes Amazon Sentiment Accuracy train: 0.987500 test: 0.635000


< 3.
Observe that words that were not observed at training time cause problems at test time. Why? To solve this problem, apply a simple add-one smoothing technique

The ideia is to change likelihood as to:
    likelihood[:,i]=(1+freq_of_each_word)/ (freq_of_each_word.sum()+n_words)

In [5]:
smoothing=True 

import lxmls.classifiers.multinomial_naive_bayes as mnbb
mnb = mnbb.MultinomialNaiveBayes()
params_nb_sc = mnb.train(scr.train_X,scr.train_y,smoothing)
y_pred_train = mnb.test(scr.train_X,params_nb_sc)
acc_train = mnb.evaluate(scr.train_y, y_pred_train)
y_pred_test = mnb.test(scr.test_X,params_nb_sc)
acc_test = mnb.evaluate(scr.test_y, y_pred_test)
print "Multinomial Naive Bayes Amazon Sentiment Accuracy train: %f test: %f"%(acc_train,acc_test)

Multinomial Naive Bayes Amazon Sentiment Accuracy train: 0.974375 test: 0.840000


<b> Exercise1.2 </b>
<br>
We provide an implementation of the perceptron algorithm in the class Perceptron (file perceptron.py).

< 1. Run the following commands to generate a simple dataset similar to the one plotted on Figure 1.1:

In [6]:
import lxmls.readers.simple_data_set as sds
sd = sds.SimpleDataSet(nr_examples=100, g1 = [[-1,-1],1], g2 = [[1,1],1], balance=0.5, split=[0.5,0,0.5])

< 2. Run the perceptron algorithm on the simple dataset previously generated and report its train and test set accuracy:

In [7]:
import lxmls.classifiers.perceptron as percc
perc = percc.Perceptron()
params_perc_sd = perc.train(sd.train_X,sd.train_y)
y_pred_train = perc.test(sd.train_X,params_perc_sd)
acc_train = perc.evaluate(sd.train_y, y_pred_train)
y_pred_test = perc.test(sd.test_X,params_perc_sd)
acc_test = perc.evaluate(sd.test_y, y_pred_test)
print "Perceptron Simple Dataset Accuracy train: %f test: %f"%(acc_train,acc_test)

Rounds: 0 Accuracy: 0.840000
Rounds: 1 Accuracy: 0.900000
Rounds: 2 Accuracy: 0.940000
Rounds: 3 Accuracy: 0.940000
Rounds: 4 Accuracy: 0.860000
Rounds: 5 Accuracy: 0.940000
Rounds: 6 Accuracy: 0.920000
Rounds: 7 Accuracy: 0.920000
Rounds: 8 Accuracy: 0.920000
Rounds: 9 Accuracy: 0.920000
Perceptron Simple Dataset Accuracy train: 0.940000 test: 0.840000


< 3. Plot the decision boundary found:

In [10]:
#import matplotlib
fig,axis = sd.plot_data()
fig,axis = sd.add_line(fig,axis,params_perc_sd,"Perceptron","blue")

  'Matplotlib is building the font cache using fc-list. '


[[-1.69314718 -1.69314718]
 [-1.          1.        ]
 [-1.          1.        ]]


< 4. Run the perceptron algorithm on the Amazon dataset.

In [9]:
params_perc_scr = perc.train(scr.train_X,scr.train_y)
y_pred_train = perc.test(scr.train_X,params_perc_scr)
acc_train = perc.evaluate(scr.train_y, y_pred_train)
y_pred_test = perc.test(scr.test_X,params_perc_scr)
acc_test = perc.evaluate(scr.test_y, y_pred_test)
print "Perceptron Simple Dataset Accuracy train: %f test: %f"%(acc_train,acc_test)

Rounds: 0 Accuracy: 0.870000
Rounds: 1 Accuracy: 0.940000
Rounds: 2 Accuracy: 0.979375
Rounds: 3 Accuracy: 0.965625
Rounds: 4 Accuracy: 0.989375
Rounds: 5 Accuracy: 0.996250
Rounds: 6 Accuracy: 0.995000
Rounds: 7 Accuracy: 0.999375
Rounds: 8 Accuracy: 0.996250
Rounds: 9 Accuracy: 0.998125
Perceptron Simple Dataset Accuracy train: 0.998750 test: 0.825000
