# Sentiment analysis using logistic regression

The Sentiment data set consists of 3000 sentences which come from reviews on `imdb.com`, `amazon.com`, and `yelp.com`. Each sentence is labeled according to whether it comes from a positive review or negative review.

We will use `logistic regression` to learn a classifier from this data.

## Set up notebook, load and preprocess data

First, some standard includes.

In [10]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

%matplotlib inline
import string
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
matplotlib.rc('xtick', labelsize=14) 
matplotlib.rc('ytick', labelsize=14)

The data set consists of 3000 sentences, each labeled '1' (if it came from a positive review) or '0' (if it came from a negative review). We will change the negative review label to '-1'.

In [2]:
## Read in the data set
with open("full_set.txt") as f:
    content = f.readlines()

## Remove leading and trailing white space
content = [x.strip() for x in content]

## Separate the sentences from the labels
sentences = [x.split("\t")[0] for x in content]
labels = [x.split("\t")[1] for x in content]

## Transform the labels from '0 v.s. 1' to '-1 v.s. 1'
y = np.array(labels, dtype='int8')
y = 2*y - 1

### Preprocessing the text data

To transform this prediction problem into one amenable to linear classification, we will first need to preprocess the text data. We will do four transformations:

1. Remove punctuation and numbers.
2. Transform all words to lower-case.
3. Remove _stop words_.
4. Convert the sentences into vectors, using a bag-of-words representation.

We begin with first two steps.

In [3]:
## full_remove takes a string x and a list of characters removal_list 
## returns x with all the characters in removal_list replaced by ' '
def full_remove(x, removal_list):
    for w in removal_list:
        x = x.replace(w, ' ')
    return x

## Remove digits
digits = [str(x) for x in range(10)]
digit_less = [full_remove(x, digits) for x in sentences]

## Remove punctuation
punc_less = [full_remove(x, list(string.punctuation)) for x in digit_less]

## Make everything lower-case
sents_lower = [x.lower() for x in punc_less]

### Stop words

Stop words are words that are filtered out because they are believed to contain no useful information for the task at hand. These usually include articles such as 'a' and 'the', pronouns such as 'i' and 'they', and prepositions such 'to' and 'from'. We have put together a very small list of stop words, but these are by no means comprehensive. Feel free to use something different; for instance, larger lists can easily be found on the web.

Run these two lines just once and download 'stopwords' from the interactive menu.
* import nltk 
* nltk.download()

In [5]:
## Define our stop words
from nltk.corpus import stopwords
stop_set = set(stopwords.words('english'))

## Remove stop words
sents_split = [x.split() for x in sents_lower] # example - ['the', 'mic', 'is', 'great']
sents_processed = [" ".join(list(filter(lambda a: a not in stop_set, x))) for x in sents_split] # example - 'mic great'

Let us see how the sentences look so far.

In [6]:
sents_processed[0:10]

['way plug us unless go converter',
 'good case excellent value',
 'great jawbone',
 'tied charger conversations lasting minutes major problems',
 'mic great',
 'jiggle plug get line right get decent volume',
 'several dozen several hundred contacts imagine fun sending one one',
 'razr owner must',
 'needless say wasted money',
 'waste money time']

### Bag of words

In order to use linear classifiers on our data set, we need to transform our textual data into numeric data. The classical way to do this is known as the `bag of words` representation. 

In this representation, each word is thought of as corresponding to a number in `{1, 2, ..., V}` where `V` is the size of our vocabulary. And each sentence is represented as a V-dimensional vector $x$, where $x_i$ is the number of times that word $i$ occurs in the sentence.

To do this transformation, we will make use of the `CountVectorizer` class in `scikit-learn`. We will cap the number of features at 4500, meaning a word will make it into our vocabulary only if it is one of the 4500 most common words in the corpus. This is often a useful step as it can weed out spelling mistakes and words which occur too infrequently to be useful.

In [7]:
# See Sebastian Raschka - Python Machine Learning, page 236/237 for reference
from sklearn.feature_extraction.text import CountVectorizer

## Transform to bag of words representation.
vectorizer = CountVectorizer(analyzer = "word", tokenizer = None, preprocessor = None, stop_words = None, max_features = 4500)
data_features = vectorizer.fit_transform(sents_processed)

data_mat = data_features.toarray()

### Training / test split

Finally, we split the data into a training set of 2500 sentences and a test set of 500 sentences (of which 250 are positive and 250 negative).

In [13]:
## Split the data into training and testingsets
np.random.seed(0)
test_inds = np.append(np.random.choice((np.where(y==-1))[0], 250, replace=False), np.random.choice((np.where(y==1))[0], 250, replace=False))
train_inds = list(set(range(len(labels))) - set(test_inds))

train_data = data_mat[train_inds,]
train_labels = y[train_inds]

test_data = data_mat[test_inds,]
test_labels = y[test_inds]

print"Train data: ", train_data.shape
print"Test data: ", test_data.shape

Train data:  (2500, 4500)
Test data:  (500, 4500)


## Fitting a logistic regression model to the training data

We could implement our own logistic regression solver using stochastic gradient descent, but fortunately, there is already one built into `scikit-learn`.

Due to the randomness in the SGD procedure, different runs can yield slightly different solutions (and thus different error values).

In [14]:
from sklearn.linear_model import SGDClassifier

## Fit logistic classifier on training data
clf = SGDClassifier(loss="log", penalty="none")
clf.fit(train_data, train_labels)

## Pull out the parameters (w,b) of the logistic regression model
w = clf.coef_[0,:]
b = clf.intercept_

## Get predictions on training and test data
preds_train = clf.predict(train_data)
preds_test = clf.predict(test_data)

## Compute errors
errs_train = np.sum(preds_train != train_labels)
errs_test = np.sum(preds_test != test_labels)

print"Training error: ", float(errs_train)/len(train_labels)
print"Test error: ", float(errs_test)/len(test_labels)

## Compute accuracy
print'Test Accuracy: %.3f' % clf.score(test_data, test_labels)

Training error:  0.0276
Test error:  0.182
Test Accuracy: 0.818


The results reveal that our machine learning model can predict whether a movie review is positive or negative ``with 81 percent accuracy``.

## Words with large influence

Finally, we attempt to partially interpret the logistic regression model.

Which words are most important in deciding whether a sentence is positive? As a first approximation to this, we simply take the words whose coefficients in `w` that have the largest positive values.

Likewise, we look at the words whose coefficients in `w` that have the most negative values, and we think of these as influential in negative predictions.

In [16]:
## Convert vocabulary into a list:
vocab = np.array([z[0] for z in sorted(vectorizer.vocabulary_.items(), key=lambda x:x[1])])

## Get indices of sorting w
inds = np.argsort(w) # parameters (w,b) of the logistic regression model

## Words with large positive values
pos_inds = inds[-49:-1]
print("Highly positive words: \n ")
print([str(x) for x in list(vocab[pos_inds])])
print(" \n ")

## Words with large negative values
neg_inds = inds[0:50]
print("Highly negative words: \n  ")
print([str(x) for x in list(vocab[neg_inds])])

Highly positive words: 
 
['huston', 'white', 'world', 'blackberry', 'ladies', 'order', 'phenomenal', 'flavorful', 'happy', 'brilliant', 'soul', 'feature', 'assure', 'motorola', 'gem', 'slim', 'comfortable', 'amount', 'without', 'bowl', 'fantastic', 'sturdy', 'performance', 'exactly', 'keyboard', 'pleased', 'hand', 'great', 'cool', 'mouth', 'clear', 'data', 'wonderful', 'massive', 'perfect', 'liked', 'love', 'joy', 'works', 'delicious', 'inside', 'art', 'enjoyed', 'nice', 'incredible', 'awesome', 'loved', 'interesting']
 
 
Highly negative words: 
  
['poor', 'beep', 'pm', 'avoid', 'bland', 'ok', 'unfortunately', 'mediocre', 'fly', 'stupid', 'slow', 'make', 'wasted', 'sucks', 'storyline', 'racial', 'rude', 'worst', 'junk', 'dirty', 'fat', 'lacks', 'wife', 'waste', 'script', 'disappointment', 'cheap', 'failed', 'att', 'mistake', 'luck', 'guess', 'empty', 'joke', 'flat', 'cheesy', 'bye', 'fails', 'none', 'ridiculous', 'puppets', 'plug', 'average', 'garbage', 'crap', 'card', 'cover', 'cli