# Intro to Sentiment Analysis
[Original article here](https://towardsdatascience.com/sentiment-analysis-with-python-part-1-5ce197074184)
[Data from here](https://github.com/aaronkub/machine-learning-examples/blob/master/imdb-sentiment-analysis/movie_data.tar.gz)

## Import Libraries

In [85]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore') #I like to live dangerously

## Import Data

In [12]:
reviews_train = []
for line in open ('../../../data/movie_data/full_train.txt','r'):
    reviews_train.append(line.strip())
    
reviews_test = []
for line in open ('../../../data/movie_data/full_test.txt','r'):
    reviews_test.append(line.strip())

## Clean and Preprocess
We will now use Regex (REGular EXpression) functions in Python to do our cleaning. Being comfortable with Regex is an absolute must for text mining. 

The `re.compile()` method is given a regular expression pattern (the crazy sequence of characters) which is used for pattern matching.

In [70]:
import re

#REPLACE_NO_SPACE pattern matches for the characters within it (mostly punctuation)
#and replaces them with no space
REPLACE_NO_SPACE = re.compile("[.;:!\'?,\"()\[\]]")

#REPLACE_WITH_SPACE pattern matches for all the characters within it
#and replaces them with a space
REPLACE_WITH_SPACE = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)") 

Take a moment to do some pattern matching of your own and connect what is being placed as an input to `re.compile()` to what is being removed from the strings below

In [72]:
print(REPLACE_NO_SPACE.sub("","!!?'H()el?l?o!"))
print(REPLACE_WITH_SPACE.sub(" ",'Hello-to/you<br /><br />too!'))

Hello
Hello to you too!


In [76]:
def preprocess_reviews(reviews):
    #line.lower() is turning each line in reviews into all lower case
    
    if isinstance(reviews,str):
        reviews = [reviews] #if it's not a list, wrap it in a list so we can use the code below
    
    reviews = [REPLACE_NO_SPACE.sub("",line.lower()) for line in reviews]
    reviews = [REPLACE_WITH_SPACE.sub(" ",line) for line in reviews]

    return reviews

Now we will preprocess our text such that we have removed the punctuation and unwanted HTML artifacts.

Let's see an example of this preprocessing technique at work.

In [77]:
print('Paragraph without preprocessing\n')
print(reviews_train[0])

Paragraph without preprocessing

Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I'm here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't!


In [78]:
print('Paragraph with preprocessing\n')
print(preprocess_reviews(reviews_train[0]))

Paragraph with preprocessing

['bromwell high is a cartoon comedy it ran at the same time as some other programs about school life such as teachers my 35 years in the teaching profession lead me to believe that bromwell highs satire is much closer to reality than is teachers the scramble to survive financially the insightful students who can see right through their pathetic teachers pomp the pettiness of the whole situation all remind me of the schools i knew and their students when i saw the episode in which a student repeatedly tried to burn down the school i immediately recalled  at  high a classic line inspector im here to sack one of your teachers student welcome to bromwell high i expect that many adults of my age think that bromwell high is far fetched what a pity that it isnt']


As you can see, any 'noisy' text that we would want to remove has been removed. As well, we have also changed it to all lower case and gotten rid of punctuation.

In [79]:
train = preprocess_reviews(reviews_train)
test = preprocess_reviews(reviews_test)

## Vectorization
We will now need to convert each review to a numeric representation, which as we know is the process of vectorization.

We know that there are other steps we can perform before vectorization (normalization and lemmatization) to make our corpus better, but let's naively move forward and perhaps witness the benefits of these effects later on.

Here, we will pass in: 
```python
binary=True
```
which will return a very large matrix with **one column for each unique word** in the corpus and **one row for each review**. In our case, the corpus contains 50K reviews, and a 1 in row will indicate the presence of that word in that review. This is the process known as **one hot encoding**.

In [80]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(binary=True)
cv.fit(train)
X = cv.transform(train)
X_test = cv.transform(test)

## Build Classifier

Now that our dataset is in a format suitable for modeling we can start building a classifier. We can use **Logistic Regression** as a good baseline model as they are easy to interpret and linear models tend to work well on sparse datasets like this one. As well, they learn very fast which will lend itself well to the large binary matrices we just created.

**Note**: the targets/labels will be the same for the train and test sets as both datasets are structured the same. The first 12.5K are positive reviews and the last 12.5K are negative reviews.

In [87]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

target = [1 if i < 12500 else 0 for i in range(25000)]

X_train, X_val, y_train, y_val = train_test_split(X,target,train_size=0.75)
#we will include 75% of our data in the train set and 25% in the validation set

In [88]:
#quick HP sweep
for c in [0.01, 0.05, 0.025, 0.5, 1]:
    lr = LogisticRegression(C=c) #C is the inverse of regularization strength
    lr.fit(X_train,y_train)
    
    print ("Accuracy for C=%s: %s" 
           % (c, accuracy_score(y_val, lr.predict(X_val))))

Accuracy for C=0.01: 0.87152
Accuracy for C=0.05: 0.88384
Accuracy for C=0.025: 0.88016
Accuracy for C=0.5: 0.87632
Accuracy for C=1: 0.8744


That the value of C which gives the highest accuracy is 0.5

## Train Final Model
We can now train a model with the entire training set and evaluat eon the test set we've reserved.

In [89]:
final_model = LogisticRegression(C=0.05)
final_model.fit(X, target)
print ("Final Accuracy: %s" 
       % accuracy_score(target, final_model.predict(X_test)))

Final Accuracy: 0.88152


## Thoughts
Above we've chosen to move quickly to the modeling stage for some basic results rather than dive into the nitty gritty of improving our accuracy. Let's do a post-mortem analysis here.

In [97]:
#Here we make a dictionary with the key being the word and the corresponding 
#value being the coefficient for that variable in the linear model
feature_to_coef = {
    word: coef for word, coef in zip(cv.get_feature_names(),final_model.coef_[0])
}

Now lets sort through and look at the 5 most discriminating words for both positive and negative reviews. These would correspond to the largest and smallest coefficients respectively.

In [98]:
for best_positive in sorted(
    feature_to_coef.items(), 
    key=lambda x: x[1], 
    reverse=True)[:5]:
    print (best_positive)

('excellent', 0.9292549121503528)
('perfect', 0.7907005783795896)
('great', 0.67453235464691)
('amazing', 0.6127039931007847)
('superb', 0.6019368001642376)


In [99]:
for best_negative in sorted(
    feature_to_coef.items(), 
    key=lambda x: x[1])[:5]:
    print (best_negative)

('worst', -1.3645958972261922)
('waste', -1.1664242065789645)
('awful', -1.0324189439735652)
('poorly', -0.8752018765502437)
('boring', -0.8563543421846104)


It's nice to see that the model has achieved a strong grasp of the association of these words to negative and positive reviews.