## Sentiment Analysis with Python 

<hr>

**[Classifying IMDb Movie Reviews](https://towardsdatascience.com/sentiment-analysis-with-python-part-1-5ce197074184)**

### Step1: Read into Python

In [1]:
path = "../../data/input/Sentiment_Analysis/movie_data"

In [2]:
reviews_train = []
for line in open(path+'/full_train.txt', 'r'):
    reviews_train.append(line.strip())

In [8]:
reviews_train[1]

'Homelessness (or Houselessness as George Carlin stated) has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school, work, or vote for the matter. Most people think of the homeless as just a lost cause while worrying about things such as racism, the war on Iraq, pressuring kids to succeed, technology, the elections, inflation, or worrying if they\'ll be next to end up on the streets.<br /><br />But what if you were given a bet to live on the streets for a month without the luxuries you once had from a home, the entertainment sets, a bathroom, pictures on the wall, a computer, and everything you once treasure to see what it\'s like to be homeless? That is Goddard Bolt\'s lesson.<br /><br />Mel Brooks (who directs) who stars as Bolt plays a rich man who has everything in the world until deciding to make a bet with a sissy rival (Jeffery Tambor) to see if he can live in the streets for thirty days withou

In [4]:
reviews_test = []
for line in open(path+'/full_test.txt', 'r'):
    reviews_testt.append(line.strip())

In [9]:
reviews_test[1]

'Actor turned director Bill Paxton follows up his promising debut, the Gothic-horror "Frailty", with this family friendly sports drama about the 1913 U.S. Open where a young American caddy rises from his humble background to play against his Bristish idol in what was dubbed as "The Greatest Game Ever Played." I\'m no fan of golf, and these scrappy underdog sports flicks are a dime a dozen (most recently done to grand effect with "Miracle" and "Cinderella Man"), but some how this film was enthralling all the same.<br /><br />The film starts with some creative opening credits (imagine a Disneyfied version of the animated opening credits of HBO\'s "Carnivale" and "Rome"), but lumbers along slowly for its first by-the-numbers hour. Once the action moves to the U.S. Open things pick up very well. Paxton does a nice job and shows a knack for effective directorial flourishes (I loved the rain-soaked montage of the action on day two of the open) that propel the plot further or add some unexpec

### Step2: Clean and Preprocess

We will do very basic text processing like removing punctuation and HTML tags and making everything lower-case.

**Note:** Understanding and being able to use regular expressions is a prerequisite for doing any Natural Language Processing task. If you’re unfamiliar with them perhaps start here: [Regex Tutorial](https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285)

In [10]:
import re

REPLACE_NO_SPACE = re.compile("[.;:!\'?,\"()\[\]]")
REPLACE_WITH_SPACE = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")

def preprocess_reviews(reviews):
    reviews = [REPLACE_NO_SPACE.sub("", line.lower()) for line in reviews]
    reviews = [REPLACE_WITH_SPACE.sub(" ", line) for line in reviews]
    
    return reviews

reviews_train_clean = preprocess_reviews(reviews_train)
reviews_test_clean = preprocess_reviews(reviews_test)

In [11]:
reviews_train_clean[1]

'homelessness or houselessness as george carlin stated has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school work or vote for the matter most people think of the homeless as just a lost cause while worrying about things such as racism the war on iraq pressuring kids to succeed technology the elections inflation or worrying if theyll be next to end up on the streets but what if you were given a bet to live on the streets for a month without the luxuries you once had from a home the entertainment sets a bathroom pictures on the wall a computer and everything you once treasure to see what its like to be homeless that is goddard bolts lesson mel brooks who directs who stars as bolt plays a rich man who has everything in the world until deciding to make a bet with a sissy rival jeffery tambor to see if he can live in the streets for thirty days without the luxuries if bolt succeeds he can do what he w

### Step3: Vectorization

The simplest form of this is to create one very large matrix with one column for every unique word in your corpus (where the corpus is all 50k reviews in our case). Then we transform each review into one row containing 0s and 1s, where 1 means that the word in the corpus corresponding to that column appears in that review. 
That being said, each row of the matrix will be very sparse (mostly zeros). This process is also known as **one hot encoding**.

In [12]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(binary=True)
cv.fit(reviews_train_clean)
X = cv.transform(reviews_train_clean)
X_test = cv.transform(reviews_test_clean)

### Step4: Build Classifier

we’ve transformed our dataset into a format suitable for modeling we can start building a classifier.
Logistic Regression is a good baseline model for us to use for several reasons: 

1. They’re easy to interpret, 
2. Linear models tend to perform well on sparse datasets like this one,
3. They learn very fast compared to other algorithms.

**Note:** The targets/labels we use will be the same for training and testing because both datasets are structured the same, where the first 12.5k are positive and the last 12.5k are negative.

**About the hyperparameter C, which adjusts the regularization.**

In [13]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

In [14]:
target = [1 if i < 12500 else 0 for i in range(25000)]

In [15]:
X_train, X_val, y_train, y_val = train_test_split(
    X, target, train_size = 0.75
)



In [64]:
for c in [0.01, 0.05, 0.25, 0.5, 1]:
    
    lr = LogisticRegression(C=c)
    lr.fit(X_train, y_train)
    print ("Accuracy for C=%s: %s" 
           % (c, accuracy_score(y_val, lr.predict(X_val))))
    
# Accuracy for C=0.01: 0.87248
# Accuracy for C=0.05: 0.88272
# Accuracy for C=0.25: 0.88048
# Accuracy for C=0.5: 0.87824
# Accuracy for C=1: 0.87568

Accuracy for C=0.01: 0.87248
Accuracy for C=0.05: 0.88272
Accuracy for C=0.25: 0.88048
Accuracy for C=0.5: 0.87824
Accuracy for C=1: 0.87568


### Step5: Train Final Model

Now that we’ve found the optimal value for C, we should train a model using the entire training set and evaluate our accuracy on the 25k test reviews.

In [65]:
final_model = LogisticRegression(C=0.05)
final_model.fit(X, target)
print ("Final Accuracy: %s" 
       % accuracy_score(target, final_model.predict(X_test)))

# Final Accuracy: 0.88152

Final Accuracy: 0.88152


As a sanity check, let’s look at the 5 most discriminating words for both positive and negative reviews. 
We’ll do this by looking at the largest and smallest coefficients, respectively.

In [67]:
final_model.intercept_

array([0.15757332])

In [68]:
final_model.coef_

array([[ 1.49484201e-03, -3.52897272e-06, -3.71951510e-03, ...,
         2.90177224e-04, -2.69632878e-02, -6.63970496e-03]])

In [66]:
feature_to_coef = {
    word: coef for word, coef in zip(
        cv.get_feature_names(), final_model.coef_[0]
    )
}

In [69]:
for best_positive in sorted(
    feature_to_coef.items(), 
    key=lambda x: x[1], 
    reverse=True)[:5]:
    print (best_positive)
    
# ('excellent', 0.9292549002494034)
# ('perfect', 0.7907005736625977)
# ('great', 0.6745323581303191)
# ('amazing', 0.612703981446081)
# ('superb', 0.6019367936694553)

('excellent', 0.9292549002494034)
('perfect', 0.7907005736625977)
('great', 0.6745323581303191)
('amazing', 0.612703981446081)
('superb', 0.6019367936694553)


In [72]:
for best_negative in sorted(
    feature_to_coef.items(), 
    key=lambda x: x[1])[:5]:
    print (best_negative)
    
# ('worst', -1.3645958618890326)
# ('waste', -1.166424181103741)
# ('awful', -1.032418905297706)
# ('poorly', -0.8752018666767407)
# ('boring', -0.8563543336107031)

('worst', -1.3645958618890326)
('waste', -1.166424181103741)
('awful', -1.032418905297706)
('poorly', -0.8752018666767407)
('boring', -0.8563543336107031)


### What do next

<hr>

1. **Text Processing**: Stemming/Lemmatizing to convert different forms of each word into one.
2. **n-grams**: Instead of just single-word tokens (1-gram/unigram) we can also include word pairs.
3. **epresentations**: Instead of simple, binary vectors we can use word counts or TF-IDF to transform those counts.
4. **Algorithms**: In addition to Logistic Regression, we’ll see how Support Vector Machines perform.

## Sentiment Analysis with Python  (Past 2)

<hr>

**[Improving a Movie Review Sentiment Classifier](https://towardsdatascience.com/sentiment-analysis-with-python-part-2-4f71e7bde59a)**

### Enhance1: Text Processing

We can clean things up further by removing stop words and normalizing the text.
To make these transformations we’ll use libraries from the [Natural Language Toolkit](https://www.nltk.org/) (NLTK).

#### Removing Stop Words

Stop words are the very common words like ‘if’, ‘but’, ‘we’, ‘he’, ‘she’, and ‘they’. 
We can usually remove these words without changing the semantics of a text and doing so often (but not always) improves the performance of a model.
Removing these stop words becomes a lot more useful when we start using longer word sequences as model features (see n-grams below).

In [73]:
from nltk.corpus import stopwords

In [77]:
# import nltk

In [78]:
# nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/duanle/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [79]:
english_stop_words = stopwords.words('english')

In [80]:
def remove_stop_words(corpus):
    removed_stop_words = []
    for review in corpus:
        removed_stop_words.append(
            ' '.join([word for word in review.split() 
                      if word not in english_stop_words])
        )
    return removed_stop_words

In [81]:
no_stop_words = remove_stop_words(reviews_train_clean)

In [82]:
reviews_train_clean[2]

'brilliant over acting by lesley ann warren best dramatic hobo lady i have ever seen and love scenes in clothes warehouse are second to none the corn on face is a classic as good as anything in blazing saddles the take on lawyers is also superb after being accused of being a turncoat selling out his boss and being dishonest the lawyer of pepto bolt shrugs indifferently im a lawyer he says three funny words jeffrey tambor a favorite from the later larry sanders show is fantastic here too as a mad millionaire who wants to crush the ghetto his character is more malevolent than usual the hospital scene and the scene where the homeless invade a demolition site are all time classics look for the legs scene and the two big diggers fighting one bleeds this movie gets better each time i see it which is quite often'

In [83]:
no_stop_words[2]

'brilliant acting lesley ann warren best dramatic hobo lady ever seen love scenes clothes warehouse second none corn face classic good anything blazing saddles take lawyers also superb accused turncoat selling boss dishonest lawyer pepto bolt shrugs indifferently im lawyer says three funny words jeffrey tambor favorite later larry sanders show fantastic mad millionaire wants crush ghetto character malevolent usual hospital scene scene homeless invade demolition site time classics look legs scene two big diggers fighting one bleeds movie gets better time see quite often'

#### Normalization

A common next step in text preprocessing is to normalize the words in your corpus by trying to convert all of the different forms of a given word into one. 
Two methods that exist for this are _Stemming_ and _Lemmatization_.

##### Stemming

Stemming is considered to be the more crude/brute-force approach to normalization (although this doesn’t necessarily mean that it will perform worse). 
There’s several algorithms, but in general they all use basic rules to chop off the ends of words.

NLTK has several stemming algorithm implementations. We’ll use the Porter stemmer here but you can explore all of the options with examples here: [NLTK Stemmers](http://www.nltk.org/howto/stem.html)

In [84]:
def get_stemmed_text(corpus):
    from nltk.stem.porter import PorterStemmer
    stemmer = PorterStemmer()
    return [' '.join([stemmer.stem(word) for word in review.split()]) for review in corpus]

In [85]:
stemmed_reviews = get_stemmed_text(reviews_train_clean)

In [87]:
stemmed_reviews[2]

'brilliant over act by lesley ann warren best dramat hobo ladi i have ever seen and love scene in cloth warehous are second to none the corn on face is a classic as good as anyth in blaze saddl the take on lawyer is also superb after be accus of be a turncoat sell out hi boss and be dishonest the lawyer of pepto bolt shrug indiffer im a lawyer he say three funni word jeffrey tambor a favorit from the later larri sander show is fantast here too as a mad millionair who want to crush the ghetto hi charact is more malevol than usual the hospit scene and the scene where the homeless invad a demolit site are all time classic look for the leg scene and the two big digger fight one bleed thi movi get better each time i see it which is quit often'

##### Lemmatization

Lemmatization works by identifying the part-of-speech of a given word and then applying more complex rules to transform the word into its true root.

In [88]:
def get_lemmatized_text(corpus):
    from nltk.stem import WordNetLemmatizer
    lemmatizer = WordNetLemmatizer()
    return [' '.join([lemmatizer.lemmatize(word) for word in review.split()]) for review in corpus]

In [90]:
# nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /home/duanle/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [91]:
lemmatized_reviews = get_lemmatized_text(reviews_train_clean)

### Enhance2: n-grams

We can potentially add more predictive power to our model by adding two or three word sequences (bigrams or trigrams) as well. 
The scikit-learn library makes this really easy to play around with. Just use the ngram_range argument with any of the ‘Vectorizer’ classes.

In [92]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

In [93]:
ngram_vectorizer = CountVectorizer(binary=True, ngram_range=(1, 2))
ngram_vectorizer.fit(reviews_train_clean)
X = ngram_vectorizer.transform(reviews_train_clean)
X_test = ngram_vectorizer.transform(reviews_test_clean)

X_train, X_val, y_train, y_val = train_test_split(
    X, target, train_size = 0.75
)



In [94]:
for c in [0.01, 0.05, 0.25, 0.5, 1]:
    
    lr = LogisticRegression(C=c)
    lr.fit(X_train, y_train)
    print ("Accuracy for C=%s: %s" 
           % (c, accuracy_score(y_val, lr.predict(X_val))))
    
# Accuracy for C=0.01: 0.8776
# Accuracy for C=0.05: 0.88576
# Accuracy for C=0.25: 0.88816
# Accuracy for C=0.5: 0.88816
# Accuracy for C=1: 0.88816

Accuracy for C=0.01: 0.8776
Accuracy for C=0.05: 0.88576
Accuracy for C=0.25: 0.88816
Accuracy for C=0.5: 0.88816
Accuracy for C=1: 0.88816


In [95]:
final_ngram = LogisticRegression(C=0.5)
final_ngram.fit(X, target)
print ("Final Accuracy: %s" 
       % accuracy_score(target, final_ngram.predict(X_test)))

# Final Accuracy: 0.8976

Final Accuracy: 0.8976


Getting pretty close to 90%! So, simply considering 2-word sequences in addition to single words increased our accuracy by more than 1.6 percentage points.