# IMDB Sentiment Analysis

The data is split evenly with 25k reviews intended for training and 25k for testing your classifier. Moreover, each set has 12.5k positive and 12.5k negative reviews.

IMDb lets users rate movies on a scale from 1 to 10. To label these reviews the curator of the data labeled anything with ≤ 4 stars as negative and anything with ≥ 7 stars as positive. Reviews with 5 or 6 stars were left out.

**Import the required libraries**

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np
import pandas as pd
import os
import re

**Load Data**

In [2]:
reviews_train = []
for line in open('full_train.txt', 'r', encoding="utf8"):
    
    reviews_train.append(line.strip())
    
reviews_test = []
for line in open('full_test.txt', 'r', encoding="utf8"):
    
    reviews_test.append(line.strip())

**See one of the elements in the list**

In [3]:
reviews_train[2]

'Brilliant over-acting by Lesley Ann Warren. Best dramatic hobo lady I have ever seen, and love scenes in clothes warehouse are second to none. The corn on face is a classic, as good as anything in Blazing Saddles. The take on lawyers is also superb. After being accused of being a turncoat, selling out his boss, and being dishonest the lawyer of Pepto Bolt shrugs indifferently "I\'m a lawyer" he says. Three funny words. Jeffrey Tambor, a favorite from the later Larry Sanders show, is fantastic here too as a mad millionaire who wants to crush the ghetto. His character is more malevolent than usual. The hospital scene, and the scene where the homeless invade a demolition site, are all-time classics. Look for the legs scene and the two big diggers fighting (one bleeds). This movie gets better each time I see it (which is quite often).'

The raw text is pretty messy for these reviews so before we can do any analytics we need to clean things up


**Use Regular expressions to remove the non text characters, and the html tags**

In [4]:
import re

REPLACE_NO_SPACE = re.compile("(\.)|(\;)|(\:)|(\!)|(\')|(\?)|(\,)|(\")|(\()|(\))|(\[)|(\])|(\d+)")
REPLACE_WITH_SPACE = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")
NO_SPACE = ""
SPACE = " "

def preprocess_reviews(reviews):
    
    reviews = [REPLACE_NO_SPACE.sub(NO_SPACE, line.lower()) for line in reviews]
    reviews = [REPLACE_WITH_SPACE.sub(SPACE, line) for line in reviews]
    
    return reviews

reviews_train_clean = preprocess_reviews(reviews_train)
reviews_test_clean = preprocess_reviews(reviews_test)

In [5]:
reviews_train_clean[2]

'brilliant over acting by lesley ann warren best dramatic hobo lady i have ever seen and love scenes in clothes warehouse are second to none the corn on face is a classic as good as anything in blazing saddles the take on lawyers is also superb after being accused of being a turncoat selling out his boss and being dishonest the lawyer of pepto bolt shrugs indifferently im a lawyer he says three funny words jeffrey tambor a favorite from the later larry sanders show is fantastic here too as a mad millionaire who wants to crush the ghetto his character is more malevolent than usual the hospital scene and the scene where the homeless invade a demolition site are all time classics look for the legs scene and the two big diggers fighting one bleeds this movie gets better each time i see it which is quite often'

**Vectorization**

In order for this data to make sense to our machine learning algorithm we’ll need to convert each review to a numeric representation, which we call vectorization.

The simplest form of this is to create one very large matrix with one column for every unique word in your corpus (where the corpus is all 50k reviews in our case). Then we transform each review into one row containing 0s and 1s, where 1 means that the word in the corpus corresponding to that column appears in that review. That being said, each row of the matrix will be very sparse (mostly zeros). This process is also known as one hot encoding. Use the *CountVectorizer* method.

In [6]:
vectorizer = TfidfVectorizer()

In [7]:
 X = reviews_train_clean
 Y = reviews_test_clean

In [8]:
X_vec = vectorizer.fit_transform(X)
Y_vec = vectorizer.fit_transform(Y)

In [31]:
# print(vectorizer.get_feature_names())
print(X_vec.shape)
# print(Y_vec)

(25000, 90860)


**Modeling**

Use a logistic regression to buid a classifier.  (1) They’re easy to interpret, (2) linear models tend to perform well on sparse datasets like this one, and (3) they learn very fast compared to other algorithms.
Test models with C values of [0.01, 0.05, 0.25, 0.5, 1] and see wich is the best value for C, and calculate the accuracy

In [10]:
#   Accuracy for C=0.01: 0.87472
#   Accuracy for C=0.05: 0.88368
#   Accuracy for C=0.25: 0.88016
#   Accuracy for C=0.5: 0.87808
#   Accuracy for C=1: 0.87648

#   Final Accuracy: 0.88128

In [34]:
X_vec_train = X_vec[: ,:80860]
Y_vec_train = X_vec[:, 80860:]

In [27]:
c = 0.01

In [28]:
clf = LogisticRegression(C = c)

In [37]:
clf.fit(X_vec, X_vec)
y_pred = clf.predict(X_test)

ValueError: bad input shape (25000, 90860)

**Feture importance**


Let’s look at the 5 most discriminating words for both positive and negative reviews. We’ll do this by looking at the largest and smallest coefficients, respectively.

('excellent', 0.9288812243499887)
('perfect', 0.7934641130707091)
('great', 0.6750409142310883)
('amazing', 0.6160397978360739)
('superb', 0.6063967683158226)
('worst', -1.367978439214473)
('waste', -1.1684450936321602)
('awful', -1.0277001161980686)
('poorly', -0.8748317728678134)
('boring', -0.8587249635418243)
