In [172]:
import pandas as pd
from textblob import TextBlob
from textblob.classifiers import NaiveBayesClassifier

## Part 1: Using the TextBlob Sentiment Analyzer

### Import the movie review data as a data frame and ensure that the data is loaded properly.

In [176]:
movrev_df = pd.read_csv("Data/labeledTrainData.tsv", sep='\t')
movrev_df.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


### How many of each positive and negative reviews are there?

In [177]:
PosRevs = sum(movrev_df['sentiment'] == 1)  # Counting the number of Positive Reviews
NegRevs = sum(movrev_df['sentiment'] == 0)  # Counting the number of Negative Reviews

# Printing the Number of positive and negative reviews
print("No. of Positive Reviews are:", PosRevs)      
print("No. of Negative Reviews are:", NegRevs)

No. of Positive Reviews are: 12500
No. of Negative Reviews are: 12500


### Use TextBlob to classify each movie review as positive or negative. Assume that a polarity score greater than or equal to zero is a positive sentiment and less than 0 is a negative sentiment.

In [178]:
movrev_df['TB_SentAnalysis'] = movrev_df['review'].apply(lambda review: TextBlob(review).sentiment.polarity)    # Adding a column for TextBlob Sentiment analysis

print("No. of positive reviews using TextBlob Analysis is", sum(movrev_df['TB_SentAnalysis'] >= 0))             # Sums of positive and negative reviews.
print("No. of Negative reviews using TextBlob Analysis is", sum(movrev_df['TB_SentAnalysis'] < 0))

No. of positive reviews using TextBlob Analysis is 19017
No. of Negative reviews using TextBlob Analysis is 5983


### Check the accuracy of this model. Is this model better than random guessing?

In [179]:
print("Accurate positive sentiment prediction by TextBlob:", sum((movrev_df['sentiment'] > 0) & (movrev_df['TB_SentAnalysis'] >= 0)))        # Calculating Accurate positive and negative sentiment predictions
print("Accurate negative sentiment prediction by TextBlob:", sum((movrev_df['sentiment'] <= 0) & (movrev_df['TB_SentAnalysis'] < 0)))

Accurate positive sentiment prediction by TextBlob: 11824
Accurate negative sentiment prediction by TextBlob: 5307


Total number of sentiment predictions using TextBlob: 11824+5307 = 17131

Total number of reviews: 25000

Accuracy of this model = (17131/25000)*100 = 68.524%

Accuracy of this model is about 68.5%. This is definitely better than random guessing which has only 50% accuracy.

### For up to five points extra credit, use another prebuilt text sentiment analyzer, e.g., VADER, and repeat steps (3) and (4).

In [180]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer()
movrev_df['V_SentAnalysis'] = movrev_df['review'].apply(lambda review: analyzer.polarity_scores(review).get('compound'))     # Adding a column for Vader sentiment analysis

print("No. of positive reviews using VADER is:", sum(movrev_df['V_SentAnalysis'] >= 0))  # Sums of positive and negative reviews using Vader
print("No. of negative reviews using VADER is:", sum(movrev_df['V_SentAnalysis'] < 0))

print("\nAccurate positive sentiment prediction by VADER :", sum((movrev_df['sentiment'] > 0) & (movrev_df['V_SentAnalysis'] >= 0)))       # Calculating the accuracy
print("Accurate negative sentiment prediction by VADER :", sum((movrev_df['sentiment'] <= 0) & (movrev_df['V_SentAnalysis'] < 0)))

No. of positive reviews using VADER is: 16611
No. of negative reviews using VADER is: 8389

Accurate positive sentiment prediction by VADER : 10731
Accurate negative sentiment prediction by VADER : 6620


Total number of sentiment predictions using VADER: 10731+6620 = 17351

Total number of reviews: 25000

Accuracy of this model = (17351/25000)*100 = 69.404%

Accuracy of VADER model is about 69.4%. This is definitely better than random guessing which has only 50% accuracy.

## Part 2: Prepping Text for a Custom Model

### Convert all text to lowercase letters.

In [198]:
movrev_df.review = movrev_df.review.str.lower()           # Converting review column text to lowercase.

### Remove punctuation and special characters from the text.

In [199]:
import string 
  
movrev_df.review = movrev_df.review.str.translate(str.maketrans('', '', string.punctuation))

### Remove stop words.

In [200]:
from nltk.corpus import stopwords

stop_words = stopwords.words('english')

movrev_df.review = movrev_df.review.apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))          # Removing the stop word

### Apply NLTK’s PorterStemmer.

In [201]:
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()            # Creating the stemmer

movrev_df.review = movrev_df.review.apply(lambda x: ' '.join([porter.stem(word) for word in x.split()]))         #Applying the stemmer

### Create a bag-of-words matrix from your stemmed text (output from (4)), where each row is a word-count vector for a single movie review (see sections 5.3 & 6.8 in the Machine Learning with Python Cookbook). Display the dimensions of your bag-of-words matrix. The number of rows in this matrix should be the same as the number of rows in your original data frame.

In [202]:
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer()
bag_of_words = count.fit_transform(movrev_df.review)

bag_of_words                           #Size of bag_of_words

<25000x91437 sparse matrix of type '<class 'numpy.int64'>'
	with 2398626 stored elements in Compressed Sparse Row format>

### Create a term frequency-inverse document frequency (tf-idf) matrix from your stemmed text, for your movie reviews (see section 6.9 in the Machine Learning with Python Cookbook). Display the dimensions of your tf-idf matrix. These dimensions should be the same as your bag-of-words matrix.

In [203]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
tfidf_review = vectorizer.fit_transform(movrev_df.review)    # Create the tf-idf matrix

print(tfidf_review.shape)             #Dimensions od tf-idf matrix

(25000, 91437)


The dimensions of my tf-idf matrix are same as the dimensions of bag-of-words matrix.