## Part 1: Using the TextBlob Sentiment Analyzer

### 1. Import the movie review data as a data frame and ensure that the data is loaded properly.

In [1]:
import pandas as pd
from textblob import TextBlob

In [2]:
#Load the dataset as a Pandas data frame.
labeled_train_data_df = pd.read_csv('labeledTrainData.tsv', sep='\t') 

print(labeled_train_data_df.shape)
labeled_train_data_df.head()

(25000, 3)


Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


### 2. How many of each positive and negative reviews are there?

In [3]:
labeled_train_data_df.groupby(['sentiment'])['sentiment'].count().sort_index()
# 50% are positive and 50% are negative reviews.

sentiment
0    12500
1    12500
Name: sentiment, dtype: int64

### 3. Use TextBlob to classify each movie review as positive or negative. 

In [4]:
#Assume that a polarity score greater than or equal to zero is a positive sentiment and 
#less than 0 is a negative sentiment.
#labeled_train_data_df['TextBlob_sentiment'] = labeled_train_data_df['review'].apply(lambda review:
#                                                                              TextBlob(review).sentiment)
labeled_train_data_df['subjectivity'] = labeled_train_data_df['review'].apply(lambda review:
                                                                              TextBlob(review).sentiment.subjectivity)

labeled_train_data_df['polarity'] = labeled_train_data_df['review'].apply(lambda review:
                                                                              TextBlob(review).sentiment.polarity)

labeled_train_data_df['analysis'] = labeled_train_data_df['polarity'].apply(lambda x: 1 if x >=0 else 0)

labeled_train_data_df.groupby(['analysis'])['analysis'].count().sort_index()

analysis
0     5983
1    19017
Name: analysis, dtype: int64

### 4. Check the accuracy of this model. Is this model better than random guessing?

In [5]:
#Comparing sentiment and analysis for accuracy
from sklearn.metrics import accuracy_score

accuracy_score(labeled_train_data_df['sentiment'],labeled_train_data_df['analysis'])

#It appears the accuracy of this model is better than the random guessing based on existing sentiment data.

0.68524

### 5. For up to five points extra credit, use another prebuilt text sentiment analyzer, e.g., VADER, and repeat steps (3) and (4).

In [6]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer() 
labeled_train_data_df['Vader_polarity'] = labeled_train_data_df['review'].apply(lambda review: analyzer.polarity_scores(review)['compound'])

In [7]:
labeled_train_data_df['Vader_analysis'] = labeled_train_data_df['Vader_polarity'].apply(lambda x: 1 if x >= 0.05 else 0)

labeled_train_data_df.groupby(['Vader_analysis'])['Vader_analysis'].count().sort_index()

Vader_analysis
0     8493
1    16507
Name: Vader_analysis, dtype: int64

In [8]:
accuracy_score(labeled_train_data_df['sentiment'],labeled_train_data_df['Vader_analysis'])

0.69556

Accuracy from TextBlob is 68.5% while that from VADER is 69.6%

## Part 2: Prepping Text for a Custom Model

In [9]:
#Load the dataset as a Pandas data frame.
labeled_train_data_df = pd.read_csv('labeledTrainData.tsv', sep='\t') 

### 1. Convert all text to lowercase letters.    

In [10]:
labeled_train_data_df  = labeled_train_data_df.applymap(lambda x: x.lower() if type(x)==str else x)

### 2. Remove punctuation and special characters from the text.

In [11]:
# Remove punctuations
import string 
labeled_train_data_df.review = labeled_train_data_df.review.apply(lambda review: review.translate(str.maketrans('', '', string.punctuation))) 

In [12]:
#Remove special characters
labeled_train_data_df.review = labeled_train_data_df.review.str.replace(r"[^a-zA-Z0-9]+", " " , regex=True) 
labeled_train_data_df.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,with all this stuff going down at the moment w...
1,2381_9,1,the classic war of the worlds by timothy hines...
2,7759_3,0,the film starts with a manager nicholas bell g...
3,3630_4,0,it must be assumed that those who praised this...
4,9495_8,1,superbly trashy and wondrously unpretentious 8...


### 3. Remove stop words.

In [13]:
# Load library
from nltk.corpus import stopwords  
stopwords= stopwords.words('english') 

# Remove stop words
labeled_train_data_df.review = labeled_train_data_df.review.apply(
    lambda review: ' '.join([word for word in review.split() if word not in (stopwords)]))
labeled_train_data_df.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,stuff going moment mj ive started listening mu...
1,2381_9,1,classic war worlds timothy hines entertaining ...
2,7759_3,0,film starts manager nicholas bell giving welco...
3,3630_4,0,must assumed praised film greatest filmed oper...
4,9495_8,1,superbly trashy wondrously unpretentious 80s e...


### 4. Apply NLTK’s PorterStemmer.

In [14]:
from nltk.stem.porter import PorterStemmer

# Create stemmer
porter = PorterStemmer()
# Apply stemmer
labeled_train_data_df.review = labeled_train_data_df.review.apply(lambda review:' '.join([porter.stem(word) for word in review.split()]))
labeled_train_data_df.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,stuff go moment mj ive start listen music watc...
1,2381_9,1,classic war world timothi hine entertain film ...
2,7759_3,0,film start manag nichola bell give welcom inve...
3,3630_4,0,must assum prais film greatest film opera ever...
4,9495_8,1,superbl trashi wondrous unpretenti 80 exploit ...


### 5. Create a bag-of-words matrix from your stemmed text (output from (4)), where each row is a word-count vector for a single movie review (see sections 5.3 & 6.8 in the Machine Learning with Python Cookbook). Display the dimensions of your bag-of-words matrix. The number of rows in this matrix should be the same as the number of rows in your original data frame.

In [15]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

# Create the bag of words feature matrix
count = CountVectorizer()
bag_of_words = count.fit_transform(labeled_train_data_df.review)
# Show feature matrix
bag_of_words 

<25000x91908 sparse matrix of type '<class 'numpy.int64'>'
	with 2439277 stored elements in Compressed Sparse Row format>

In [16]:
bag_of_words.shape

(25000, 91908)

### 6. Create a term frequency-inverse document frequency (tf-idf) matrix from your stemmed text, for your movie reviews (see section 6.9 in the Machine Learning with Python Cookbook). Display the dimensions of your tf-idf matrix. These dimensions should be the same as your bag-of-words matrix.

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer
 
# Create the tf-idf feature matrix
tfidf = TfidfVectorizer()
feature_matrix = tfidf.fit_transform(labeled_train_data_df.review)
# Show tf-idf feature matrix 
tfidf.vocabulary_ 

{'stuff': 77888,
 'go': 33711,
 'moment': 52832,
 'mj': 52606,
 'ive': 42173,
 'start': 76735,
 'listen': 47296,
 'music': 54427,
 'watch': 88063,
 'odd': 57487,
 'documentari': 23379,
 'wiz': 89827,
 'moonwalk': 53151,
 'mayb': 50395,
 'want': 87812,
 'get': 33084,
 'certain': 14835,
 'insight': 41100,
 'guy': 35396,
 'thought': 81392,
 'realli': 66251,
 'cool': 18413,
 'eighti': 25631,
 'make': 49128,
 'mind': 52047,
 'whether': 89047,
 'guilti': 35198,
 'innoc': 41023,
 'part': 60219,
 'biographi': 10152,
 'featur': 29070,
 'film': 29583,
 'rememb': 67058,
 'see': 71248,
 'cinema': 16169,
 'origin': 58683,
 'releas': 66940,
 'subtl': 78171,
 'messag': 51411,
 'feel': 29122,
 'toward': 82790,
 'press': 63711,
 'also': 4576,
 'obviou': 57383,
 'drug': 24490,
 'bad': 7840,
 'mkaybr': 52611,
 'br': 11731,
 'visual': 87252,
 'impress': 40339,
 'cours': 18939,
 'michael': 51627,
 'jackson': 42257,
 'unless': 85363,
 'remot': 67096,
 'like': 46993,
 'anyway': 5816,
 'hate': 36549,
 'find':

In [18]:
feature_matrix.shape

(25000, 91908)