In [None]:
# Sentiment Analysis Assignment
# DSC 550
# Week 3
# Data Mining Assignment Week 3
# David Berberena
# 3/31/2024

# Program Start

## Part 1: Using the TextBlob Sentiment Analyzer

### 1. Import the movie review data as a data frame and ensure that the data is loaded properly.

In [9]:
# The libraries needed for Part 1 are Pandas (data importation/manipulation) and textblob (sentiment analyzer).

import pandas as pd
from textblob import TextBlob

# To import the labeled training dataset properly, I need to add the delimiter argument present in the read_csv() function 
# and set it to '\t' to properly read in TSV (tab separated files) files. 

movie_data = pd.read_csv('labeledTrainData.tsv', delimiter='\t')

# The head() function will be employed here to verify that the data has been loaded in properly.

movie_data.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


### 2. How many of each positive and negative reviews are there?

In [12]:
# Looking at the sentiment variable column allows me to see two observations: 1 (I am assuming that this means positive) 
# and 0 (I am assuming this to be negative). I will use the value.counts() function to count each unique value and total 
# them and print the results by calling the unique values.

review_counts = movie_data['sentiment'].value_counts()

print('Total number of positive reviews:', review_counts[1])
print('Total number of negative reviews:', review_counts[0])

Total number of positive reviews: 12500
Total number of negative reviews: 12500


### 3. Use TextBlob to classify each movie review as positive or negative. Assume that a polarity score greater than or equal to zero is a positive sentiment and less than 0 is a negative sentiment.

In [22]:
# In order to use the TextBlob() function on every review in the movie_data data frame, I placed the needed code in a 
# function that I will apply using the apply() function. The classify_review_sentiment() function I have created takes a 
# review as an argument, converts that review into a string (if it has not already been done), and runs the TextBlob() 
# function on that review. The TextBlob() function calculates the polarity metric, which is called through dot notation. 
# The polarity of the review is then classified as either positive or negative based on its value with the help of the 
# if-else conditional statement.

def classify_review_sentiment(review):
    review_analysis = TextBlob(str(review))
    polarity = review_analysis.sentiment.polarity
    if polarity >= 0:
        return 'Positive'
    else:
        return 'Negative'
    
# The classify_review_sentiment() function is applied here to the review column in the movie_data data frame, which the 
# results overwriting the previous sentiment column. 

movie_data['new_sentiment'] = movie_data['review'].apply(classify_review_sentiment)

# The head() function is meant to show the changes made to the data frame.

movie_data.head()

Unnamed: 0,id,sentiment,review,new_sentiment
0,5814_8,Positive,With all this stuff going down at the moment w...,Positive
1,2381_9,Positive,"\The Classic War of the Worlds\"" by Timothy Hi...",Positive
2,7759_3,Negative,The film starts with a manager (Nicholas Bell)...,Negative
3,3630_4,Positive,It must be assumed that those who praised this...,Positive
4,9495_8,Negative,Superbly trashy and wondrously unpretentious 8...,Negative


In [23]:
# Now to realize the number of positive and negative reviews found by TextBlob, I will again use the same value.counts() 
# function and print out how many reviews were positive and how many were negative. 

new_review_counts = movie_data['new_sentiment'].value_counts()

print('Total number of new positive reviews:', new_review_counts['Positive'])
print('Total number of new negative reviews:', new_review_counts['Negative'])

Total number of new positive reviews: 19017
Total number of new negative reviews: 5983


### 4. Check the accuracy of this model. Is this model better than random guessing?

In [27]:
# To check the accuracy of the sentiment analysis, I did some research on how to do this and came across the accuracy_score 
# metric in the sci-kit learn library in Python. This automated way to compute the accuracy is less complex than manually 
# computing accuracy, so I will employ the accuracy_score() function.

# First I must import the sci-kit learn library and directly access the accuracy_score function within the metrics module.

from sklearn.metrics import accuracy_score

accuracy = accuracy_score(movie_data['sentiment'], movie_data['new_sentiment'])

print('Accuracy:', accuracy)

Accuracy: 1.0


In [None]:
# Looking at the accuracy of the sentiment analysis in comparison to the random guessing that the dataset had already done 
# with the sentiment observations prior to the sentiment analysis, the TextBlob analysis does a much better job at 
# classifying the reviews, as there is a large disparity between the number of positive and negative reviews. The random 
# guessing had the reviews split 50/50, yet the sentiment analysis shows that almost 80 percent of the reviews were 
# positive and only about 20 percent were classified as negative.

## Part 2: Prepping Text for a Custom Model

### 1. Convert all text to lowercase letters.

In [49]:
# The string formatting str.lower() function has the ability to make all text lowercase, so I will be using this against 
# the review variable.

movie_data['review'] = movie_data['review'].str.lower()

# The head() function is meant to show the changes made to the data frame.

movie_data.head()

Unnamed: 0,id,sentiment,review,new_sentiment
0,5814_8,1,with all this stuff going down at the moment w...,Positive
1,2381_9,1,the classic war of the worlds by timothy hines...,Positive
2,7759_3,0,the film starts with a manager nicholas bell g...,Negative
3,3630_4,0,it must be assumed that those who praised this...,Positive
4,9495_8,1,superbly trashy and wondrously unpretentious 8...,Negative


### 2. Remove punctuation and special characters from the text.

In [50]:
# This step involves applying a lambda function that checks each review (converted into a string value since it has not 
# been done yet) for alphanumeric characters (with the isalnum() function) and space characters 
# (using the isspace() function) and keeps them (removing the remaining punctuation and special characters by default). The 
# lambda function finally joins each review back together at the whitespaces left by the exclusion of the unwanted 
# characters

movie_data['review'] = movie_data['review'].apply(
    lambda review: ''.join(str(character) for character in review if character.isalnum() or character.isspace()))

# The head() function is meant to show the changes made to the data frame.

movie_data.head()

Unnamed: 0,id,sentiment,review,new_sentiment
0,5814_8,1,with all this stuff going down at the moment w...,Positive
1,2381_9,1,the classic war of the worlds by timothy hines...,Positive
2,7759_3,0,the film starts with a manager nicholas bell g...,Negative
3,3630_4,0,it must be assumed that those who praised this...,Positive
4,9495_8,1,superbly trashy and wondrously unpretentious 8...,Negative


### 3. Remove stop words.

In [54]:
# To remove stopwords, I will need to import the NLTK library to access the stopwords function.

import nltk
from nltk.corpus import stopwords

# To have access to the stopwords in the English language, I used the words() function in stopwords and specified english 
# as the language whose stopwords I wish to use and stored those words into a set variable to call on in my function to 
# remove these stopwords from the review variable. 

stop_words = set(stopwords.words('english'))

# The delete_stopwords() function I made is similar to the lambda function that removed the special characters and 
# punctuation, except that I encased the code in a recallable function. The function works to split the review into 
# individual words, analyze them to identify the stopwords and keep the words that are not stopwords, then join the words 
# to keep with the join() function where the stopwords were deleted. The joined review is returned at the end. 

def delete_stopwords(review):
    kept_words = [word for word in review.split() if word not in stop_words]
    return ' '.join(kept_words)

# The above function is applied to each review in the review variable with the apply() function.

movie_data['review'] = movie_data['review'].apply(delete_stopwords)

# The head() function is meant to show the changes made to the data frame.

movie_data.head()

Unnamed: 0,id,sentiment,review,new_sentiment
0,5814_8,1,stuff going moment mj ive started listening mu...,Positive
1,2381_9,1,classic war worlds timothy hines entertaining ...,Positive
2,7759_3,0,film starts manager nicholas bell giving welco...,Negative
3,3630_4,0,must assumed praised film greatest filmed oper...,Positive
4,9495_8,1,superbly trashy wondrously unpretentious 80s e...,Negative


### 4. Apply NLTK’s PorterStemmer.

In [55]:
# To gain access to PorterStemmer, I must import it from the NLTK library.

from nltk.stem import PorterStemmer

# The function generated here follows the same logic as the previous function, with the PorterStemmer() function iterating 
# over every word in the review with the split() function, then the stemmed words are joined back together at the place 
# where they were stemmed.

def Porter_Stemming(review):
    stemmed_words = [PorterStemmer().stem(word) for word in review.split()]
    return ' '.join(stemmed_words)

# The above function is applied to each review in the review variable with the apply() function.

movie_data['review'] = movie_data['review'].apply(Porter_Stemming)

# The head() function is meant to show the changes made to the data frame.

movie_data.head()

Unnamed: 0,id,sentiment,review,new_sentiment
0,5814_8,1,stuff go moment mj ive start listen music watc...,Positive
1,2381_9,1,classic war world timothi hine entertain film ...,Positive
2,7759_3,0,film start manag nichola bell give welcom inve...,Negative
3,3630_4,0,must assum prais film greatest film opera ever...,Positive
4,9495_8,1,superbl trashi wondrous unpretenti 80 exploit ...,Negative


### 5. Create a bag-of-words matrix from your stemmed text (output from (4)), where each row is a word-count vector for a single movie review (see sections 5.3 & 6.8 in the Machine Learning with Python Cookbook). Display the dimensions of your bag-of-words matrix. The number of rows in this matrix should be the same as the number of rows in your original data frame.

In [59]:
# To manage the task of creating a bag-of-words matrix, I need to access the CountVectorizer function in Sci-kitlearn.

from sklearn.feature_extraction.text import CountVectorizer

# To create the bag-of-words matrix, I simply used the CountVectorizer's fit_transform function on the review variable.

bow_matrix = CountVectorizer().fit_transform(movie_data['review'])

# The shape() function is used to directly call the dimensions of the matrix that has been created.

print("The dimensions of the movie review bag-of-words matrix are", bow_matrix.shape)

The dimensions of the movie review bag-of-words matrix are (25000, 92399)


### 6. Create a term frequency-inverse document frequency (tf-idf) matrix from your stemmed text, for your movie reviews (see section 6.9 in the Machine Learning with Python Cookbook). Display the dimensions of your tf-idf matrix. These dimensions should be the same as your bag-of-words matrix.

In [60]:
# To craft a term frequency-inverse document frequency matrix, Sci-kitlearn also has a TfidfVectorizer function that needs 
# to be imported and utilized.

from sklearn.feature_extraction.text import TfidfVectorizer

# The TF-IDF matrix can be made by calling the TfidfVectorizer's fit_transform function and subjecting the review variable 
# to it.

tfidf_matrix = TfidfVectorizer().fit_transform(movie_data['review'])

# The shape() function is used to directly call the dimensions of the matrix that has been created.

print("The dimensions of the movie review TF-IDF matrix are", tfidf_matrix.shape)

The dimensions of the movie review TF-IDF matrix are (25000, 92399)
