<center>

## DSC550-T303 Data Mining
## Week 3: Text/Sentiment Analysis, Categorical Data, and Dates/Times
## Exercise:  Sentiment Analysis and Preprocessing Text
### Karthika Vellingiri
### 13 Dec 2024

</center>

### **<span style="color:blue">Download the labeled training dataset from this link: Bag of Words Meets Bags of Popcorn.</span>** 

#### **<span style="color:blue">Part 1: Using the TextBlob Sentiment Analyzer </span>**

**1. Import the movie review data as a data frame and ensure that the data is loaded properly.**

**2. How many of each positive and negative reviews are there?**

**3. Use TextBlob to classify each movie review as positive or negative. Assume that a polarity score greater than or equal to zero is a positive sentiment and less than 0 is a negative sentiment.**

**4. Check the accuracy of this model. Is this model better than random guessing?**

**5. For up to five points extra credit, use another prebuilt text sentiment analyzer, e.g., VADER, and repeat steps (3) and (4).**


In [22]:
import pandas as pd
from textblob import TextBlob
from sklearn.metrics import accuracy_score
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import os
import nltk
from IPython.display import display, HTML

# Load dataset from the current working directory
cwd = os.getcwd()  # Get the current working directory
file_path = os.path.join(cwd, "labeledTrainData.tsv")  # Define the file path

# Load the dataset into pands dataframe
# data = pd.read_csv(file_path, delimiter='\t', nrows=5)
data = pd.read_csv(file_path, delimiter='\t')

# Download the VADER lexicon without any output
nltk.download('vader_lexicon', quiet=True)

# Set up display options to allow full display of the review column
pd.set_option('display.max_colwidth', 150)

# Handle missing values: Drop rows with missing sentiment or review data
data.dropna(subset=['review', 'sentiment'], inplace=True)

# Display basic information about the dataset
display(HTML("<h3>Data Loaded and Cleaned:</h3>"))
display(data.head())  # Display the first few rows without the index

# Checking for missing values after cleaning
display(HTML("<h4>Missing Values After Cleaning:</h4>"))
missing_values = data.isnull().sum()
formatted_missing_values = "\n".join([f"{col}: {missing_values[col]}" for col in missing_values.index])
display(HTML(f"<pre>{formatted_missing_values}</pre>"))

# Count the number of positive and negative reviews in the dataset
sentiment_counts = data['sentiment'].value_counts()
formatted_sentiment = f"sentiment\nPositive    {sentiment_counts.get(1, 0)}\nNegative    {sentiment_counts.get(0, 0)}"
display(HTML("<h4>Count of Positive and Negative Reviews:</h4>"))
display(HTML(f"<pre>{formatted_sentiment}</pre>"))

# Using TextBlob for sentiment classification with polarity score
def classify_sentiment_textblob(review):
    # Create a TextBlob object for the review and get the polarity score
    blob = TextBlob(review)
    polarity = blob.sentiment.polarity

    # If the polarity score is greater than or equal to 0, it's positive, otherwise negative
    sentiment = "Positive" if polarity >= 0 else "Negative"
    return sentiment, polarity

# Apply the classification function to each review in the 'review' column
data[['TextBlob_Sentiment', 'Polarity_Score']] = data['review'].apply(lambda review: pd.Series(classify_sentiment_textblob(review)))

# Calculate accuracy of TextBlob sentiment analysis by comparing with actual sentiment values
textblob_accuracy = accuracy_score(data['sentiment'].map({1: 'Positive', 0: 'Negative'}), data['TextBlob_Sentiment'])

# Using VADER for sentiment classification
sid = SentimentIntensityAnalyzer()

def classify_sentiment_vader(review):
    # Get the compound sentiment score from VADER
    score = sid.polarity_scores(review)['compound']
    return "Positive" if score >= 0 else "Negative"

# Apply VADER sentiment analysis to each review in the 'review' column
data['VADER_Sentiment'] = data['review'].apply(classify_sentiment_vader)

# Calculate accuracy of VADER sentiment analysis by comparing with actual sentiment values
vader_accuracy = accuracy_score(data['sentiment'].map({1: 'Positive', 0: 'Negative'}), data['VADER_Sentiment'])

# Checking the accuracy of random guessing by considering the class distribution
random_guess_accuracy = max(data['sentiment'].value_counts(normalize=True))  # Max of class distribution

# Display comparison of actual vs. predicted sentiments for the first 10 rows
display(HTML("<h4>Comparison of Actual and Predicted Sentiments (First 10 Rows):</h4>"))
display(data[['review', 'sentiment', 'TextBlob_Sentiment', 'VADER_Sentiment']].head(10))  # Show a subset of data for readability

# Display accuracy of both models and random guessing
display(HTML(f"<b><br>TextBlob Sentiment Analysis Accuracy:</b> {textblob_accuracy}"))
display(HTML(f"<b><br>VADER Sentiment Analysis Accuracy:</b> {vader_accuracy}"))
display(HTML(f"<b><br>Random Guessing Accuracy:</b> {random_guess_accuracy}"))


# Final Summary: Is TextBlob better than random guessing?
if textblob_accuracy > random_guess_accuracy:
    display(HTML("<br><span style='color:green;'>TextBlob performs better than random guessing!</span>"))
else:
    display(HTML("<br><span style='color:red;'>TextBlob does not perform better than random guessing.</span>"))

# Final Summary: Is VADER better than random guessing?
if vader_accuracy > random_guess_accuracy:
    display(HTML("<br><span style='color:green;'>VADER performs better than random guessing!</span>"))
else:
    display(HTML("<br><span style='color:red;'>VADER does not perform better than random guessing.</span>"))



Unnamed: 0,id,sentiment,review
0,5814_8,1,"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The..."
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hines is a very entertaining film that obviously goes to great effort and lengths to faithfully recrea..."
2,7759_3,0,The film starts with a manager (Nicholas Bell) giving welcome investors (Robert Carradine) to Primal Park . A secret project mutating a primal ani...
3,3630_4,0,"It must be assumed that those who praised this film (\the greatest filmed opera ever,\"" didn't I read somewhere?) either don't care for opera, don..."
4,9495_8,1,"Superbly trashy and wondrously unpretentious 80's exploitation, hooray! The pre-credits opening sequences somewhat give the false impression that ..."


Unnamed: 0,review,sentiment,TextBlob_Sentiment,VADER_Sentiment
0,"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The...",1,Positive,Negative
1,"\The Classic War of the Worlds\"" by Timothy Hines is a very entertaining film that obviously goes to great effort and lengths to faithfully recrea...",1,Positive,Positive
2,The film starts with a manager (Nicholas Bell) giving welcome investors (Robert Carradine) to Primal Park . A secret project mutating a primal ani...,0,Negative,Negative
3,"It must be assumed that those who praised this film (\the greatest filmed opera ever,\"" didn't I read somewhere?) either don't care for opera, don...",0,Positive,Negative
4,"Superbly trashy and wondrously unpretentious 80's exploitation, hooray! The pre-credits opening sequences somewhat give the false impression that ...",1,Negative,Positive
5,"I dont know why people think this is such a bad movie. Its got a pretty good plot, some good action, and the change of location for Harry does not...",1,Positive,Positive
6,"This movie could have been very good, but comes up way short. Cheesy special effects and so-so acting. I could have looked past that if the story ...",0,Negative,Negative
7,I watched this video at a friend's house. I'm glad I did not waste money buying this one. The video cover has a scene from the 1975 movie Capricor...,0,Positive,Negative
8,"A friend of mine bought this film for £1, and even then it was grossly overpriced. Despite featuring big names such as Adam Sandler, Billy Bob Tho...",0,Positive,Positive
9,"<br /><br />This movie is full of references. Like \Mad Max II\"", \""The wild one\"" and many others. The ladybug´s face it´s a clear reference (or ...",1,Positive,Positive


#### **<span style="color:blue">  Part 2: If you want to run your own model to classify text, it needs to be in proper form to do so. The following steps will outline a procedure to do this on the movie reviews text.** </span>

1. **Convert all text to lowercase letters.**
2. **Remove punctuation and special characters from the text.**
3. **Remove stop words.**
4. **Apply NLTK’s PorterStemmer.**
5. **Create a bag-of-words matrix from your stemmed text** (output from (4)), where each row is a word-count vector for a single movie review (see sections 5.3 & 6.8 in the Machine Learning with Python Cookbook). Display the dimensions of your bag-of-words matrix. The number of rows in this matrix should be the same as the number of rows in your original data frame.
6. **Create a term frequency-inverse document frequency (tf-idf) matrix from your stemmed text**, for your movie reviews (see section 6.9 in the Machine Learning with Python Cookbook). Display the dimensions of your tf-idf matrix. These dimensions should be the same as your bag-of-words matrix.

In [24]:
import pandas as pd
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import warnings

warnings.filterwarnings('ignore', category=UserWarning, module='nltk')

# Step 1: Load the data
# Assuming the dataset is loaded already into the DataFrame 'data'
# data = pd.read_csv('path_to_your_file.csv')  # Load the data from a file

# Step 2: Convert all text to lowercase
data['cleaned_review'] = data['review'].apply(lambda x: x.lower())

# Step 3: Remove punctuation and special characters
def remove_punctuation(text):
    return ''.join([char for char in text if char not in string.punctuation])

data['cleaned_review'] = data['cleaned_review'].apply(remove_punctuation)

# Step 4: Remove stop words
nltk.download('stopwords', quiet=True)  # Download stopwords without any output
stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    return ' '.join([word for word in text.split() if word not in stop_words])

data['cleaned_review'] = data['cleaned_review'].apply(remove_stopwords)

# Step 5: Apply NLTK’s PorterStemmer
stemmer = PorterStemmer()

def apply_stemming(text):
    return ' '.join([stemmer.stem(word) for word in text.split()])

data['cleaned_review'] = data['cleaned_review'].apply(apply_stemming)

# Step 6: Create a Bag-of-Words (BoW) matrix
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(data['cleaned_review'])

# Display the dimensions of the BoW matrix
print("\nBag-of-Words matrix dimensions:", bow_matrix.shape)

# Step 7: Create a Term Frequency-Inverse Document Frequency (tf-idf) matrix
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(data['cleaned_review'])

# Display the dimensions of the tf-idf matrix
print("\nTF-IDF matrix dimensions:", tfidf_matrix.shape)

# first 5 terms of each matrix
print("\nFirst 5 terms in the BoW matrix:", vectorizer.get_feature_names_out()[:5])
print("\nFirst 5 terms in the TF-IDF matrix:", tfidf_vectorizer.get_feature_names_out()[:5])



Bag-of-Words matrix dimensions: (25000, 92379)

TF-IDF matrix dimensions: (25000, 92379)

First 5 terms in the BoW matrix: ['00' '000' '0000000000001' '000001' '00000110']

First 5 terms in the TF-IDF matrix: ['00' '000' '0000000000001' '000001' '00000110']
