### Sentiment Analysis Experiment using "moviereviews2.tsv" Dataset
1. Check if the dataset contains empty values, unnecessary spaces, etc.
2. Based on the selected dataset, determine the accuracy of sentiment analysis using VADER and TextBlob lexicons. Also, calculate the precision, recall, and F1-score of VADER and TextBlob models.
3. Create a machine learning sentiment classification model that automatically detects positive and negative sentiments in text.
   Steps:
   - Split the dataset into training and testing sets.
   - Extract numerical features (choose features at your discretion).
   - Train the selected model (K-nearest neighbors - KNN).
   - Evaluate the model's accuracy. Also, calculate the precision, recall, and F1-score of VADER and TextBlob models.

#### MoviesRewies2 Dataset Preparation

In [5]:
import pandas as pd
import numpy as np
import spacy
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from textblob import TextBlob
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

# Load the dataset
data = pd.read_csv('moviereviews2.tsv', sep='\t')

# Check for empty values
print("Empty values:")
print(data.isnull().sum())

# Display label counts
print("\nLabel Counts:")
print(data['label'].value_counts())

# Show total number of reviews
print("\nTotal Reviews:", len(data))


# Remove rows with null values in the 'review' column
fixed_data = data.dropna(subset=['review'])

print("\n  ///////")

# Verify that null reviews have been removed
print("\nEmpty values:")
print(fixed_data.isnull().sum())
print("Label Counts:")
print(fixed_data['label'].value_counts())
print("Total Reviews:", len(fixed_data))


# Check for unnecessary spaces
whitespace_check = fixed_data['review'].apply(lambda x: x.isspace())
print("\nWhitespace Check:")
print(whitespace_check.value_counts())

fixed_data.head()


Empty values:
label      0
review    20
dtype: int64

Label Counts:
label
pos    3000
neg    3000
Name: count, dtype: int64

Total Reviews: 6000

  ///////

Empty values:
label     0
review    0
dtype: int64
Label Counts:
label
pos    2990
neg    2990
Name: count, dtype: int64
Total Reviews: 5980

Whitespace Check:
review
False    5980
Name: count, dtype: int64


Unnamed: 0,label,review
0,pos,I loved this movie and will watch it again. Or...
1,pos,"A warm, touching movie that has a fantasy-like..."
2,pos,I was not expecting the powerful filmmaking ex...
3,neg,"This so-called ""documentary"" tries to tell tha..."
4,pos,This show has been my escape from reality for ...


#### Vader Sentiment Analysis

In [2]:

vader_data = fixed_data.copy()

vader_data.loc[:, 'scores'] = vader_data['review'].apply(lambda review: SentimentIntensityAnalyzer().polarity_scores(review))
vader_data.loc[:, 'compound'] = vader_data['review'].apply(lambda review: SentimentIntensityAnalyzer().polarity_scores(review)['compound'])

# Determine sentiment scores
vader_data['comp_score'] = 'pos'
vader_data.loc[vader_data['compound'] <= -0.05, 'comp_score'] = 'neg'
vader_data.loc[(vader_data['compound'] > -0.05) & (vader_data['compound'] < 0.05), 'comp_score'] = 'neu'

vader_data.head()

Unnamed: 0,label,review,scores,compound,comp_score
0,pos,I loved this movie and will watch it again. Or...,"{'neg': 0.062, 'neu': 0.695, 'pos': 0.243, 'co...",0.872,pos
1,pos,"A warm, touching movie that has a fantasy-like...","{'neg': 0.033, 'neu': 0.783, 'pos': 0.184, 'co...",0.9549,pos
2,pos,I was not expecting the powerful filmmaking ex...,"{'neg': 0.097, 'neu': 0.795, 'pos': 0.108, 'co...",0.7201,pos
3,neg,"This so-called ""documentary"" tries to tell tha...","{'neg': 0.116, 'neu': 0.832, 'pos': 0.052, 'co...",-0.9821,neg
4,pos,This show has been my escape from reality for ...,"{'neg': 0.028, 'neu': 0.769, 'pos': 0.203, 'co...",0.9935,pos


##### Calculating accuracy, precision, recall, and F1-score with Vader

In [3]:

acc_score = accuracy_score(vader_data['label'], vader_data['comp_score'])
pre_score = precision_score(vader_data['label'], vader_data['comp_score'], average='weighted')
rec_score = recall_score(vader_data['label'], vader_data['comp_score'], average='weighted', zero_division=1)
f_score = f1_score(vader_data['label'], vader_data['comp_score'], average='weighted')

print(f"Vader Accuracy: {acc_score}, Precision: {pre_score}, Recall: {rec_score}, F1-score: {f_score}")


Vader Accuracy: 0.7277591973244147, Precision: 0.7574199030583763, Recall: 0.7277591973244147, F1-score: 0.7245674966067225


#### TextBlob Sentiment Analysis

In [8]:

nlp = spacy.load("en_core_web_sm")

def get_textblob_score(text):
    analysis = TextBlob(text)
    if analysis.sentiment.polarity > 0:
        return 'pos'
    else:
        return 'neg'

TextBlob_data = fixed_data
TextBlob_data.loc[:, 'polarity'] = TextBlob_data['review'].apply(lambda x: TextBlob(x).sentiment.polarity)
TextBlob_data.loc[:, 'subjectivity'] = TextBlob_data['review'].apply(lambda x: TextBlob(x).sentiment.subjectivity)
TextBlob_data.loc[:, 'comp_score'] = TextBlob_data['review'].apply(get_textblob_score)

TextBlob_data.head()

Unnamed: 0,label,review,polarity,subjectivity,comp_score
0,pos,I loved this movie and will watch it again. Or...,0.615,0.57,pos
1,pos,"A warm, touching movie that has a fantasy-like...",0.373333,0.598889,pos
2,pos,I was not expecting the powerful filmmaking ex...,0.100889,0.375302,pos
3,neg,"This so-called ""documentary"" tries to tell tha...",0.039785,0.43629,pos
4,pos,This show has been my escape from reality for ...,0.202484,0.534265,pos


##### Calculating accuracy, precision, recall, and F1-score with TextBlob

In [9]:

acc_score_textblob = accuracy_score(TextBlob_data['label'], TextBlob_data['comp_score'])
pre_score_textblob = precision_score(TextBlob_data['label'], TextBlob_data['comp_score'], average='weighted')
rec_score_textblob = recall_score(TextBlob_data['label'], TextBlob_data['comp_score'], average='weighted', zero_division=1)
f_score_textblob = f1_score(TextBlob_data['label'], TextBlob_data['comp_score'], average='weighted')

print(f"TextBlob accuracy_score: {acc_score_textblob}, precision_score: {pre_score_textblob}, recall_score: {rec_score_textblob}, f1_score: {f_score_textblob}")

TextBlob accuracy_score: 0.7389632107023412, precision_score: 0.7926858907605404, recall_score: 0.7389632107023412, f1_score: 0.7264087489832409


#### Data Splitting

In [10]:

X = fixed_data['review']
Y = fixed_data['label']
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.25, random_state=42)

print(X_train.shape)
print(Y_train.shape)

(4485,)
(4485,)


#### Model Training and Evaluation

In [12]:

knn = Pipeline([('tfidf', TfidfVectorizer()), ('clf', KNeighborsClassifier())])

knn.fit(X_train, Y_train)
predictions = knn.predict(X_test)
print(metrics.accuracy_score(Y_test, predictions))

0.8187290969899665
