# Week 10

**Sentiment Analysis Questions**

1.	What is sentiment analysis?

  a.	Sentiment analysis is a method for understanding the positive/negative context within sentences.

2.	Where can we get reviews?

  a.	Amazon product reviews, Rotten Tomatoes, and IMBD.

3.	Give me examples of the importance of sentiment.

  a.	One example is indication of a company’s product/service performance. Another is fixing and updating negatively perceived brands, products, and services.
4.	What are the sentiment analysis steps?

  a.	Pre-processing: removing stop words, punctuation, and numbers. Converting all characters into lowercase, performing tokenisation to produce tokens, and doing stemming to convert words into their base form.

  b.	Feature extraction: using Ngrams to decide frequency of word splitting, parts-of-speech tagging to add part-of-speech meta information about each token, term frequency to find the frequency of each term in a single text/document, and inverse document frequency to identify more unique terms across multiple texts/documents.

  c.	Lexicon matching: compare text to a lexicon to understand the overall sentiment of the text. Example, higher number of positive words means positive sentence.

  d.	Model selection and training.

  e.	Evaluation: using evaluation metrics such as accuracy, precision, recall, and f1 score to evaluate the effectiveness of chosen model and its predictions.

5.	How do we know if we have a good model?

  a.	The evaluation metrics will provide a good indication as to the performance of the model. High evaluation metrics indicates a very accurate model.


**Implement Sentiment Analysis on Sentences using Lexicon**

Sentence 1 has a neutral sentiment with 3 positives and 3 negatives.

Sentence 2 has a negative sentiment with 2 positives and 3 negatives.

Sentence 3 has a positive sentiment with 4 positives and 2 negatives.

Sentence 4 has a positive sentiment with 4 positives and 3 negatives.

**Implementation Example in Python**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import nltk

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Download stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [2]:
# Download IMBD sentiment datasets from Kaggle
import kagglehub

path = kagglehub.dataset_download("columbine/imdb-dataset-sentiment-analysis-in-csv-format")

print("Path to dataset files:", path)

import os

# Check name of datasets
print(os.listdir(path))

Downloading from https://www.kaggle.com/api/v1/datasets/download/columbine/imdb-dataset-sentiment-analysis-in-csv-format?dataset_version_number=1...


100%|██████████| 25.7M/25.7M [00:01<00:00, 21.8MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/columbine/imdb-dataset-sentiment-analysis-in-csv-format/versions/1
['Train.csv', 'Valid.csv', 'Test.csv']


In [3]:
# Load datasets into data frames
train_dataset = pd.read_csv(path + '/Train.csv')
test_dataset = pd.read_csv(path + '/Test.csv')

display(train_dataset.head())
display(test_dataset.head())

Unnamed: 0,text,label
0,I grew up (b. 1965) watching and loving the Th...,0
1,"When I put this movie in my DVD player, and sa...",0
2,Why do people who do not know what a particula...,0
3,Even though I have great interest in Biblical ...,0
4,Im a die hard Dads Army fan and nothing will e...,1


Unnamed: 0,text,label
0,I always wrote this series off as being a comp...,0
1,1st watched 12/7/2002 - 3 out of 10(Dir-Steve ...,0
2,This movie was so poorly written and directed ...,0
3,The most interesting thing about Miryang (Secr...,1
4,"when i first read about ""berlin am meer"" i did...",0


In [4]:
# Split training and testing datasets into x features and y labels
x_train = train_dataset['text']
y_train = train_dataset['label']

x_test = test_dataset['text']
y_test = test_dataset['label']

In [5]:
# Print first review
x_train[0]

'I grew up (b. 1965) watching and loving the Thunderbirds. All my mates at school watched. We played "Thunderbirds" before school, during lunch and after school. We all wanted to be Virgil or Scott. No one wanted to be Alan. Counting down from 5 became an art form. I took my children to see the movie hoping they would get a glimpse of what I loved as a child. How bitterly disappointing. The only high point was the snappy theme tune. Not that it could compare with the original score of the Thunderbirds. Thankfully early Saturday mornings one television channel still plays reruns of the series Gerry Anderson and his wife created. Jonatha Frakes should hand in his directors chair, his version was completely hopeless. A waste of film. Utter rubbish. A CGI remake may be acceptable but replacing marionettes with Homo sapiens subsp. sapiens was a huge error of judgment.'

In [6]:
# Vectorize the text data using CountVectorizer
vectorizer = CountVectorizer(stop_words = 'english')

x_train_counts = vectorizer.fit_transform(x_train)
x_test_counts = vectorizer.transform(x_test)

In [7]:
# Train a Naive Bayes classifier
classifier = MultinomialNB()
classifier.fit(x_train_counts, y_train)

In [8]:
# Make predictions on the test set
predictions = classifier.predict(x_test_counts)

In [9]:
label_names = ['Negative', 'Positive']

# Evaluate the model
accuracy = accuracy_score(y_test, predictions)
conf_matrix = confusion_matrix(y_test, predictions)
class_report = classification_report(y_test, predictions, target_names = label_names)

In [10]:
# Print results
print(f"Accuracy: {accuracy:.2f}")
print("\nConfusion Matrix:")
print(conf_matrix)
print("\nClassification Report:")
print(class_report)

Accuracy: 0.86

Confusion Matrix:
[[2192  303]
 [ 383 2122]]

Classification Report:
              precision    recall  f1-score   support

    Negative       0.85      0.88      0.86      2495
    Positive       0.88      0.85      0.86      2505

    accuracy                           0.86      5000
   macro avg       0.86      0.86      0.86      5000
weighted avg       0.86      0.86      0.86      5000



**Reflection**

For this week, I learned how to perform sentiment analysis using a lexicon and summing the positive and negative words. The larger sum dictated the sentiment of the sentence.

I also learned the various steps taken during sentiment analysis. This includes pre-processing, feature extraction, lexicon matching, model training, and evaluation. I learned about the pre-processing techniques of removing stop-words, punctuation, tokenization, and stemming. I also learned about the feature extraction techniques of word splitting frequency using Ngrams, adding parts-of-speech information using POS tagging, and identifying unique and frequent words across multiple documents using the combination of team frequency and inverse document frequency.

Lastly, I implemented a sentiment classifier using multinomial Naive-Bayes. I chose an IMDB sentiment dataset on Kaggle. The datasets were vectorized and stop-words were removed. The model then trained on the vectorized training dataset. Predictions were than made on the vectorized testing dataset. Evaluation metrics of accuracy, confusion matrix, and classification report were used. Overall, the model performed well with an accuracy of 86% and far more true than false predictions. The precision, recall, and F1 scores were all high and evenly matched between each target, indicating a reliable prediction regardless of instance and metric.