<a href="https://colab.research.google.com/github/JericCantos/SentencePolarity/blob/main/notebooks/Sentence_Polarity_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

Using the [Sentence Polarity dataset](https://www.kaggle.com/datasets/nltkdata/sentence-polarity), train and evaluate different sentiment classsifier models that can predict whether given review is leaning **positive** or **negative**

The project goes through:
1. Data Preprocessing and Cleaning
2. Feature Extraction
3. Model Trainning and Evaluation with Cross-Validation
4. Hyperparameter Tuning

# Setup

## Import Libraries

In [5]:
import numpy as np
import pandas as pd
import random
import nltk
import re

from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('punkt_tab')

# Downloading necessary NLTK resources
nltk.download('stopwords')  # List of common stop words in English
nltk.download('punkt')  # Pre-trained tokenizer models
nltk.download('wordnet')  # WordNet lemmatizer dataset

# Libraries for text feature extraction and model training
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC

# Libraries for model evaluation
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.model_selection import KFold, cross_val_score

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Load Dataset

### Kaggle

Ensure that you create a KAGGLE_USERNAME and KAGGLE_KEY secret. You can get a username and key by generating an API Token from Kaggle (Settings -> API).

In [2]:
import os
import json
from google.colab import userdata

KAGGLE_USERNAME = userdata.get('KAGGLE_USERNAME')
KAGGLE_KEY = userdata.get('KAGGLE_KEY')

# Create kaggle.json from your secrets
os.makedirs(os.path.expanduser("~/.kaggle"), exist_ok=True)
with open(os.path.expanduser("~/.kaggle/kaggle.json"), "w") as f:
    json.dump({
        "username": KAGGLE_USERNAME,
        "key": KAGGLE_KEY
    }, f)

!chmod 600 ~/.kaggle/kaggle.json


In [3]:
!kaggle datasets download -d nltkdata/sentence-polarity

Dataset URL: https://www.kaggle.com/datasets/nltkdata/sentence-polarity
License(s): other
Downloading sentence-polarity.zip to /content
  0% 0.00/488k [00:00<?, ?B/s]
100% 488k/488k [00:00<00:00, 617MB/s]


In [4]:
!unzip sentence-polarity.zip

Archive:  sentence-polarity.zip
  inflating: sentence_polarity/README.txt  
  inflating: sentence_polarity/rt-polarity.neg  
  inflating: sentence_polarity/rt-polarity.pos  


### Read Datafiles

In [9]:
# Read the positive and negative sentiment files
df_sent_pos = pd.read_csv('sentence_polarity/rt-polarity.pos',
                          sep='\t', header=None)
  # Positive sentiment sentences
df_sent_neg = pd.read_csv('sentence_polarity/rt-polarity.neg', sep='\t', header=None)
  # Negative sentiment sentences

print(df_sent_pos.shape)
print(df_sent_neg.shape)
df_sent_pos.head()

(5331, 1)
(5331, 1)


Unnamed: 0,0
0,the rock is destined to be the 21st century's ...
1,"the gorgeously elaborate continuation of "" the..."
2,effective but too-tepid biopic
3,if you sometimes like to go to the movies to h...
4,"emerges as something rare , an issue movie tha..."


In [10]:
# Rename the column to 'sentence'
df_sent_pos.rename(columns={0: "sentence"}, inplace=True)
df_sent_neg.rename(columns={0: "sentence"}, inplace=True)

# Data Preprocessing

In [11]:
# Define the preprocessing function
def preprocess_text(sentences):
    # Convert all tokens to lowercase
    sentences = [sentence.lower() for sentence in sentences]

    # Remove punctuation using regex
    sentences = [re.sub(r"[^\w\s]", "", sentence) for sentence in sentences]

    # Remove extra whitespace between words
    sentences = [" ".join(sentence.split()) for sentence in sentences]

    # Tokenize sentences into words
    sentences = [word_tokenize(sentence) for sentence in sentences]

    # Remove stop words
    stop_words = set(stopwords.words('english'))  # Load English stop words
    filtered_sentences = []
    for sentence in sentences:
        filtered_sentence = [word for word in sentence if word not in stop_words]
        filtered_sentences.append(filtered_sentence)

    # Lemmatize words
    lemmatizer = WordNetLemmatizer()
    lemmatized_sentences = []
    for sentence in filtered_sentences:
        lemmatized_sentence = [lemmatizer.lemmatize(word) for word in sentence]
        lemmatized_sentences.append(lemmatized_sentence)

    return [' '.join(sentence) for sentence in lemmatized_sentences]

In [12]:
# Preprocess the sentences
pos_preprocessed_sentences = preprocess_text(df_sent_pos['sentence'])
neg_preprocessed_sentences = preprocess_text(df_sent_neg['sentence'])

# Print the first preprocessed negative sentence
print(neg_preprocessed_sentences[0])

simplistic silly tedious


In [14]:
# Combine preprocessed positive and negative sentences
sentences = pos_preprocessed_sentences + neg_preprocessed_sentences

In [15]:
# Create a list for all labels
polarities = []
polarities.extend([1] * len(df_sent_pos))  # Label positive sentences as 1
polarities.extend([0] * len(df_sent_neg))  # Label negative sentences as 0

In [18]:
# Combine sentences and labels into a single list
combined = list(zip(sentences, polarities))

# Shuffle the combined list
random.shuffle(combined)

# Split the shuffled list back into sentences and labels
sentences[:], polarities[:] = zip(*combined)

# Train-Test Split

In [19]:
# Define train-test split ratio
train_test_ratio = 0.8

# Calculate the size of the training set
train_set_size = int(train_test_ratio * len(sentences))

# Split data into training and test sets
X_train, X_test = sentences[:train_set_size], sentences[train_set_size:]
y_train, y_test = polarities[:train_set_size], polarities[train_set_size:]

# Print sizes of training and test sets
print("Size of training set:", len(X_train))
print("Size of test set:", len(X_test))

Size of training set: 8529
Size of test set: 2133
