### Step 1. Importing Libraries

In [7]:
# Importing necessary libraries
import numpy as np  # for numerical operations
import pandas as pd  # for data manipulation
import random  # for shuffling the data
import nltk
import re  # for handling regular expressions

from nltk.stem import WordNetLemmatizer  # for lemmatizing words
from nltk.corpus import stopwords  # for stop word removal
from nltk.tokenize import word_tokenize  # for tokenizing sentences into words
nltk.download('punkt_tab')  # Downloads the 'punkt' tokenizer table used for tokenization of text into sentences or words

# Downloading necessary NLTK resources
nltk.download('stopwords')  # List of common stop words in English
nltk.download('punkt')  # Pre-trained tokenizer models
nltk.download('wordnet')  # WordNet lemmatizer dataset

# Libraries for text feature extraction and model training
from sklearn.feature_extraction.text import TfidfVectorizer  # Convert text into numerical features (TF-IDF)
from sklearn.linear_model import LogisticRegression  # Logistic regression for classification
from sklearn.svm import LinearSVC  # Support Vector Machines for classification

# Libraries for model evaluation
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix  # For model evaluation metrics
from sklearn.model_selection import KFold, cross_val_score  # For cross-validation

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/dianaterraza/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/dianaterraza/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/dianaterraza/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/dianaterraza/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Step 2: Load & Prepare Dataset

The Sentence Polarity dataset contains 5,331 positive and 5,331 negative sentences. We'll load this dataset and prepare it for analysis.


In [8]:
%pip install kagglehub -q

Note: you may need to restart the kernel to use updated packages.


In [9]:

import kagglehub

# Download a single file
kagglehub.dataset_download('bricevergnou/spotify-recommendation', path='data.csv')

print("Path to dataset files:", path)

Path to dataset files: /Users/dianaterraza/.cache/kagglehub/datasets/nltkdata/sentence-polarity/versions/1


In [10]:
# Read the positive and negative sentiment files
df_sent_pos = pd.read_csv('/Users/dianaterraza/Desktop/NLP/Data/sentence_polarity 2/rt-polarity.pos', sep='\t', header=None)  # Positive sentiment sentences
df_sent_neg = pd.read_csv('/Users/dianaterraza/Desktop/NLP/Data/sentence_polarity 2/rt-polarity.neg', sep='\t', header=None)  # Negative sentiment sentences

# Display the first few rows of the positive dataset to understand its structure
print(df_sent_pos.head())

                                                   0
0  the rock is destined to be the 21st century's ...
1  the gorgeously elaborate continuation of " the...
2                     effective but too-tepid biopic
3  if you sometimes like to go to the movies to h...
4  emerges as something rare , an issue movie tha...


### Step 4: Rename Columns

In [11]:
# Rename the column to 'sentence'
df_sent_pos.rename(columns={0: "sentence"}, inplace=True)
df_sent_neg.rename(columns={0: "sentence"}, inplace=True)

inplace=True ensures the changes are applied directly to the dataframes.

### Step 5: Data Preprocessing

1. Converts text to lowercase.
2. Removes punctuation using regular expressions.
3. Removes extra whitespace.
4. Tokenizes sentences into words.
5. Removes stop words.
6. Lemmatizes words.

In [12]:
# Define the preprocessing function
def preprocess_text(sentences):
    # Convert all tokens to lowercase
    sentences = [sentence.lower() for sentence in sentences]

    # Remove punctuation using regex
    sentences = [re.sub(r"[^\w\s]", "", sentence) for sentence in sentences]

    # Remove extra whitespaces between words
    sentences = [" ".join(sentence.split()) for sentence in sentences]

    # Tokenize sentences into words
    sentences = [word_tokenize(sentence) for sentence in sentences]

    # Remove stop words
    stop_words = set(stopwords.words('english'))  # Load English stop words
    filtered_sentences = []
    for sentence in sentences:
        filtered_sentence = [word for word in sentence if word not in stop_words]
        filtered_sentences.append(filtered_sentence)

    # Lemmatize words
    lemmatizer = WordNetLemmatizer()
    lemmatized_sentences = []
    for sentence in filtered_sentences:
        lemmatized_sentence = [lemmatizer.lemmatize(word) for word in sentence]
        lemmatized_sentences.append(lemmatized_sentence)

    return [' '.join(sentence) for sentence in lemmatized_sentences]

* Each preprocessing step cleans the text data and prepares it for model training.
* Stop words like "the", "is", and "and" are removed to focus on meaningful words.
* Lemmatization reduces words to their base form (e.g., "running" → "run").
* Note what we pass after the return keyword! [' '.join(sentence) for sentence in lemmatized_sentences] does an important step for us: it collects separate tokens back to a string. Here is a concrete example: we move from ['simplistic', 'silly', 'tedious'] to `
simplistic silly tedious. In other words, we move from a list of tokens to a single sentence in a form of a string.

### Step 6: Apply Preprocessing

In [13]:
# Preprocess the sentences
pos_preprocessed_sentences = preprocess_text(df_sent_pos['sentence'])
neg_preprocessed_sentences = preprocess_text(df_sent_neg['sentence'])

# Print the first preprocessed negative sentence
print(neg_preprocessed_sentences[0])

simplistic silly tedious


### Step 7: Combine Datasets

We merge the positive and negative sentences into a single list called sentences

In [14]:
# Combine preprocessed positive and negative sentences
sentences = pos_preprocessed_sentences + neg_preprocessed_sentences

### Step 8: Create Labels

Labels (also called targets) distinguish positive and negative sentences. Positive sentences are labeled as 1, and negative ones as 0.

In [15]:
# Create a list for all labels
polarities = []
polarities.extend([0] * len(df_sent_neg))  # Label negative sentences as 0
polarities.extend([1] * len(df_sent_pos))  # Label positive sentences as 1

The length of the polarities list matches the sentences list, maintaining the correct label for each sentence.

### Step 9: Shuffle Data

In [16]:
# Combine sentences and labels into a single list
combined = list(zip(sentences, polarities))

# Shuffle the combined list
random.shuffle(combined)

# Split the shuffled list back into sentences and labels
sentences[:], polarities[:] = zip(*combined)

### Step 10: Split Dataset

We’ll split the data into training and test sets, using 80% for training and 20% for testing.

In [17]:
# Define train-test split ratio
train_test_ratio = 0.8

# Calculate the size of the training set
train_set_size = int(train_test_ratio * len(sentences))

# Split data into training and test sets
X_train, X_test = sentences[:train_set_size], sentences[train_set_size:]
y_train, y_test = polarities[:train_set_size], polarities[train_set_size:]

# Print sizes of training and test sets
print("Size of training set:", len(X_train))
print("Size of test set:", len(X_test))

Size of training set: 8529
Size of test set: 2133


## Vectorizing Texts, Training Models & Evaluating Their Performance

### 1. Transforming Text into Features

TF-IDF stands for Term Frequency-Inverse Document Frequency 

The TF-IDF Vectorizer is a widely used tool for this purpose. It transforms sentences into a sparse matrix where each row corresponds to a document (sentence) and each column represents a term (word or token).

Here, we’ll use the TfidfVectorizer out of the box with its default configuration. By default, this includes:
* Only unigrams (individual terms/tokens) as features.
* A maximum document frequency of 1.0 (no terms are excluded based on frequency).
* Normalization applied to the resulting feature vectors.

In [18]:
# Import TF-IDF Vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Define the vectorizer with default parameters
tfidf_vectorizer = TfidfVectorizer()

# Transform the training data into a TF-IDF matrix
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

# Check the number of samples and features
num_samples, num_features = X_train_tfidf.shape
print("#Samples: {}, #Features: {}".format(num_samples, num_features))

#Samples: 8529, #Features: 16486


### 2. Training the Classifier

Logistic Regression is a simple yet effective algorithm for binary classification tasks, such as predicting sentiment polarity.

In [19]:
# Import the Logistic Regression model
from sklearn.linear_model import LogisticRegression

# Train the Logistic Regression classifier
logistic_regression_classifier = LogisticRegression().fit(X_train_tfidf, y_train)

### 3. Evaluating the Classifier

After training the model, we need to assess its performance on unseen test data. Evaluation involves transforming the test data into the same TF-IDF format as the training data, making predictions, and calculating key metrics.

### 3.1 Transform Test Data

The general rule here is that .fit_transform() is always used for the training data and .transform() for any other data.

In [20]:
# Transform the test data into TF-IDF format
X_test_tfidf = tfidf_vectorizer.transform(X_test)

### 3.2 Predict Sentiment

In [21]:
# Predict polarities for the test data
y_pred = logistic_regression_classifier.predict(X_test_tfidf)

### 3.3 Generate Evaluation Report

We use the classification_report() function to generate precision, recall, and F1-score metrics for each class, as well as the overall accuracy.

In [22]:
# Import evaluation metrics
from sklearn.metrics import classification_report, accuracy_score

# Generate and display the classification report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.75      0.75      0.75      1061
           1       0.75      0.75      0.75      1072

    accuracy                           0.75      2133
   macro avg       0.75      0.75      0.75      2133
weighted avg       0.75      0.75      0.75      2133



### 4. Interpreting the Results
Let’s break down the key metrics:
* Precision: How many of the model’s positive predictions were correct? For example, the precision for class 0 (negative sentiment) is 75%, meaning 75% of the sentences predicted as negative were actually negative.

* Recall: How many of the actual positive cases were correctly identified? For class 1 (positive sentiment), the recall is 75%, meaning the model correctly identified 75% of all positive sentences.

* F1-score: This is the harmonic mean of precision and recall, providing a balanced measure. In this example, both classes have an F1-score of 0.75.

### Is 75% a Good F1-Score?
A random classifier for a binary task would achieve an F1-score of ~50%.
Our model’s F1-score of 75% shows that it performs significantly better than random guessing. However, it’s not perfect, indicating room for improvement