Import Libraries and Download Resources

In [19]:
# Libraries
import nltk
import random
import numpy as np 
import pandas as pd 

# ML Libraries
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Movie reviews dataset
from nltk.corpus import movie_reviews
nltk.download("movie_reviews")

[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\Nikolai\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


True

## Implement Basic Text Classification

Load and Prepare Movie Reviews Dataset

In [20]:
# Load and Shuffle Movie Reviews Dataset
# Each document is a tuple of (words, category)
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

# Prepare Data
doc_text = [" ".join(doc) for doc, _ in documents] # Extract and join words
doc_labels = [label for _, label in documents] # Extrtact labels

# Create DataFrame for easier manipulation
review_df = pd.DataFrame({
    "text": doc_text,
    "label": doc_labels
})

In [21]:
# Display Dataset Overview and Sample Reviews
print(f"Dataset Overview:\n{review_df["label"].value_counts()}")
print(f"\nSample Reviews:\n{review_df.head()}")

Dataset Overview:
label
pos    1000
neg    1000
Name: count, dtype: int64

Sample Reviews:
                                                text label
0  i remember hearing about this film when it fir...   pos
1  the keen wisdom of an elderly bank robber , th...   pos
2  wild things is a way to steam up an otherwise ...   neg
3  star wars : ? episode i -- the phantom menace ...   neg
4  weir is well - respected in hollywood for turn...   pos


Convert Reviews into Suitable ML Format

In [22]:
# Split Data into Training and Testing Sets
X_train, X_test, y_train, y_test = train_test_split(
    review_df["text"], # Extract text for training
    review_df["label"], # Extract labels for training
    test_size = 0.2, # 20% of data for testing
    random_state = 42 # Ensure reproducibility
)

Use CountVectorizer to Convert Text to Features

In [23]:
# Initialize CountVectorizer
count_vectorizer = CountVectorizer(max_features = 5000, stop_words = "english") # Convert text to features

# Fit and transform training data, then transform test data
X_train_counts = count_vectorizer.fit_transform(X_train) # Fit and transform training data
X_test_counts = count_vectorizer.transform(X_test) # Transform test data using the same vectorizer

In [24]:
# Print number of features and top features
print(f"Number of features:\n{len(count_vectorizer.get_feature_names_out())}")
print(f"\nTop features:\n{count_vectorizer.get_feature_names_out()[:10]}")

Number of features:
5000

Top features:
['000' '10' '100' '11' '12' '13' '13th' '14' '15' '16']


Train Naive Bayes Classifier and Evaluate its Performance

In [25]:
nb_clf = MultinomialNB() # Initialize Naive Bayes classifier
nb_clf.fit(X_train_counts, y_train) # Train the classifier
y_pred = nb_clf.predict(X_test_counts) # Predict on test data

In [26]:
# Print detailed performance metrics and overall accuracy
print(f"Classification Report: \n{classification_report(y_test, y_pred)}")
print(f"\nAccuracy: {accuracy_score(y_test, y_pred):.4f}")

Classification Report: 
              precision    recall  f1-score   support

         neg       0.77      0.86      0.81       189
         pos       0.86      0.76      0.81       211

    accuracy                           0.81       400
   macro avg       0.81      0.81      0.81       400
weighted avg       0.82      0.81      0.81       400


Accuracy: 0.8100


## Understanding CountVectorizer

1. Explain the role of CountVectorizer in text classification. How does it prepare text data for machine learning? 
- CountVectorizer is a feature extraction tool that transforms raw text documents into a numerical format that machine learning models can understand. It works by first tokenizing the text, breaking it down into individual words. It then counts the frequency of each word within a document, creating a "bag-of-words" model. In this model, each document is represented as a numerical vector where the values correspond to the raw counts of each word.

2. What is the purpose of train_test_split? Why is it important in machine learning evaluation? 

- The train_test_split function divides a dataset into separate training and testing subsets. Doing do is essential for evaluating a model's ability to generalize to new "unseen" data. If you were to train and test a model on the same data, it might simply memorize the examples. Ultimately leading to overfitting and inaccurate performance metrics. By training the model on one portion of the data and then evaluating its performance on a separate portion, train_test_split provides a more realistic and reliable estimate of how the model will perform in real-world scenarios.

3. Analyze the accuracy calculation in the Naive Bayes classifier. What does the score tell us about model performance?
- The accuracy score of the Naive Bayes classifier measures the proportion of correctly predicted movie reviews in the test set. For this particular movie review dataset, which is well-balanced with an equal number of positive and negative reviews, accuracy serves as a reliable indicator of overall performance. An accuracy of 81% means the model correctly classified 81% of the test reviews. Because the dataset is balanced, this score suggests that the model is effective at distinguishing between both positive and negative sentiment classes without being biased toward one.

## Advanced Feature Engineering

Modify Code to use TfidfVectorizer

In [27]:
# Initialize TF-IDF Vectorizer
tf_count_vectorizer = TfidfVectorizer(
    max_features = 5000, # Limit to 5000 features
    min_df = 5, # Minimum document frequency of 5
    max_df = 0.7, # Maximum document frequency of 70%
    stop_words = "english" # Remove common English stop words
)

# Fit and transform training data, then transform test data
X_train_tfidf = tf_count_vectorizer.fit_transform(X_train) # Fit and transform training data
X_test_tfidf = tf_count_vectorizer.transform(X_test) # Transform test data using the same vectorizer

In [28]:
# Print number of features and top features
print(f"Number of features:\n{len(tf_count_vectorizer.get_feature_names_out())}")
print(f"\nTop features:\n{tf_count_vectorizer.get_feature_names_out()[:10]}")

Number of features:
5000

Top features:
['000' '10' '100' '101' '11' '12' '13' '13th' '14' '15']


In [29]:
# Train Naive Bayes Classifier and evaluate its performance
nb_clf_tfidf = MultinomialNB() # Initialize Naive Bayes classifier
nb_clf_tfidf.fit(X_train_tfidf, y_train) # Train the classifier
y_pred_tfidf = nb_clf_tfidf.predict(X_test_tfidf) # Predict on test data

In [30]:
# Print detailed performance metrics and overall accuracy
print(f"Classification Report: \n{classification_report(y_test, y_pred_tfidf)}")
print(f"\nAccuracy: {accuracy_score(y_test, y_pred_tfidf):.4f}")

Classification Report: 
              precision    recall  f1-score   support

         neg       0.75      0.87      0.81       189
         pos       0.86      0.74      0.80       211

    accuracy                           0.80       400
   macro avg       0.81      0.81      0.80       400
weighted avg       0.81      0.80      0.80       400


Accuracy: 0.8025


Compare Accuracy Scores

The Naive Bayes classifier achieved slightly higher accuracy using CountVectorizer (81.00%) compared to TfidfVectorizer (80.25%). Both approaches yielded similar macro-averaged F1 scores (0.81 vs 0.80), with CountVectorizer having a slight edge in recall for the positive class. While the difference is small, CountVectorizer showed marginally more balanced performance across both sentiment classes in this binary classification task.

Why One May Perform Better than the Other

- CountVectorizer may perform better in this instance because it retains the full weight of frequently occurring words, many of which are strong sentiment indicators. 

- TfidfVectorizer down-weights common words across the dataset. While this is useful for tasks like topic modeling to highlight unique words, it can inadvertently reduce the impact of these high-frequency sentiment indicators.

- Since the Naive Bayes algorithm strongly relies on word frequency to make its classifications, CountVectorizer's direct count-based approach aligns more closely with the model's underlying assumptions. Which leads to slightly better performance in this specific scenario.

## Alternative Methods

Implement and compare an SVM classifier

Replace Naive Bayes with LinearSVC

In [31]:
# Train SVM with CountVectorizer features
svm_clf = LinearSVC(max_iter = 10000) # Initialize SVM classifier
svm_clf.fit(X_train_counts, y_train) # Train the classifier
y_pred_svm = svm_clf.predict(X_test_counts) # Predict on test data

In [32]:
# Print detailed performance metrics and overall accuracy
print(f"Classification Report: \n{classification_report(y_test, y_pred_svm)}")
print(f"\nAccuracy: {accuracy_score(y_test, y_pred_svm):.4f}")

Classification Report: 
              precision    recall  f1-score   support

         neg       0.81      0.85      0.83       189
         pos       0.86      0.82      0.84       211

    accuracy                           0.83       400
   macro avg       0.83      0.84      0.83       400
weighted avg       0.84      0.83      0.84       400


Accuracy: 0.8350


Compare Performace Metrics between Naive Bayes and SVM

- The SVM classifier achieved an accuracy of 83.5%. Which outperformed the Naive Bayes model’s 81.0%. 

- The SVM model demonstrated strong performance, achieving a balanced F1-score across both positive and negative sentiment classes.

- For negative reviews, the model had an F1-score of 0.83, while for positive reviews, it achieved an F1-score of 0.84. The macro-averaged F1-score of 0.83 further confirms the model's consistent and reliable performance across both classes.

- All-in-all, the SVM model proved to be more effective than the Naive Bayes classifier. It was particularly adept at creating a more refined decision boundary, leading to better classification of ambiguous examples.

Discuss Trade-offs

- SVM had better accuracy and slightly better balance between precision and recall. SVMs require more training time and may need careful hyperparameter tuning. For larger datasets, you may need to increase the max_iter parameter to ensure the model converges. 

- Naive Bayes trains quickly and is easier to interpret. Which is great for quickfire prototyping or with limited resources. Its simplicity and speed can be a significant advantage in many scenarios.

- The SVM’s ability to model more complex relationships between features could justify its extra computational cost when higher accuracy is needed.

- Naive Bayes is more suited when speed and simplicity are required.