# TFIDF-Naive Bayes Model for Twitter Sentiment Analysis
Author: Marcus KWAN TH

Last updated: 2025-11-15

## 1. TF-IDF Embedding Implementation
### Vectorize the pre-processed data into PySpark dataframe using TF-IDF.
Libraries: Scikit-learn, PySpark

In [1]:
"""
The use of this lines is to:
1. Prevent PySpark from using a different Python interpreter
2. Adding the root path to the sys context for the runtime to properly import util.preprocessing, otherwise, error will occur.
based on the testing with Jupyter Notebook.
"""

import sys, os
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

# Add the root folder to sys.path before importing util package
root_path = os.path.abspath(os.path.join(os.getcwd(), '..'))
if root_path not in sys.path:
    sys.path.append(root_path)

from util.preprocessing import load_and_preprocess_data

In [2]:
# Import all necessary library for TF-IDF embedding
from sklearn.feature_extraction.text import TfidfVectorizer
from pyspark.sql import SparkSession
from pyspark.ml.linalg import Vectors

In [None]:
"""
Before running the following code, please ensure to zip the util folder using
command: zip -r util.zip util
on the root directory.
"""

# Initialize Spark Session
ss  = SparkSession.builder \
        .appName("Marcus TF-IDF Naive-Bayes Model") \
        .getOrCreate()

# Add util.zip to PySpark context 
spark = ss.sparkContext.addPyFile("../util.zip")

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/11/15 18:04:03 WARN Utils: Your hostname, Marcuss-MacBook-Air.local, resolves to a loopback address: 127.0.0.1; using 192.168.0.216 instead (on interface en0)
25/11/15 18:04:03 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/11/15 18:04:03 WARN Utils: Your hostname, Marcuss-MacBook-Air.local, resolves to a loopback address: 127.0.0.1; using 192.168.0.216 instead (on interface en0)
25/11/15 18:04:03 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust

In [4]:
# Initialize variables:
col_name = "tf-idf vectors"

In [5]:
# Load and preprocess training and testing data from the util package
train_df = load_and_preprocess_data('../Twitter_data/traindata7.csv')
test_df = load_and_preprocess_data('../Twitter_data/testdata7.csv')

25/11/15 18:04:05 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [6]:
# Convert Spark DataFrames to Pandas for TF-IDF processing
train_pandas = train_df.toPandas()
test_pandas = test_df.toPandas()

                                                                                

In [7]:
# Extract documents and labels from training data
train_documents = train_pandas.iloc[:, 0].astype(str).tolist()
train_labels = train_pandas.iloc[:, 1].tolist()

# Extract documents and labels from testing data
test_documents = test_pandas.iloc[:, 0].astype(str).tolist()
test_labels = test_pandas.iloc[:, 1].tolist()

print(f"Number of training samples: {len(train_documents)}")
print(f"Number of testing samples: {len(test_documents)}\n")

print(f"Training 1: {train_documents[0]}")
print(f"Testing 2: {test_documents[0]}")

Number of training samples: 596
Number of testing samples: 397

Training 1: wishin i could go to home depot to buy shit to build shit
Testing 2: cold war black ops zombie be a damn hype!!!!!!


In [8]:
# Apply IDF Vectorizer fit on training data
vectorizer = TfidfVectorizer(min_df=4, max_df=0.95 ,ngram_range=(1,2))
train_tfidf_matrix = vectorizer.fit_transform(train_documents)

# Apply TF Vectorizer on testing data
test_tfidf_matrix = vectorizer.transform(test_documents)

In [9]:
# Convert sparse (doc-term) matrices to dense arrays
train_tfidf_dense = train_tfidf_matrix.toarray()
test_tfidf_dense = test_tfidf_matrix.toarray()

print(f"TF-IDF matrix shape for (1) Train: {train_tfidf_dense.shape}, (2) Test: {test_tfidf_dense.shape}")
print(f"Vocabulary size: {len(vectorizer.get_feature_names_out())}")

TF-IDF matrix shape for (1) Train: (596, 673), (2) Test: (397, 673)
Vocabulary size: 673


In [10]:
# Create PySpark DataFrames for training and testing data for later stages
train_spark_df = ss.createDataFrame(
    [(Vectors.dense(vec), int(lbl)) for vec, lbl in zip(train_tfidf_dense, train_labels)],
    [col_name, "label"]
)

test_spark_df = ss.createDataFrame(
    [(Vectors.dense(vec), int(lbl)) for vec, lbl in zip(test_tfidf_dense, test_labels)],
    [col_name, "label"]
)

## 2. Naive Bayes Data Analytic Model Implementation
### Sentiment classification using Naive Bayes model using PySpark.
Libraries: Pyspark (Naive Bayes classification)

In [11]:
# Import necessary libraries for Naive Bayes classification
from pyspark.ml.classification import NaiveBayes

In [12]:
# Train the Naive Bayes model on training data
nb = NaiveBayes(featuresCol=col_name, labelCol="label", modelType="multinomial")
model = nb.fit(train_spark_df)

                                                                                

## 3. Simple Evaluation
### Examine the performance of the TFIDF-NaivaBayes combination with training and testing loss with accuracy.
Libraries: Pyspark (UDF and evaluation), Numpy

In [13]:
# Import necessary libraries for evaluation
import numpy as np
from pyspark.sql.functions import udf, col
from pyspark.sql.types import DoubleType
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [14]:
# Function to extract probability of the true class
def get_prob(probability, label):
    return float(probability[int(label)])
get_prob_udf = udf(get_prob, DoubleType())

In [15]:
"""
Loss Evaluation
"""
# Set selectExpr statement
statement = "mean(log(true_prob)) as log_loss"

# Training loss
train_predictions = model.transform(train_spark_df)
train_predictions = train_predictions.withColumn("true_prob", get_prob_udf(col("probability"), col("label")))
train_loss = -train_predictions.selectExpr(statement).collect()[0]["log_loss"]

# Testing loss
test_predictions = model.transform(test_spark_df)
test_predictions = test_predictions.withColumn("true_prob", get_prob_udf(col("probability"), col("label")))
test_loss = -test_predictions.selectExpr(statement).collect()[0]["log_loss"]

25/11/15 18:04:11 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
                                                                                

In [16]:
"""
Accuracy Evaluation
"""
# Evaluate the accuracy using PySpark
accuracy_evaluator = MulticlassClassificationEvaluator(
    labelCol="label", predictionCol="prediction", metricName="accuracy"
)

train_accuracy = accuracy_evaluator.evaluate(train_predictions)
test_accuracy = accuracy_evaluator.evaluate(test_predictions)

# Result
print(f"Training Loss: {train_loss:.4f}")
print(f"Testing Loss: {test_loss:.4f}\n")
print(f"Train set accuracy = {train_accuracy:.4f}")
print(f"Test set accuracy = {test_accuracy:.4f}")

Training Loss: 0.9433
Testing Loss: 1.2211

Train set accuracy = 0.7550
Test set accuracy = 0.5063


## 4. Preliminary Result

### Control:
**1. TfidfVectorizer(min_df=2, max_df=0.95)**

**2. modelType="multinomial"**

###  Accuracy result captures:
1. Base

Train loss: 0.9160, accuracy: 78.52%; Testing loss: 1.2395, accuracy: 47.61%.

2. min_df = 4

Train loss: 0.9732, accuracy: 73.32; Testing loss: 1.2277, accuracy: 48.36%.

3. max_df = 0.75

Train loss: 0.9160, accuracy: 78.52%; Testing loss: 1.2395, accuracy: 47.61%.

4. ngram_range=(1,2)

Train loss: 0.8619, accuracy: 82.72; Testing loss: 1.2345, accuracy: 48.61%.

5. modelType="complement"

Train loss: 1.0443, accuracy: 89.60%; Testing loss: 1.2878, accuracy: 47.61%.

###  Current best result:
1. min_df = 4, ngram_range=(1,2)

Train loss: 0.9433, accuracy: 75.50%; Testing loss: 1.2211, accuracy: 50.63%.

In [17]:
# Show a few prediction results with probabilities
test_predictions.select(col_name, "label", "prediction", "probability").show(10)

+--------------------+-----+----------+--------------------+
|      tf-idf vectors|label|prediction|         probability|
+--------------------+-----+----------+--------------------+
|[0.0,0.0,0.0,0.0,...|    0|       0.0|[0.50812020215687...|
|[0.0,0.0,0.0,0.0,...|    0|       0.0|[0.30836712661241...|
|[0.0,0.0,0.0,0.0,...|    0|       0.0|[0.48166239626815...|
|[0.0,0.0,0.0,0.0,...|    0|       0.0|[0.37653776171522...|
|[0.0,0.0,0.0,0.0,...|    0|       0.0|[0.37741271967632...|
|[0.0,0.0,0.0,0.0,...|    0|       0.0|[0.31971914512349...|
|[0.0,0.0,0.0,0.0,...|    0|       0.0|[0.42827943363745...|
|[0.0,0.0,0.0,0.0,...|    0|       0.0|[0.39138085905693...|
|[0.0,0.0,0.0,0.0,...|    0|       3.0|[0.25094732599449...|
|[0.0,0.0,0.0,0.0,...|    0|       2.0|[0.29007920052226...|
+--------------------+-----+----------+--------------------+
only showing top 10 rows


## Appendix:
### Methods to Optimize TF-IDF and Naive Bayes Models

**TF-IDF Optimization:**
- **Tune Parameters:**
  - Adjust `max_features`, `min_df`, `max_df` in `TfidfVectorizer` to control vocabulary size and filter rare/common terms.
  - Try different `ngram_range` values (e.g., `(1,2)` for unigrams and bigrams).
  - Experiment with and without `stop_words='english'`.
- **Text Preprocessing:**
  - Normalize text (lowercase, remove punctuation, stemming/lemmatization).
  - Remove or correct misspellings and special characters.

**Naive Bayes Optimization:**
- **Model Type:**
  - Try different `modelType` options: `multinomial`, `bernoulli`, `complement`.
- **Class Imbalance:**
  - If classes are imbalanced, consider resampling or adjusting class weights.
- **Feature Selection:**
  - Remove low-importance features or use dimensionality reduction (e.g., PCA).

**General Approaches:**
- **Cross-Validation:**
  - Use cross-validation to tune hyperparameters and avoid overfitting.
- **Ensemble Methods:**
  - Combine predictions from multiple models (e.g., voting, stacking).
- **Error Analysis:**
  - Analyze misclassified samples to improve preprocessing or feature engineering.

**Example: Tuning TF-IDF and Naive Bayes**
```python
# Example: Try bigrams and different min_df
vectorizer = TfidfVectorizer(ngram_range=(1,2), min_df=4, max_df=0.75)
train_tfidf_matrix = vectorizer.fit_transform(train_documents)
test_tfidf_matrix = vectorizer.transform(test_documents)

# Example: Try different Naive Bayes model types
nb = NaiveBayes(featuresCol=col_name, labelCol="label", modelType="multinomial")
model = nb.fit(train_spark_df)
```
