# TF-IDF with Naive Bayes Model (Document-Based)
### Implementation: Scikit's TF-IDF and **PySpark-based** Naive Bayes models
Author: Marcus KWAN TH

Last updated: 2025-11-17

## 1. Prerequisite: Data extraction from `util`

In [None]:
"""
The purpose of this lines is to:
1. Prevent PySpark from using a different Python interpreter
2. Adding the root path to the sys context for the runtime to properly import util.preprocessing, otherwise, error will occur.

Based on the testing with Jupyter Notebook.
"""

import sys, os
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

# Add the root folder to sys.path before importing util package
root_path = os.path.abspath(os.path.join(os.getcwd(), '..'))
if root_path not in sys.path:
    sys.path.append(root_path)

# Import all necessary library for TF-IDF embedding
from pyspark.sql import SparkSession
from util.preprocessing import load_and_preprocess_data

In [None]:
"""
Before running the following code, please ensure to zip the util folder using the command: 
1. cd (to the root directory)
2. zip -r util.zip util

This will create a util.zip file in the root directory.
This step is necessary for PySpark to access the util package during distributed processing.
"""

# Initialize Spark Session
ss  = SparkSession.builder \
        .appName("Marcus TF-IDF Naive-Bayes Model") \
        .getOrCreate()

# Add util.zip to PySpark context 
spark = ss.sparkContext.addPyFile("../util.zip")

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/11/17 14:06:11 WARN Utils: Your hostname, Marcuss-MacBook-Air.local, resolves to a loopback address: 127.0.0.1; using 10.11.97.189 instead (on interface en0)
25/11/17 14:06:11 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/11/17 14:06:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/11/17 14:06:12 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/11/17 14:06:12 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


In [3]:
# Initialize variables:
col_name = "tf-idf vectors"
col_label = "label"

In [4]:
# Load and preprocess training and testing data from the util package
# Then convert to Pandas DataFrame for TF-IDF vectorization
train_df = load_and_preprocess_data('../Twitter_data/traindata7.csv').toPandas()
test_df = load_and_preprocess_data('../Twitter_data/testdata7.csv').toPandas()

25/11/17 14:06:13 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.
                                                                                

In [5]:
# Extract documents and labels from training data
train_documents = train_df.iloc[:, 0].astype(str).tolist()
train_labels = train_df.iloc[:, 1].tolist()

# Extract documents and labels from testing data
test_documents = test_df.iloc[:, 0].astype(str).tolist()
test_labels = test_df.iloc[:, 1].tolist()

print(f"Number of training samples: {len(train_documents)}")
print(f"Number of testing samples: {len(test_documents)}\n")

print(f"Training 1: {train_documents[0]}")
print(f"Testing 2: {test_documents[0]}")

Number of training samples: 596
Number of testing samples: 397

Training 1: wishin i could go to home depot to buy shit to build shit
Testing 2: cold war black ops zombie be a damn hype!!!!!!


## 2. TF-IDF Embedding Implementation
### Vectorize the pre-processed data by documents.
Libraries: Scikit-learn

In [6]:
# Import all necessary library for document-vectorization
from pyspark.ml.linalg import Vectors
from sklearn.feature_extraction.text import TfidfVectorizer

In [7]:
# Apply TF-IDF Vectorizer fit on training data
vectorizer = TfidfVectorizer(min_df=4, max_df=0.95, ngram_range=(1,2), use_idf=True)
train_tfidf_matrix = vectorizer.fit_transform(train_documents)

# Apply TF-IDF Vectorizer on testing data
test_tfidf_matrix = vectorizer.transform(test_documents)

#### In a nut-shell,

- **`fit()`**: Fit the vectorizer/model to the **training data** and save the vectorizer/model to a variable (returns sklearn.feature_extraction.text.TfidfVectorizer)

- **`transform()`**: Use the variable output from `fit()` to transformer **validation/test data** (returns scipy.sparse.csr.csr_matrix)

- **`fit_transform()`**: Used to directly transform the **training data**, essentially a combination of `fit()` + `transform()`, thus `fit_transform()`. (returns scipy.sparse.csr.csr_matrix)

Source: https://stackoverflow.com/questions/53027864/what-is-the-difference-between-tfidfvectorizer-fit-transfrom-and-tfidf-transform

## 3. Save the TD-IDF-vectorized documents to dataframes.
Libraries: Pyspark

In [8]:
# Convert sparse (doc-term) matrices to dense arrays
train_tfidf_dense = train_tfidf_matrix.toarray()
test_tfidf_dense = test_tfidf_matrix.toarray()

# Check the shape of the resulting TF-IDF matrices
print(f"TF-IDF matrix shape for (1) Train: {train_tfidf_dense.shape}, (2) Test: {test_tfidf_dense.shape}")
print(f"Dimension size: {len(vectorizer.get_feature_names_out())}")

TF-IDF matrix shape for (1) Train: (596, 673), (2) Test: (397, 673)
Dimension size: 673


In [9]:
# Create PySpark DataFrames for training and testing data for later stages
train_spark_df = ss.createDataFrame(
    [(Vectors.dense(vec), int(lbl)) for vec, lbl in zip(train_tfidf_dense, train_labels)],
    [col_name, col_label]
)

test_spark_df = ss.createDataFrame(
    [(Vectors.dense(vec), int(lbl)) for vec, lbl in zip(test_tfidf_dense, test_labels)],
    [col_name, col_label]
)

In [10]:
print("Sample training data:")
train_spark_df.show(5, truncate=50)

Sample training data:
+--------------------------------------------------+-----+
|                                    tf-idf vectors|label|
+--------------------------------------------------+-----+
|[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0....|    0|
|[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0....|    0|
|[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0....|    0|
|[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0....|    0|
|[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0....|    0|
+--------------------------------------------------+-----+
only showing top 5 rows


## 4. Naive Bayes Data Analytic Model Implementation
### Sentiment classification using Naive Bayes model using PySpark.
Libraries: Pyspark (Naive Bayes classification)

In [11]:
# Import necessary libraries for Naive Bayes classification
from pyspark.ml.classification import NaiveBayes

In [12]:
# Train the Naive Bayes model on training data
nb = NaiveBayes(featuresCol=col_name, labelCol=col_label, modelType="multinomial", smoothing=1.0)
model = nb.fit(train_spark_df)

#### Note:
- **Cannot** use `bernoulli` for the modelType of NaiveBayes as it is only suitable for binary classification.
- Smoothing (range: [0.0, 1.0]) seems doesn't improve the performance of the model.

## 5. Simple Evaluation
### Examine the performance of the TF-IDF + Naive Bayes combination with training and testing loss with accuracy.
Libraries: PySpark

In [13]:
# Import necessary libraries for evaluation
from pyspark.sql.functions import udf, col
from pyspark.sql.types import DoubleType
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [14]:
# Function to extract probability of the true class for each test sample
def get_prob(probability, label):
    return float(probability[int(label)])
get_prob_udf = udf(get_prob, DoubleType())

In [15]:
# Training predictions
train_predictions = model.transform(train_spark_df)
train_predictions = train_predictions.withColumn(
    "true_prob", get_prob_udf(col("probability"), col(col_label))
)

# Testing predictions
test_predictions = model.transform(test_spark_df)
test_predictions = test_predictions.withColumn(
    "true_prob", get_prob_udf(col("probability"), col(col_label))
)

In [16]:
""" 
Accuracy and loss Evaluation
"""
# Evaluators of the accuracy and loss using PySpark
accuracy_evaluator = MulticlassClassificationEvaluator(
    labelCol=col_label, predictionCol="prediction", metricName="accuracy"
)
loss_evaluator = MulticlassClassificationEvaluator(
    labelCol=col_label, predictionCol="prediction", metricName="logLoss"
)

train_loss = loss_evaluator.evaluate(train_predictions)
test_loss = loss_evaluator.evaluate(test_predictions)
train_accuracy = accuracy_evaluator.evaluate(train_predictions)
test_accuracy = accuracy_evaluator.evaluate(test_predictions)

# Result
print(f"Train accuracy: {train_accuracy:.4f}, Train log loss: {train_loss:.4f}")
print(f"Test  accuracy: {test_accuracy:.4f}, Test  log loss: {test_loss:.4f}")

25/11/17 14:06:18 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS


Train accuracy: 0.7550, Train log loss: 0.9433
Test  accuracy: 0.5063, Test  log loss: 1.2211


## Preliminary Result

### Control:
**1. TfidfVectorizer(min_df=2, max_df=0.95)**

**2. modelType="multinomial"**

###  Accuracy result captures:
1. Control

Train loss: 0.9160, accuracy: 78.52%; Testing loss: 1.2395, accuracy: 47.61%.

2. min_df = 4

Train loss: 0.9732, accuracy: 73.32; Testing loss: 1.2277, accuracy: 48.36%.

3. min_df = 8

Train loss: 1.0532, accuracy: 64.43%; Testing loss: 1.2383, accuracy: 47.36%.

4. max_df = 0.75

Train loss: 0.9160, accuracy: 78.52%; Testing loss: 1.2395, accuracy: 47.61%.

5. max_df = 0.45

Train loss: 0.9135, accuracy: 79.36%; Testing loss: 1.2402, accuracy: 47.61%.

6. ngram_range=(1,2)

Train loss: 0.8619, accuracy: 82.72%; Testing loss: 1.2345, accuracy: 48.61%.

7. ngram_range=(1,3)

Train loss: 0.8559, accuracy: 82.89%; Testing loss: 1.2367, accuracy: 47.10%.

8. modelType="complement"

Train loss: 1.0443, accuracy: 89.60%; Testing loss: 1.2878, accuracy: 47.61%.

###  Current best result:
1. min_df = 4, max_df = 0.95, ngram_range=(1,2)

Train loss: 0.9433, accuracy: 75.50%; Testing loss: 1.2211, accuracy: 50.63%.

In [17]:
# Show a few prediction results with probabilities
test_predictions.select(col_name, "label", "prediction", "probability").show(10)

+--------------------+-----+----------+--------------------+
|      tf-idf vectors|label|prediction|         probability|
+--------------------+-----+----------+--------------------+
|[0.0,0.0,0.0,0.0,...|    0|       0.0|[0.50812020215687...|
|[0.0,0.0,0.0,0.0,...|    0|       0.0|[0.30836712661241...|
|[0.0,0.0,0.0,0.0,...|    0|       0.0|[0.48166239626815...|
|[0.0,0.0,0.0,0.0,...|    0|       0.0|[0.37653776171522...|
|[0.0,0.0,0.0,0.0,...|    0|       0.0|[0.37741271967632...|
|[0.0,0.0,0.0,0.0,...|    0|       0.0|[0.31971914512349...|
|[0.0,0.0,0.0,0.0,...|    0|       0.0|[0.42827943363745...|
|[0.0,0.0,0.0,0.0,...|    0|       0.0|[0.39138085905693...|
|[0.0,0.0,0.0,0.0,...|    0|       3.0|[0.25094732599449...|
|[0.0,0.0,0.0,0.0,...|    0|       2.0|[0.29007920052226...|
+--------------------+-----+----------+--------------------+
only showing top 10 rows


In [18]:
ss.stop()

## Appendix:
### Methods to Optimize TF-IDF and Naive Bayes Models

**TF-IDF Optimization:**
- **Tune Parameters:**
  - Adjust `max_features`, `min_df`, `max_df` in `TfidfVectorizer` to control vocabulary size and filter rare/common terms.
  - Try different `ngram_range` values (e.g., `(1,2)` for unigrams and bigrams).
  - Experiment with and without `stop_words='english'`.
- **Text Preprocessing:**
  - Normalize text (lowercase, remove punctuation, stemming/lemmatization).
  - Remove or correct misspellings and special characters.

**Naive Bayes Optimization:**
- **Model Type:**
  - Try different `modelType` options: `multinomial`, `bernoulli`, `complement`.
- **Class Imbalance:**
  - If classes are imbalanced, consider resampling or adjusting class weights.
- **Feature Selection:**
  - Remove low-importance features or use dimensionality reduction (e.g., PCA).

**General Approaches:**
- **Cross-Validation:**
  - Use cross-validation to tune hyperparameters and avoid overfitting.
- **Ensemble Methods:**
  - Combine predictions from multiple models (e.g., voting, stacking).
- **Error Analysis:**
  - Analyze misclassified samples to improve preprocessing or feature engineering.

**Example: Tuning TF-IDF and Naive Bayes**
```python
# Example: Try bigrams and different min_df
vectorizer = TfidfVectorizer(ngram_range=(1,2), min_df=4, max_df=0.75)
train_tfidf_matrix = vectorizer.fit_transform(train_documents)
test_tfidf_matrix = vectorizer.transform(test_documents)

# Example: Try different Naive Bayes model types
nb = NaiveBayes(featuresCol=col_name, labelCol="label", modelType="multinomial")
model = nb.fit(train_spark_df)
```
