# TF-IDF Embedding Implementation
### Vectorize the pre-processed data into PySpark dataframe using TF-IDF.
Libraries: Scikit-learn, PySpark

Author: Marcus KWAN TH

Last updated: 2025-11-14

In [1]:
"""
The use of this lines is to:
1. Prevent PySpark from using a different Python interpreter
2. Adding the root path to the sys context for the runtime to properly import util.preprocessing, otherwise, error will occur.
based on the testing with Jupyter Notebook.
"""

import sys, os
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

# Add the root folder to sys.path before importing util package
root_path = os.path.abspath(os.path.join(os.getcwd(), '..'))
if root_path not in sys.path:
    sys.path.append(root_path)

from util.preprocessing import load_and_preprocess_data

In [2]:
# Import all necessary library for TF-IDF embedding
from sklearn.feature_extraction.text import TfidfVectorizer
from pyspark.sql import SparkSession
from pyspark.ml.linalg import Vectors

In [3]:
"""
Before running the following code, please ensure to zip the util folder using
command: zip -r util.zip util
on the root directory.
"""

# Initialize Spark Session
ss  = SparkSession.builder \
        .appName("Marcus TF-IDF") \
        .getOrCreate()

# Add util.zip to PySpark context 
spark = ss.sparkContext.addPyFile("../util.zip")

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/11/14 23:52:59 WARN Utils: Your hostname, Marcuss-MacBook-Air.local, resolves to a loopback address: 127.0.0.1; using 192.168.0.216 instead (on interface en0)
25/11/14 23:52:59 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/11/14 23:52:59 WARN Utils: Your hostname, Marcuss-MacBook-Air.local, resolves to a loopback address: 127.0.0.1; using 192.168.0.216 instead (on interface en0)
25/11/14 23:52:59 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust

In [4]:
# Load and preprocess training and testing data from the util package
train_df = load_and_preprocess_data('../Twitter_data/traindata7.csv')
test_df = load_and_preprocess_data('../Twitter_data/testdata7.csv')

25/11/14 23:53:01 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [5]:
# Convert Spark DataFrames to Pandas for TF-IDF processing
train_pandas = train_df.toPandas()
test_pandas = test_df.toPandas()

                                                                                

In [6]:
# Extract documents and labels from training data
train_documents = train_pandas.iloc[:, 0].astype(str).tolist()
train_labels = train_pandas.iloc[:, 1].tolist()

# Extract documents and labels from testing data
test_documents = test_pandas.iloc[:, 0].astype(str).tolist()
test_labels = test_pandas.iloc[:, 1].tolist()

print(f"Training samples: {len(train_documents)}")
print(f"Testing samples: {len(test_documents)}")

Training samples: 596
Testing samples: 397


In [7]:
# Apply IDF Vectorizer fit on training data
vectorizer = TfidfVectorizer(max_features=1000, min_df=2, max_df=0.95)
train_tfidf_matrix = vectorizer.fit_transform(train_documents)

# Apply TF Vectorizer on testing data
test_tfidf_matrix = vectorizer.transform(test_documents)

In [8]:
# Convert sparse (doc-term) matrices to dense arrays
train_tfidf_dense = train_tfidf_matrix.toarray()
test_tfidf_dense = test_tfidf_matrix.toarray()

print(f"TF-IDF matrix shape for (1) Train: {train_tfidf_dense.shape}, (2) Test: {test_tfidf_dense.shape}")
print(f"Vocabulary size: {len(vectorizer.get_feature_names_out())}")

TF-IDF matrix shape for (1) Train: (596, 1000), (2) Test: (397, 1000)
Vocabulary size: 1000


In [9]:
# Create PySpark DataFrames for training and testing data for later stages
train_spark_df = ss.createDataFrame(
    [(Vectors.dense(vec), int(lbl)) for vec, lbl in zip(train_tfidf_dense, train_labels)],
    ["tf-idf", "label"]
)

test_spark_df = ss.createDataFrame(
    [(Vectors.dense(vec), int(lbl)) for vec, lbl in zip(test_tfidf_dense, test_labels)],
    ["tf-idf", "label"]
)

# Naive Bayes Data Analytic Model Implementation
### Sentiment classification using Naive Bayes model using PySpark.
Libraries: Pyspark (classification and evaluator)

Author: Marcus KWAN TH

Last updated: 2025-11-14

In [10]:
# Import necessary libraries for Naive Bayes classification
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [11]:
# Train the Naive Bayes model on training data
nb = NaiveBayes(featuresCol="tf-idf", labelCol="label", modelType="multinomial")
model = nb.fit(train_spark_df) # Currently no params set.

# Perform prediction analysis on the fitted model
predictions = model.transform(test_spark_df)

                                                                                

In [12]:
# Show a few prediction results with probabilities (5)
predictions.select("tf-idf", "label", "prediction", "probability").show(5)

25/11/14 23:53:07 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS


+--------------------+-----+----------+--------------------+
|              tf-idf|label|prediction|         probability|
+--------------------+-----+----------+--------------------+
|[0.0,0.0,0.0,0.0,...|    0|       0.0|[0.47656605906150...|
|[0.0,0.0,0.0,0.0,...|    0|       1.0|[0.30388306807292...|
|[0.0,0.0,0.0,0.0,...|    0|       1.0|[0.37299478086017...|
|[0.0,0.0,0.0,0.0,...|    0|       0.0|[0.40872759652428...|
|[0.0,0.0,0.0,0.0,...|    0|       0.0|[0.37958369959263...|
+--------------------+-----+----------+--------------------+
only showing top 5 rows


In [13]:
# Evaluate the accuracy using PySpark (for now)
accuracy_evaluator = MulticlassClassificationEvaluator(
    labelCol="label", predictionCol="prediction", metricName="accuracy"
)
accuracy = accuracy_evaluator.evaluate(predictions)
print(f"Test set accuracy = {accuracy:.4f}")

Test set accuracy = 0.4761
