# TF-IDF model for word-embedding (Requested)
### Implementation: Scikit's TF-IDF
Author: Marcus KWAN TH

Last updated: 2025-11-17

## 1. Prerequisite: Data extraction from `util`

In [None]:
"""
The purpose of this lines is to:
1. Prevent PySpark from using a different Python interpreter
2. Adding the root path to the sys context for the runtime to properly import util.preprocessing, otherwise, error will occur.

Based on the testing with Jupyter Notebook.
"""

import sys, os
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

# Add the root folder to sys.path before importing util package
root_path = os.path.abspath(os.path.join(os.getcwd(), '..'))
if root_path not in sys.path:
    sys.path.append(root_path)

# Import all necessary library for word-vectorization
from pyspark.sql import SparkSession
from util.preprocessing import load_and_preprocess_data

In [None]:
# Initialize Spark Session
ss  = SparkSession.builder \
        .appName("Marcus TF-IDF Word Dimension Builder") \
        .getOrCreate()

# Add util.zip to PySpark context 
spark = ss.sparkContext.addPyFile("../util.zip")

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/11/17 14:09:57 WARN Utils: Your hostname, Marcuss-MacBook-Air.local, resolves to a loopback address: 127.0.0.1; using 10.11.97.189 instead (on interface en0)
25/11/17 14:09:57 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/11/17 14:09:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/11/17 14:09:57 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [3]:
# Load and preprocess training and testing data from the util package
train_df = load_and_preprocess_data('../Twitter_data/traindata7.csv')
test_df = load_and_preprocess_data('../Twitter_data/testdata7.csv')

# Convert Spark DataFrames to Pandas for TF-IDF processing
train_pandas = train_df.toPandas()
test_pandas = test_df.toPandas()

# Extract documents and labels from training data
train_documents = train_pandas.iloc[:, 0].astype(str).tolist()
train_labels = train_pandas.iloc[:, 1].tolist()

# Extract documents and labels from testing data
test_documents = test_pandas.iloc[:, 0].astype(str).tolist()
test_labels = test_pandas.iloc[:, 1].tolist()

print(f"Number of training samples: {len(train_documents)}")
print(f"Number of testing samples: {len(test_documents)}\n")
print(f"Training 1: {train_documents[0]}")
print(f"Testing 2: {test_documents[0]}")

25/11/17 14:09:58 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.
                                                                                

Number of training samples: 596
Number of testing samples: 397

Training 1: wishin i could go to home depot to buy shit to build shit
Testing 2: cold war black ops zombie be a damn hype!!!!!!


## 2. TF-IDF Embedding Implementation
### Vectorize the pre-processed data by words (vocab) into Pandas DF using TF-IDF.
Libraries: Scikit-learn

In [4]:
# Import all necessary libraries for word-vectorization
from sklearn.feature_extraction.text import TfidfVectorizer

In [5]:
# Combine all documents for TF-IDF fitting
all_documents = train_documents + test_documents

# Fit TF-IDF on all documents and get the document-term matrix
vectorizer = TfidfVectorizer()
all_vocab_vectorizer = vectorizer.fit_transform(all_documents)
all_vocab = vectorizer.get_feature_names_out()

# Fit TF-IDF on train documents and get the document-term matrix
vectorizer = TfidfVectorizer()
train_vocab_vectorizer = vectorizer.fit_transform(train_documents)
train_vocab = vectorizer.get_feature_names_out()

# Fit TF-IDF on testing documents and get the document-term matrix
vectorizer = TfidfVectorizer()
test_vocab_vectorizer = vectorizer.fit_transform(test_documents)
test_vocab = vectorizer.get_feature_names_out()

## 3. Save the TD-IDF-vectorized words to dataframes.
Libraries: Pandas

In [6]:
# Import all necessary libraries for saving data
import pandas as pd

In [7]:
# Check the shape to ensure all words are present
print(f"All docs: Vocabulary size: {len(all_vocab)}")
print(f"All docs: Matrix shape: {all_vocab_vectorizer.shape}\n")
print(f"Train docs: Vocabulary size: {len(train_vocab)}")
print(f"Train docs: Matrix shape: {train_vocab_vectorizer.shape}\n")
print(f"Test docs: Vocabulary size: {len(test_vocab)}")
print(f"Test docs: Matrix shape: {test_vocab_vectorizer.shape}")

All docs: Vocabulary size: 4046
All docs: Matrix shape: (993, 4046)

Train docs: Vocabulary size: 2969
Train docs: Matrix shape: (596, 2969)

Test docs: Vocabulary size: 2209
Test docs: Matrix shape: (397, 2209)


In [8]:
# Convert the TF-IDF matrix to an array format and create a pandas DataFrame
all_df_tfidf = pd.DataFrame(all_vocab_vectorizer.toarray(), columns=all_vocab)
train_df_tfidf = pd.DataFrame(train_vocab_vectorizer.toarray(), columns=train_vocab)
test_df_tfidf = pd.DataFrame(test_vocab_vectorizer.toarray(), columns=test_vocab)

print("For all words: \nTF-IDF matrix: Columns are words, rows are dimensions\n")
print(all_df_tfidf)

For all words: 
TF-IDF matrix: Columns are words, rows are dimensions

      00  000  00303   01   06   07   08   09  0ezqmlg9ik  0wn3frahg  ...  \
0    0.0  0.0    0.0  0.0  0.0  0.0  0.0  0.0         0.0        0.0  ...   
1    0.0  0.0    0.0  0.0  0.0  0.0  0.0  0.0         0.0        0.0  ...   
2    0.0  0.0    0.0  0.0  0.0  0.0  0.0  0.0         0.0        0.0  ...   
3    0.0  0.0    0.0  0.0  0.0  0.0  0.0  0.0         0.0        0.0  ...   
4    0.0  0.0    0.0  0.0  0.0  0.0  0.0  0.0         0.0        0.0  ...   
..   ...  ...    ...  ...  ...  ...  ...  ...         ...        ...  ...   
988  0.0  0.0    0.0  0.0  0.0  0.0  0.0  0.0         0.0        0.0  ...   
989  0.0  0.0    0.0  0.0  0.0  0.0  0.0  0.0         0.0        0.0  ...   
990  0.0  0.0    0.0  0.0  0.0  0.0  0.0  0.0         0.0        0.0  ...   
991  0.0  0.0    0.0  0.0  0.0  0.0  0.0  0.0         0.0        0.0  ...   
992  0.0  0.0    0.0  0.0  0.0  0.0  0.0  0.0         0.0        0.0  ...   

    

## 4. Export the TF-IDF vectors to .txt files

In [9]:
'''
ALL Documents Word Vectors
'''
# Export each word and its TF-IDF vector (train documents) to a text file.
output_path = os.path.join(root_path + "/model5/output", "1_tfidf_word_vectors.txt")

with open(output_path, "w", encoding="utf-8") as f:

    for word in all_df_tfidf.columns:
        vec = all_df_tfidf[word].to_list()
        vec_str = ", ".join(f"{v:.6f}" for v in vec) # Format numbers with 6 dp; adjust if necessary.
        f.write(f"{word} [{vec_str}]\n")

print(f"Wrote {len(all_df_tfidf.columns)} word TF-IDF vectors to {output_path}")

Wrote 4046 word TF-IDF vectors to /Users/MarcussPC/Desktop/Temp/DSAI4205_Project/twitter-sentiment-analysis/model5/output/1_tfidf_word_vectors.txt


In [10]:
'''
TRAIN Documents Word Vectors
'''
# Export each word and its TF-IDF vector (train documents) to a text file.
output_path = os.path.join(root_path + "/model5/output", "2_train_docs_tfidf_word_vectors.txt")

with open(output_path, "w", encoding="utf-8") as f:

    for word in train_df_tfidf.columns:
        vec = train_df_tfidf[word].to_list()
        vec_str = ", ".join(f"{v:.6f}" for v in vec) # Format numbers with 6 dp; adjust if necessary.
        f.write(f"{word} [{vec_str}]\n")

print(f"Wrote {len(train_df_tfidf.columns)} word TF-IDF vectors to {output_path}")

Wrote 2969 word TF-IDF vectors to /Users/MarcussPC/Desktop/Temp/DSAI4205_Project/twitter-sentiment-analysis/model5/output/2_train_docs_tfidf_word_vectors.txt


In [11]:
'''
TEST Documents Word Vectors
'''
# Export each word and its TF-IDF vector (test documents) to a text file.
output_path = os.path.join(root_path + "/model5/output", "3_test_docs_tfidf_word_vectors.txt")

with open(output_path, "w", encoding="utf-8") as f:

    for word in test_df_tfidf.columns:
        vec = test_df_tfidf[word].to_list()
        vec_str = ", ".join(f"{v:.6f}" for v in vec) # Format numbers with 6 dp; adjust if necessary.
        f.write(f"{word} [{vec_str}]\n")

print(f"Wrote {len(test_df_tfidf.columns)} word TF-IDF vectors to {output_path}")

Wrote 2209 word TF-IDF vectors to /Users/MarcussPC/Desktop/Temp/DSAI4205_Project/twitter-sentiment-analysis/model5/output/3_test_docs_tfidf_word_vectors.txt


In [12]:
ss.stop()