# Sidestory: Word-Vectorization Builder with TF-IDF
### Extracting every word's vector representations and store them in a text file
Author: Marcus KWAN TH

Last updated: 2025-11-15

## 1. Prerequisite (same as the main program code)

In [None]:
import sys, os
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

# Add the root folder to sys.path before importing util package
root_path = os.path.abspath(os.path.join(os.getcwd(), '..'))
if root_path not in sys.path:
    sys.path.append(root_path)

from util.preprocessing import load_and_preprocess_data

# Import all necessary library for word-vectorization
from sklearn.feature_extraction.text import TfidfVectorizer
from pyspark.sql import SparkSession
import pandas as pd


In [2]:
# Initialize Spark Session
ss  = SparkSession.builder \
        .appName("Marcus TF-IDF Word Dimension Builder") \
        .getOrCreate()

# Add util.zip to PySpark context 
spark = ss.sparkContext.addPyFile("../util.zip")

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/11/15 20:19:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
# Load and preprocess training and testing data from the util package
train_df = load_and_preprocess_data('../Twitter_data/traindata7.csv')
test_df = load_and_preprocess_data('../Twitter_data/testdata7.csv')

# Convert Spark DataFrames to Pandas for TF-IDF processing
train_pandas = train_df.toPandas()
test_pandas = test_df.toPandas()

# Extract documents and labels from training data
train_documents = train_pandas.iloc[:, 0].astype(str).tolist()
train_labels = train_pandas.iloc[:, 1].tolist()

# Extract documents and labels from testing data
test_documents = test_pandas.iloc[:, 0].astype(str).tolist()
test_labels = test_pandas.iloc[:, 1].tolist()

print(f"Number of training samples: {len(train_documents)}")
print(f"Number of testing samples: {len(test_documents)}\n")
print(f"Training 1: {train_documents[0]}")
print(f"Testing 2: {test_documents[0]}")

25/11/15 20:19:38 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.
                                                                                

Number of training samples: 596
Number of testing samples: 397

Training 1: wishin i could go to home depot to buy shit to build shit
Testing 2: cold war black ops zombie be a damn hype!!!!!!


## 2. Perform TF-IDF word vectorization, saves them to dataframe, extract them locally

In [None]:
# Combine all documents for TF-IDF fitting
all_documents = train_documents + test_documents

# Fit TF-IDF on all documents and get the document-term matrix
vectorizer = TfidfVectorizer()
all_vocab_vectorizer = vectorizer.fit_transform(all_documents)
all_vocab = vectorizer.get_feature_names_out()

# Check the shape to ensure all words are present
print(f"Vocabulary size: {len(all_vocab)}")
print(f"Matrix shape: {all_vocab_vectorizer.shape}")

Vocabulary size: 4046
Matrix shape: (993, 4046)


In [None]:
# Convert the TF-IDF matrix to an array format and create a pandas DataFrame
df_tfidf = pd.DataFrame(all_vocab_vectorizer.toarray(), columns=all_vocab)

print("TF-IDF matrix: Columns are words, rows are dimensions\n")
print(df_tfidf)

TF-IDF matrix: Columns are words, rows are documents

      00  000  00303   01   06   07   08   09  0ezqmlg9ik  0wn3frahg  ...  \
0    0.0  0.0    0.0  0.0  0.0  0.0  0.0  0.0         0.0        0.0  ...   
1    0.0  0.0    0.0  0.0  0.0  0.0  0.0  0.0         0.0        0.0  ...   
2    0.0  0.0    0.0  0.0  0.0  0.0  0.0  0.0         0.0        0.0  ...   
3    0.0  0.0    0.0  0.0  0.0  0.0  0.0  0.0         0.0        0.0  ...   
4    0.0  0.0    0.0  0.0  0.0  0.0  0.0  0.0         0.0        0.0  ...   
..   ...  ...    ...  ...  ...  ...  ...  ...         ...        ...  ...   
988  0.0  0.0    0.0  0.0  0.0  0.0  0.0  0.0         0.0        0.0  ...   
989  0.0  0.0    0.0  0.0  0.0  0.0  0.0  0.0         0.0        0.0  ...   
990  0.0  0.0    0.0  0.0  0.0  0.0  0.0  0.0         0.0        0.0  ...   
991  0.0  0.0    0.0  0.0  0.0  0.0  0.0  0.0         0.0        0.0  ...   
992  0.0  0.0    0.0  0.0  0.0  0.0  0.0  0.0         0.0        0.0  ...   

     zelda  zero  zio

In [None]:
# Export each word and its TF-IDF vector (dimensions across all documents) to a text file.
output_path = os.path.join(root_path + "/model5/output", "tfidf_word_vectors.txt")

with open(output_path, "w", encoding="utf-8") as f:

    for word in df_tfidf.columns:
        vec = df_tfidf[word].to_list()
        vec_str = ", ".join(f"{v:.6f}" for v in vec) # Format numbers with 6 dp; adjust if necessary.
        f.write(f"{word} [{vec_str}]\n")

print(f"Wrote {len(df_tfidf.columns)} word TF-IDF vectors to {output_path}")

Wrote 4046 word vectors (each length 993) to: /Users/MarcussPC/Desktop/Temp/DSAI4205_Project/twitter-sentiment-analysis/model5/output/tfidf_word_vectors.txt
