#### Import library

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import joblib
import scipy.sparse

#### Read and Show shape of data

In [2]:
train_df = pd.read_csv("D:/Projects/Sentiment_Analysis/data/processed/train_clean.csv")
dev_df   = pd.read_csv("D:/Projects/Sentiment_Analysis/data/processed/dev_clean.csv")
test_df  = pd.read_csv("D:/Projects/Sentiment_Analysis/data/processed/test_clean.csv")

print("Train:", train_df.shape)
print("Dev:", dev_df.shape)
print("Test:", test_df.shape)

train_df.head()

Train: (11426, 3)
Dev: (3166, 3)
Test: (3166, 3)


Unnamed: 0,sentence,sentiment,topic
0,slide giao trinh day du,2,1
1,nhiet tinh giang day gan gui voi sinh vien,2,0
2,di hoc day du full diem chuyen can,0,1
3,chua ap dung cong thong va cac thiet bi ho tro...,0,0
4,thay giang bai co nhieu bai tap vi du tren lop,2,0


#### Prepare data (only take sentence column)

In [3]:
X_train = train_df["sentence"].astype(str)
X_dev   = dev_df["sentence"].astype(str)
X_test  = test_df["sentence"].astype(str)

1. Take the "sentence" column as input text.

2. .astype(str) ensures all values ​​are strings.

3. X_train is used to fit TF-IDF, X_dev & X_test are only transforms.

#### Create TF-IDF vectorizer and learn from train data

In [4]:
vectorizer = TfidfVectorizer(
    max_features=5000,   # limit number of features
    ngram_range=(1, 2),  # use both unigrams and bigrams
)

X_train_tfidf = vectorizer.fit_transform(X_train)
X_dev_tfidf   = vectorizer.transform(X_dev)
X_test_tfidf  = vectorizer.transform(X_test)

1. **TfidfVectorizer**: chuyển câu văn thành vector số, thể hiện tầm quan trọng của từ trong văn bản (TF-IDF).

2. **max_features=5000**: giữ tối đa 5000 từ/ngram phổ biến nhất, tránh ma trận quá lớn.

3. **ngram_range=(1,2)**:

    * 1 → unigram (từ đơn, ví dụ: “tốt”)

    * 2 → bigram (cụm 2 từ, ví dụ: “rất tốt”)

4. **vectorizer.fit_transform(X_train)** → học từ train và tạo ma trận TF-IDF cho train.

5. **vectorizer.transform(X_dev/X_test)** → tạo TF-IDF cho dev/test với vocab đã học từ train, tránh data leakage.

#### Check size 

In [5]:
print("Train TF-IDF shape:", X_train_tfidf.shape)
print("Dev TF-IDF shape:", X_dev_tfidf.shape)
print("Test TF-IDF shape:", X_test_tfidf.shape)

Train TF-IDF shape: (11426, 5000)
Dev TF-IDF shape: (3166, 5000)
Test TF-IDF shape: (3166, 5000)


#### Save vectorizer & TF-IDF matrix (for training)

In [7]:
# Lưu vectorizer
joblib.dump(train_df["sentiment"].values, "D:/Projects/Sentiment_Analysis/models/TF-IDF/y_train.pkl")
joblib.dump(dev_df["sentiment"].values, "D:/Projects/Sentiment_Analysis/models/TF-IDF/y_dev.pkl")
joblib.dump(test_df["sentiment"].values, "D:/Projects/Sentiment_Analysis/models/TF-IDF/y_test.pkl")

# Lưu ma trận TF-IDF dạng nén (nếu muốn)
scipy.sparse.save_npz("D:/Projects/Sentiment_Analysis/data/features/TF_IDF/X_train_tfidf.npz", X_train_tfidf)
scipy.sparse.save_npz("D:/Projects/Sentiment_Analysis/data/features/TF_IDF/X_dev_tfidf.npz", X_dev_tfidf)
scipy.sparse.save_npz("D:/Projects/Sentiment_Analysis/data/features/TF_IDF/X_test_tfidf.npz", X_test_tfidf)

1. Lưu **vectorizer** để transform dữ liệu mới mà không fit lại.

2. Lưu **ma trận TF-IDF** dạng nén để tái sử dụng khi train/evaluate mô hình, tiết kiệm RAM.