# Dataset

## Download dataset
Vietnamese Students' Feedback Corpus (UIT-VSFC) is the resource consists of over 16,000 sentences which are human-annotated with two different tasks: sentiment-based and topic-based classifications.

[1] Kiet Van Nguyen, Vu Duc Nguyen, Phu Xuan-Vinh Nguyen, Tham Thi-Hong Truong, Ngan Luu-Thuy Nguyen, UIT-VSFC: Vietnamese Students' Feedback Corpus for Sentiment Analysis,  2018 10th International Conference on Knowledge and Systems Engineering (KSE 2018), November 1-3, 2018, Ho Chi Minh City, Vietnam

In [22]:
!pip install datasets

Defaulting to user installation because normal site-packages is not writeable


In [23]:
from datasets import load_dataset

dataset = load_dataset("uitnlp/vietnamese_students_feedback")

## Interacting with the downloaded data

In [24]:
train_set = dataset['train']
train_set

Dataset({
    features: ['sentence', 'sentiment', 'topic'],
    num_rows: 11426
})

In [25]:
train_set[0]

{'sentence': 'slide giáo trình đầy đủ .', 'sentiment': 2, 'topic': 1}

In [26]:
len(train_set)

11426

## Split a sentence

In [27]:
# Read a sentence
example_word_list = train_set[0]['sentence']
example_word_list

'slide giáo trình đầy đủ .'

In [28]:
# Split sentence word-by-word
example_word_list.split()

['slide', 'giáo', 'trình', 'đầy', 'đủ', '.']

In [29]:
# Join words into 1 full sentence
sentence = ""
for word in example_word_list:
    sentence += word
sentence

'slide giáo trình đầy đủ .'

In [30]:
# Get 10 sentences to process
sentence_list = []
for idx in range(10):
    sentence = ""
    for word in train_set[idx]['sentence']:
        sentence += word
    sentence_list.append(sentence)
sentence_list

['slide giáo trình đầy đủ .',
 'nhiệt tình giảng dạy , gần gũi với sinh viên .',
 'đi học đầy đủ full điểm chuyên cần .',
 'chưa áp dụng công nghệ thông tin và các thiết bị hỗ trợ cho việc giảng dạy .',
 'thầy giảng bài hay , có nhiều bài tập ví dụ ngay trên lớp .',
 'giảng viên đảm bảo thời gian lên lớp , tích cực trả lời câu hỏi của sinh viên , thường xuyên đặt câu hỏi cho sinh viên .',
 'em sẽ nợ môn này , nhưng em sẽ học lại ở các học kỳ kế tiếp .',
 'thời lượng học quá dài , không đảm bảo tiếp thu hiệu quả .',
 'nội dung môn học có phần thiếu trọng tâm , hầu như là chung chung , khái quát khiến sinh viên rất khó nắm được nội dung môn học .',
 'cần nói rõ hơn bằng cách trình bày lên bảng thay vì nhìn vào slide .']

# Text processing

## N-grams
- N-grams are continuous sequences of words or symbols, or tokens in a document. In technical terms, they can be defined as the neighboring sequences of items in a document.
- We can use n-grams or multiple other text preprocessing algorithms by incorporating [`nltk`](https://www.nltk.org/) library.

In [31]:
example_sentence = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

In [32]:
from nltk import ngrams
import numpy as np

num_of_grams = np.arange(1, 4, 1) # Test 3 n-grams

print("Original sentence:", example_sentence[1])
print("==="*5)

for gram in num_of_grams:
    splitted_sentence = ngrams(example_sentence[1].split(), int(gram))
    print(f"{gram}-gram: ",end ='')
    n_grams_list = [' '.join(grams) for grams in splitted_sentence]
    print(n_grams_list)
    print()

Original sentence: This document is the second document.
1-gram: ['This', 'document', 'is', 'the', 'second', 'document.']

2-gram: ['This document', 'document is', 'is the', 'the second', 'second document.']

3-gram: ['This document is', 'document is the', 'is the second', 'the second document.']



## Extract features with n-grams

In [33]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

In [34]:
count_vectorize_model = CountVectorizer(ngram_range = (1, 1))
n_grams_feature_vector = count_vectorize_model.fit_transform(example_sentence).toarray()
word_frequency = pd.DataFrame(data = n_grams_feature_vector, columns = count_vectorize_model.get_feature_names_out())
word_frequency.T

Unnamed: 0,0,1,2,3
and,0,0,1,0
document,1,2,0,1
first,1,0,0,1
is,1,1,1,1
one,0,0,1,0
second,0,1,0,0
the,1,1,1,1
third,0,0,1,0
this,1,1,1,1


In [35]:
count_vectorize_model = CountVectorizer(ngram_range = (1, 1))

n_grams_feature_vector = count_vectorize_model.fit_transform([sentence_list[5]]).toarray()

word_frequency = pd.DataFrame(data = n_grams_feature_vector, columns = count_vectorize_model.get_feature_names_out())

print('Example sentence:', sentence_list[5])
word_frequency

Example sentence: giảng viên đảm bảo thời gian lên lớp , tích cực trả lời câu hỏi của sinh viên , thường xuyên đặt câu hỏi cho sinh viên .


Unnamed: 0,bảo,cho,câu,của,cực,gian,giảng,hỏi,lên,lớp,lời,sinh,thường,thời,trả,tích,viên,xuyên,đảm,đặt
0,1,1,2,1,1,1,1,2,1,1,1,2,1,1,1,1,3,1,1,1


In [36]:
count_vectorize_model = CountVectorizer(ngram_range = (1, 2))

n_grams_feature_vector = count_vectorize_model.fit_transform(example_sentence).toarray()

word_frequency = pd.DataFrame(data = n_grams_feature_vector, columns = count_vectorize_model.get_feature_names_out())
word_frequency.T

Unnamed: 0,0,1,2,3
and,0,0,1,0
and this,0,0,1,0
document,1,2,0,1
document is,0,1,0,0
first,1,0,0,1
first document,1,0,0,1
is,1,1,1,1
is the,1,1,1,0
is this,0,0,0,1
one,0,0,1,0


## Problem set 1
Based on the UIT-VSFC dataset and the aforementioned information.
- Create an $n$-gram word frequency table, such that $n$ could be any number of your desire.
- With $n=1$ and $n=2$, what is the most popular word in the dataset ?
- With $n=1$ and $n=2$, what is the rarest word in the dataset ?
- What are the limitations of this data processing flow ? How can we overcome those ?

### Retrieve all sentences within the dataset

In [37]:
from typing import List

def get_all_sentences(dataset) -> List[str]:
    """
    Function to get all sentences and store them into a list of strings

    Args:
    dataset -- The subset (i.e., train/valid/test) in UIT-VSFC dataset

    Returns:
    A list of all sentences in a subset data of the UIT-VSFC.
    """

    list_all_sentence: list = []

    ### YOUR CODE STARTS HERE
    for idx in range(len(dataset)):
        sentence = ""
        for word in dataset[idx]['sentence']:
            sentence += word
        list_all_sentence.append(sentence)

    ### YOUR CODE ENDS HERE

    return list_all_sentence

In [38]:
list_all_sentence: list = get_all_sentences(train_set)
print(f"#sentences within the dataset: {len(list_all_sentence)}")
print(f"Example sentence: {list_all_sentence[1]}")

#sentences within the dataset: 11426
Example sentence: nhiệt tình giảng dạy , gần gũi với sinh viên .


### Build the word frequency table

In [39]:
def n_gram_word_frequency(sentence_list: list,
                          n: int) -> pd.DataFrame:
    """
    Function to build a word frequency table based on n-grams

    Args:
    sentence_list (list) -- A list of all sentences needed for table constructing process
    n (int) -- Number of grams that we parse into this function

    Returns:
    A dataframe contains all words after conducting n-grams and their respective frequencies
    """

    ### YOUR CODE STARTS HERE

    count_vectorize_model = CountVectorizer(ngram_range = (n, n))
    n_grams_feature_vector = count_vectorize_model.fit_transform(sentence_list).toarray()
    word_frequency_table = pd.DataFrame(data = n_grams_feature_vector, columns = count_vectorize_model.get_feature_names_out())

    ### YOUR CODE ENDS HERE

    return word_frequency_table

In [None]:
# Construct the table of word frequency
# 1-ngram
word_frequency_table = n_gram_word_frequency(sentence_list=list_all_sentence,
                                             n=1)
print(word_frequency_table)

# In ra từ phổ biến nhất và hiếm gặp nhất trong tất cả các câu
sum_word_frequency_table = word_frequency_table.sum(axis=0).sort_values(ascending=False)
print(f"Từ phổ biến nhất: {sum_word_frequency_table.index[0]}, Xuất hiện {sum_word_frequency_table.iloc[0]} lần")
print(f"Từ hiếm gặp nhất: {sum_word_frequency_table.index[-1]}, Xuất hiện {sum_word_frequency_table.iloc[-1]} lần")

print(sum_word_frequency_table[:10])
print(sum_word_frequency_table[-10:])

# 2-ngram
word_frequency_table = n_gram_word_frequency(sentence_list=list_all_sentence,
                                             n=2)
print(word_frequency_table)

# In ra từ phổ biến nhất và hiếm gặp nhất trong tất cả các câu
sum_word_frequency_table = word_frequency_table.sum(axis=0).sort_values(ascending=False)
print(f"Từ phổ biến nhất: {sum_word_frequency_table.index[0]}, Xuất hiện {sum_word_frequency_table.iloc[0]} lần")
print(f"Từ hiếm gặp nhất: {sum_word_frequency_table.index[-1]}, Xuất hiện {sum_word_frequency_table.iloc[-1]} lần")

print(sum_word_frequency_table[:10])
print(sum_word_frequency_table[-10:])

       10  100  10h  10h30  11  11doubledot55  11h30  11h55  12  \
0       0    0    0      0   0              0      0      0   0   
1       0    0    0      0   0              0      0      0   0   
2       0    0    0      0   0              0      0      0   0   
3       0    0    0      0   0              0      0      0   0   
4       0    0    0      0   0              0      0      0   0   
...    ..  ...  ...    ...  ..            ...    ...    ...  ..   
11421   0    0    0      0   0              0      0      0   0   
11422   0    0    0      0   0              0      0      0   0   
11423   0    0    0      0   0              0      0      0   0   
11424   0    0    0      0   0              0      0      0   0   
11425   0    0    0      0   0              0      0      0   0   

       12doubledot00  ...  ấy  ẩn  ắt  ốc  ồn  ổn  ủa  ủng  ức  ứng  
0                  0  ...   0   0   0   0   0   0   0    0   0    0  
1                  0  ...   0   0   0   0   0   0   0  

In [None]:
# What are the limitations of this data processing flow ? How can we overcome those ?
# 1. The dataset is not large enough to train a good model
# 2. The dataset is not balanced enough to train a good model
# 3. The dataset is not diverse enough to train a good model
# 4. The dataset is not clean enough to train a good model
# 5. The dataset is not representative enough to train a good model
# 6. The dataset is not annotated enough to train a good model
# 7. The dataset is not labeled enough to train a good model
# 8. The dataset is not structured enough to train a good model


You should comment your answer to problem 1 here with sufficient explanations, including your implementation and reasoning.

- Stopwords gây nhiễu kết quả
- Không thể phân biệt từ đồng nghĩa
- Không xử lý lỗi chính tả
- Không phân biệt ngữ cảnh

## Stopwords

In [21]:
# Retrieve the stopword dictionary
import wget

url = "https://raw.githubusercontent.com/stopwords/vietnamese-stopwords/master/vietnamese-stopwords.txt"
filename = wget.download(url)
print(f"\nDownloaded file: {filename}")

!wget --no-check-certificate --content-disposition https://raw.githubusercontent.com/stopwords/vietnamese-stopwords/master/vietnamese-stopwords.txt


Downloaded file: vietnamese-stopwords (1).txt


'wget' is not recognized as an internal or external command,
operable program or batch file.


In [22]:
# Observe stopwords list
vietnamese_stopword = open('vietnamese-stopwords.txt', 'r', encoding='utf-8').read()
vietnamese_stopword = vietnamese_stopword.split('\n') # Separate lines by lines
print(f"#Number of stop words: {len(vietnamese_stopword)}")

#Number of stop words: 1942


In [23]:
# Stop words example
for sentence in vietnamese_stopword[:10]:
    print(sentence)

a lô
a ha
ai
ai ai
ai nấy
ai đó
alô
amen
anh
anh ấy


## Term frequency - Invert document frequency (TF-IDF)


### TF
Term frequency (TF) is the number of times a given term appears in document

$$
tf(t) = f(t,d)\times\frac{1}{T}
$$
whereas, $f(t,d)$ is the frequency of the word $t$ in the document $d$, $T$ is the number of all words in that document.

In [24]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

# Declare TF vectorize
tf_vectorizer = TfidfVectorizer(ngram_range=(1, 1),
                                use_idf=False, # only using TF
                                norm='l1')

tf_vectorizer.fit_transform(corpus)

tf_vectorized = tf_vectorizer.transform(corpus)

tf_output = tf_vectorized[0]

# Build TF table
words_tf_idf = pd.DataFrame(tf_output.T.todense(), index=tf_vectorizer.get_feature_names_out(), columns=['tf'])
words_tf_idf

Unnamed: 0,tf
and,0.0
document,0.2
first,0.2
is,0.2
one,0.0
second,0.0
the,0.2
third,0.0
this,0.2


### IDF

Inverse Document Frequency, or abbreviated as IDF, measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones.

$$
idf(t) = \log\left(\frac{\text{#documents in the document set}}{\text{#documents with term}}\right) + 1
$$

In [25]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

# Configure settings for IDF vectorize
tf_idf_vectorizer = TfidfVectorizer(ngram_range=(1, 1),
                                    smooth_idf=False,
                                    use_idf=True,
                                    norm=None)

tf_idf_vectorizer.fit_transform(corpus)

# Retrieve only idf information
idf_vectorizer = tf_idf_vectorizer.idf_

# Join idf values into the previous dataframe
words_tf_idf['idf'] = idf_vectorizer

# Show dataframe with ascending values of idf
words_tf_idf.sort_values(by=['idf'])

Unnamed: 0,tf,idf
is,0.2,1.0
the,0.2,1.0
this,0.2,1.0
document,0.2,1.287682
first,0.2,1.693147
and,0.0,2.386294
second,0.0,2.386294
one,0.0,2.386294
third,0.0,2.386294


### TF-IDF

Technically saying, TF-IDF is a score which is applied to every word in every document in our dataset. And for every word, the TF-IDF value increases with every appearance of the word in a document, but is gradually decreased with every appearance in other documents

$$
\text{tf-idf}= tf(t, d) \times idf(t)
$$

In [26]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

tf_idf_vectorizer = TfidfVectorizer(ngram_range=(1, 1),
                                    smooth_idf=False,
                                    use_idf=True,
                                    norm='l1')

tf_idf_vectorizer.fit_transform(corpus)

tf_idf_vectorized = tf_idf_vectorizer.transform(corpus)

tf_idf_output = tf_idf_vectorized[0]
words_tf_idf['tf-idf'] = tf_idf_output.T.todense()

words_tf_idf.sort_values(by=['tf-idf'])

Unnamed: 0,tf,idf,tf-idf
and,0.0,2.386294,0.0
third,0.0,2.386294,0.0
second,0.0,2.386294,0.0
one,0.0,2.386294,0.0
is,0.2,1.0,0.167201
the,0.2,1.0,0.167201
this,0.2,1.0,0.167201
document,0.2,1.287682,0.215302
first,0.2,1.693147,0.283096


### Problem set 2
Based on the problem 1 and the instruction on TF, IDF, TF-IDF:
- (2a) Build the tf-idf table for the UIT-VSFC dataset with $n$-gram = 1 and $n$-gram = 2.
- (2b) Change a few hyperparameters in the `TfidfVectorizer` function (`smooth_idf`, `sublinear_tf` and `norm`) from problem 2a (*you could browse from this [link](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) to discover which are the correct paramters to parse*). Explain the results differences collected after modifying hyperparameters.
- (2c) Which words has the lowest and the highest tf-idf values ? Do they differ from $n$-grams results ?
- (2d) Which limitations from $n$-grams that TF-IDF overcame ?

In [31]:
from sklearn.feature_extraction.text import TfidfVectorizer

# (2a) Build the tf-idf table for the UIT-VSFC dataset with $n$-gram = 1 and $n$-gram = 2
tfidf_vectorizer_1gram = TfidfVectorizer(ngram_range=(1, 1))
tfidf_vectorizer_2gram = TfidfVectorizer(ngram_range=(1, 2))

tfidf_1gram = tfidf_vectorizer_1gram.fit_transform(list_all_sentence)
tfidf_2gram = tfidf_vectorizer_2gram.fit_transform(list_all_sentence)

tfidf_1gram_df = pd.DataFrame(tfidf_1gram.toarray(), columns=tfidf_vectorizer_1gram.get_feature_names_out())
tfidf_2gram_df = pd.DataFrame(tfidf_2gram.toarray(), columns=tfidf_vectorizer_2gram.get_feature_names_out())

print("TF-IDF with 1-gram:")
print(tfidf_1gram_df)
print("\nTF-IDF with 2-gram:")
print(tfidf_2gram_df)

TF-IDF with 1-gram:
        10  100  10h  10h30   11  11doubledot55  11h30  11h55   12  \
0      0.0  0.0  0.0    0.0  0.0            0.0    0.0    0.0  0.0   
1      0.0  0.0  0.0    0.0  0.0            0.0    0.0    0.0  0.0   
2      0.0  0.0  0.0    0.0  0.0            0.0    0.0    0.0  0.0   
3      0.0  0.0  0.0    0.0  0.0            0.0    0.0    0.0  0.0   
4      0.0  0.0  0.0    0.0  0.0            0.0    0.0    0.0  0.0   
...    ...  ...  ...    ...  ...            ...    ...    ...  ...   
11421  0.0  0.0  0.0    0.0  0.0            0.0    0.0    0.0  0.0   
11422  0.0  0.0  0.0    0.0  0.0            0.0    0.0    0.0  0.0   
11423  0.0  0.0  0.0    0.0  0.0            0.0    0.0    0.0  0.0   
11424  0.0  0.0  0.0    0.0  0.0            0.0    0.0    0.0  0.0   
11425  0.0  0.0  0.0    0.0  0.0            0.0    0.0    0.0  0.0   

       12doubledot00  ...   ấy   ẩn   ắt   ốc   ồn   ổn   ủa  ủng   ức  ứng  
0                0.0  ...  0.0  0.0  0.0  0.0  0.0  0.0  0.0 

In [35]:
# (2b) Change a few hyperparameters in the TfidfVectorizer function
# 1-ngram
tfidf_vectorizer_1gram = TfidfVectorizer(ngram_range=(1, 1), smooth_idf=False, sublinear_tf=True, norm='l1')
tfidf_1gram = tfidf_vectorizer_1gram.fit_transform(list_all_sentence)
tfidf_1gram_df = pd.DataFrame(tfidf_1gram.toarray(), columns=tfidf_vectorizer_1gram.get_feature_names_out())
print("TF-IDF with 1-gram (sublinear_tf=True, smooth_idf=False, norm='l2'):")
print(tfidf_1gram_df)
# 2-ngram
tfidf_vectorizer_2gram = TfidfVectorizer(ngram_range=(1, 2), smooth_idf=False, sublinear_tf=True, norm='l1')
tfidf_2gram = tfidf_vectorizer_2gram.fit_transform(list_all_sentence)
tfidf_2gram_df = pd.DataFrame(tfidf_2gram.toarray(), columns=tfidf_vectorizer_2gram.get_feature_names_out())
print("\nTF-IDF with 2-gram (sublinear_tf=True, smooth_idf=False, norm='l2'):")
print(tfidf_2gram_df)

TF-IDF with 1-gram (sublinear_tf=True, smooth_idf=False, norm='l2'):
        10  100  10h  10h30   11  11doubledot55  11h30  11h55   12  \
0      0.0  0.0  0.0    0.0  0.0            0.0    0.0    0.0  0.0   
1      0.0  0.0  0.0    0.0  0.0            0.0    0.0    0.0  0.0   
2      0.0  0.0  0.0    0.0  0.0            0.0    0.0    0.0  0.0   
3      0.0  0.0  0.0    0.0  0.0            0.0    0.0    0.0  0.0   
4      0.0  0.0  0.0    0.0  0.0            0.0    0.0    0.0  0.0   
...    ...  ...  ...    ...  ...            ...    ...    ...  ...   
11421  0.0  0.0  0.0    0.0  0.0            0.0    0.0    0.0  0.0   
11422  0.0  0.0  0.0    0.0  0.0            0.0    0.0    0.0  0.0   
11423  0.0  0.0  0.0    0.0  0.0            0.0    0.0    0.0  0.0   
11424  0.0  0.0  0.0    0.0  0.0            0.0    0.0    0.0  0.0   
11425  0.0  0.0  0.0    0.0  0.0            0.0    0.0    0.0  0.0   

       12doubledot00  ...   ấy   ẩn   ắt   ốc   ồn   ổn   ủa  ủng   ức  ứng  
0           

In [37]:
# (2c) Find words with the lowest and highest tf-idf values
# 1-ngram
lowest_1gram = tfidf_1gram_df.min(axis=0).sort_values(ascending=True)
highest_1gram = tfidf_1gram_df.max(axis=0).sort_values(ascending=False)
print("Lowest 1-gram TF-IDF values:")
print(lowest_1gram)
print("\nHighest 1-gram TF-IDF values:")
print(highest_1gram)
# 2-ngram
lowest_2gram = tfidf_2gram_df.min(axis=0).sort_values(ascending=True)
highest_2gram = tfidf_2gram_df.max(axis=0).sort_values(ascending=False)
print("\nLowest 2-gram TF-IDF values:")
print(lowest_2gram)
print("\nHighest 2-gram TF-IDF values:")
print(highest_2gram)

Lowest 1-gram TF-IDF values:
ấm     0.0
ảo     0.0
ảnh    0.0
ướt    0.0
ước    0.0
      ... 
ủa     0.0
ủng    0.0
ức     0.0
ứng    0.0
10     0.0
Length: 2459, dtype: float64

Highest 1-gram TF-IDF values:
dễ       1.000000
chán     1.000000
không    1.000000
xinh     1.000000
hết      1.000000
           ...   
tá       0.023056
lõng     0.016674
ùn       0.016674
vạ       0.016674
gạo      0.016674
Length: 2459, dtype: float64

Lowest 2-gram TF-IDF values:
ứng đúng     0.0
ứng đáp      0.0
ứng yêu      0.0
ứng tốt      0.0
ứng nhưng    0.0
            ... 
100 là       0.0
100 cách     0.0
ứng đầy      0.0
ứng đủ       0.0
10           0.0
Length: 33843, dtype: float64

Highest 2-gram TF-IDF values:
tệ            1.000000
tốt           1.000000
giỏi          1.000000
everything    1.000000
dễ            1.000000
                ...   
gạo một       0.005771
tới bị        0.005771
lõng          0.005771
lõng mơ       0.005771
cả tên        0.005771
Length: 33843, dtype: float64


In [None]:
# (2d) Which limitations from $n$-grams that TF-IDF overcame ?
# TF-IDF overcame the limitations of n
# -grams by considering the importance of each word in the context of the entire document.
# It assigns higher weights to words that are more informative and less frequent across the entire corpus, thus reducing the impact of common words that may not carry significant meaning.
# TF-IDF also helps to mitigate the issue of high-dimensionality associated with n


In [30]:
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Tải dữ liệu UIT-VSFC
dataset = load_dataset("uitnlp/vietnamese_students_feedback")
sentences = [data["sentence"].lower() for data in dataset["train"]]  # Chuyển về chữ thường

# Hàm tính TF-IDF
def compute_tfidf(sentences, ngram_range):
    vectorizer = TfidfVectorizer(ngram_range=ngram_range)  # n-gram (1,1) hoặc (2,2)
    tfidf_matrix = vectorizer.fit_transform(sentences)  # Ma trận TF-IDF
    feature_names = vectorizer.get_feature_names_out()  # Danh sách từ/ngram
    return pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)  # Chuyển thành DataFrame

# Tạo bảng TF-IDF cho n=1 (unigram) và n=2 (bigram)
tfidf_unigram = compute_tfidf(sentences, (1,1))
tfidf_bigram = compute_tfidf(sentences, (2,2))

# Hiển thị một phần của bảng TF-IDF
print("TF-IDF với n=1 (Unigram):")
print(tfidf_unigram.head())

print("\nTF-IDF với n=2 (Bigram):")
print(tfidf_bigram.head())


TF-IDF với n=1 (Unigram):
    10  100  10h  10h30   11  11doubledot55  11h30  11h55   12  12doubledot00  \
0  0.0  0.0  0.0    0.0  0.0            0.0    0.0    0.0  0.0            0.0   
1  0.0  0.0  0.0    0.0  0.0            0.0    0.0    0.0  0.0            0.0   
2  0.0  0.0  0.0    0.0  0.0            0.0    0.0    0.0  0.0            0.0   
3  0.0  0.0  0.0    0.0  0.0            0.0    0.0    0.0  0.0            0.0   
4  0.0  0.0  0.0    0.0  0.0            0.0    0.0    0.0  0.0            0.0   

   ...   ấy   ẩn   ắt   ốc   ồn   ổn   ủa  ủng   ức  ứng  
0  ...  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  
1  ...  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  
2  ...  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  
3  ...  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  
4  ...  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  

[5 rows x 2459 columns]

TF-IDF với n=2 (Bigram):
   10 50  10 bài  10 fraction  10 kiến  10 luôn  10 mấy  10 mới  10 người  \
0    

In [32]:
# Thử nghiệm thay đổi siêu tham số
vectorizer_modified = TfidfVectorizer(ngram_range=(1,1), smooth_idf=False, sublinear_tf=True, norm="l2")
tfidf_matrix_modified = vectorizer_modified.fit_transform(sentences)
feature_names_modified = vectorizer_modified.get_feature_names_out()

# Chuyển kết quả thành DataFrame
tfidf_df_modified = pd.DataFrame(tfidf_matrix_modified.toarray(), columns=feature_names_modified)

print("TF-IDF sau khi thay đổi tham số:")
print(tfidf_df_modified.head())


TF-IDF sau khi thay đổi tham số:
    10  100  10h  10h30   11  11doubledot55  11h30  11h55   12  12doubledot00  \
0  0.0  0.0  0.0    0.0  0.0            0.0    0.0    0.0  0.0            0.0   
1  0.0  0.0  0.0    0.0  0.0            0.0    0.0    0.0  0.0            0.0   
2  0.0  0.0  0.0    0.0  0.0            0.0    0.0    0.0  0.0            0.0   
3  0.0  0.0  0.0    0.0  0.0            0.0    0.0    0.0  0.0            0.0   
4  0.0  0.0  0.0    0.0  0.0            0.0    0.0    0.0  0.0            0.0   

   ...   ấy   ẩn   ắt   ốc   ồn   ổn   ủa  ủng   ức  ứng  
0  ...  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  
1  ...  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  
2  ...  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  
3  ...  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  
4  ...  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  

[5 rows x 2459 columns]


In [33]:
# Tính tổng giá trị TF-IDF cho từng từ/ngram
unigram_tfidf_sum = tfidf_unigram.sum().sort_values(ascending=False)
bigram_tfidf_sum = tfidf_bigram.sum().sort_values(ascending=False)

# Tìm từ có giá trị TF-IDF cao nhất và thấp nhất
most_important_unigram = unigram_tfidf_sum.idxmax(), unigram_tfidf_sum.max()
least_important_unigram = unigram_tfidf_sum.idxmin(), unigram_tfidf_sum.min()

most_important_bigram = bigram_tfidf_sum.idxmax(), bigram_tfidf_sum.max()
least_important_bigram = bigram_tfidf_sum.idxmin(), bigram_tfidf_sum.min()

print("Từ có TF-IDF cao nhất (Unigram):", most_important_unigram)
print("Từ có TF-IDF thấp nhất (Unigram):", least_important_unigram)

print("\nTừ có TF-IDF cao nhất (Bigram):", most_important_bigram)
print("Từ có TF-IDF thấp nhất (Bigram):", least_important_bigram)


Từ có TF-IDF cao nhất (Unigram): ('viên', np.float64(659.3241372203528))
Từ có TF-IDF thấp nhất (Unigram): ('vạ', np.float64(0.120081257215534))

Từ có TF-IDF cao nhất (Bigram): ('nhiệt tình', np.float64(327.39125586167046))
Từ có TF-IDF thấp nhất (Bigram): ('lạc lõng', np.float64(0.0952086904461525))
