# Dataset

## Download dataset
Vietnamese Students' Feedback Corpus (UIT-VSFC) is the resource consists of over 16,000 sentences which are human-annotated with two different tasks: sentiment-based and topic-based classifications.

[1] Kiet Van Nguyen, Vu Duc Nguyen, Phu Xuan-Vinh Nguyen, Tham Thi-Hong Truong, Ngan Luu-Thuy Nguyen, UIT-VSFC: Vietnamese Students' Feedback Corpus for Sentiment Analysis,  2018 10th International Conference on Knowledge and Systems Engineering (KSE 2018), November 1-3, 2018, Ho Chi Minh City, Vietnam

In [None]:
!pip install datasets

Defaulting to user installation because normal site-packages is not writeable


In [3]:
from datasets import load_dataset

dataset = load_dataset("uitnlp/vietnamese_students_feedback")

## Interacting with the downloaded data

In [4]:
train_set = dataset['train']
train_set

Dataset({
    features: ['sentence', 'sentiment', 'topic'],
    num_rows: 11426
})

In [5]:
train_set[0]

{'sentence': 'slide giáo trình đầy đủ .', 'sentiment': 2, 'topic': 1}

In [6]:
len(train_set)

11426

## Split a sentence

In [7]:
# Read a sentence
example_word_list = train_set[0]['sentence']
example_word_list

'slide giáo trình đầy đủ .'

In [8]:
# Split sentence word-by-word
example_word_list.split()

['slide', 'giáo', 'trình', 'đầy', 'đủ', '.']

In [9]:
# Join words into 1 full sentence
sentence = ""
for word in example_word_list:
    sentence += word
sentence

'slide giáo trình đầy đủ .'

In [10]:
# Get 10 sentences to process
sentence_list = []
for idx in range(10):
    sentence = ""
    for word in train_set[idx]['sentence']:
        sentence += word
    sentence_list.append(sentence)
sentence_list

['slide giáo trình đầy đủ .',
 'nhiệt tình giảng dạy , gần gũi với sinh viên .',
 'đi học đầy đủ full điểm chuyên cần .',
 'chưa áp dụng công nghệ thông tin và các thiết bị hỗ trợ cho việc giảng dạy .',
 'thầy giảng bài hay , có nhiều bài tập ví dụ ngay trên lớp .',
 'giảng viên đảm bảo thời gian lên lớp , tích cực trả lời câu hỏi của sinh viên , thường xuyên đặt câu hỏi cho sinh viên .',
 'em sẽ nợ môn này , nhưng em sẽ học lại ở các học kỳ kế tiếp .',
 'thời lượng học quá dài , không đảm bảo tiếp thu hiệu quả .',
 'nội dung môn học có phần thiếu trọng tâm , hầu như là chung chung , khái quát khiến sinh viên rất khó nắm được nội dung môn học .',
 'cần nói rõ hơn bằng cách trình bày lên bảng thay vì nhìn vào slide .']

# Text processing

## N-grams
- N-grams are continuous sequences of words or symbols, or tokens in a document. In technical terms, they can be defined as the neighboring sequences of items in a document.
- We can use n-grams or multiple other text preprocessing algorithms by incorporating [`nltk`](https://www.nltk.org/) library.

In [11]:
example_sentence = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

In [12]:
from nltk import ngrams
import numpy as np

num_of_grams = np.arange(1, 4, 1) # Test 3 n-grams

print("Original sentence:", example_sentence[1])
print("==="*5)

for gram in num_of_grams:
    splitted_sentence = ngrams(example_sentence[1].split(), int(gram))
    print(f"{gram}-gram: ",end ='')
    n_grams_list = [' '.join(grams) for grams in splitted_sentence]
    print(n_grams_list)
    print()

Original sentence: This document is the second document.
1-gram: ['This', 'document', 'is', 'the', 'second', 'document.']

2-gram: ['This document', 'document is', 'is the', 'the second', 'second document.']

3-gram: ['This document is', 'document is the', 'is the second', 'the second document.']



## Extract features with n-grams

In [13]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

In [14]:
count_vectorize_model = CountVectorizer(ngram_range = (1, 1))
n_grams_feature_vector = count_vectorize_model.fit_transform(example_sentence).toarray()
word_frequency = pd.DataFrame(data = n_grams_feature_vector, columns = count_vectorize_model.get_feature_names_out())
word_frequency.T

Unnamed: 0,0,1,2,3
and,0,0,1,0
document,1,2,0,1
first,1,0,0,1
is,1,1,1,1
one,0,0,1,0
second,0,1,0,0
the,1,1,1,1
third,0,0,1,0
this,1,1,1,1


In [15]:
count_vectorize_model = CountVectorizer(ngram_range = (1, 1))

n_grams_feature_vector = count_vectorize_model.fit_transform([sentence_list[5]]).toarray()

word_frequency = pd.DataFrame(data = n_grams_feature_vector, columns = count_vectorize_model.get_feature_names_out())

print('Example sentence:', sentence_list[5])
word_frequency

Example sentence: giảng viên đảm bảo thời gian lên lớp , tích cực trả lời câu hỏi của sinh viên , thường xuyên đặt câu hỏi cho sinh viên .


Unnamed: 0,bảo,cho,câu,của,cực,gian,giảng,hỏi,lên,lớp,lời,sinh,thường,thời,trả,tích,viên,xuyên,đảm,đặt
0,1,1,2,1,1,1,1,2,1,1,1,2,1,1,1,1,3,1,1,1


In [16]:
count_vectorize_model = CountVectorizer(ngram_range = (1, 2))

n_grams_feature_vector = count_vectorize_model.fit_transform(example_sentence).toarray()

word_frequency = pd.DataFrame(data = n_grams_feature_vector, columns = count_vectorize_model.get_feature_names_out())
word_frequency.T

Unnamed: 0,0,1,2,3
and,0,0,1,0
and this,0,0,1,0
document,1,2,0,1
document is,0,1,0,0
first,1,0,0,1
first document,1,0,0,1
is,1,1,1,1
is the,1,1,1,0
is this,0,0,0,1
one,0,0,1,0


## Problem set 1
Based on the UIT-VSFC dataset and the aforementioned information.
- Create an $n$-gram word frequency table, such that $n$ could be any number of your desire.
- With $n=1$ and $n=2$, what is the most popular word in the dataset ?
- With $n=1$ and $n=2$, what is the rarest word in the dataset ?
- What are the limitations of this data processing flow ? How can we overcome those ?


### Retrieve all sentences within the dataset

In [17]:
from typing import List

def get_all_sentences(dataset) -> List[str]:
    """
    Function to get all sentences and store them into a list of strings

    Args:
    dataset -- The subset (i.e., train/valid/test) in UIT-VSFC dataset

    Returns:
    A list of all sentences in a subset data of the UIT-VSFC.
    """

    list_all_sentence: list = []

    ### YOUR CODE STARTS HERE
    for idx in range(len(dataset)):
        sentence = ""
        for word in dataset[idx]['sentence']:
            sentence += word
        list_all_sentence.append(sentence)

    ### YOUR CODE ENDS HERE

    return list_all_sentence

In [18]:
list_all_sentence: list = get_all_sentences(train_set)
print(f"#sentences within the dataset: {len(list_all_sentence)}")
print(f"Example sentence: {list_all_sentence[0]}")

#sentences within the dataset: 11426
Example sentence: slide giáo trình đầy đủ .


### Build the word frequency table

In [19]:
def n_gram_word_frequency(sentence_list: list,
                          n: int) -> pd.DataFrame:
    """
    Function to build a word frequency table based on n-grams

    Args:
    sentence_list (list) -- A list of all sentences needed for table constructing process
    n (int) -- Number of grams that we parse into this function

    Returns:
    A dataframe contains all words after conducting n-grams and their respective frequencies
    """

    ### YOUR CODE STARTS HERE

    count_vectorize_model = CountVectorizer(ngram_range = (n, n))
    n_grams_feature_vector = count_vectorize_model.fit_transform(sentence_list).toarray()
    word_frequency_table = pd.DataFrame(data = n_grams_feature_vector, columns = count_vectorize_model.get_feature_names_out())

    ### YOUR CODE ENDS HERE

    return word_frequency_table

In [20]:
# Construct the table of word frequency
# 1-ngram
word_frequency_table_1ngram = n_gram_word_frequency(sentence_list=list_all_sentence,
                                             n=1)
word_frequency_1ngram = word_frequency_table_1ngram.sum(axis=0).sort_values(ascending=False).to_frame('Frequency')

print(word_frequency_1ngram[:10])
print("Từ phổ biến nhất:\n", word_frequency_1ngram.index[0])
print(word_frequency_1ngram[-10:])
print("Từ hiếm gặp nhất:\n", word_frequency_1ngram.index[-1])

       Frequency
viên        4803
giảng       3711
dạy         3156
thầy        3095
sinh        3082
học         2940
bài         2336
tình        2266
không       2177
và          2068
Từ phổ biến nhất:
 viên
               Frequency
dọa                    1
đếm                    1
đế                     1
gán                    1
ấm                     1
ướt                    1
ức                     1
đống                   1
đốn                    1
11doubledot55          1
Từ hiếm gặp nhất:
 11doubledot55


In [21]:
# 2-ngram
word_frequency_table_2ngram = n_gram_word_frequency(sentence_list=list_all_sentence,
                                             n=2)
word_frequency_2ngram = word_frequency_table_2ngram.sum(axis=0).sort_values(ascending=False).to_frame('Frequency')

print(word_frequency_2ngram[:10])
print("Từ phổ biến nhất:\n", word_frequency_2ngram.index[0])
print(word_frequency_2ngram[-10:])
print("Từ hiếm gặp nhất:\n", word_frequency_2ngram.index[-1])

            Frequency
sinh viên        2698
nhiệt tình       1848
giảng viên       1610
bài tập          1057
dễ hiểu          1004
giảng dạy         956
kiến thức         904
thực hành         877
môn học           688
cho sinh          656
Từ phổ biến nhất:
 sinh viên
                  Frequency
100 cách                  1
100 là                    1
100 người                 1
100 tự                    1
10h mới                   1
10h30 nhưng               1
11 thì                    1
11doubledot55 pm          1
11h30 nghỉ                1
trên google               1
Từ hiếm gặp nhất:
 trên google


## You should comment your answer to problem 1 here with sufficient explanations, including your implementation and reasoning.

- With n = 1, the most popular word is "viên" appearing 4803 times, the rarest word is "11doubledot55" appearing 1 time. However, there are still many words appearing 1 time. Eg: "đốn", "đống", "ức".
- With n = 2, the most popular word is "sinh viên" appearing 2698 times, the rarest word is "trên google" appearing 1 time. However, there are still many words appearing 1 time. Eg: "11h30 nghỉ", "11doubledot55 pm", "11 thì".
- The limitations of this data processing flow is stopwords may take many quantity, it will reduce the important of remaining words though they are really needed.
- Mispelling is also a problem because it will disperse frequency of words. Eg: "sinhviên" and "sinh viên" may be separated count with different meaning.
- The computation cost will be large if the 'n' large. It also take much time to compute.

## Stopwords

In [22]:
# Retrieve the stopword dictionary
import wget
!wget --no-check-certificate --content-disposition https://raw.githubusercontent.com/stopwords/vietnamese-stopwords/master/vietnamese-stopwords.txt

'wget' is not recognized as an internal or external command,
operable program or batch file.


In [24]:
# Observe stopwords list
vietnamese_stopword = open('vietnamese-stopwords.txt', 'r', encoding='utf-8').read()
vietnamese_stopword = vietnamese_stopword.split('\n') # Separate lines by lines
print(f"#Number of stop words: {len(vietnamese_stopword)}")

#Number of stop words: 1942


In [25]:
# Stop words example
for sentence in vietnamese_stopword[:10]:
    print(sentence)

a lô
a ha
ai
ai ai
ai nấy
ai đó
alô
amen
anh
anh ấy


## Term frequency - Invert document frequency (TF-IDF)


### TF
Term frequency (TF) is the number of times a given term appears in document

$$
tf(t) = f(t,d)\times\frac{1}{T}
$$
whereas, $f(t,d)$ is the frequency of the word $t$ in the document $d$, $T$ is the number of all words in that document.

In [26]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

# Declare TF vectorize
tf_vectorizer = TfidfVectorizer(ngram_range=(1, 1),
                                use_idf=False, # only using TF
                                norm='l1')

tf_vectorizer.fit_transform(corpus)

tf_vectorized = tf_vectorizer.transform(corpus)

tf_output = tf_vectorized[0]

# Build TF table
words_tf_idf = pd.DataFrame(tf_output.T.todense(), index=tf_vectorizer.get_feature_names_out(), columns=['tf'])
words_tf_idf

Unnamed: 0,tf
and,0.0
document,0.2
first,0.2
is,0.2
one,0.0
second,0.0
the,0.2
third,0.0
this,0.2


### IDF

Inverse Document Frequency, or abbreviated as IDF, measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones.

$$
idf(t) = \log\left(\frac{\text{#documents in the document set}}{\text{#documents with term}}\right) + 1
$$

In [27]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

# Configure settings for IDF vectorize
tf_idf_vectorizer = TfidfVectorizer(ngram_range=(1, 1),
                                    smooth_idf=False,
                                    use_idf=True,
                                    norm=None)

tf_idf_vectorizer.fit_transform(corpus)

# Retrieve only idf information
idf_vectorizer = tf_idf_vectorizer.idf_

# Join idf values into the previous dataframe
words_tf_idf['idf'] = idf_vectorizer

# Show dataframe with ascending values of idf
words_tf_idf.sort_values(by=['idf'])

Unnamed: 0,tf,idf
is,0.2,1.0
the,0.2,1.0
this,0.2,1.0
document,0.2,1.287682
first,0.2,1.693147
and,0.0,2.386294
second,0.0,2.386294
one,0.0,2.386294
third,0.0,2.386294


### TF-IDF

Technically saying, TF-IDF is a score which is applied to every word in every document in our dataset. And for every word, the TF-IDF value increases with every appearance of the word in a document, but is gradually decreased with every appearance in other documents

$$
\text{tf-idf}= tf(t, d) \times idf(t)
$$

In [28]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

tf_idf_vectorizer = TfidfVectorizer(ngram_range=(1, 1),
                                    smooth_idf=False,
                                    use_idf=True,
                                    norm='l1')

tf_idf_vectorizer.fit_transform(corpus)

tf_idf_vectorized = tf_idf_vectorizer.transform(corpus)

tf_idf_output = tf_idf_vectorized[0]
words_tf_idf['tf-idf'] = tf_idf_output.T.todense()

words_tf_idf.sort_values(by=['tf-idf'])

Unnamed: 0,tf,idf,tf-idf
and,0.0,2.386294,0.0
third,0.0,2.386294,0.0
second,0.0,2.386294,0.0
one,0.0,2.386294,0.0
is,0.2,1.0,0.167201
the,0.2,1.0,0.167201
this,0.2,1.0,0.167201
document,0.2,1.287682,0.215302
first,0.2,1.693147,0.283096


### Problem set 2
Based on the problem 1 and the instruction on TF, IDF, TF-IDF:
- (2a) Build the tf-idf table for the UIT-VSFC dataset with $n$-gram = 1 and $n$-gram = 2.
- (2b) Change a few hyperparameters in the `TfidfVectorizer` function (`smooth_idf`, `sublinear_tf` and `norm`) from problem 2a (*you could browse from this [link](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) to discover which are the correct paramters to parse*). Explain the results differences collected after modifying hyperparameters.
- (2c) Which words has the lowest and the highest tf-idf values ? Do they differ from $n$-grams results ?
- (2d) Which limitations from $n$-grams that TF-IDF overcame ?

## 2a. Build the tf-idf table for the UIT-VSFC dataset with n-gram = 1 and n-gram = 2

### 1-ngram

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# 1-gram
tfidf_vectorizer_1gram = TfidfVectorizer(ngram_range=(1, 1))
tfidf_1gram = tfidf_vectorizer_1gram.fit_transform(list_all_sentence)
tfidf_1gram_df = pd.DataFrame(tfidf_1gram.toarray(), columns=tfidf_vectorizer_1gram.get_feature_names_out())

print("TF-IDF with 1-gram:")
tfidf_1gram_df.head(10)

TF-IDF with 1-gram:


Unnamed: 0,10,100,10h,10h30,11,11doubledot55,11h30,11h55,12,12doubledot00,...,ấy,ẩn,ắt,ốc,ồn,ổn,ủa,ủng,ức,ứng
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 2-ngram

In [None]:
# 2-ngram
tfidf_vectorizer_2gram = TfidfVectorizer(ngram_range=(2, 2))
tfidf_2gram = tfidf_vectorizer_2gram.fit_transform(list_all_sentence)
tfidf_2gram_df = pd.DataFrame(tfidf_2gram.toarray(), columns=tfidf_vectorizer_2gram.get_feature_names_out())

print("TF-IDF with 2-gram:")
tfidf_2gram_df.head(10)

TF-IDF with 2-gram:


Unnamed: 0,10 50,10 bài,10 fraction,10 kiến,10 luôn,10 mấy,10 mới,10 người,10 năm,10 phút,...,ứng kịp,ứng nhu,ứng nhưng,ứng tốt,ứng yêu,ứng đáp,ứng đúng,ứng được,ứng đầy,ứng đủ
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 2b. Change a few hyperparameters in the TfidfVectorizer function

### 1-ngram

In [32]:
# 1-ngram
new_tfidf_vectorizer_1gram = TfidfVectorizer(ngram_range=(1, 1), smooth_idf=False, sublinear_tf=True, norm='l1')
new_tfidf_1gram = new_tfidf_vectorizer_1gram.fit_transform(list_all_sentence)
new_tfidf_1gram_df = pd.DataFrame(new_tfidf_1gram.toarray(), columns=new_tfidf_vectorizer_1gram.get_feature_names_out())
new_tfidf_sort_1gram = new_tfidf_1gram_df.sum(axis=0).sort_values(ascending=False).to_frame('TF-IDF')
tfidf_sort_1gram = tfidf_1gram_df.sum(axis=0).sort_values(ascending=False).to_frame('TF-IDF')

print("TF-IDF with 1-gram:")
print(tfidf_sort_1gram.head(5))
print(tfidf_sort_1gram[-5:])
print("TF-IDF with 1-gram (new hyperparameters):")
print(new_tfidf_sort_1gram.head(5))
print(new_tfidf_sort_1gram[-5:])

TF-IDF with 1-gram:
           TF-IDF
viên   659.324137
giảng  658.334312
dạy    616.080713
tình   558.548969
thầy   552.767907
        TF-IDF
tá    0.125334
vạ    0.120081
gạo   0.120081
ùn    0.120081
lõng  0.120081
TF-IDF with 1-gram (new hyperparameters):
           TF-IDF
dạy    230.719080
giảng  230.481666
tình   211.496246
viên   205.576416
thầy   198.167823
        TF-IDF
hỏa   0.023056
lõng  0.016674
gạo   0.016674
ùn    0.016674
vạ    0.016674


### 2-ngram

In [33]:
# 2-ngram
new_tfidf_vectorizer_2gram = TfidfVectorizer(ngram_range=(2, 2), smooth_idf=False, sublinear_tf=True, norm='l1')
new_tfidf_2gram = new_tfidf_vectorizer_2gram.fit_transform(list_all_sentence)
new_tfidf_2gram_df = pd.DataFrame(new_tfidf_2gram.toarray(), columns=new_tfidf_vectorizer_2gram.get_feature_names_out())
new_tfidf_sort_2gram = new_tfidf_2gram_df.sum(axis=0).sort_values(ascending=False).to_frame('TF-IDF')
tfidf_sort_2gram = tfidf_2gram_df.sum(axis=0).sort_values(ascending=False).to_frame('TF-IDF')

print("TF-IDF with 2-gram:")
print(tfidf_sort_2gram.head(5))
print(tfidf_sort_2gram[-5:])
print("TF-IDF with 2-gram (new hyperparameters):")
print(new_tfidf_sort_2gram.head(5))
print(new_tfidf_sort_2gram[-5:])

TF-IDF with 2-gram:
                TF-IDF
nhiệt tình  327.391256
sinh viên   266.909184
giảng viên  234.135877
dễ hiểu     211.014566
giảng dạy   179.107510
            TF-IDF
sức và    0.095209
cả tên    0.095209
lõng mơ   0.095209
buộc thì  0.095209
vạ gạo    0.095209
TF-IDF with 2-gram (new hyperparameters):
                TF-IDF
nhiệt tình  134.238251
dễ hiểu      84.110376
giảng viên   82.846095
sinh viên    78.990046
giảng dạy    67.601236
              TF-IDF
buổi thông  0.008826
mềm hệ      0.008826
như phần    0.008826
một vạ      0.008826
gạo một     0.008826


### Explain the results differences collected after modifying hyperparameters.
- Change 'smooth_idf' from True(default) into False to reduce weight of words.
- Change 'sublinear_tf' from False(default) into True to reduce influence of word which appear many times in one document. Help to balance out their influence across different documents.
- Change 'norm' from l2(default) into l1, TF-IDF values sum to 1 per document instead of being scaled by Euclidean norm, which may affect how documents are compared.

## 2c. Find words with the lowest and highest tf-idf values

### 1-ngram

In [34]:
# 1-ngram
print("N-grams:")
print("Lowest 1-gram:", word_frequency_1ngram.index[-1])
print("Highest 1-gram:", word_frequency_1ngram.index[0])
print("TF-IDF with 1-gram:")
print("Lowest 1-gram TF-IDF:", tfidf_sort_1gram.index[-1])
print("Highest 1-gram TF-IDF:", tfidf_sort_1gram.index[0])
print("TF-IDF with 1-gram (new hyperparameters):")
print("Lowest 1-gram TF-IDF:", new_tfidf_sort_1gram.index[-1])
print("Highest 1-gram TF-IDF:", new_tfidf_sort_1gram.index[0])

N-grams:
Lowest 1-gram: 11doubledot55
Highest 1-gram: viên
TF-IDF with 1-gram:
Lowest 1-gram TF-IDF: lõng
Highest 1-gram TF-IDF: viên
TF-IDF with 1-gram (new hyperparameters):
Lowest 1-gram TF-IDF: vạ
Highest 1-gram TF-IDF: dạy


In [35]:
# 2-ngram
print("N-grams:")
print("Lowest 2-gram:", word_frequency_2ngram.index[-1])
print("Highest 2-gram:", word_frequency_2ngram.index[0])
print("TF-IDF with 2-gram:")
print("Lowest 2-gram TF-IDF:", tfidf_sort_2gram.index[-1])
print("Highest 2-gram TF-IDF:", tfidf_sort_2gram.index[0])
print("TF-IDF with 2-gram (new hyperparameters):")
print("Lowest 2-gram TF-IDF:", new_tfidf_sort_2gram.index[-1])
print("Highest 2-gram TF-IDF:", new_tfidf_sort_2gram.index[0])

N-grams:
Lowest 2-gram: trên google
Highest 2-gram: sinh viên
TF-IDF with 2-gram:
Lowest 2-gram TF-IDF: vạ gạo
Highest 2-gram TF-IDF: nhiệt tình
TF-IDF with 2-gram (new hyperparameters):
Lowest 2-gram TF-IDF: gạo một
Highest 2-gram TF-IDF: nhiệt tình


## 2d. Which limitations from n-grams that TF-IDF overcame ?
- TF-IDF can overcome stopwords problem of n-grams.
- TF-IDF reduce weight of popular words.
- TF-IDF separate important words from table, reduce size of useful data.
- TF-IDF evaluate the importance of dataset.