# Homework Lab 2: Text Preprocessing with Vietnamese
**Overview:** In this exercise, we will build a text preprocessing program for Vietnamese.

Import the necessary libraries. Note that we are using the underthesea library for Vietnamese tokenization. To install it, follow the instructions below. ([link](https://github.com/undertheseanlp/underthesea))

In [14]:
!pip install underthesea



In [15]:
import os,glob
import codecs
import sys
import re
from underthesea import word_tokenize

## Question 1: Create a Corpus and Survey the Data

The data in this section is partially extracted from the [VNTC](https://github.com/duyvuleo/VNTC) dataset. VNTC is a Vietnamese news dataset covering various topics. In this section, we will only process the science topic from VNTC. We will create a corpus from both the train and test directories. Complete the following program:

- Write `sentences_list` to a file named `dataset_name.txt`, with each element as a document on a separate line.
- Check how many documents are in the corpus.


In [16]:
!mkdir -p VNTC_khoahoc

train_url = "https://github.com/duyvuleo/VNTC/raw/master/Data/10Topics/Ver1.1/Train_Full.rar"
test_url = "https://github.com/duyvuleo/VNTC/raw/master/Data/10Topics/Ver1.1/Test_Full.rar"
!wget {train_url}
!wget {test_url}

!unrar x Train_Full.rar VNTC_khoahoc/
!unrar x Test_Full.rar VNTC_khoahoc/

--2026-01-26 09:34:59--  https://github.com/duyvuleo/VNTC/raw/master/Data/10Topics/Ver1.1/Train_Full.rar
Resolving github.com (github.com)... 140.82.113.3
Connecting to github.com (github.com)|140.82.113.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/duyvuleo/VNTC/master/Data/10Topics/Ver1.1/Train_Full.rar [following]
--2026-01-26 09:34:59--  https://raw.githubusercontent.com/duyvuleo/VNTC/master/Data/10Topics/Ver1.1/Train_Full.rar
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 49152721 (47M) [application/octet-stream]
Saving to: ‘Train_Full.rar.2’


2026-01-26 09:35:00 (199 MB/s) - ‘Train_Full.rar.2’ saved [49152721/49152721]

--2026-01-26 09:35:00--  https://github.com/duyvuleo/VNTC/raw/m

In [17]:
dataset_name = "VNTC_khoahoc"

path = ['./VNTC_khoahoc/Train_Full/', './VNTC_khoahoc/Test_Full/']

if os.listdir(path[0]) == os.listdir(path[1]):
    folder_list = [os.listdir(path[0]), os.listdir(path[1])]
    print("train labels = test labels")
else:
    print("train labels differ from test labels")

doc_num = 0
sentences_list = []
meta_data_list = []
for i in range(2):
    # for folder_name in folder_list[i]:
    folder_path = path[i] + "Khoa hoc"
    # if folder_name[0] != ".":
    if os.path.exists(folder_path):
      for file_name in glob.glob(os.path.join(folder_path, '*.txt')):
          # Read the file content into f
          f = codecs.open(file_name, 'br')
          # Convert the data to UTF-16 format for Vietnamese text
          file_content = (f.read().decode("utf-16")).replace("\r\n", " ")
          sentences_list.append(file_content.strip())
          f.close
          # Count the number of documents
          doc_num += 1

#### YOUR CODE HERE ####
with open(dataset_name + ".txt", "w", encoding="utf-8") as f:
  for doc in sentences_list:
    f.write(doc + "\n")

# Check number of the corpus
print(f"Number documents in the corpus: {doc_num}")

#### END YOUR CODE #####

train labels = test labels
Number documents in the corpus: 3916


## Question 2: Write Preprocessing Functions







### Question 2.1: Write a Function to Clean Text
Hint:
- The text should only retain the following characters: aAàÀảẢãÃáÁạẠăĂằẰẳẲẵẴắẮặẶâÂầẦẩẨẫẪấẤậẬbBcCdDđĐeEèÈẻẺẽẼéÉẹẸêÊềỀểỂễỄếẾệỆfFgGhHiIìÌỉỈĩĨíÍịỊjJkKlLmMnNoOòÒỏỎõÕóÓọỌôÔồỒổỔỗỖốỐộỘơƠờỜởỞỡỠớỚợỢpPqQrRsStTuUùÙủỦũŨúÚụỤưƯừỪửỬữỮứỨựỰvVwWxXyYỳỲỷỶỹỸýÝỵỴzZ0-9(),!?\'\
- Then trim the whitespace in the input text.

In [18]:
def clean_str(string):
    #### YOUR CODE HERE ####
    regex = r"[^aAàÀảẢãÃáÁạẠăĂằẰẳẲẵẴắẮặẶâÂầẦẩẨẫẪấẤậẬbBcCdDđĐeEèÈẻẺẽẼéÉẹẸêÊềỀểỂễỄếẾệỆfFgGhHiIìÌỉỈĩĨíÍịỊjJkKlLmMnNoOòÒỏỎõÕóÓọỌôÔồỒổỔỗỖốỐộỘơƠờỜởỞỡỠớỚợỢpPqQrRsStTuUùÙủỦũŨúÚụỤưƯừỪửỬữỮứỨựỰvVwWxXyYỳỲỷỶỹỸýÝỵỴzZ0-9(),!?'\"]"
    return re.sub(regex, ' ', string).strip()
    #### END YOUR CODE #####
# print(clean_str("abĂbÂ!!!!*&"))

### Question 2.2: Write a Function to Convert Text to Lowercase

In [19]:
# make all text lowercase
def text_lowercase(string):
    #### YOUR CODE HERE ###
    # str -> str
    return string.lower()
    #### END YOUR CODE #####
# print(text_lowercase(clean_str("aAàÀảẢãÃáÁạẠăĂằẰẳẲẵẴắẮặẶâÂầẦẩẨẫẪấẤậẬbBcCdDđĐeEèÈẻẺẽẼéÉẹẸêÊềỀểỂễỄếẾệỆfFgGhHiIìÌỉỈĩĨíÍịỊjJkKlLmMnNoOòÒỏỎõÕóÓọỌôÔồỒổỔỗỖốỐộỘơƠờỜởỞỡỠớỚợỢpPqQrRsStTuUùÙủỦũŨúÚụỤưƯừỪửỬữỮứỨựỰvVwWxXyYỳỲỷỶỹỸýÝỵỴzZ0-9(),!?'\"]")))

### Question 2.3: Tokenize Words
Hint: Use the `word_tokenize()` function imported above with two parameters: `strings` and `format="text"`.


In [20]:
def tokenize(strings):
    #### YOUR CODE HERE ####
    # str -> List[str]
    return word_tokenize(strings, format="text")
    #### END YOUR CODE #####
# print(tokenize("anh chả là Nhật ký SEA Games biết ngày 21/8: Ánh Viên thắng giòn giã bài cái ở vòng loại."))
# help(word_tokenize)

### Question 2.4: Remove Stop Words
To remove stop words, we use a list of Vietnamese stop words stored in the file `./vietnamese-stopwords.txt`. Complete the following program:
- Check each word in the text (`strings`). If a word is not in the stop words list, add it to `doc_words`.


In [21]:
from urllib.request import urlopen
url = "https://raw.githubusercontent.com/stopwords/vietnamese-stopwords/master/vietnamese-stopwords.txt"
raw_data = urlopen(url).read().decode("utf-8")
STOPWORDS_SET = set(raw_data.split("\n"))

def remove_stopwords(strings):
    #### YOUR CODE HERE ####
    words = [word for word in strings.split(" ") if word not in STOPWORDS_SET]
    return " ".join(words)
    #### END YOUR CODE #####
# print(remove_stopwords(tokenize(text_lowercase(clean_str("anh chả là Nhật ký SEA Games biết ngày 21/8: Ánh Viên thắng giòn giã bài cái ở vòng loại.")))))

## Question 2.5: Build a Preprocessing Function
Hint: Call the functions `clean_str`, `text_lowercase`, `tokenize`, and `remove_stopwords` in order, then return the result from the function.


In [22]:
def text_preprocessing(strings):
    #### YOUR CODE HERE ####
    cleaned_strings = clean_str(strings)
    lowercased_strings = text_lowercase(cleaned_strings)
    tokenized_strings = tokenize(lowercased_strings)
    sw_removed_strings = remove_stopwords(tokenized_strings)
    result = sw_removed_strings

    return result
    #### END YOUR CODE #####

## Question 3: Perform Preprocessing
Now, we will read the corpus from the file created in Question 1. After that, we will call the preprocessing function for each document in the corpus.

Hint: Call the `text_preprocessing()` function with `doc_content` as the input parameter and save the result in the variable `temp1`.


In [23]:
#### YOUR CODE HERE ####
import gc

clean_docs = []
with open("VNTC_khoahoc.txt", "r", encoding="utf-8") as f:
    for line in f:
        doc_content = line.strip()
        if not doc_content:
            continue

        try:
            processed = text_preprocessing(doc_content)
            if processed:
                clean_docs.append(processed)
        except:
            continue

        del doc_content

gc.collect()
#### END YOUR CODE #####

print("\nlength of clean_docs = ", len(clean_docs))
print('clean_docs[0]:\n' + clean_docs[0])


length of clean_docs =  3916
clean_docs[0]:
đôi giày thể_hiện tính_cách meghan_cleary , tác_giả sách tương_hợp hoàn_hảo , " bất_cứ phụ_tùng trang_phục , đôi giày tiết_lộ trạng_thái tinh_thần phụ_nữ " tìm_hiểu ý_nghĩa đôi giày ưng_ý kiểu giày lê_đế phẳng giỏi ngoại_giao , quan_tâm , chăm_sóc , thường_xuyên xoa_dịu , dàn hòa bất_đồng bạn_bè_bạn óc sáng_tạo nghiêm_túc kiểu giày cao_gót nhọn phối_hợp quyến_rũ truyền_thống hiện_đại , đầy_đủ sức_mạnh phụ_nữ tự_tin giày_hở gót năng_nổ , xông_xáo thực_sự , thường_xuyên thoăn_thoắt công_sở bữa tiệc hơi nghịch_ngợm một_chút , đánh_giá giày vải quyến_rũ điềm_đạm , người_yêu trò_chuyện thông_minh , quan_sát nhanh_nhẹn phát_triển bản_thân ngừng hoạt_động liên_tục hiểu_biết âm_nhạc , điện_ảnh giày sống tình có_lý , sợ đương_đầu vấn_đề gia_đình công_sở bạn_bè yêu quý_vẻ bình_dị , hài_hước


## Question 4: Save Preprocessed Data
Hint: Save the preprocessed data to a file named `dataset_name + '.clean.txt'`, where each document is written on a separate line.


In [26]:
#### YOUR CODE HERE ####
filename = dataset_name + ".clean.txt"

with open(filename, "w", encoding="utf-8") as f:
    for doc in clean_docs:
        f.write(doc + "\n")
#### YOUR CODE HERE ####