# Homework Lab 2: Text Preprocessing with Vietnamese
**Overview:** In this exercise, we will build a text preprocessing program for Vietnamese.

Import the necessary libraries. Note that we are using the underthesea library for Vietnamese tokenization. To install it, follow the instructions below. ([link](https://github.com/undertheseanlp/underthesea))

In [1]:
%pip install underthesea

Collecting underthesea
  Downloading underthesea-6.8.4-py3-none-any.whl.metadata (15 kB)
Collecting python-crfsuite>=0.9.6 (from underthesea)
  Downloading python_crfsuite-0.9.11-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.3 kB)
Collecting underthesea-core==1.0.4 (from underthesea)
  Downloading underthesea_core-1.0.4-cp310-cp310-manylinux2010_x86_64.whl.metadata (1.7 kB)
Downloading underthesea-6.8.4-py3-none-any.whl (20.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.9/20.9 MB[0m [31m46.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading underthesea_core-1.0.4-cp310-cp310-manylinux2010_x86_64.whl (657 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m657.8/657.8 kB[0m [31m27.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading python_crfsuite-0.9.11-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m39.9 

In [2]:
import os,glob
import codecs
import sys
import re
from underthesea import word_tokenize

## Question 1: Create a Corpus and Survey the Data

The data in this section is partially extracted from the [VNTC](https://github.com/duyvuleo/VNTC) dataset. VNTC is a Vietnamese news dataset covering various topics. In this section, we will only process the science topic from VNTC. We will create a corpus from both the train and test directories. Complete the following program:

- Write `sentences_list` to a file named `dataset_name.txt`, with each element as a document on a separate line.
- Check how many documents are in the corpus.


In [3]:
# Dataset https://www.kaggle.com/code/tientrungcao/vntc-text-classification/output
%cd /kaggle/input/vntc-text-classification
dataset_name = "VNTC_khoahoc"
path = ['./Train_Full/', './Test_Full/']

if os.listdir(path[0]) == os.listdir(path[1]):
    folder_list = [os.listdir(path[0]), os.listdir(path[1])]
    print("train labels = test labels")
else:
    print("train labels differ from test labels")

doc_num = 0
sentences_list = []
meta_data_list = []
for i in range(2):
    folder_path = path[i] + 'Khoa hoc'
    for file_name in glob.glob(os.path.join(folder_path, '*.txt')):
        # Read the file content into f
        f = codecs.open(file_name, 'br')
        # Convert the data to UTF-16 format for Vietnamese text
        file_content = (f.read().decode("utf-16")).replace("\r\n", " ")
        sentences_list.append(file_content.strip())
        f.close
        # Count the number of documents
        doc_num += 1

#### YOUR CODE HERE ####
# Number of documents
print('Number of documents =', doc_num)

%cd /kaggle/working/

# Write to VNTC_khoahoc.txt file
file = open(dataset_name + '.txt', 'w')
for sentence in sentences_list:
    file.write(sentence + '\n')
file.close()

#### END YOUR CODE #####

/kaggle/input/vntc-text-classification
train labels = test labels
Number of documents = 3916
/kaggle/working


## Question 2: Write Preprocessing Functions







### Question 2.1: Write a Function to Clean Text
Hint:
- The text should only retain the following characters: aAàÀảẢãÃáÁạẠăĂằẰẳẲẵẴắẮặẶâÂầẦẩẨẫẪấẤậẬbBcCdDđĐeEèÈẻẺẽẼéÉẹẸêÊềỀểỂễỄếẾệỆfFgGhHiIìÌỉỈĩĨíÍịỊjJkKlLmMnNoOòÒỏỎõÕóÓọỌôÔồỒổỔỗỖốỐộỘơƠờỜởỞỡỠớỚợỢpPqQrRsStTuUùÙủỦũŨúÚụỤưƯừỪửỬữỮứỨựỰvVwWxXyYỳỲỷỶỹỸýÝỵỴzZ0-9(),!?\'\
- Then trim the whitespace in the input text.

In [4]:
def clean_str(string):
    #### YOUR CODE HERE ####
    allowed_chars =  r"a-zA-Z0-9\(\),!?\'\\àÀảẢãÃáÁạẠăĂằẰẳẲẵẴắẮặẶâÂầẦẩẨẫẪấẤậẬđĐèÈẻẺẽẼéÉẹẸêÊềỀểỂễỄếẾệỆìÌỉỈĩĨíÍịỊòÒỏỎõÕóÓọỌôÔồỒổỔỗỖốỐộỘơƠờỜởỞỡỠớỚợỢùÙủỦũŨúÚụỤưƯừỪửỬữỮứỨựỰỳỲỷỶỹỸýÝỵỴ"
    cleaned_text = re.sub(f"[^ {allowed_chars}]", " ", string)
    cleaned_text = re.sub(r"\s+", " ", cleaned_text)
    return cleaned_text.strip()
    #### END YOUR CODE #####

### Question 2.2: Write a Function to Convert Text to Lowercase

In [5]:
def text_lowercase(string):
    #### YOUR CODE HERE ####
    return string.lower()
    #### END YOUR CODE #####

### Question 2.3: Tokenize Words
Hint: Use the `word_tokenize()` function imported above with two parameters: `strings` and `format="text"`.


In [6]:
def tokenize(strings):
    #### YOUR CODE HERE ####
    return word_tokenize(strings, format="text")
    #### END YOUR CODE #####

### Question 2.4: Remove Stop Words
To remove stop words, we use a list of Vietnamese stop words stored in the file `./vietnamese-stopwords.txt`. Complete the following program:
- Check each word in the text (`strings`). If a word is not in the stop words list, add it to `doc_words`.


In [7]:
# Save stop words to list
!git clone https://github.com/stopwords/vietnamese-stopwords.git
with open('./vietnamese-stopwords/vietnamese-stopwords.txt', 'r', encoding='utf-8') as f:
    stop_words = f.read().splitlines()

Cloning into 'vietnamese-stopwords'...
remote: Enumerating objects: 95, done.[K
remote: Counting objects: 100% (14/14), done.[K
remote: Compressing objects: 100% (10/10), done.[K
remote: Total 95 (delta 3), reused 0 (delta 0), pack-reused 81 (from 1)[K
Receiving objects: 100% (95/95), 40.25 KiB | 3.10 MiB/s, done.
Resolving deltas: 100% (31/31), done.


In [8]:
def remove_stopwords(strings):
    #### YOUR CODE HERE ####
    words = strings.split()
    # Remove any words appear in the stop words list
    doc_words = [word for word in words if word not in stop_words]
    return doc_words
    #### END YOUR CODE #####

## Question 2.5: Build a Preprocessing Function
Hint: Call the functions `clean_str`, `text_lowercase`, `tokenize`, and `remove_stopwords` in order, then return the result from the function.


In [9]:
def text_preprocessing(strings):
    #### YOUR CODE HERE ####
    text = clean_str(strings)
    text = text_lowercase(text)
    text = tokenize(text)
    text = remove_stopwords(text)
    return ' '.join(text)
    #### END YOUR CODE #####

## Question 3: Perform Preprocessing
Now, we will read the corpus from the file created in Question 1. After that, we will call the preprocessing function for each document in the corpus.

Hint: Call the `text_preprocessing()` function with `doc_content` as the input parameter and save the result in the variable `temp1`.


In [10]:
#### YOUR CODE HERE ####
clean_docs = []
with open(dataset_name + '.txt', 'r') as f:
    for doc_content in f:
        # Processing each document
        temp1 = text_preprocessing(doc_content)
        clean_docs.append(temp1)
#### END YOUR CODE #####

print("Length of clean_docs = ", len(clean_docs))
print('Clean_docs[0]:\n' + clean_docs[0])

Length of clean_docs =  3916
Clean_docs[0]:
chiến_thắng hay_là chết olympic cổ_đại ? 564 công_nguyên , lực_sĩ arrichion phigaleia , vô_địch olympic môn pankration kết_hợp đấm bốc vật trao vòng_nguyệt quế vinh_quang tử_vong tranh_tài giành vương_miện olympic 3 , arrichion đối_thủ bóp_cổ không_thể thoát gọng kìm_kinh_hoàng , arrichion tóm cổ_chân đối_thủ vặn gãy đau_đớn , đối_phương khuất_phục , cổ arrichion thắt chặt , ta công_bố chiến_thắng , arrichion trút hơi thở cuối_cùng mặc_dù chết arrichion xảy tình_huống bi_kịch , câu_chuyện vận_động_viên olympic từ_bỏ mạng sống giành chiến_thắng hề hiếm hy_lạp cổ_đại môn pankration bạo_lực , bóp_cổ , bẻ ngón đấm hạ_bộ phép , vận_động_viên tổn_thương nặng_nề hầu_hết chết vết_thương trận đấu kết_thúc có_điều kỳ_olympic cổ_đại dấy niềm khát_khao chiến_thắng mãnh_liệt vận_động_viên bỏ_mạng thất_bại ? truyền_thuyết , olympic bắt_đầu 776 công_nguyên , môn thi duy_nhất chạy_đua nước_rút 192 mét diễn khu đền thờ_thần zeus olympia hồi , vận_động_viên ch

## Question 4: Save Preprocessed Data
Hint: Save the preprocessed data to a file named `dataset_name + '.clean.txt'`, where each document is written on a separate line.


In [11]:
#### YOUR CODE HERE ####
file = open(dataset_name + '.clean.txt', 'w')
for docs in clean_docs:
    file.write(docs + '\n')
file.close()
#### YOUR CODE HERE ####

In [12]:
from IPython.display import FileLink
FileLink('VNTC_khoahoc.txt')

In [13]:
FileLink('VNTC_khoahoc.clean.txt')