- Topic modeling is an unsupervised learning process that automatically extracts topics or issues from a document set by analyzing the patterns of words constituting text data.
- Cluster-based topic modeling is a methodology that combines embedding from language models with a clustering architecture.
- In this project, we create a KRNewsBERT language model and then design a KRNewsBERT based-topic modeling model using parallel clustering and semantic embedding techniques.
- The Korean National Assembly provided News articles, Twitter, and online community data related to major legislation in Korea.
- Korean NLI and STS dataset were additionally collected for fine-tuning the language model.
To summarize the entire process of the 『Parallel Clustering based News article Topic Modeling』 we designed, it consists of the following four steps. Both Steps 1 and 2 are the process of pre-training and fine-tuning the language model to create KRNewsBERT.
-
STEP 1) Unsupervised training(TSDAE) : The language model (KRNewsBERT) understands the context of a given news article and optimizes it for the domain through unsupervised learning of the TSDAE method.
-
STEP 2) Supervised trainig(NLI and STS) : Our language model (KRNewsBERT) trains on the Korean NLI·STS dataset so that the model can distinguish similarities between sentences or documents.
-
STEP 3) Parallel Clustering : This clustering method we designed focuses on speed and stability.
-
STEP 4) Keyword Extraction : Extracts important words from clustered groups using C-TF-IDF (Class-based Term Freq-Inverse Doc Freq) calculation method.
- TSDAE consists of the following three steps:
- TSDAE introduces noise to input sequences by deleting or swapping tokens.
- These damaged sentences are encoded by the transformer model into sentence embedding.
- Another decoder network then attempts to reconstruct the original input from the damaged sentence encoding.
- In order to implement high-performance text clustering and topic modeling, a language model that generates high-quality semantic embeddings is important.
- Accordingly, two types of supervised learning are additionally performed: Natural Language Inference (NLI) and Semantic Textual Similarity (STS), which fine-tune the vector space between similar sentences.
-
Clustering-based topic modeling is using a clustering framework with contextualized semantic embeddings for topic modeling.
-
We develop a simple cluster-based topic modeling focused on speed.
- News article embedding apply the parallel clustering method to group semantically similar articles.
- Each cluster is regarded as a topic and then model select representative words from each cluster through the class-based TF-IDF formula.
-
Experimental results demonstrate that our parallel clustering is faster and more coherent in text embeddings clustering than other famous clustering methods.
- Seoul National University NLP Labs
- Navy Lee