#**Topic Modeling with BERT**

BERTopic is a topic modeling using transformers and c-TF-IDF to classify data to cluster (unsupervised)

##**1. Architecture**

![link text](https://maartengr.github.io/BERTopic/tutorial/algorithm/algorithm.png)

<br>

##**2. Environment**





We install some libraries written to leverage BERTopic

In [None]:
#@title Install BERTopic, nltk, gensim
!pip install bertopic
!pip install nltk
!pip install gensim

In [None]:
#@title Import BERTopic, nltk, gensim and other libraries
from bertopic import BERTopic
import nltk
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
import re
import matplotlib.pyplot as plt
from nltk import word_tokenize, pos_tag
from nltk.stem import WordNetLemmatizer
from gensim.models import Phrases
from gensim.parsing.preprocessing import remove_stopwords, preprocess_string
import pathlib
import logging
from gensim import models, corpora
from gensim.models import Phrases, LdaModel
from gensim.test.utils import datapath


Download English data that is implemented with LDA algorithm

In [None]:
#@title Download dataset for train and test
!wget https://github.com/hoavt-54/LDA-Demo/raw/master/data/wiki.train.tokens
!wget https://github.com/hoavt-54/LDA-Demo/raw/master/data/wiki.test.tokens

Import libraries for processing data

In [79]:
#@title Import utils.py
STOPW = {"unk", "<unk>"}
wnl = WordNetLemmatizer()


def extract_pos(doc, tag = ["NN"]):
    text = word_tokenize(doc)
    return " ".join([t[0] for t in pos_tag(text) if t[1] in tag])


def extract_bigrams(docs, biagram_model=None):
    """Extract bigrams features before POS remove
    to keep interesting patterns
    """
    list_tokens = [lemmatize(remove_stopw(doc.lower())).split() for doc in docs]
    if not biagram_model:
        biagram_model = Phrases(list_tokens, min_count=12, max_vocab_size=50000, threshold=3)
    return (biagram_model, [[b for b in biagram_model[tks] if "_" in b] for tks in list_tokens])


def lemmatize(doc):
    return " ".join([wnl.lemmatize(token) for token in doc.split()])


def remove_non_word(string):
    pattern = "^[\\w-]+$"
    return " ".join([token for token in string.split() if re.match(pattern, token)])


def remove_stopw(string):
    string = remove_stopwords(string)
    return " ".join([token for token in string.split() if token not in STOPW])


def read_text_file(fn="test.txt"):
    with open(f"data/{fn}") as f:
        return f.read()


Convert dataset to list of documents

In [None]:
#@title dataset to documents
def read_wiki(dataset="wiki.test.tokens"):
    """ In this dataset, document are separated by their title
        having format = Title here =. We split the dataset set file
        into chunks of text by the titles.
    """
    
    article = []
    begin_pattern = "^\\s*= [^=]"  # Ex. = Robert <unk> = 
    with open(f"{dataset}") as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            if re.match(begin_pattern, line):
                if len(article) > 0:
                    out = " ".join(article)
                    article = []
                    yield out
                else:
                    article.append(line.strip())
            else:
                article.append(line.strip())
        yield " ".join(article)

Clean data

In [86]:
#@title Function for preprocessing data
def preprocess(data, bigrams=None):
    """removing stopwords, filter part-of-speech
        tokenization ...
    
    """
    data = [remove_non_word(doc) for doc in data]
    print(f"number of doc: {len(data)}")
    bigrams, bi_data = extract_bigrams(data)
    data = [extract_pos(doc) for doc in data]
    data = [doc for doc in data if 40 < len(doc) < 60000]
    data = [remove_stopw(doc.lower()) for doc in data]
    data = [lemmatize(doc) for doc in data]

    return data

    # list_of_tokens = [doc.split() for doc in data]
    
    # # concat bigrams
    # for idx, tokens in enumerate(list_of_tokens):
    #     tokens.extend(bi_data[idx])

    # return (list_of_tokens, bigrams)

<br>

##**3. Running**

###**3.1. Default parameters**

Load dataset from file

In [136]:
data_train = read_wiki("wiki.train.tokens")
docs = list(data_train)

Example from train dataset

In [137]:
docs[2]

'<unk> Mary Barker ( 28 June 1895 – 16 February 1973 ) was an English illustrator best known for a series of fantasy illustrations depicting fairies and flowers . Barker \'s art education began in girlhood with correspondence courses and instruction at the Croydon School of Art . Her earliest professional work included <unk> cards and juvenile magazine illustrations , and her first book , Flower Fairies of the Spring , was published in 1923 . Similar books were published in the following decades . Barker was a devout Anglican , and donated her artworks to Christian fundraisers and missionary organizations . She produced a few Christian @-@ themed books such as The Children ’ s Book of Hymns and , in collaboration with her sister Dorothy , He Leadeth Me . She designed a stained glass window for St. Edmund \'s Church , Pitlake , and her painting of the Christ Child , The Darling of the World Has Come , was purchased by Queen Mary . Barker was equally proficient in <unk> , pen and ink , o

Initialize new model with default parameters (data isn't preprocessed)

In [184]:
topic_model_default = BERTopic(language="english", calculate_probabilities=True, verbose=True)

Fit model with train dataset

In [185]:
topics_default, probs_default = topic_model_default.fit_transform(docs)

HBox(children=(FloatProgress(value=0.0, description='Batches', max=24.0, style=ProgressStyle(description_width…

2021-07-28 07:23:33,555 - BERTopic - Transformed documents to Embeddings





2021-07-28 07:23:38,808 - BERTopic - Reduced dimensionality with UMAP
2021-07-28 07:23:38,868 - BERTopic - Clustered UMAP embeddings with HDBSCAN


Quantity of documents are assigned to a topic

In [140]:
topic_model_default.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,-1,178,-1_was_as_his_with
1,0,76,0_he_his_season_first
2,1,73,1_song_album_music_songs
3,2,55,2_episode_her_she_he
4,3,49,3_were_ships_ship_guns
5,4,43,4_route_highway_road_ny
6,5,42,5_species_with_have_has
7,6,33,6_his_he_king_john
8,7,31,7_storm_hurricane_winds_cyclone
10,8,25,8_australian_air_aircraft_war


*We examine specific topic, in this case is topic 0*

In [202]:
topic_model_default.get_topic(2)

[('album', 0.05475474984094957),
 ('song', 0.05247642175644389),
 ('songs', 0.024933615515251344),
 ('dylan', 0.024013155143533627),
 ('carey', 0.018255159079420036),
 ('recording', 0.015016510724405492),
 ('recorded', 0.013111084734772216),
 ('lyrics', 0.012383771178515367),
 ('vocals', 0.011050798453406452),
 ('albums', 0.011016638009398463)]

**This topic has high probability to be allocated to MUSIC topic**

In [144]:
 topic_model_default.visualize_topics()

In [190]:
topic_model_default.visualize_barchart(top_n_topics=10, )

**In this scenario, we didn't remove Stopwords (he, she, the, is,...) and Unknown words (<unk>) and it's affecting to our results.**

<br>

###**3.2. Remove Stopword and Unknown**

Below function filters each of documents to remove stopword and unknown

In [167]:
def remove_words(documents):
  new_documents = []
  for doc in documents:
    new_doc = remove_stopwords(doc)          
    new_doc = new_doc.replace("<unk>", "")  

    new_documents.append(new_doc)

  return new_documents

In [165]:
data_train = read_wiki("wiki.train.tokens")
docs = list(data_train)

In [168]:
docs = remove_words(docs)

Repeat steps above to retrain model

In [170]:
topic_model_word_removed = BERTopic(language="english", calculate_probabilities=True, verbose=True)

In [171]:
topics_word_removed, probs_word_removed = topic_model_word_removed.fit_transform(docs)

HBox(children=(FloatProgress(value=0.0, description='Batches', max=24.0, style=ProgressStyle(description_width…

2021-07-28 07:15:45,646 - BERTopic - Transformed documents to Embeddings





2021-07-28 07:15:50,890 - BERTopic - Reduced dimensionality with UMAP
2021-07-28 07:15:50,953 - BERTopic - Clustered UMAP embeddings with HDBSCAN


Information of topics after removing words

In [172]:
topic_model_word_removed.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,-1,127,-1_scientology_century_war_church
1,0,78,0_game_won_championship_win
2,1,76,1_album_song_songs_dylan
3,2,57,2_route_river_road_highway
4,3,55,3_ships_ship_brigade_infantry
5,4,51,4_episode_series_rachel_character
6,5,43,5_species_birds_genus_kakapo
7,6,38,6_church_haifa_cathedral_built
8,7,34,7_poem_book_poetry_poems
9,8,33,8_storm_hurricane_winds_cyclone


Visualize result of topics after removing words

In [174]:
topic_model_word_removed.visualize_topics()

In [180]:
topic_model_word_removed.get_topic(7)

[('poem', 0.02895436100562492),
 ('book', 0.018500695773743425),
 ('poetry', 0.01472020873773619),
 ('poems', 0.013517322674881864),
 ('keats', 0.012975981985054199),
 ('tennyson', 0.01148870810750142),
 ('ulysses', 0.010148164663459767),
 ('wrote', 0.01002148070860962),
 ('poet', 0.009764157199366906),
 ('nightingale', 0.009246899630430634)]

In [191]:
topic_model_word_removed.visualize_barchart(top_n_topics=10)

**After removing stopword and unknown but still remaining order of words in sequences, the result is better**

<br>

### 3.2. Multilingual

We recognize the dataset has documents in other languages (Japaneses, ...).

In [192]:
data_train = read_wiki("wiki.train.tokens")
docs = list(data_train)
docs[0]

'= Valkyria Chronicles III = Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . <unk> the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " <unk> Raven " . The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more <unk> for series newcomers . Character de

In [193]:
def remove_words(documents):
  new_documents = []
  for doc in documents:
    new_doc = remove_stopwords(doc)          
    new_doc = new_doc.replace("<unk>", "")  

    new_documents.append(new_doc)

  return new_documents

In [194]:
docs = remove_words(docs)

Repeat steps above to retrain model 

In [195]:
topic_model_multilanguage = BERTopic(language="multilingual", calculate_probabilities=True, verbose=True)

In [197]:
topics_multilanguage, probs_multilanguage = topic_model_multilanguage.fit_transform(docs)

HBox(children=(FloatProgress(value=0.0, description='Batches', max=24.0, style=ProgressStyle(description_width…

2021-07-28 07:30:45,631 - BERTopic - Transformed documents to Embeddings





2021-07-28 07:30:50,826 - BERTopic - Reduced dimensionality with UMAP
2021-07-28 07:30:50,888 - BERTopic - Clustered UMAP embeddings with HDBSCAN


In [198]:
topic_model_multilanguage.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,-1,103,-1_team_league_game_won
1,0,98,0_war_government_ireland_military
2,1,74,1_album_song_music_songs
3,2,69,2_church_century_temple_haifa
4,3,58,3_route_river_highway_road
5,4,58,4_episode_television_season_rachel
6,5,47,5_species_birds_genus_proteins
7,6,36,6_game_hero_guitar_games
8,7,34,7_king_nixon_election_died
9,8,32,8_storm_hurricane_winds_cyclone


In [199]:
topic_model_multilanguage.visualize_barchart(top_n_topics=10)