# Table of Contents

* [Setup Environment](#section1)
* [Functions](#section2)
* [Topic Modeling](#section3)
* [Results](#section4)
    * [1950-1959](#section_4_1)
        * [Extracting Topics](#section_4_1_1)
        * [Visualizations](#section_4_1_2)
    * [1960-1969](#section_4_2)
        * [Extracting Topics](#section_4_2_1)
        * [Visualizations](#section_4_2_2)
    * [1970-1979](#section_4_3)
        * [Extracting Topics](#section_4_3_1)
        * [Visualizations](#section_4_3_2)
    * [1980-1989](#section_4_4)
        * [Extracting Topics](#section_4_4_1)
        * [Visualizations](#section_4_4_2)
    * [1990-1999](#section_4_5)
        * [Extracting Topics](#section_4_5_1)
        * [Visualizations](#section_4_5_2)
    * [2000-2009](#section_4_6)
        * [Extracting Topics](#section_4_6_1)
        * [Visualizations](#section_4_6_2)
    * [2010-2019](#section_4_7)
        * [Extracting Topics](#section_4_7_1)
        * [Visualizations](#section_4_7_2)
    * [2020-2029](#section_4_8)
        * [Extracting Topics](#section_4_8_1)
        * [Visualizations](#section_4_8_2)

# Setup Environment <a class=anchor id=section1></a>

In [2]:
%%capture
!apt-get update
!apt-get install --reinstall build-essential --yes

In [3]:
%%capture
!pip install git+https://github.com/MaartenGr/BERTopic.git@407fd4fdf2e05e80019c1c217972bf3314a41040
!pip install farm-haystack
!pip install spacy
!pip install gensim
!pip install sagemaker_pyspark
!python -m spacy download en_core_web_sm

In [None]:
import re
import glob
import spacy
import gensim
import pickle
import logging
import pyspark
import pynndescent
import pandas as pd
import plotly.io as pio

from umap import UMAP
from bertopic import BERTopic
from nltk.corpus import stopwords
from haystack.nodes import PreProcessor
from gensim.utils import simple_preprocess
from nltk.corpus import PlaintextCorpusReader
from haystack.utils import convert_files_to_docs

pio.renderers.default='iframe'
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
pynndescent.rp_trees.FlatTree.__module__  = "pynndescent.rp_trees"
logging.getLogger("haystack.utils.preprocessing").setLevel(logging.ERROR)

In [None]:
import nltk
from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer

nltk.download('wordnet')
nltk.download('omw-1.4')
token_pattern = re.compile(r"(?u)\b\w\w+\b")

class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()

    def __call__(self, doc):
        return [
            self.wnl.lemmatize(t)
            for t in word_tokenize(doc)
            if len(t) >= 2 and re.match("[a-z].*", t) and re.match(token_pattern, t)
        ]

# Functions <a class=anchor id=section2></a>

In [6]:
def pre_process(source):
    all_docs = convert_files_to_docs(dir_path=source)
    preprocessor = PreProcessor(
        clean_empty_lines=True,
        clean_whitespace=True,
        clean_header_footer=False,
        split_by="word",
        split_length=500,
        split_respect_sentence_boundary=True,
    )
    processed_docs = preprocessor.process(all_docs)
    print(f"Number of input files: {len(all_docs)}\nNumber of output files: {len(processed_docs)}")
    return [item.content for item in processed_docs]

In [7]:
def training(docs):
    vectorizer_model= CountVectorizer(stop_words="english", tokenizer=LemmaTokenizer())
    # Set the random state in the UMAP model to prevent stochastic behavior 
    umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=1)
    topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True,
                           nr_topics="auto", vectorizer_model=vectorizer_model, umap_model=umap_model)
    topics, probs = topic_model.fit_transform(docs)
    return topic_model, topics, probs

# Topic Modeling <a class=anchor id=section3></a>

In [23]:
sources = ['CLEANSED/1950-1959', 'CLEANSED/1960-1969', 'CLEANSED/1970-1979', 'CLEANSED/1980-1989', 
          'CLEANSED/1990-1999', 'CLEANSED/2000-2009', 'CLEANSED/2010-2019', 'CLEANSED/2020-2029']
topic_models, topics, probs = list(), list(), list()

In [24]:
for i in range(0, len(sources)):
    docs = pre_process(sources[i])
    topic_model, topic, prob = training(docs)
    topic_models.append(topic_model)
    topics.append(topic)
    probs.append(prob)

100%|██████████| 1052/1052 [00:01<00:00, 600.04docs/s]


Number of input files: 1052
Number of output files: 2171


HBox(children=(FloatProgress(value=0.0, description='Batches', max=68.0, style=ProgressStyle(description_width…

2022-06-01 17:54:01,095 - BERTopic - Transformed documents to Embeddings





2022-06-01 17:54:11,515 - BERTopic - Reduced dimensionality
2022-06-01 17:54:11,807 - BERTopic - Clustered reduced embeddings
2022-06-01 17:54:41,153 - BERTopic - Reduced number of topics from 48 to 30
100%|██████████| 2900/2900 [00:04<00:00, 720.22docs/s]


Number of input files: 2900
Number of output files: 5096


HBox(children=(FloatProgress(value=0.0, description='Batches', max=160.0, style=ProgressStyle(description_widt…

2022-06-01 17:55:18,634 - BERTopic - Transformed documents to Embeddings





2022-06-01 17:55:49,905 - BERTopic - Reduced dimensionality
2022-06-01 17:55:51,457 - BERTopic - Clustered reduced embeddings
2022-06-01 17:56:56,236 - BERTopic - Reduced number of topics from 86 to 33
100%|██████████| 1258/1258 [00:04<00:00, 314.05docs/s]


Number of input files: 1258
Number of output files: 3592


HBox(children=(FloatProgress(value=0.0, description='Batches', max=113.0, style=ProgressStyle(description_widt…

2022-06-01 17:57:21,524 - BERTopic - Transformed documents to Embeddings





2022-06-01 17:57:42,059 - BERTopic - Reduced dimensionality
2022-06-01 17:57:42,966 - BERTopic - Clustered reduced embeddings
2022-06-01 17:58:35,246 - BERTopic - Reduced number of topics from 71 to 34
100%|██████████| 1771/1771 [00:02<00:00, 694.82docs/s]


Number of input files: 1771
Number of output files: 4032


HBox(children=(FloatProgress(value=0.0, description='Batches', max=126.0, style=ProgressStyle(description_widt…

2022-06-01 17:59:02,904 - BERTopic - Transformed documents to Embeddings





2022-06-01 17:59:27,091 - BERTopic - Reduced dimensionality
2022-06-01 17:59:28,387 - BERTopic - Clustered reduced embeddings
2022-06-01 18:00:23,446 - BERTopic - Reduced number of topics from 88 to 63
100%|██████████| 1520/1520 [00:02<00:00, 731.18docs/s]


Number of input files: 1520
Number of output files: 3447


HBox(children=(FloatProgress(value=0.0, description='Batches', max=108.0, style=ProgressStyle(description_widt…

2022-06-01 18:00:46,403 - BERTopic - Transformed documents to Embeddings





2022-06-01 18:01:06,563 - BERTopic - Reduced dimensionality
2022-06-01 18:01:07,964 - BERTopic - Clustered reduced embeddings
2022-06-01 18:01:53,722 - BERTopic - Reduced number of topics from 101 to 70
100%|██████████| 2010/2010 [00:03<00:00, 650.70docs/s]


Number of input files: 2010
Number of output files: 4989


HBox(children=(FloatProgress(value=0.0, description='Batches', max=156.0, style=ProgressStyle(description_widt…

2022-06-01 18:02:26,829 - BERTopic - Transformed documents to Embeddings





2022-06-01 18:02:40,435 - BERTopic - Reduced dimensionality
2022-06-01 18:02:44,072 - BERTopic - Clustered reduced embeddings
2022-06-01 18:03:53,705 - BERTopic - Reduced number of topics from 140 to 110
100%|██████████| 2852/2852 [00:05<00:00, 514.80docs/s]


Number of input files: 2852
Number of output files: 8442


HBox(children=(FloatProgress(value=0.0, description='Batches', max=264.0, style=ProgressStyle(description_widt…

2022-06-01 18:04:48,240 - BERTopic - Transformed documents to Embeddings





2022-06-01 18:05:10,564 - BERTopic - Reduced dimensionality
2022-06-01 18:05:27,690 - BERTopic - Clustered reduced embeddings
2022-06-01 18:07:32,566 - BERTopic - Reduced number of topics from 214 to 149
100%|██████████| 412/412 [00:01<00:00, 395.33docs/s]


Number of input files: 412
Number of output files: 1536


HBox(children=(FloatProgress(value=0.0, description='Batches', max=48.0, style=ProgressStyle(description_width…

2022-06-01 18:07:42,518 - BERTopic - Transformed documents to Embeddings





2022-06-01 18:07:48,898 - BERTopic - Reduced dimensionality
2022-06-01 18:07:49,112 - BERTopic - Clustered reduced embeddings
2022-06-01 18:08:12,278 - BERTopic - Reduced number of topics from 48 to 31


# Results <a class=anchor id=section4></a>

## 1950-1959 <a class=anchor id=section_4_1></a>

In [34]:
topic_model = topic_models[0]

### Extracting Topics <a class=anchor id=section_4_1_1></a>

After fitting our model, we can start by looking at the results. Typically, we look at the most frequent topics first as they best represent the collection of documents. 

In [35]:
freq = topic_model.get_topic_info()
freq.head(5)

Unnamed: 0,Topic,Count,Name
0,-1,742,-1_ha_wa_new_year
1,0,395,0_union_automation_company_permission
2,1,160,1_fund_share_stock_cent
3,2,147,2_world_president_people_u
4,3,115,3_soviet_production_wa_industry


> -1 refers to all outliers and should typically be ignored.

Next, let's take a look at a frequent topic that were generated:

In [36]:
topic_model.get_topic(0)  # Select the most frequent topic

[('union', 0.02258703508946951),
 ('automation', 0.01894151059480157),
 ('company', 0.016066923744205434),
 ('permission', 0.015684745954614554),
 ('industry', 0.015603473422202746),
 ('said', 0.015173536410500088),
 ('machine', 0.014700665517533169),
 ('new', 0.014677331760689442),
 ('worker', 0.013759861054540554),
 ('wa', 0.013452695000182507)]

**NOTE**: BERTopic is stochastic which means that the topics might differ across runs. This is mostly due to the stocastisch nature of UMAP.

### Visualizations <a class=anchor id=section_4_1_2></a>

In [37]:
topic_model.visualize_topics()

We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation. Insights can be gained from the relative c-TF-IDF scores between and within topics. Moreover, you can easily compare topic representations to each other.

In [38]:
topic_model.visualize_barchart(top_n_topics=6)

The topics that were created can be hierarchically reduced. In order to understand the potential hierarchical structure of the topics, we create clusters and visualize how they relate to one another.

In [39]:
topic_model.visualize_hierarchy(top_n_topics=50)

Having generated topic embeddings, through both c-TF-IDF and embeddings, we can create a similarity matrix by simply applying cosine similarities through those topic embeddings. The result will be a matrix indicating how similar certain topics are to each other.

In [40]:
topic_model.visualize_heatmap(width=1000, height=1000)

Topics are represented by a number of words starting with the best representative word. Each word is represented by a c-TF-IDF score. The higher the score, the more representative a word to the topic is. Since the topic words are sorted by their c-TF-IDF score, the scores slowly decline with each word that is added. At some point adding words to the topic representation only marginally increases the total c-TF-IDF score and would not be beneficial for its representation.

To visualize this effect, we can plot the c-TF-IDF scores for each topic by the term rank of each word. In other words, the position of the words (term rank), where the words with the highest c-TF-IDF score will have a rank of 1, will be put on the x-axis. Whereas the y-axis will be populated by the c-TF-IDF scores. The result is a visualization that shows you the decline of c-TF-IDF score when adding words to the topic representation. It allows you, using the elbow method, to select the best number of words in a topic.

In [41]:
topic_model.visualize_term_rank()

## 1960-1969 <a class=anchor id=section_4_2></a>

In [42]:
topic_model = topic_models[1]

### Extracting Topics <a class=anchor id=section_4_2_1></a>

After fitting our model, we can start by looking at the results. Typically, we look at the most frequent topics first as they best represent the collection of documents. 

In [43]:
freq = topic_model.get_topic_info()
freq.head(5)

Unnamed: 0,Topic,Count,Name
0,0,1890,0_new_union_mr_wa
1,-1,1654,-1_wa_new_permission_ha
2,1,483,1_ship_maritime_union_marine
3,2,109,2_soviet_communist_united_state
4,3,98,3_theater_play_film_cartoon


> -1 refers to all outliers and should typically be ignored.

Next, let's take a look at a frequent topic that were generated:

In [44]:
topic_model.get_topic(0)  # Select the most frequent topic

[('new', 0.015746051089954482),
 ('union', 0.014909770940270569),
 ('mr', 0.014369321028334037),
 ('wa', 0.014252653768070507),
 ('said', 0.014228811267299472),
 ('year', 0.01225456942312672),
 ('ha', 0.011896365191061553),
 ('permission', 0.011881308145662411),
 ('york', 0.01159832099349294),
 ('newspaper', 0.011227202973038928)]

**NOTE**: BERTopic is stochastic which means that the topics might differ across runs. This is mostly due to the stocastisch nature of UMAP.

### Visualizations <a class=anchor id=section_4_2_2></a>

In [45]:
topic_model.visualize_topics()

We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation. Insights can be gained from the relative c-TF-IDF scores between and within topics. Moreover, you can easily compare topic representations to each other.

In [46]:
topic_model.visualize_barchart(top_n_topics=6)

The topics that were created can be hierarchically reduced. In order to understand the potential hierarchical structure of the topics, we create clusters and visualize how they relate to one another.

In [47]:
topic_model.visualize_hierarchy(top_n_topics=50)

Having generated topic embeddings, through both c-TF-IDF and embeddings, we can create a similarity matrix by simply applying cosine similarities through those topic embeddings. The result will be a matrix indicating how similar certain topics are to each other.

In [48]:
topic_model.visualize_heatmap(width=1000, height=1000)

Topics are represented by a number of words starting with the best representative word. Each word is represented by a c-TF-IDF score. The higher the score, the more representative a word to the topic is. Since the topic words are sorted by their c-TF-IDF score, the scores slowly decline with each word that is added. At some point adding words to the topic representation only marginally increases the total c-TF-IDF score and would not be beneficial for its representation.

To visualize this effect, we can plot the c-TF-IDF scores for each topic by the term rank of each word. In other words, the position of the words (term rank), where the words with the highest c-TF-IDF score will have a rank of 1, will be put on the x-axis. Whereas the y-axis will be populated by the c-TF-IDF scores. The result is a visualization that shows you the decline of c-TF-IDF score when adding words to the topic representation. It allows you, using the elbow method, to select the best number of words in a topic.

In [49]:
topic_model.visualize_term_rank()

## 1970-1979 <a class=anchor id=section_4_3></a>

In [50]:
topic_model = topic_models[2]

### Extracting Topics <a class=anchor id=section_4_3_1></a>

After fitting our model, we can start by looking at the results. Typically, we look at the most frequent topics first as they best represent the collection of documents. 

In [51]:
freq = topic_model.get_topic_info()
freq.head(5)

Unnamed: 0,Topic,Count,Name
0,0,1912,0_st_net_income_earns
1,-1,787,-1_wa_said_mr_ha
2,1,91,1_president_vice_mr_executive
3,2,66,2_port_ship_container_longshoreman
4,3,48,3_train_transit_railroad_passenger


> -1 refers to all outliers and should typically be ignored.

Next, let's take a look at a frequent topic that were generated:

In [52]:
topic_model.get_topic(0)  # Select the most frequent topic

[('st', 0.018714139505333335),
 ('net', 0.01696200512427951),
 ('income', 0.016897261090038168),
 ('earns', 0.016686334081319223),
 ('shr', 0.014074489738208887),
 ('sale', 0.014041445756215016),
 ('new', 0.011130992020859438),
 ('share', 0.010725065026655196),
 ('qtr', 0.010432867793512973),
 ('mr', 0.009990037509544723)]

**NOTE**: BERTopic is stochastic which means that the topics might differ across runs. This is mostly due to the stocastisch nature of UMAP.

### Visualizations <a class=anchor id=section_4_3_2></a>

In [53]:
topic_model.visualize_topics()

We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation. Insights can be gained from the relative c-TF-IDF scores between and within topics. Moreover, you can easily compare topic representations to each other.

In [54]:
topic_model.visualize_barchart(top_n_topics=6)

The topics that were created can be hierarchically reduced. In order to understand the potential hierarchical structure of the topics, we create clusters and visualize how they relate to one another.

In [55]:
topic_model.visualize_hierarchy(top_n_topics=50)

Having generated topic embeddings, through both c-TF-IDF and embeddings, we can create a similarity matrix by simply applying cosine similarities through those topic embeddings. The result will be a matrix indicating how similar certain topics are to each other.

In [56]:
topic_model.visualize_heatmap(width=1000, height=1000)

Topics are represented by a number of words starting with the best representative word. Each word is represented by a c-TF-IDF score. The higher the score, the more representative a word to the topic is. Since the topic words are sorted by their c-TF-IDF score, the scores slowly decline with each word that is added. At some point adding words to the topic representation only marginally increases the total c-TF-IDF score and would not be beneficial for its representation.

To visualize this effect, we can plot the c-TF-IDF scores for each topic by the term rank of each word. In other words, the position of the words (term rank), where the words with the highest c-TF-IDF score will have a rank of 1, will be put on the x-axis. Whereas the y-axis will be populated by the c-TF-IDF scores. The result is a visualization that shows you the decline of c-TF-IDF score when adding words to the topic representation. It allows you, using the elbow method, to select the best number of words in a topic.

In [57]:
topic_model.visualize_term_rank()

## 1980-1989 <a class=anchor id=section_4_4></a>

In [58]:
topic_model = topic_models[3]

### Extracting Topics <a class=anchor id=section_4_4_1></a>

After fitting our model, we can start by looking at the results. Typically, we look at the most frequent topics first as they best represent the collection of documents. 

In [59]:
freq = topic_model.get_topic_info()
freq.head(5)

Unnamed: 0,Topic,Count,Name
0,-1,1184,-1_said_company_wa_ha
1,0,551,0_worker_union_japanese_car
2,1,173,1_exchange_trading_stock_market
3,2,168,2_bank_banking_customer_loan
4,3,129,3_president_vice_corp_named


> -1 refers to all outliers and should typically be ignored.

Next, let's take a look at a frequent topic that were generated:

In [60]:
topic_model.get_topic(0)  # Select the most frequent topic

[('worker', 0.016937502732839067),
 ('union', 0.015982487771387072),
 ('japanese', 0.01530474674053234),
 ('car', 0.014110388755043961),
 ('job', 0.01317621178715519),
 ('company', 0.012707022032547824),
 ('gm', 0.012231676441400616),
 ('plant', 0.011406453260414985),
 ('said', 0.011028626266792421),
 ('auto', 0.010616661844398948)]

**NOTE**: BERTopic is stochastic which means that the topics might differ across runs. This is mostly due to the stocastisch nature of UMAP.

### Visualizations <a class=anchor id=section_4_4_2></a>

In [61]:
topic_model.visualize_topics()

We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation. Insights can be gained from the relative c-TF-IDF scores between and within topics. Moreover, you can easily compare topic representations to each other.

In [62]:
topic_model.visualize_barchart(top_n_topics=6)

The topics that were created can be hierarchically reduced. In order to understand the potential hierarchical structure of the topics, we create clusters and visualize how they relate to one another.

In [63]:
topic_model.visualize_hierarchy(top_n_topics=50)

Having generated topic embeddings, through both c-TF-IDF and embeddings, we can create a similarity matrix by simply applying cosine similarities through those topic embeddings. The result will be a matrix indicating how similar certain topics are to each other.

In [64]:
topic_model.visualize_heatmap(width=1000, height=1000)

Topics are represented by a number of words starting with the best representative word. Each word is represented by a c-TF-IDF score. The higher the score, the more representative a word to the topic is. Since the topic words are sorted by their c-TF-IDF score, the scores slowly decline with each word that is added. At some point adding words to the topic representation only marginally increases the total c-TF-IDF score and would not be beneficial for its representation.

To visualize this effect, we can plot the c-TF-IDF scores for each topic by the term rank of each word. In other words, the position of the words (term rank), where the words with the highest c-TF-IDF score will have a rank of 1, will be put on the x-axis. Whereas the y-axis will be populated by the c-TF-IDF scores. The result is a visualization that shows you the decline of c-TF-IDF score when adding words to the topic representation. It allows you, using the elbow method, to select the best number of words in a topic.

In [65]:
topic_model.visualize_term_rank()

## 1990-1999 <a class=anchor id=section_4_5></a>

In [66]:
topic_model = topic_models[4]

### Extracting Topics <a class=anchor id=section_4_5_1></a>

After fitting our model, we can start by looking at the results. Typically, we look at the most frequent topics first as they best represent the collection of documents. 

In [67]:
freq = topic_model.get_topic_info()
freq.head(5)

Unnamed: 0,Topic,Count,Name
0,-1,783,-1_said_company_mr_ha
1,0,301,0_stock_share_exchange_million
2,1,214,1_job_worker_union_labor
3,2,146,2_faa_pilot_controller_airline
4,3,129,3_available___


> -1 refers to all outliers and should typically be ignored.

Next, let's take a look at a frequent topic that were generated:

In [68]:
topic_model.get_topic(0)  # Select the most frequent topic

[('stock', 0.02957507036574429),
 ('share', 0.026690717752955343),
 ('exchange', 0.02542833258089495),
 ('million', 0.0210914236907901),
 ('trading', 0.018356041064508073),
 ('market', 0.01784634522560513),
 ('company', 0.013508741355036012),
 ('corp', 0.012111536512518437),
 ('percent', 0.0115641389977353),
 ('initial', 0.011409430921860817)]

**NOTE**: BERTopic is stochastic which means that the topics might differ across runs. This is mostly due to the stocastisch nature of UMAP.

### Visualizations <a class=anchor id=section_4_5_2></a>

In [69]:
topic_model.visualize_topics()

We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation. Insights can be gained from the relative c-TF-IDF scores between and within topics. Moreover, you can easily compare topic representations to each other.

In [70]:
topic_model.visualize_barchart(top_n_topics=6)

The topics that were created can be hierarchically reduced. In order to understand the potential hierarchical structure of the topics, we create clusters and visualize how they relate to one another.

In [71]:
topic_model.visualize_hierarchy(top_n_topics=50)

Having generated topic embeddings, through both c-TF-IDF and embeddings, we can create a similarity matrix by simply applying cosine similarities through those topic embeddings. The result will be a matrix indicating how similar certain topics are to each other.

In [72]:
topic_model.visualize_heatmap(width=1000, height=1000)

Topics are represented by a number of words starting with the best representative word. Each word is represented by a c-TF-IDF score. The higher the score, the more representative a word to the topic is. Since the topic words are sorted by their c-TF-IDF score, the scores slowly decline with each word that is added. At some point adding words to the topic representation only marginally increases the total c-TF-IDF score and would not be beneficial for its representation.

To visualize this effect, we can plot the c-TF-IDF scores for each topic by the term rank of each word. In other words, the position of the words (term rank), where the words with the highest c-TF-IDF score will have a rank of 1, will be put on the x-axis. Whereas the y-axis will be populated by the c-TF-IDF scores. The result is a visualization that shows you the decline of c-TF-IDF score when adding words to the topic representation. It allows you, using the elbow method, to select the best number of words in a topic.

In [73]:
topic_model.visualize_term_rank()

## 2000-2009 <a class=anchor id=section_4_6></a>

In [74]:
topic_model = topic_models[5]

### Extracting Topics <a class=anchor id=section_4_6_1></a>

After fitting our model, we can start by looking at the results. Typically, we look at the most frequent topics first as they best represent the collection of documents. 

In [75]:
freq = topic_model.get_topic_info()
freq.head(5)

Unnamed: 0,Topic,Count,Name
0,-1,987,-1_said_company_ha_year
1,0,443,0_exchange_stock_trading_share
2,1,260,1_home_house_light_control
3,2,202,2_fbi_al_intelligence_wa
4,3,175,3_airline_pilot_plane_passenger


> -1 refers to all outliers and should typically be ignored.

Next, let's take a look at a frequent topic that were generated:

In [76]:
topic_model.get_topic(0)  # Select the most frequent topic

[('exchange', 0.015822666399611295),
 ('stock', 0.01319733188898422),
 ('trading', 0.01273277773264471),
 ('share', 0.008451472840802426),
 ('company', 0.008074358251292503),
 ('fund', 0.007975094370507788),
 ('million', 0.007290083820686794),
 ('customer', 0.007286718768915403),
 ('market', 0.007112919355901209),
 ('board', 0.007018075686678832)]

**NOTE**: BERTopic is stochastic which means that the topics might differ across runs. This is mostly due to the stocastisch nature of UMAP.

### Visualizations <a class=anchor id=section_4_6_2></a>

In [77]:
topic_model.visualize_topics()

We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation. Insights can be gained from the relative c-TF-IDF scores between and within topics. Moreover, you can easily compare topic representations to each other.

In [78]:
topic_model.visualize_barchart(top_n_topics=6)

The topics that were created can be hierarchically reduced. In order to understand the potential hierarchical structure of the topics, we create clusters and visualize how they relate to one another.

In [79]:
topic_model.visualize_hierarchy(top_n_topics=50)

Having generated topic embeddings, through both c-TF-IDF and embeddings, we can create a similarity matrix by simply applying cosine similarities through those topic embeddings. The result will be a matrix indicating how similar certain topics are to each other.

In [80]:
topic_model.visualize_heatmap(width=1000, height=1000)

Topics are represented by a number of words starting with the best representative word. Each word is represented by a c-TF-IDF score. The higher the score, the more representative a word to the topic is. Since the topic words are sorted by their c-TF-IDF score, the scores slowly decline with each word that is added. At some point adding words to the topic representation only marginally increases the total c-TF-IDF score and would not be beneficial for its representation.

To visualize this effect, we can plot the c-TF-IDF scores for each topic by the term rank of each word. In other words, the position of the words (term rank), where the words with the highest c-TF-IDF score will have a rank of 1, will be put on the x-axis. Whereas the y-axis will be populated by the c-TF-IDF scores. The result is a visualization that shows you the decline of c-TF-IDF score when adding words to the topic representation. It allows you, using the elbow method, to select the best number of words in a topic.

In [81]:
topic_model.visualize_term_rank()

## 2010-2019 <a class=anchor id=section_4_7></a>

In [82]:
topic_model = topic_models[6]

### Extracting Topics <a class=anchor id=section_4_7_1></a>

After fitting our model, we can start by looking at the results. Typically, we look at the most frequent topics first as they best represent the collection of documents. 

In [83]:
freq = topic_model.get_topic_info()
freq.head(5)

Unnamed: 0,Topic,Count,Name
0,-1,2267,-1_mr_wa_said_ha
1,0,573,0_robot_machine_ai_job
2,1,250,1_home_light_smart_data
3,2,239,2_amazon_store_grocery_warehouse
4,3,224,3_trading_bank_stock_market


> -1 refers to all outliers and should typically be ignored.

Next, let's take a look at a frequent topic that were generated:

In [84]:
topic_model.get_topic(0)  # Select the most frequent topic

[('robot', 0.014084814105966613),
 ('machine', 0.008648981793178285),
 ('ai', 0.008541941307799046),
 ('job', 0.008028270892704815),
 ('student', 0.007908039271858994),
 ('college', 0.007875337218539597),
 ('china', 0.007698299797308915),
 ('human', 0.007531851787424907),
 ('robotics', 0.007194358880961464),
 ('chinese', 0.007176531265016953)]

**NOTE**: BERTopic is stochastic which means that the topics might differ across runs. This is mostly due to the stocastisch nature of UMAP.

### Visualizations <a class=anchor id=section_4_7_2></a>

In [85]:
topic_model.visualize_topics()

We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation. Insights can be gained from the relative c-TF-IDF scores between and within topics. Moreover, you can easily compare topic representations to each other.

In [86]:
topic_model.visualize_barchart(top_n_topics=6)

The topics that were created can be hierarchically reduced. In order to understand the potential hierarchical structure of the topics, we create clusters and visualize how they relate to one another.

In [87]:
topic_model.visualize_hierarchy(top_n_topics=50)

Having generated topic embeddings, through both c-TF-IDF and embeddings, we can create a similarity matrix by simply applying cosine similarities through those topic embeddings. The result will be a matrix indicating how similar certain topics are to each other.

In [88]:
topic_model.visualize_heatmap(width=1000, height=1000)

Topics are represented by a number of words starting with the best representative word. Each word is represented by a c-TF-IDF score. The higher the score, the more representative a word to the topic is. Since the topic words are sorted by their c-TF-IDF score, the scores slowly decline with each word that is added. At some point adding words to the topic representation only marginally increases the total c-TF-IDF score and would not be beneficial for its representation.

To visualize this effect, we can plot the c-TF-IDF scores for each topic by the term rank of each word. In other words, the position of the words (term rank), where the words with the highest c-TF-IDF score will have a rank of 1, will be put on the x-axis. Whereas the y-axis will be populated by the c-TF-IDF scores. The result is a visualization that shows you the decline of c-TF-IDF score when adding words to the topic representation. It allows you, using the elbow method, to select the best number of words in a topic.

In [89]:
topic_model.visualize_term_rank()

## 2020-2029 <a class=anchor id=section_4_8></a>

In [90]:
topic_model = topic_models[7]

### Extracting Topics <a class=anchor id=section_4_8_1></a>

After fitting our model, we can start by looking at the results. Typically, we look at the most frequent topics first as they best represent the collection of documents. 

In [91]:
freq = topic_model.get_topic_info()
freq.head(5)

Unnamed: 0,Topic,Count,Name
0,-1,374,-1_said_wa_ha_company
1,0,353,0_mr_wa_ha_worker
2,1,73,1_family_state_death_case
3,2,69,2_yang_mr_campaign_wa
4,3,62,3_facebook_app_ad_people


> -1 refers to all outliers and should typically be ignored.

Next, let's take a look at a frequent topic that were generated:

In [92]:
topic_model.get_topic(0)  # Select the most frequent topic

[('mr', 0.01577201417581109),
 ('wa', 0.014903646200986236),
 ('ha', 0.013648622843988177),
 ('worker', 0.01311366927186684),
 ('china', 0.012913799858483007),
 ('job', 0.012315253477634389),
 ('state', 0.012279857167462655),
 ('economy', 0.012097470344112616),
 ('biden', 0.011932502697408521),
 ('said', 0.011401015943806238)]

**NOTE**: BERTopic is stochastic which means that the topics might differ across runs. This is mostly due to the stocastisch nature of UMAP.

### Visualizations <a class=anchor id=section_4_8_2></a>

In [93]:
topic_model.visualize_topics()

We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation. Insights can be gained from the relative c-TF-IDF scores between and within topics. Moreover, you can easily compare topic representations to each other.

In [94]:
topic_model.visualize_barchart(top_n_topics=6)

The topics that were created can be hierarchically reduced. In order to understand the potential hierarchical structure of the topics, we create clusters and visualize how they relate to one another.

In [95]:
topic_model.visualize_hierarchy(top_n_topics=50)

Having generated topic embeddings, through both c-TF-IDF and embeddings, we can create a similarity matrix by simply applying cosine similarities through those topic embeddings. The result will be a matrix indicating how similar certain topics are to each other.

In [96]:
topic_model.visualize_heatmap(width=1000, height=1000)

Topics are represented by a number of words starting with the best representative word. Each word is represented by a c-TF-IDF score. The higher the score, the more representative a word to the topic is. Since the topic words are sorted by their c-TF-IDF score, the scores slowly decline with each word that is added. At some point adding words to the topic representation only marginally increases the total c-TF-IDF score and would not be beneficial for its representation.

To visualize this effect, we can plot the c-TF-IDF scores for each topic by the term rank of each word. In other words, the position of the words (term rank), where the words with the highest c-TF-IDF score will have a rank of 1, will be put on the x-axis. Whereas the y-axis will be populated by the c-TF-IDF scores. The result is a visualization that shows you the decline of c-TF-IDF score when adding words to the topic representation. It allows you, using the elbow method, to select the best number of words in a topic.

In [97]:
topic_model.visualize_term_rank()