# Table of Contents

* [Setup Environment](#section1)
* [Data](#section2)
* [Basic Topic Model](#section3)
* [Extracting Topics](#section4)
* [Topics over Time](#section5)
* [Visualize Topics over Time](#section6)

# Setup Environment <a class=anchor id=section1></a>

In [3]:
%%capture
!apt-get update
!apt-get install --reinstall build-essential --yes

In [4]:
%%capture
!pip install git+https://github.com/MaartenGr/BERTopic.git@407fd4fdf2e05e80019c1c217972bf3314a41040
!pip install farm-haystack
!pip install spacy
!pip install gensim
!pip install sagemaker_pyspark
!python -m spacy download en_core_web_sm

In [4]:
import re
import pickle
import logging
import pandas as pd
import plotly.io as pio

from umap import UMAP
from bertopic import BERTopic
from haystack.nodes import PreProcessor
from nltk.corpus import PlaintextCorpusReader
from haystack.utils import convert_files_to_docs

pio.renderers.default='iframe'
logging.getLogger("haystack.utils.preprocessing").setLevel(logging.ERROR)

INFO - haystack.modeling.model.optimization -  apex not found, won't use it. See https://nvidia.github.io/apex/
ERROR - root -  Failed to import 'magic' (from 'python-magic' and 'python-magic-bin' on Windows). FileTypeClassifier will not perform mimetype detection on extensionless files. Please make sure the necessary OS libraries are installed if you need this functionality.


In [5]:
import nltk

from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer

nltk.download('wordnet')
nltk.download('omw-1.4')
token_pattern = re.compile(r"(?u)\b\w\w+\b")

class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()

    def __call__(self, doc):
        return [
            self.wnl.lemmatize(t)
            for t in word_tokenize(doc)
            if len(t) >= 2 and re.match("[a-z].*", t) and re.match(token_pattern, t)
        ]

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


# Data <a class=anchor id=section2></a>

In [1]:
!rm -rf `find -type d -name .ipynb_checkpoints`

In [2]:
sources = ['CLEANSED/1950-1959', 'CLEANSED/1960-1969', 'CLEANSED/1970-1979', 'CLEANSED/1980-1989', 
          'CLEANSED/1990-1999', 'CLEANSED/2000-2009', 'CLEANSED/2010-2019', 'CLEANSED/2020-2029']

In [8]:
docs = []
timestamps = []
for source in sources:
    print("Processing {}".format(source.split("/")[1]))
    all_docs = convert_files_to_docs(dir_path=source)
    preprocessor = PreProcessor(
        clean_empty_lines=True,
        clean_whitespace=True,
        clean_header_footer=False,
        split_by="word",
        split_length=500,
        split_respect_sentence_boundary=True,)
    processed_docs = preprocessor.process(all_docs)
    print(f"Number of input files: {len(all_docs)}\nNumber of output files: {len(processed_docs)}")
    docs.extend([item.content for item in processed_docs])
    timestamps.extend([source.split("-")[1]]*len(processed_docs))

Processing 1950-1959


100%|██████████| 1051/1051 [00:01<00:00, 794.79docs/s]


Number of input files: 1051
Number of output files: 2168
Processing 1960-1969


100%|██████████| 2900/2900 [00:02<00:00, 971.53docs/s] 


Number of input files: 2900
Number of output files: 5093
Processing 1970-1979


100%|██████████| 1258/1258 [00:03<00:00, 341.76docs/s]


Number of input files: 1258
Number of output files: 3589
Processing 1980-1989


100%|██████████| 1771/1771 [00:01<00:00, 1003.97docs/s]


Number of input files: 1771
Number of output files: 4032
Processing 1990-1999


100%|██████████| 1520/1520 [00:01<00:00, 1065.39docs/s]


Number of input files: 1520
Number of output files: 3447
Processing 2000-2009


100%|██████████| 2010/2010 [00:02<00:00, 925.38docs/s]


Number of input files: 2010
Number of output files: 4989
Processing 2010-2019


100%|██████████| 2852/2852 [00:03<00:00, 791.48docs/s]


Number of input files: 2852
Number of output files: 8442
Processing 2020-2029


100%|██████████| 412/412 [00:00<00:00, 543.05docs/s]

Number of input files: 412
Number of output files: 1536





In [9]:
print(len(docs))
print(len(timestamps))

33296
33296


# Basic Topic Model <a class=anchor id=section3></a>

To perform Dynamic Topic Modeling with BERTopic we will first need to create a basic topic model using all articles. The temporal aspect will be ignored as we are, for now, only interested in the topics that reside in those articles. 

In [10]:
vectorizer_model= CountVectorizer(stop_words="english", tokenizer=LemmaTokenizer())
# Set the random state in the UMAP model to prevent stochastic behavior 
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=1)
topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True,
                       nr_topics="auto", vectorizer_model=vectorizer_model, umap_model=umap_model)
topics, probs = topic_model.fit_transform(docs)

HBox(children=(FloatProgress(value=0.0, description='Batches', max=1041.0, style=ProgressStyle(description_wid…

2022-06-02 06:30:36,328 - BERTopic - Transformed documents to Embeddings





2022-06-02 06:31:38,492 - BERTopic - Reduced dimensionality
2022-06-02 06:42:35,189 - BERTopic - Clustered reduced embeddings
2022-06-02 06:50:05,851 - BERTopic - Reduced number of topics from 487 to 362


In [11]:
topic_model.save("bert_dtm")


Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.



# Extracting Topics <a class=anchor id=section4></a>

We can then extract most frequent topics:

In [12]:
freq = topic_model.get_topic_info()
freq.head(10)

Unnamed: 0,Topic,Count,Name
0,-1,11520,-1_wa_said_job_company
1,0,786,0_exchange_trading_stock_market
2,1,719,1_ship_port_cargo_longshoreman
3,2,608,2_pilot_airline_plane_flight
4,3,600,3_gm_steel_car_ford
5,4,577,4_trump_republican_democratic_democrat
6,5,516,5_hospital_patient_health_doctor
7,6,428,6_car_selfdriving_vehicle_driver
8,7,343,7_ibm_software_microsoft_computer
9,8,341,8_student_school_education_college


> -1 refers to all outliers and should typically be ignored. 

Next, let's take a look at a frequent topic that were generated:

In [13]:
topic_nr = freq.iloc[4]["Topic"]  # We select a frequent topic
topic_model.get_topic(topic_nr)   # You can select a topic number as shown above

[('gm', 0.017653390217818362),
 ('steel', 0.013759767702444145),
 ('car', 0.012598195314218553),
 ('ford', 0.01249254212854796),
 ('auto', 0.010309827941324125),
 ('motor', 0.009743642008408592),
 ('plant', 0.009568833052719059),
 ('japanese', 0.008050828379352693),
 ('toyota', 0.007534728555264),
 ('assembly', 0.006684561486707328)]

We can visualize the basic topics that were created with the Intertopic Distance Map. This allows us to judge visually whether the basic topics are sufficient before proceeding to creating the topics over time. 

In [28]:
topic_model.visualize_topics()

# Topics over Time <a class=anchor id=section5></a>

Before we start with the Dynamic Topic Modeling step, it is important that you are satisfied with the topics that were created previously. We are going to be using those specific topics as a base for Dynamic Topic Modeling. Thus, this step will essentially show you how the topics that were defined previously have evolved over time. 

There are a few important parameters that you should take note of, namely:
* `docs`
  * These are the articles that we are using
* `topics`
  * The topics that we have created before
* `timestamps`
  * The timestamp of each article/document
* `global_tuning`
  * Whether to average the topic representation of a topic at time *t* with its global topic representation
* `evolution_tuning`
  * Whether to average the topic representation of a topic at time *t* with the topic representation of that topic at time *t-1*
* `nr_bins`
  * The number of bins to put our timestamps into. It is computationally inefficient to extract the topics at thousands of different timestamps. Therefore, it is advised to keep this value below 20. 


In [15]:
print(len(docs))
print(len(topics))
print(len(timestamps))

33296
33296
33296


In [16]:
topics_over_time = topic_model.topics_over_time(docs=docs, 
                                                topics=topics, 
                                                timestamps=timestamps, 
                                                global_tuning=True, 
                                                evolution_tuning=True, 
                                                nr_bins=8)

8it [02:35, 19.38s/it]


In [17]:
topics_over_time

Unnamed: 0,Topic,Words,Frequency,Timestamp
0,-1,"permission, reproduced, copyright, reproduction, prohibited",971,1958-12-06 10:22:04.800
1,0,"exchange, stock, margin, walston, trading",13,1958-12-06 10:22:04.800
2,1,"cargo, ship, port, longshoreman, container",79,1958-12-06 10:22:04.800
3,2,"airline, airport, traffic, pilot, plane",16,1958-12-06 10:22:04.800
4,3,"steel, ford, mcdonald, union, gm",58,1958-12-06 10:22:04.800
...,...,...,...,...
1577,335,"helsinki, browne, carbon, kalasatama, neighborhood",3,2020-04-02 00:00:00.000
1578,343,"smoke, alarm, nest, detector, smart",1,2020-04-02 00:00:00.000
1579,353,"resister, nettie, aunt, gwen, novel",10,2020-04-02 00:00:00.000
1580,357,"medicaid, higherincome, enrolled, benefit, requirement",7,2020-04-02 00:00:00.000


# Visualize Topics over Time <a class=anchor id=section6></a>

After having created our `topics_over_time`, we will have to visualize those topics as accessing them becomes a bit more difficult with the added temporal dimension. 

To do so, we are going to visualize the distribution of topics over time based on their frequency. Doing so allows us to see how the topics have evolved over time. Make sure to hover over any point to see how the topic representation at time *t* differs from the global topic representation.

In [27]:
topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=50, normalize_frequency=False)