Structure:

1. [x] Read data
2. [] Try LDA
3. [] Try BerTopic:
    - [x] with tfidf vectorizer as embed model
    - [x] with transformer model: probably don't do per year topic modeling as too low number of articles. 
    - [] show topic for all years, find out compact form for showing all clusters


In [1]:
cd ..

/Users/andreiaksionov/Study/Machine_Learning/semantic_search/Weaviate-demo


In [76]:
import os
import sys
from collections import Counter

import numpy as np
import pandas as pd
import seaborn as sns
import spacy
from bs4 import BeautifulSoup
from matplotlib import pyplot as plt
from omegaconf import OmegaConf
from bertopic import BERTopic
from sklearn.feature_extraction.text import TfidfVectorizer
from umap import UMAP
from tqdm import tqdm

pd.set_option("display.max_colwidth", 100)

# ROOT = os.path.realpath("../")
# if ROOT not in sys.path:
#     sys.path.append(ROOT)

from src import config


# 1. Read data

Read file from csv format and display the first article.

In [47]:
data = pd.read_csv(config.data.raw)
data.head(1)


Unnamed: 0,title,url,published_at,author,publisher,short_description,keywords,header_image,raw_description,description,scraped_at
0,Santoli’s Wednesday market notes: Could September’s stock shakeout tee up strength for the fourt...,https://www.cnbc.com/2021/09/29/santolis-wednesday-market-notes-could-septembers-stock-shakeout-...,2021-09-29T17:09:39+0000,Michael Santoli,CNBC,"This is the daily notebook of Mike Santoli, CNBC's senior markets commentator, with ideas about ...","cnbc, Premium, Articles, Investment strategy, Markets, Investing, PRO Home, CNBC Pro, Pro: Santo...",https://image.cnbcfm.com/api/v1/image/106949602-1632934577499-FINTECH_ETF_9-29.jpg?v=1632934691,"<div class=""group""><p><em>This is the daily notebook of Mike Santoli, CNBC's senior markets comm...","This is the daily notebook of Mike Santoli, CNBC's senior markets commentator, with ideas about ...",2021-10-30 14:11:23.709372


In [48]:
data["title"] = data["title"].apply(lambda x: BeautifulSoup(x).text)

# 2. LDA

In [117]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation as LDA
from nltk.corpus import stopwords

import nltk</br>
nltk.download("stopwords")

In [125]:
corpus = data["description"].dropna().tolist()

In [138]:
count_vect = TfidfVectorizer(stop_words=stopwords.words('english'), lowercase=True, ngram_range=(2, 3), min_df=5)
x_counts = count_vect.fit_transform(corpus)

In [139]:
dimension = 25
lda = LDA(n_components = dimension)
lda_array = lda.fit_transform(x_counts)
lda_array

array([[0.04      , 0.04      , 0.04      , ..., 0.04      , 0.04      ,
        0.04      ],
       [0.00854759, 0.00854759, 0.00854759, ..., 0.00854759, 0.00854759,
        0.00854759],
       [0.00793419, 0.62943028, 0.00793419, ..., 0.00793419, 0.00793419,
        0.00793419],
       ...,
       [0.00843864, 0.00843864, 0.00843864, ..., 0.00843864, 0.00843864,
        0.00843864],
       [0.00764065, 0.00764065, 0.00764065, ..., 0.00764065, 0.00764065,
        0.00764065],
       [0.00471075, 0.00471075, 0.00471075, ..., 0.00471075, 0.00471075,
        0.00471075]])

In [142]:
components = [lda.components_[i] for i in range(len(lda.components_))]
features = count_vect.get_feature_names()
important_words = [sorted(features, key = lambda x: components[j][features.index(x)], reverse = True)[:5] for j in range(len(components))]
important_words[:2]

[['thomsonreuters com',
  'net keywords',
  'reuters net',
  'com reuters',
  'reuters net keywords'],
 ['third quarter',
  'wall street',
  'dow jones',
  'dow jones industrial',
  'jones industrial']]

***

# 3. BerTopic

## 3.1. BerTopic with TfIdf as vectorizer

In [25]:
X = data["description"].dropna()

In [26]:
# Create TF-IDF sparse matrix
vectorizer = TfidfVectorizer(min_df=1, ngram_range=(1, 1))
embeddings = vectorizer.fit_transform(X)
embeddings

<593x19446 sparse matrix of type '<class 'numpy.float64'>'
	with 146800 stored elements in Compressed Sparse Row format>

In [27]:
# Model
model = BERTopic(language="english", verbose=True, calculate_probabilities=True)
topics, probabilities = model.fit_transform(X, embeddings)

2022-03-24 16:36:15,673 - BERTopic - Reduced dimensionality with UMAP
2022-03-24 16:36:15,714 - BERTopic - Clustered UMAP embeddings with HDBSCAN


In [28]:
model.visualize_distribution(probabilities[0])

In [32]:
model.get_topic(1)

[('owns', 0.18416864793514212),
 ('is', 0.08840593076215217),
 ('the', 0.08454213205396005),
 ('long', 0.07613590230790047),
 ('partners', 0.07559552411502671),
 ('of', 0.06700680827319175),
 ('to', 0.0568647317979676),
 ('and', 0.05677751606481321),
 ('short', 0.05595622057821746),
 ('investment', 0.055637613123345225)]

In [33]:
model.get_topic_freq()

Unnamed: 0,Topic,Count
0,0,560
1,1,33


## 3.1. BerTopic with transformer as vectorizer

In [212]:
# X = data["description"].dropna().reset_index(drop=True)
X = data["title"]

In [213]:
# Model
umap_model = UMAP(
    n_neighbors=15,
    n_components=5,
    min_dist=0.0,
    metric="cosine",
    random_state=42,
)
model = BERTopic(
    language="english",
    verbose=True,
    calculate_probabilities=True,
    umap_model=umap_model,
    n_gram_range=(2, 4),
    nr_topics="auto",
    min_topic_size=15,
)
topics, probabilities = model.fit_transform(X)


Batches: 100%|██████████| 20/20 [00:04<00:00,  4.95it/s]
2022-03-24 19:05:43,722 - BERTopic - Transformed documents to Embeddings
2022-03-24 19:05:45,629 - BERTopic - Reduced dimensionality with UMAP
2022-03-24 19:05:45,669 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2022-03-24 19:05:46,155 - BERTopic - Reduced number of topics from 7 to 7


In [214]:
model.get_topic_freq()

Unnamed: 0,Topic,Count
0,-1,301
1,0,143
2,1,73
3,2,42
4,3,28
5,4,23
6,5,15


In [215]:
clusters_df = pd.DataFrame(columns=["Cluster #", "Cluster name", "Topics", "Number of topics"])

for cluster_index, cluster in model.get_topics().items():
    # skipping cluster with outliers
    if cluster_index == -1:
        continue

    # set classter name with topic with highest score
    cluster_name = cluster[0][0]
    topics = ", ".join(x[0] for x in cluster)
    number_of_topics = model.get_topic_freq(cluster_index)

    clusters_df.loc[len(clusters_df)] = cluster_index, cluster_name, topics, number_of_topics

clusters_df.style.hide_index()

Cluster #,Cluster name,Topics,Number of topics
0,stocks making,"stocks making, the fed, making the biggest moves, making the biggest, the biggest moves, the biggest, making the, biggest moves, stocks making the biggest, stocks making the",143
1,white house,"white house, donald trump, plans to, new york, coronavirus cases, likely to, york gov, for retirement, fully vaccinated, new york gov",73
2,the brexit,"the brexit, the euro, central bank, euro zone, vs dollar, vs yen, on the, pm tsipras says, postelection crisis euro zone, properties european",42
3,to open,"to open, video game, to be, 2011 surge, pct in january fliers, open at walt disney, overheating sears is, on big retailers big, park in, park in malaysia",28
4,ceo pay,"ceo pay, due to, out of, 10 million, on protests equal, nobel says ceo ton, not much, officer standish names raman, of ceo oneal was, of ceo",23
5,saudi arabia,"saudi arabia, oil prices, 75 billion, prices will rise analyst, saudi arabia withdraws overseas, reports us, package supports development, package supports development of, relations not, pushes prices higher saudi",15


In [147]:
for cluster_index, cluster in model.get_topics().items():
    # skipping cluster with outliers
    if cluster_index == -1:
        continue

    print(f"Cluster # {cluster_index}")
    _df = pd.DataFrame(cluster, columns=["Topic", "Score"])
    display(_df)
    print(" - " * 50)


Cluster # 0


Unnamed: 0,Topic,Score
0,white house,0.013815
1,donald trump,0.009041
2,plans to,0.008618
3,government shutdown,0.006426
4,coronavirus cases,0.006426
5,new york gov,0.006426
6,tax cuts,0.006426
7,york gov,0.006426
8,fully vaccinated,0.006426
9,likely to,0.006426


 -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  - 
Cluster # 1


Unnamed: 0,Topic,Score
0,to open,0.013249
1,to deliver,0.009417
2,video game,0.009417
3,cctv script,0.009417
4,to be,0.008098
5,overheating sears is,0.005208
6,pfizer no ifs ands,0.005208
7,pharmacy is,0.005208
8,pharmacy is complementary,0.005208
9,pharmacy is complementary not,0.005208


 -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  - 
Cluster # 2


Unnamed: 0,Topic,Score
0,euro zone,0.010931
1,the brexit,0.010931
2,the euro,0.010931
3,central bank,0.010931
4,vs dollar,0.010931
5,vs yen,0.010931
6,on the,0.008839
7,plan if talks,0.006045
8,plan if talks fail,0.006045
9,president luxembourgs pm,0.006045


 -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  - 
Cluster # 3


Unnamed: 0,Topic,Score
0,stocks making,0.030073
1,making the biggest,0.030073
2,the biggest moves,0.030073
3,stocks making the,0.030073
4,biggest moves,0.030073
5,making the,0.030073
6,the biggest,0.030073
7,making the biggest moves,0.030073
8,stocks making the biggest,0.030073
9,early movers,0.025012


 -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  - 
Cluster # 4


Unnamed: 0,Topic,Score
0,due to,0.017757
1,ceo pay,0.017757
2,out of,0.01527
3,10 million,0.009821
4,organizational shuffle ceo,0.009821
5,oneal was,0.009821
6,of ceo oneal was,0.009821
7,of global fixed,0.009821
8,of firming seen,0.009821
9,of firming seen ge,0.009821


 -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  - 
Cluster # 5


Unnamed: 0,Topic,Score
0,halftime report,0.028622
1,lightning round,0.026847
2,2four strong,0.015829
3,problem with,0.015829
4,shakes up spacex,0.015829
5,presented with oneofakind babe,0.015829
6,rocks midtown,0.015829
7,rock bottom utah halftime,0.015829
8,rock bottom utah,0.015829
9,rock bottom,0.015829


 -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  - 
Cluster # 6


Unnamed: 0,Topic,Score
0,saudi arabia,0.025896
1,stocks close higher,0.02429
2,european stocks close higher,0.02429
3,close higher,0.02429
4,oil prices,0.02429
5,stocks close,0.023151
6,european stocks,0.023151
7,european stocks close,0.023151
8,production capacity will return,0.014322
9,production capacity will,0.014322


 -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  - 
Cluster # 7


Unnamed: 0,Topic,Score
0,net profit,0.035955
1,morgan stanley,0.035955
2,off ratings cut analysts,0.019885
3,peloton shares rise as,0.019885
4,q2 net profit rises,0.019885
5,profit rises 36,0.019885
6,q2 net,0.019885
7,profit rises 36 beats,0.019885
8,pharmaceutical companies just,0.019885
9,off ratings,0.019885


 -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  - 
Cluster # 8


Unnamed: 0,Topic,Score
0,hedge funds,0.029003
1,2007 25,0.01604
2,more danger likely,0.01604
3,on volatility spook,0.01604
4,put their money where,0.01604
5,more danger,0.01604
6,on the money,0.01604
7,on pace to plow,0.01604
8,on pace to,0.01604
9,on pace,0.01604


 -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  - 
Cluster # 9


Unnamed: 0,Topic,Score
0,for retirement,0.035955
1,about social,0.019885
2,rule how to,0.019885
3,strategy as more firms,0.019885
4,seekers things to,0.019885
5,strategy as,0.019885
6,stem cells in the,0.019885
7,stop saving for,0.019885
8,stop saving,0.019885
9,should temporarily,0.019885


 -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  - 
Cluster # 10


Unnamed: 0,Topic,Score
0,jim cramer,0.03537
1,12 laggards,0.019561
2,microsoft spun,0.019561
3,money cramers stocks,0.019561
4,on good holiday season,0.019561
5,oil prices cramers,0.019561
6,on good,0.019561
7,oil prices cramers game,0.019561
8,more volatility jim,0.019561
9,left for dead,0.019561


 -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  - 
Cluster # 11


Unnamed: 0,Topic,Score
0,the fed,0.044893
1,fed minutes,0.034804
2,over the map,0.019248
3,sign of,0.019248
4,second look at fed,0.019248
5,sees fiscal boost,0.019248
6,sees fiscal,0.019248
7,rally street to take,0.019248
8,slow rate hikes,0.019248
9,over the,0.019248


 -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  - 


### With split by year

In [49]:
data["published_at"] = pd.to_datetime(data["published_at"], utc=True)

In [101]:
# Model
umap_model = UMAP(
    n_neighbors=15,
    n_components=5,
    min_dist=0.0,
    metric="cosine",
    random_state=42,
)
model = BERTopic(
    language="english",
    verbose=False,
    # calculate_probabilities=True,
    # umap_model=umap_model,
    # n_gram_range=(2, 4),
    # nr_topics=5,
)


In [83]:
year_range = range(data["published_at"].dt.year.min(), data["published_at"].dt.year.max() + 1)

for year in tqdm(year_range):
    

    titles = data.loc[data["published_at"].dt.year == year, "title"].reset_index(drop=True)
    if len(titles) < 5:
        continue
    
    model.fit(titles)
    # print(year, len(titles))
    break

  0%|          | 0/16 [00:00<?, ?it/s]2022-03-24 17:55:59,458 - BERTopic - Transformed documents to Embeddings
2022-03-24 17:56:00,188 - BERTopic - Reduced dimensionality with UMAP
2022-03-24 17:56:00,194 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2022-03-24 17:56:00,231 - BERTopic - Reduced number of topics from 1 to 1
  6%|▋         | 1/16 [00:00<00:14,  1.03it/s]


In [87]:
topics_df = pd.DataFrame(columns=["year", *[f"Topic_{x}" for x in range(1, 6)]])
topics_df

Unnamed: 0,year,Topic_1,Topic_2,Topic_3,Topic_4,Topic_5


In [90]:
model.get_topics()[-1]

[('the word on', 0.01871518871760522),
 ('the word', 0.01871518871760522),
 ('word on', 0.01871518871760522),
 ('new robert kennedy in', 0.01048752440512453),
 ('oil bill', 0.01048752440512453),
 ('new robert', 0.01048752440512453),
 ('net profit explosion rocks', 0.01048752440512453),
 ('net profit explosion', 0.01048752440512453),
 ('net profit', 0.01048752440512453),
 ('nation will air on', 0.01048752440512453)]

In [91]:
def prepare_topics(topics: dict) -> dict:

    topics_result = []

    for cluster_index, cluster in topics.items():
        if cluster_index == -1:
            continue

        # gathering only topic names
        # [(topic_name, score)]
        _topics = [x[0] for x in cluster]

        topics_result.append(_topics)

    return topics_result


prepare_topics(model.get_topics())

[]

In [102]:
year_range = range(data["published_at"].dt.year.min(), data["published_at"].dt.year.max() + 1)

for year in tqdm(year_range):
    

    titles = data.loc[data["published_at"].dt.year == year, "title"].reset_index(drop=True)
    if len(titles) < 5:
        continue
    
    model.fit(titles)
    # print(year, len(titles))

    topics = prepare_topics(model.get_topics())

    # if topics:
    #     break 

    print("-" * 100)
    print(f"year: {year}")
    print()
    print(model.get_topics())
    print()
    print(topics)
    print("-" * 100)




    # break

  0%|          | 0/16 [00:00<?, ?it/s]2022-03-24 18:10:08,602 - BERTopic - Transformed documents to Embeddings
2022-03-24 18:10:10,460 - BERTopic - Reduced dimensionality with UMAP
2022-03-24 18:10:10,465 - BERTopic - Clustered UMAP embeddings with HDBSCAN
 12%|█▎        | 2/16 [00:11<01:22,  5.91s/it]2022-03-24 18:10:10,691 - BERTopic - Transformed documents to Embeddings


----------------------------------------------------------------------------------------------------
year: 2007

{-1: [('in', 0.10382990695467838), ('on', 0.09083544225830015), ('the', 0.09083544225830015), ('for', 0.09083544225830015), ('with', 0.07690899358441898), ('profit', 0.06180177616451508), ('investments', 0.04509117377807157), ('no', 0.04509117377807157), ('is', 0.04509117377807157), ('subprime', 0.04509117377807157)]}

[]
----------------------------------------------------------------------------------------------------


2022-03-24 18:10:12,545 - BERTopic - Reduced dimensionality with UMAP
2022-03-24 18:10:12,550 - BERTopic - Clustered UMAP embeddings with HDBSCAN
 19%|█▉        | 3/16 [00:13<00:56,  4.31s/it]2022-03-24 18:10:12,706 - BERTopic - Transformed documents to Embeddings


----------------------------------------------------------------------------------------------------
year: 2008

{-1: [('the', 0.12341498185834088), ('for', 0.09678360440818062), ('to', 0.09678360440818062), ('help', 0.06601401719618526), ('of', 0.06601401719618526), ('trade', 0.06601401719618526), ('up', 0.06601401719618526), ('bring', 0.04824472219562629), ('from', 0.04824472219562629), ('global', 0.04824472219562629)]}

[]
----------------------------------------------------------------------------------------------------


2022-03-24 18:10:14,536 - BERTopic - Reduced dimensionality with UMAP
2022-03-24 18:10:14,540 - BERTopic - Clustered UMAP embeddings with HDBSCAN
 25%|██▌       | 4/16 [00:15<00:41,  3.45s/it]2022-03-24 18:10:14,760 - BERTopic - Transformed documents to Embeddings


----------------------------------------------------------------------------------------------------
year: 2009

{-1: [('to', 0.12026056535665978), ('the', 0.08280497874839567), ('in', 0.08280497874839567), ('for', 0.06089542681487593), ('of', 0.06089542681487593), ('report', 0.06089542681487593), ('portfolio', 0.06089542681487593), ('halftime', 0.06089542681487593), ('up', 0.06089542681487593), ('us', 0.06089542681487593)]}

[]
----------------------------------------------------------------------------------------------------


2022-03-24 18:10:17,039 - BERTopic - Reduced dimensionality with UMAP
2022-03-24 18:10:17,043 - BERTopic - Clustered UMAP embeddings with HDBSCAN
 31%|███▏      | 5/16 [00:18<00:34,  3.12s/it]

----------------------------------------------------------------------------------------------------
year: 2010

{-1: [('in', 0.07923178457133968), ('to', 0.07923178457133968), ('of', 0.06370746393016614), ('the', 0.06370746393016614), ('halftime', 0.046516870565536286), ('for', 0.046516870565536286), ('new', 0.046516870565536286), ('about', 0.046516870565536286), ('stocks', 0.046516870565536286), ('ahead', 0.046516870565536286)]}

[]
----------------------------------------------------------------------------------------------------


2022-03-24 18:10:17,290 - BERTopic - Transformed documents to Embeddings
2022-03-24 18:10:19,139 - BERTopic - Reduced dimensionality with UMAP
2022-03-24 18:10:19,144 - BERTopic - Clustered UMAP embeddings with HDBSCAN
 38%|███▊      | 6/16 [00:20<00:27,  2.78s/it]

----------------------------------------------------------------------------------------------------
year: 2011

{-1: [('the', 0.10030327535122845), ('to', 0.09092866823260955), ('in', 0.08107958954397138), ('big', 0.059589714460394315), ('stocks', 0.059589714460394315), ('for', 0.059589714460394315), ('with', 0.059589714460394315), ('on', 0.059589714460394315), ('top', 0.04765323935941024), ('move', 0.04765323935941024)]}

[]
----------------------------------------------------------------------------------------------------


2022-03-24 18:10:19,674 - BERTopic - Transformed documents to Embeddings
2022-03-24 18:10:21,515 - BERTopic - Reduced dimensionality with UMAP
2022-03-24 18:10:21,522 - BERTopic - Clustered UMAP embeddings with HDBSCAN
 44%|████▍     | 7/16 [00:22<00:23,  2.66s/it]

----------------------------------------------------------------------------------------------------
year: 2012

{-1: [('to', 0.08396814782561146), ('of', 0.06872316089707725), ('with', 0.047658080389398484), ('and', 0.04295197556067328), ('in', 0.036711277668762796), ('the', 0.03546846494485502), ('announces', 0.03541281766678363), ('by', 0.03541281766678363), ('new', 0.03308508238960397), ('on', 0.031287636902221805)], 0: [('euro', 0.1294783159785823), ('to', 0.11226510606720945), ('jobs', 0.06407524083418141), ('rises', 0.06407524083418141), ('spain', 0.06407524083418141), ('an', 0.06407524083418141), ('offshore', 0.06407524083418141), ('trade', 0.06407524083418141), ('gas', 0.058619856908734815), ('oil', 0.058619856908734815)], 1: [('market', 0.10879328893878784), ('markets', 0.08159496670409087), ('the', 0.06560467755847339), ('forecasts', 0.06364230001773424), ('and', 0.06355731518775304), ('is', 0.058223776794486606), ('update', 0.05439664446939392), ('us', 0.05144138499689621),

2022-03-24 18:10:21,825 - BERTopic - Transformed documents to Embeddings
2022-03-24 18:10:23,704 - BERTopic - Reduced dimensionality with UMAP
2022-03-24 18:10:23,709 - BERTopic - Clustered UMAP embeddings with HDBSCAN
 50%|█████     | 8/16 [00:25<00:20,  2.50s/it]

----------------------------------------------------------------------------------------------------
year: 2013

{-1: [('the', 0.11585514181740224), ('to', 0.07963425346822008), ('for', 0.07963425346822008), ('of', 0.0693956705461373), ('in', 0.0693956705461373), ('what', 0.046764215571226724), ('market', 0.046764215571226724), ('ecb', 0.033902270903340706), ('after', 0.033902270903340706), ('minister', 0.033902270903340706)]}

[]
----------------------------------------------------------------------------------------------------


2022-03-24 18:10:24,071 - BERTopic - Transformed documents to Embeddings
2022-03-24 18:10:26,344 - BERTopic - Reduced dimensionality with UMAP
2022-03-24 18:10:26,350 - BERTopic - Clustered UMAP embeddings with HDBSCAN
 56%|█████▋    | 9/16 [00:27<00:17,  2.55s/it]

----------------------------------------------------------------------------------------------------
year: 2014

{-1: [('to', 0.13181269721485842), ('more', 0.06723848059743669), ('the', 0.06723848059743669), ('on', 0.06723848059743669), ('in', 0.06723848059743669), ('for', 0.06723848059743669), ('movers', 0.0584663670051549), ('of', 0.0584663670051549), ('us', 0.0584663670051549), ('is', 0.03918928102117917)]}

[]
----------------------------------------------------------------------------------------------------


2022-03-24 18:10:26,739 - BERTopic - Transformed documents to Embeddings
2022-03-24 18:10:28,599 - BERTopic - Reduced dimensionality with UMAP
2022-03-24 18:10:28,605 - BERTopic - Clustered UMAP embeddings with HDBSCAN
 62%|██████▎   | 10/16 [00:29<00:14,  2.46s/it]

----------------------------------------------------------------------------------------------------
year: 2015

{-1: [('to', 0.0977856418773759), ('in', 0.08430687093481283), ('the', 0.0771910222603887), ('of', 0.06978105490049859), ('on', 0.062033060356583535), ('how', 0.062033060356583535), ('for', 0.05388830009508047), ('china', 0.045264113294687054), ('is', 0.045264113294687054), ('stocks', 0.045264113294687054)]}

[]
----------------------------------------------------------------------------------------------------


2022-03-24 18:10:29,059 - BERTopic - Transformed documents to Embeddings
2022-03-24 18:10:30,892 - BERTopic - Reduced dimensionality with UMAP
2022-03-24 18:10:30,899 - BERTopic - Clustered UMAP embeddings with HDBSCAN
 69%|██████▉   | 11/16 [00:32<00:12,  2.42s/it]

----------------------------------------------------------------------------------------------------
year: 2016

{-1: [('to', 0.12891104471675874), ('on', 0.09052264847070639), ('the', 0.060348432313804254), ('brexit', 0.05913266520213232), ('of', 0.049287236844841376), ('in', 0.04712436992092386), ('outlook', 0.04330237295085275), ('are', 0.04330237295085275), ('may', 0.04330237295085275), ('out', 0.03942177680142154)], 0: [('to', 0.06672280203932739), ('in', 0.06504268143189491), ('stocks', 0.06121270870170481), ('bank', 0.06121270870170481), ('the', 0.0468534311180038), ('hike', 0.044825571999375216), ('rate', 0.044825571999375216), ('cnbc', 0.044825571999375216), ('fed', 0.044825571999375216), ('buy', 0.044825571999375216)], 1: [('trump', 0.14708515460167143), ('donald', 0.11073935483308416), ('just', 0.0810935347988697), ('our', 0.0810935347988697), ('promises', 0.0810935347988697), ('his', 0.07382623655538943), ('that', 0.07382623655538943), ('of', 0.06153436842446862), ('to', 0.

2022-03-24 18:10:31,381 - BERTopic - Transformed documents to Embeddings
2022-03-24 18:10:33,222 - BERTopic - Reduced dimensionality with UMAP
2022-03-24 18:10:33,227 - BERTopic - Clustered UMAP embeddings with HDBSCAN
 75%|███████▌  | 12/16 [00:34<00:09,  2.38s/it]

----------------------------------------------------------------------------------------------------
year: 2017

{-1: [('to', 0.10926586129582751), ('says', 0.07291498355057316), ('the', 0.06258026318934623), ('in', 0.06258026318934623), ('is', 0.057155334498074785), ('on', 0.05152958031456979), ('us', 0.05152958031456979), ('as', 0.05152958031456979), ('trump', 0.0456733762393331), ('investors', 0.0456733762393331)]}

[]
----------------------------------------------------------------------------------------------------


2022-03-24 18:10:33,539 - BERTopic - Transformed documents to Embeddings
2022-03-24 18:10:35,349 - BERTopic - Reduced dimensionality with UMAP
2022-03-24 18:10:35,354 - BERTopic - Clustered UMAP embeddings with HDBSCAN
 81%|████████▏ | 13/16 [00:36<00:06,  2.30s/it]

----------------------------------------------------------------------------------------------------
year: 2018

{-1: [('to', 0.1070732447936461), ('for', 0.07666907586987982), ('of', 0.06822635583879905), ('with', 0.0593359659616087), ('in', 0.0593359659616087), ('the', 0.0499042980362594), ('could', 0.0499042980362594), ('on', 0.0499042980362594), ('and', 0.0499042980362594), ('trump', 0.0499042980362594)]}

[]
----------------------------------------------------------------------------------------------------


2022-03-24 18:10:35,842 - BERTopic - Transformed documents to Embeddings
2022-03-24 18:10:38,155 - BERTopic - Reduced dimensionality with UMAP
2022-03-24 18:10:38,160 - BERTopic - Clustered UMAP embeddings with HDBSCAN
 88%|████████▊ | 14/16 [00:39<00:04,  2.45s/it]

----------------------------------------------------------------------------------------------------
year: 2019

{-1: [('to', 0.12650790913512378), ('the', 0.09049181602967112), ('in', 0.07830081157963284), ('of', 0.060629231098658226), ('stocks', 0.055895328959804966), ('on', 0.055895328959804966), ('says', 0.05100723147422886), ('and', 0.05100723147422886), ('is', 0.05100723147422886), ('its', 0.04068328551824408)]}

[]
----------------------------------------------------------------------------------------------------


2022-03-24 18:10:38,610 - BERTopic - Transformed documents to Embeddings
2022-03-24 18:10:40,451 - BERTopic - Reduced dimensionality with UMAP
2022-03-24 18:10:40,456 - BERTopic - Clustered UMAP embeddings with HDBSCAN
 94%|█████████▍| 15/16 [00:41<00:02,  2.41s/it]

----------------------------------------------------------------------------------------------------
year: 2020

{-1: [('to', 0.11714500314946491), ('the', 0.0824668247495687), ('in', 0.07766623166461337), ('as', 0.07273408836145277), ('for', 0.06242282993136949), ('of', 0.06242282993136949), ('coronavirus', 0.06242282993136949), ('says', 0.05701045135446367), ('new', 0.05139788872778774), ('is', 0.05139788872778774)]}

[]
----------------------------------------------------------------------------------------------------


2022-03-24 18:10:40,765 - BERTopic - Transformed documents to Embeddings
2022-03-24 18:10:42,615 - BERTopic - Reduced dimensionality with UMAP
2022-03-24 18:10:42,620 - BERTopic - Clustered UMAP embeddings with HDBSCAN
100%|██████████| 16/16 [00:43<00:00,  2.75s/it]

----------------------------------------------------------------------------------------------------
year: 2021

{-1: [('the', 0.10815704291057748), ('to', 0.10091811638825), ('as', 0.0689519893283495), ('is', 0.059974878142740495), ('in', 0.059974878142740495), ('you', 0.05044911593515956), ('for', 0.05044911593515956), ('says', 0.05044911593515956), ('stock', 0.05044911593515956), ('and', 0.05044911593515956)]}

[]
----------------------------------------------------------------------------------------------------



