# Topic Modeling with Bert

## Implementation attemps

**Reference:** https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6

**Requierements:** py -3.8 -m pip install umap-learn hdbscan sentence_transformers

## using Our own train data

In [1]:
import pandas as pd
from sklearn.utils import shuffle

df=pd.read_csv('data/TCCSocialMediaData_train.csv')
df=df.drop(['Unnamed: 0','Unnamed: 0.1'], axis=1)
df=df[df['headline'].notna()]
#If we were to take only half
#df = shuffle(df)
#df=df[0:int(len(df)/2)]

df=df.reset_index(drop=True)
display(df.head())

Unnamed: 0,date,headline,message,link,domain,rating,orientation,sourceEchochamber
0,2020-10-27 16:21:29,"The balls on this guy, huh?",,https://www.tmz.com/2020/10/27/arizona-racist-...,tmz.com,N,,Liberal
1,2020-08-20 14:34:31,Lindell has come under fire for promoting pote...,En Serio?,https://www.forbes.com/sites/andrewsolender/20...,forbes.com,T,,Liberal
2,2020-11-17 15:35:49,Biden has told aides that he's concerned that ...,Wary ?? Wary ?? More like he is worried that t...,https://www.msn.com/en-us/news/politics/presid...,msn.com,T,,Conservative
3,2021-03-03 03:19:18,The response was a remarkable moment at a pivo...,,https://www.nbcnews.com/politics/elections/sup...,nbcnews.com,T,,Liberal
4,2020-08-30 16:26:09,Astronaut Jeanette Epps is the first Black wom...,,https://newsone.com/4005134/nasa-astronaut-jea...,newsone.com,T,Slightly Left,Liberal


## Same but using Bert Large

- Complete library on hugging face hub: https://huggingface.co/sentence-transformers
- Bert large for sentence modeling: https://huggingface.co/sentence-transformers/bert-large-nli-mean-tokens
- encoding time +/-40 min

In [None]:
from sentence_transformers import SentenceTransformer

#model = SentenceTransformer('distilbert-base-nli-mean-tokens')
model = SentenceTransformer('bert-large-nli-mean-tokens')

data=df[df['headline'].notna()]['headline'].tolist()
print(len(data))
embeddings = model.encode(data, show_progress_bar=True)

### Dimension Reduction with UMAP

Reduce the dimensionality to 5 while keeping the size of the local neighborhood at 15. You can play around with these values to optimize for your topic creation. Note that a too low dimensionality results in a loss of information while a too high dimensionality results in poorer clustering results

In [None]:
import umap
umap_embeddings = umap.UMAP(n_neighbors=5, 
                            n_components=50, 
                            metric='cosine').fit_transform(embeddings)

In [None]:
print(type(umap_embeddings))
print(umap_embeddings[0])
len(umap_embeddings)

### Cluster the documents with HDBSCAN

Cluster the documents with HDBSCAN. HDBSCAN is a density-based algorithm that works quite well with UMAP since UMAP maintains a lot of local structure even in lower-dimensional space. Moreover, HDBSCAN does not force data points to clusters as it considers them outliers

https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html


In [None]:
import hdbscan
cluster = hdbscan.HDBSCAN(min_cluster_size=50,
                          metric='euclidean',                      
                          cluster_selection_method='eom').fit(umap_embeddings)

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
# Prepare data
umap_data = umap.UMAP(n_neighbors=5, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)
result = pd.DataFrame(umap_data, columns=['x', 'y'])
result['labels'] = cluster.labels_

# Visualize clusters
fig, ax = plt.subplots(figsize=(20, 10))
outliers = result.loc[result.labels == -1, :]
clustered = result.loc[result.labels != -1, :]
plt.scatter(outliers.x, outliers.y, color='#BDBDBD', s=0.05)
plt.scatter(clustered.x, clustered.y, c=clustered.labels, s=0.05, cmap='hsv_r')
plt.colorbar()

In [None]:
fig.savefig('cluster_5_15_15.png')

### Topic Creation with TFIDF


In [None]:
docs_df

In [11]:
docs_df = pd.DataFrame(data, columns=["Doc"])
docs_df['Topic'] = cluster.labels_
docs_df['Doc_ID'] = range(len(docs_df))
docs_df.to_csv('docs_df.csv')
docs_per_topic = docs_df.groupby(['Topic'], as_index = False).agg({'Doc': ' '.join})
docs_per_topic.to_csv('docs_per_topic.csv')
docs_per_topic

Unnamed: 0,Topic,Doc
0,-1,"The balls on this guy, huh? Lindell has come u..."
1,0,"ZeroHedge - On a long enough timeline, the sur..."
2,1,PolitiFact is a fact-checking website that rat...
3,2,😳 😐 🌊🤜🏻🚪🗳 😨 😡🤬🤬🤬 😳😳😳 🦠🕵🏻‍♂️\n\n 👀👀 😯🤯 🤔... \n\...
4,3,Wow... Wow... Wow... Wow... Wow... Just wow......
...,...,...
504,503,"There's no single domestic terrorism statute, ..."
505,504,A narrow victory for either side does not fund...
506,505,"In America, the two political parties don’t co..."
507,506,Get them out of there! Confirmed! Stand Down o...


In [14]:
print(len(docs_per_topic))
docs_per_topic=docs_per_topic[docs_per_topic['Topic']!=0]
docs_per_topic=docs_per_topic.reset_index(drop=True)
print(len(docs_per_topic))
docs_per_topic=docs_per_topic[docs_per_topic['Doc'].str.contains("[a-zA-Z]").fillna(False)]
print(len(docs_per_topic))

508
508
503


In [16]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

#np.seterr(divide='ignore', invalid='ignore')

def c_tf_idf(documents, m, ngram_range=(1, 1)):
    count = CountVectorizer(ngram_range=ngram_range, stop_words="english").fit(documents)
    t = count.transform(documents).toarray()
    
    w = t.sum(axis=1)

    tf = np.divide(t.T, w)
    sum_t = t.sum(axis=0)
    idf = np.log(np.divide(m, sum_t)).reshape(-1, 1)
    tf_idf = np.multiply(tf, idf)

    return tf_idf, count
  
tf_idf, count = c_tf_idf(docs_per_topic.Doc.values, m=len(data))

### Topic Representation
In order to create a topic representation, we take the top 20 words per topic based on their c-TF-IDF scores. The higher the score, the more representative it should be of its topic as the score is a proxy of information density.

The topic name-1 refers to all documents that did not have any topics assigned. The great thing about HDBSCAN is that not all documents are forced towards a certain cluster. If no cluster could be found, then it is simply an outlier.

We can see that topics 7, 43, 12, and 41 are the largest clusters that we could create. To view the words belonging to those topics, we can simply use the dictionarytop_n_words to access these topics:

In [18]:
def extract_top_n_words_per_topic(tf_idf, count, docs_per_topic, n):
    words = count.get_feature_names()
    labels = list(docs_per_topic.Topic)
    tf_idf_transposed = tf_idf.T
    indices = tf_idf_transposed.argsort()[:, -n:]
    top_n_words = {label: [(words[j], tf_idf_transposed[i][j]) for j in indices[i]][::-1] for i, label in enumerate(labels)}
    return top_n_words

def extract_topic_sizes(df):
    topic_sizes = (df.groupby(['Topic'])
                     .Doc
                     .count()
                     .reset_index()
                     .rename({"Topic": "Topic", "Doc": "Size"}, axis='columns')
                     .sort_values("Size", ascending=False))
    return topic_sizes

top_n_words = extract_top_n_words_per_topic(tf_idf, count, docs_per_topic, n=20)
topic_sizes = extract_topic_sizes(docs_df)
topic_sizes.to_csv('topic_sizes.csv')
topic_sizes.head(10)

Unnamed: 0,Topic,Size
0,-1,347704
362,361,4247
458,457,3446
139,138,3154
170,169,2338
175,174,2106
12,11,2053
386,385,1980
383,382,1900
503,502,1618


In [19]:
topic_sizes.sort_values(['Size'])

Unnamed: 0,Topic,Size
320,319,50
360,359,51
291,290,51
155,154,51
504,503,51
...,...,...
170,169,2338
139,138,3154
458,457,3446
362,361,4247


In [20]:
topic_sizes['topic top words']=topic_sizes['Topic'].map(top_n_words)

def word_extract(x):
    if type(x)!= float:
        z=[tup[0] for tup in x]
    else: z=1
    return z

topic_sizes['words']=topic_sizes['topic top words'].apply(lambda x:word_extract(x))
topic_sizes=topic_sizes[topic_sizes['words']!=1]
topic_sizes

Unnamed: 0,Topic,Size,topic top words,words
0,-1,347704,"[(trump, 0.021263067193352816), (president, 0....","[trump, president, biden, election, said, joe,..."
362,361,4247,"[(twitter, 0.09277363693910848), (facebook, 0....","[twitter, facebook, social, media, account, tr..."
458,457,3446,"[(good, 0.07022461672120524), (great, 0.042902...","[good, great, right, think, know, time, love, ..."
139,138,3154,"[(https, 0.1580439995285021), (com, 0.15198126...","[https, com, www, fbclid, 2020, msn, politics,..."
170,169,2338,"[(russian, 0.10859609065953478), (russia, 0.09...","[russian, russia, intelligence, putin, vladimi..."
...,...,...,...,...
121,120,51,"[(game, 0.2633899598384372), (star, 0.24340667...","[game, star, baseball, league, atlanta, major,..."
152,151,51,"[(democratic, 0.12098058076373801), (joe, 0.08...","[democratic, joe, biden, seals, presidential, ..."
482,481,51,"[(accuses, 0.14427005906294704), (giuliani, 0....","[accuses, giuliani, disinformation, lawyer, ca..."
209,208,51,"[(dejoy, 0.16734225254893453), (louis, 0.13729...","[dejoy, louis, reimbursed, postmaster, donate,..."


In [21]:
topic_sizes[topic_sizes['words']==1]

Unnamed: 0,Topic,Size,topic top words,words
56,55,400,,1
1,0,382,,1
59,58,258,,1
57,56,104,,1
60,59,87,,1
49,48,82,,1


In [22]:
topic_sizes=topic_sizes[topic_sizes['words']!=1]

In [23]:
topic_sizes.to_csv('topic_sizes.csv')
topic_sizes

Unnamed: 0,Topic,Size,topic top words,words
0,-1,347704,"[(trump, 0.021263067193352816), (president, 0....","[trump, president, biden, election, said, joe,..."
362,361,4247,"[(twitter, 0.09277363693910848), (facebook, 0....","[twitter, facebook, social, media, account, tr..."
458,457,3446,"[(good, 0.07022461672120524), (great, 0.042902...","[good, great, right, think, know, time, love, ..."
139,138,3154,"[(https, 0.1580439995285021), (com, 0.15198126...","[https, com, www, fbclid, 2020, msn, politics,..."
170,169,2338,"[(russian, 0.10859609065953478), (russia, 0.09...","[russian, russia, intelligence, putin, vladimi..."
...,...,...,...,...
121,120,51,"[(game, 0.2633899598384372), (star, 0.24340667...","[game, star, baseball, league, atlanta, major,..."
152,151,51,"[(democratic, 0.12098058076373801), (joe, 0.08...","[democratic, joe, biden, seals, presidential, ..."
482,481,51,"[(accuses, 0.14427005906294704), (giuliani, 0....","[accuses, giuliani, disinformation, lawyer, ca..."
209,208,51,"[(dejoy, 0.16734225254893453), (louis, 0.13729...","[dejoy, louis, reimbursed, postmaster, donate,..."


In [24]:
len(topic_sizes['Topic'].tolist())


503

In [25]:
top_n_words[5][:10]

[('dml', 0.9937668157626008),
 ('offers', 0.8782998848366768),
 ('app', 0.8664674442343197),
 ('news', 0.7845449377223067),
 ('reporting', 0.7579552627671509),
 ('best', 0.7245379919728183),
 ('𝘆𝗼𝘂𝗿', 0.0),
 ('eratic', 0.0),
 ('erased', 0.0),
 ('eraser', 0.0)]