In [1]:
import tweepy 
import configparser
import requests     # For saving access tokens and for file management when creating and adding to the dataset
import os           # For dealing with json responses we receive from the API
import json         # For displaying the data after
import pandas as pd # For saving the response data in CSV format
import csv          # For parsing the dates received from twitter in readable formats
import datetime
import dateutil.parser
import unicodedata  #To add wait time between requests
import time
import sqlite3
import re
import twitter
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import silhouette_score
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance,PartOfSpeech
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer
from umap import UMAP
from hdbscan import HDBSCAN
from nltk.corpus import stopwords
from sklearn.model_selection import RandomizedSearchCV
from flair.embeddings import TransformerDocumentEmbeddings
import numpy as np
from twitter import *
from functools import partial
from collections import Counter
import nltk
import string
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

In [2]:
df_tweets_preprocessed = pd.read_pickle('../data/df_tweets_preprocessed.pkl')

### 6 Parts in BERT TOPIC MODELLING: 
1. Embedding Model 
 2. Dimensionality Reduction 
 3. Clustering
 4. Vectorizer
 5. TF-IDF
 6. Fine Tune Topics

#### EMBEDDING MODEL
BERTopic starts with transforming our input documents into numerical representations. Although there are many ways this can be achieved, we typically use sentence-transformers ("all-MiniLM-L6-v2") as it is quite capable of capturing the semantic similarity between documents.

However, there is not one perfect embedding model and you might want to be using something entirely different for your use case. 
This modularity allows us not only to choose any embedding model to convert our documents into numerical representations, we can use essentially any data to perform our clustering. When new state-of-the-art pre-trained embedding models are released, BERTopic will be able to use them. As a result, BERTopic grows with any new models being released.

#### UMAP & its Hyperparameters
UMAP is a technique used for dimensionality reduction. In BERTopic, it is used to reduce the dimensionality of document embedding into something easier to use with HDBSCAN to create good clusters.
Why this particular clustering model over others? this is because it automatically identifies the number of clusters as opposed to k-means for example which requires a trial and error test to figure out the right number. HDBSCAN does it based on a density based method. 
1. **n_neighbors**: number of neighboring sample points used when making the manifold approximation. Increasing this value often results in larger clusters being created.
2. **n_components**:  refers to the dimensionality of the embeddings after reducing them. This is set as a default to 5 to reduce dimensionality as much as possible whilst trying to maximize the information kept in the resulting embeddings. Although lowering or increasing this value influences the quality of embeddings, its effect is largest on the performance of HDBSCAN. Increasing this value too much and HDBSCAN will have a hard time clustering the high-dimensional embeddings. If you want to increase this value, I would advise setting using a metric for HDBSCAN that works well in high dimensional data.
3. **metric**: refers to the method used to compute the distances in high dimensional space. The default is cosine as we are dealing with high dimensional data.


#### HDBSCAN & Its Hyperparameters
After reducing the embeddings with UMAP, we use HDBSCAN to cluster our documents into clusters of similar documents. Similar to UMAP, HDBSCAN has many parameters that could be tweaked to improve the cluster's quality.
1. **min_cluster_size**: arguably the most important parameter in HDBSCAN.  It controls the minimum size of a cluster and thereby the number of clusters that will be generated. It is set to 10 as a default. Increasing this value results in fewer clusters but of larger size whereas decreasing this value results in more micro clusters being generated. Typically, I would advise increasing this value rather than decreasing it.
2. **min_samples**: is automatically set to min_cluster_size and controls the number of outliers generated. Setting this value significantly lower than min_cluster_size might help you reduce the amount of noise you will get. Do note that outliers are to be expected and forcing the output to have no outliers may not properly represent the data.
3. **metric**:  like with HDBSCAN is used to calculate the distances. Here, we went with euclidean as, after reducing the dimensionality, we have low dimensional data and not much optimization is necessary. However, if you increase n_components in UMAP, then it would be advised to look into metrics that work with high dimensional data.

#### Vectorizers 
In topic modeling, the quality of the topic representations is key for interpreting the topics, communicating results, and understanding patterns. It is of utmost importance to make sure that the topic representations fit with your use case.
In practice, there is not one correct way of creating topic representations. Some use cases might opt for higher n-grams, whereas others might focus more on single words without any stop words. 
1. **ngram_range**: allows us to decide how many tokens each entity is in a topic representation. For example, we have words like game and team with a length of 1 in a topic but it would also make sense to have words like hockey league with a length of 2
2. **stop_words**: In some of the topics, we can see stop words appearing like he or the.
Stop words are something we typically want to prevent in our topic representations as they do not give additional information to the topic
3. **min_df**:  typically an integer representing how frequent a word must be before being added to our representation. You can imagine that if we have a million documents and a certain word only appears a single time across all of them, then it would be highly unlikely to be representative of a topic. Typically, the c-TF-IDF calculation removes that word from the topic representation but when you have millions of documents, that will also lead to a very large topic-term matrix.
4. **max_features**: A parameter similar to min_df is max_features which allows you to select the top n most frequent words to be used in the topic representation. Setting this, for example, to 10_000 creates a topic-term matrix with 10_000 terms. This helps you control the size of the topic-term matrix directly without having to fiddle around with the min_df parameter:
5. **tokenizer**: The default tokenizer in the CountVectorizer works well for western languages but fails to tokenize some non-western languages, like Chinese. 

In [3]:
from sklearn.metrics import silhouette_score
from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer
from sentence_transformers import SentenceTransformer
from bertopic.representation import TextGeneration
from bertopic.representation import KeyBERTInspired
from bertopic.representation import MaximalMarginalRelevance
from sklearn.feature_extraction.text import CountVectorizer
from umap import UMAP
from transformers import pipeline, AutoModel
from bertopic.representation import OpenAI
from hdbscan import HDBSCAN
import numpy as np

# my_model = BERTopic.load("/content/drive/MyDrive/Colab_Notebooks/Data/twitter_bert_model")

# Build the pipeline with the current parameter settings
stopwords_list      = list(stopwords.words('english')) + ['http', 'https', 'amp', 'com', 'gtgtgt', 'please', 'send', 'dm']
vectorizer_model    = CountVectorizer(min_df=5,
                                      ngram_range=(1,2),
                                      stop_words=stopwords_list)
embedding_model     = AutoModel.from_pretrained('roberta-base')
umap_model          = UMAP(n_neighbors= 15,
                           n_components= 7,
                           min_dist= 0.1,
                           random_state= 42)
hdbscan_model       = HDBSCAN(min_cluster_size= 100,
                              min_samples= 40,
                              gen_min_span_tree=True,
                              prediction_data=True)
ctfidf = ClassTfidfTransformer(reduce_frequent_words=True)

representation_model = KeyBERTInspired()

model = BERTopic(
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    embedding_model=embedding_model,
    vectorizer_model=vectorizer_model,
    representation_model = representation_model,
    ctfidf_model=ctfidf,
    top_n_words=10,
    min_topic_size=100,
    language='english',
    calculate_probabilities=True,
    verbose=True,
    nr_topics = 50
    )

# Fit the BERTopic model
topics, probs = model.fit_transform(df_tweets_preprocessed['text_preprocessed'])

# Calculate silhouette score
silhouette_avg = silhouette_score(probs, hdbscan_model.labels_)

print(silhouette_avg)

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Batches: 100%|██████████| 1137/1

0.19616470366850333


In [4]:
# # Method 3 - pickle
model.save("../model/localBERT", serialization="pickle")

  self._set_arrayXarray(i, j, x)
