# Exploration
Initial exploration of the BERTopic package for my b.sc. thesis project.

## Main goals of exploration: 
- Looking at computational cost (is it feasible to run on a CPU?)
- Exploring flexibility (how much info can I get out?)
- Familiarizing myself with the API 
    
## Initial Results: 
    - Default parameters works pretty badly with small-ish data
    - Relatively easy to use

In [12]:
import random
import numpy as np
import pandas as pd
from umap import UMAP
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from sklearn.datasets import fetch_20newsgroups
from typing import List, Tuple, Union, Mapping, Any

#### Helper functions

In [59]:
def weighted_mean(X, weights):
    return np.dot(X.T, weights) / np.sum(weights)


def get_unique_topics(topic_model):
    topic_info = topic_model.get_topic_info()
    return topic_info["Topic"].unique()


def find_centroid(embeddings: np.ndarray, topics: np.ndarray, probs: np.ndarray, target_topic: int):
    """
    Arguments:
        embeddings: 2d with dimensions (num_documents, num_dimensions)
        topics: list of length num documents
        probs: np.array of length num_documents showing the probability of the assigned topic
        target_topic: the topic, we want to find the centroid for
    returns: 
        The centroid for the cluster
    """
    # Filtering the embeddings
    filtered_embeddings = embeddings[topics == target_topic, :]
    filtered_probs = probs[topics == target_topic]

    # Calculating the centroid
    return weighted_mean(filtered_embeddings, filtered_probs)

### Getting example data

In [2]:
# Getting the Data #
print("fetching the data")
docs = fetch_20newsgroups(subset='test',  remove=('headers', 'footers', 'quotes'))['data']
docs = random.sample(docs, 1000)

fetching the data


### Initializing model
Fastest one I could find on sentence_transformers

In [13]:
print("loading model")
topic_model = BERTopic() 
sentence_model = SentenceTransformer("paraphrase-MiniLM-L6-v2")
print("creating embeddings")
embeddings = sentence_model.encode(docs, show_progress_bar=True)
topics, probs = topic_model.fit_transform(docs, embeddings=embeddings)
print("all done!")
print(topic_model.get_topic_info())

loading model
creating embeddings
all done!
   Topic  Count                        Name
0      0    965             0_the_to_of_and
1      1     35  1_me_subscribe_mistake_yes


### Finding the centroids example
Below I'll find the centroids of topic one. It will be based on the full embeddings (for now). 

In [33]:
# Filtering the embeddings
filtered_embeddings = embeddings[np.array(topics) == 1, :]
filtered_probs = probs[np.array(topics) == 1]

# Calculating the centroid
centroid = weighted_mean(filtered_embeddings, filtered_probs)
assert centroid.shape == (filtered_embeddings.shape[1], )  # We expect a (n, )-tuple with n being the dimensionality of the embeddings

### Finding all centroids (non-optimized)
The above method seems to work fairly well. I'll try to calculate all topics below (in a slow-ish for loop)

In [69]:
unique_topics = get_unique_topics(topic_model)
centroids = np.zeros((len(unique_topics),embeddings.shape[1])) # Centroids need dimensions (number of topics, embedding-dimensionality)
topics = np.array(topics)
for i, topic in enumerate(unique_topics):
    centroids[i, :] += find_centroid(embeddings, topics, probs, topic)
    
np.testing.assert_array_equal(centroids[1, :], find_centroid(embeddings, topics, probs, 1))

### Sanity checking distance calculations
