# Overview

In this notebook, we do the topic modeling with BERTopic. Topic Medeling is the process of discovering topics in a collection of documents. It is what we did in notebook [Generate Word Clouds by Using Specific Topic](https://www.kaggle.com/code/aisuko/generate-word-clouds-by-using-specific-topic).

In [1]:
%%capture
!pip install bertopic==0.16.0
!pip install datasets==2.17.0

Collecting bertopic==0.16.0
  Downloading bertopic-0.16.0-py2.py3-none-any.whl.metadata (21 kB)
Collecting hdbscan>=0.8.29 (from bertopic==0.16.0)
  Downloading hdbscan-0.8.33.tar.gz (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m52.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting sentence-transformers>=0.4.1 (from bertopic==0.16.0)
  Downloading sentence_transformers-2.3.1-py3-none-any.whl.metadata (11 kB)
Collecting cython<3,>=0.27 (from hdbscan>=0.8.29->bertopic==0.16.0)
  Using cached Cython-0.29.37-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl.metadata (3.1 kB)
Downloading bertopic-0.16.0-py2.py3-none-any.whl (154 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.1/154.1 kB[0m [31m9.9 MB/s[0m eta [

In [2]:
import os

os.environ['DATASET_NAME']='CShorten/ML-ArXiv-Papers'

# Loading datasets

We will use a dataset containing abstracts and metadata [ArXiv](https://huggingface.co/datasets/arxiv_dataset).

In [3]:
from datasets import load_dataset

dataset=load_dataset(os.getenv('DATASET_NAME'))['train']
dataset

Downloading readme:   0%|          | 0.00/986 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/147M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

  return pd.read_csv(xopen(filepath_or_buffer, "rb", download_config=download_config), **kwargs)


Dataset({
    features: ['Unnamed: 0.1', 'Unnamed: 0', 'title', 'abstract'],
    num_rows: 117592
})

In [4]:
# extract abstracts to train on and corresponding titles
abstracts=dataset['abstract']
titles=dataset['title']

abstracts[0]

'  The problem of statistical learning is to construct a predictor of a random\nvariable $Y$ as a function of a related random variable $X$ on the basis of an\ni.i.d. training sample from the joint distribution of $(X,Y)$. Allowable\npredictors are drawn from some specified class, and the goal is to approach\nasymptotically the performance (expected loss) of the best predictor in the\nclass. We consider the setting in which one has perfect observation of the\n$X$-part of the sample, while the $Y$-part has to be communicated at some\nfinite bit rate. The encoding of the $Y$-values is allowed to depend on the\n$X$-values. Under suitable regularity conditions on the admissible predictors,\nthe underlying family of probability distributions and the loss function, we\ngive an information-theoretic characterization of achievable predictor\nperformance in terms of conditional distortion-rate functions. The ideas are\nillustrated on the example of nonparametric regression in Gaussian noise.\n'

## Tip - Sentence Splitter

Whenever we have large documents, we typically want to split them up into either paragraphs or sentences. A nice wat to do so is by using NLTK's sentence splitter which is nothing more than:

```python
from nltk.tokenize import sent_tokenize, word_tokenize

sentences=[sent_tokenize(abstract) for abstract in abstracts]
sentences=[sentence for doc in sentences for sentence in doc]
```

# Pieline of BERTopic

Before we are going to start `Topic Modeling`. It is good for us to know the pipeline of BERTopic. BERTopic can be viewed as a sequence of steps to create its topic representations. 

Here is the process:

![https://maartengr.github.io/BERTopic/algorithm/default.svg](https://maartengr.github.io/BERTopic/algorithm/default.svg)

We can adopt the pipeline to the current state-of-art with respect to each individual step:

![https://maartengr.github.io/BERTopic/algorithm/modularity.svg](https://maartengr.github.io/BERTopic/algorithm/modularity.svg)

# Pre-calculate Embeddings

We are going to execute the first step of the BERTopic pipeline which is `embeddings`. If you want to compute embeddings with multiple GPUs, check [Computing Embeddings Streaming](https://www.kaggle.com/code/aisuko/computing-embeddings-streaming) and [Computing Embeddings with Multi GPUs](https://www.kaggle.com/code/aisuko/computing-embeddings-with-multi-gpus).

In [6]:
from sentence_transformers import SentenceTransformer

encoder=SentenceTransformer('all-MiniLM-L6-v2').to('cuda')
encoder.max_seq_length=256
encoder

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
  (2): Normalize()
)

In [22]:
corpus_embeddings=encoder.encode(abstracts, show_progress_bar=True)
len(corpus_embeddings)

Batches:   0%|          | 0/3675 [00:00<?, ?it/s]

117592

# Preventing Stochastic Behavior

We generally ise a dimensionality reduction algorithm to reduce the size of the embeddings. This is done to prevent the [curse of dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality) to a certain degree. As a default, this is done with `UMAP` which is an incredible algorithm for reducing dimentional space. However, by default, it shows stochastic behavior which creates different results each time you run it. To prevent that, we will need to set a `random_state` of the model before passing it to BERTopic.

In [11]:
from umap import UMAP

umap_model=UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
umap_model

2024-02-13 10:21:49.047799: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-13 10:21:49.047936: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-13 10:21:49.175903: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


# Controlling Number of Topics

There is a parameter to control the number of topics, namely `nr_ropics`. This parameter merges topics `after` they have been created. It is a parameter that supports creating fixed number of topics. However, it is advised to control the number of topics through the cluster model which is by default `HDBSCAN`. `HDBSCAN` has a parameter, namely `min_topic_size` that indirectly controls the number of topics that will be created.

A higher `min_topic_size` will generate fewer topics and a lower `min_topic_size` will generate more topics. Here, we will go with `min_topic_size=40` to get around xxx topics.

In [14]:
from hdbscan import HDBSCAN

hdbscan_model=HDBSCAN(min_cluster_size=150, metric='euclidean', cluster_selection_method='eom', prediction_data=True)
hdbscan_model

# Improving Default Representation

The default representation of topics is calculated through [c-TF-IDF](). However, c-TF-IDF is powered by the [CountVectorizer]() which converts text into tokens. Using the CountVectorizer, we can do a number of things:
* Remove stopwords
* Ignore inferquent words
* Increase

In other words, we can preprocess the topic representations after documents are assigned to topics. This will not influence the clustering proess in any way. Here we will ignore English stopwords and infrequent words. Moreover, by increasing the n-gram range we will consider topic representations that are made up of one or two words.

In [15]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer_model=CountVectorizer(stop_words='english', min_df=2, ngram_range=(1,2))
vectorizer_model

# Additional Representations

In [17]:
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, PartOfSpeech

keybert_model=KeyBERTInspired()

pos_model=PartOfSpeech('en_core_web_sm')

mmr_model=MaximalMarginalRelevance(diversity=0.3)

In [18]:
representation_model={
    'KeyBERT':keybert_model,
    'MMR':mmr_model,
    'POS':pos_model
}

# Training

In [23]:
from bertopic import BERTopic

topic_model=BERTopic(
    embedding_model=encoder,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    vectorizer_model=vectorizer_model,
    representation_model=representation_model,
    
    # hyperparameters
    top_n_words=10,
    verbose=True
)

topics, probs=topic_model.fit_transform(abstracts, corpus_embeddings)

2024-02-13 10:45:08,291 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
  warn(f"n_jobs value {self.n_jobs} overridden to 1 by setting random_state. Use no seed for parallelism.")
2024-02-13 10:48:01,231 - BERTopic - Dimensionality - Completed ✓
2024-02-13 10:48:01,234 - BERTopic - Cluster - Start clustering the reduced embeddings
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after paral

In [24]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,KeyBERT,MMR,POS,Representative_Docs
0,-1,38401,-1_data_learning_model_models,"[data, learning, model, models, based, trainin...","[deep learning, classification, deep, machine ...","[data, learning, model, models, based, trainin...","[data, learning, model, models, training, meth...",[ Clinical decision support using deep neural...
1,0,5743,0_policy_reinforcement_reinforcement learning_rl,"[policy, reinforcement, reinforcement learning...","[policy gradient, deep reinforcement, reinforc...","[policy, reinforcement, reinforcement learning...","[policy, reinforcement, learning, reward, agen...",[ Reinforcement learning (RL) algorithms have...
2,1,3482,1_speech_audio_speaker_music,"[speech, audio, speaker, music, acoustic, asr,...","[automatic speech, speech enhancement, speech ...","[speech, audio, speaker, music, acoustic, asr,...","[speech, audio, speaker, music, acoustic, reco...","[In this paper, we propose a novel unsupervise..."
3,2,3309,2_3d_object_video_segmentation,"[3d, object, video, segmentation, image, objec...","[3d, point cloud, point clouds, semantic segme...","[3d, object, video, segmentation, image, objec...","[object, video, segmentation, image, objects, ...","[ Given two consecutive RGB-D images, we prop..."
4,3,2044,3_user_recommendation_items_item,"[user, recommendation, items, item, recommende...","[collaborative filtering, recommendation syste...","[user, recommendation, items, item, recommende...","[user, recommendation, items, item, recommende...",[ Capturing the temporal dynamics of user pre...
...,...,...,...,...,...,...,...,...
113,112,162,112_metric_metric learning_distance_distance m...,"[metric, metric learning, distance, distance m...","[metric learning, learning metric, existing me...","[metric, metric learning, distance, distance m...","[metric, metric learning, distance, similarity...",[ Distance metric learning aims to learn from...
114,113,160,113_class_imbalanced_classification_imbalance,"[class, imbalanced, classification, imbalance,...","[imbalanced datasets, class imbalance, class i...","[class, imbalanced, classification, imbalance,...","[class, imbalanced, classification, imbalance,...",[ Class-imbalance refers to classification pr...
115,114,157,114_clustering_deep clustering_deep_cluster,"[clustering, deep clustering, deep, cluster, u...","[deep clustering, unsupervised clustering, dee...","[clustering, deep clustering, deep, cluster, u...","[clustering, deep clustering, deep, cluster, u...","[ Recently, deep clustering, which is able to..."
116,115,154,115_view_multi view_views_multi,"[view, multi view, views, multi, clustering, v...","[view clustering, view learning, view classifi...","[view, multi view, views, multi, clustering, v...","[view, views, multi, clustering, view clusteri...",[ Multi-view clustering has attracted much at...


To get all representations for a single topic, we simply run the following:

In [25]:
topic_model.get_topic(1, full=True)

{'Main': [('speech', 0.030980777560791747),
  ('audio', 0.019195372105819192),
  ('speaker', 0.01656129893941675),
  ('music', 0.01246907585980965),
  ('acoustic', 0.00979335516004357),
  ('asr', 0.009399745930121677),
  ('recognition', 0.009217889498832661),
  ('speech recognition', 0.00819144632140227),
  ('voice', 0.007604926436734665),
  ('model', 0.007508728321366143)],
 'KeyBERT': [('automatic speech', 0.5871144),
  ('speech enhancement', 0.5601559),
  ('speech recognition', 0.5465496),
  ('speaker verification', 0.47161502),
  ('voice', 0.4673625),
  ('speech', 0.38206068),
  ('speaker', 0.37639505),
  ('audio', 0.3757063),
  ('recognition asr', 0.36790693),
  ('utterance', 0.34125164)],
 'MMR': [('speech', 0.030980777560791747),
  ('audio', 0.019195372105819192),
  ('speaker', 0.01656129893941675),
  ('music', 0.01246907585980965),
  ('acoustic', 0.00979335516004357),
  ('asr', 0.009399745930121677),
  ('recognition', 0.009217889498832661),
  ('speech recognition', 0.0081914463

## Tip-Parameters

If you would like to return the topic-document probability matrix, then it is advised to use `calculate_probabilities=True`. Do note tha this can significatnly slow down training. To speed it up, use [cuML's HDBSCAN] instead. You could also approximate the topic-document probability matrix with `.approximate_distribution` which will be discussed later.

# (Custom) Labels

The default label of each topic are the top 3 words in each topic combined with an underscore between them. This, of course, might not be the best label that you can think of for a certain topic. Instead, we can use `.set_topic_labels` to manually label all or certain topics. We can also use `.set_topic_labels` to use one of the other topic representations that we had before, like `KeyBERTInspired`.

In [26]:
# Label the topics yourself
topic_model.set_topic_labels({1:'Space Travel', 7:'Religion'})

# or use one of the other topic representations, like KeyBERTInspired
keybert_topic_labels={topic: ' | '.join(list(zip(*values))[0][:3]) for topic, values in topic_model.topic_aspects_['KeyBERT'].items()}
topic_model.set_topic_labels(keybert_topic_labels)

topic_model

<bertopic._bertopic.BERTopic at 0x7a60ebc6a7a0>

Now that we have set the updated topic labels, we can access them with the many functions used throughout BERTopic. Most notably, we can show the updated labels in visulizations with the `custom_labels=True` parameters. And we can see that `.get_topic_info` now also includes the column `CustomName`. That is the custom label that we just created for each topic.

In [29]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,CustomName,Representation,KeyBERT,MMR,POS,Representative_Docs
0,-1,38401,-1_data_learning_model_models,deep learning | classification | deep,"[data, learning, model, models, based, trainin...","[deep learning, classification, deep, machine ...","[data, learning, model, models, based, trainin...","[data, learning, model, models, training, meth...",[ Clinical decision support using deep neural...
1,0,5743,0_policy_reinforcement_reinforcement learning_rl,policy gradient | deep reinforcement | reinfor...,"[policy, reinforcement, reinforcement learning...","[policy gradient, deep reinforcement, reinforc...","[policy, reinforcement, reinforcement learning...","[policy, reinforcement, learning, reward, agen...",[ Reinforcement learning (RL) algorithms have...
2,1,3482,1_speech_audio_speaker_music,automatic speech | speech enhancement | speech...,"[speech, audio, speaker, music, acoustic, asr,...","[automatic speech, speech enhancement, speech ...","[speech, audio, speaker, music, acoustic, asr,...","[speech, audio, speaker, music, acoustic, reco...","[In this paper, we propose a novel unsupervise..."
3,2,3309,2_3d_object_video_segmentation,3d | point cloud | point clouds,"[3d, object, video, segmentation, image, objec...","[3d, point cloud, point clouds, semantic segme...","[3d, object, video, segmentation, image, objec...","[object, video, segmentation, image, objects, ...","[ Given two consecutive RGB-D images, we prop..."
4,3,2044,3_user_recommendation_items_item,collaborative filtering | recommendation syste...,"[user, recommendation, items, item, recommende...","[collaborative filtering, recommendation syste...","[user, recommendation, items, item, recommende...","[user, recommendation, items, item, recommende...",[ Capturing the temporal dynamics of user pre...
...,...,...,...,...,...,...,...,...,...
113,112,162,112_metric_metric learning_distance_distance m...,metric learning | learning metric | existing m...,"[metric, metric learning, distance, distance m...","[metric learning, learning metric, existing me...","[metric, metric learning, distance, distance m...","[metric, metric learning, distance, similarity...",[ Distance metric learning aims to learn from...
114,113,160,113_class_imbalanced_classification_imbalance,imbalanced datasets | class imbalance | class ...,"[class, imbalanced, classification, imbalance,...","[imbalanced datasets, class imbalance, class i...","[class, imbalanced, classification, imbalance,...","[class, imbalanced, classification, imbalance,...",[ Class-imbalance refers to classification pr...
115,114,157,114_clustering_deep clustering_deep_cluster,deep clustering | unsupervised clustering | de...,"[clustering, deep clustering, deep, cluster, u...","[deep clustering, unsupervised clustering, dee...","[clustering, deep clustering, deep, cluster, u...","[clustering, deep clustering, deep, cluster, u...","[ Recently, deep clustering, which is able to..."
116,115,154,115_view_multi view_views_multi,view clustering | view learning | view classif...,"[view, multi view, views, multi, clustering, v...","[view clustering, view learning, view classifi...","[view, multi view, views, multi, clustering, v...","[view, views, multi, clustering, view clusteri...",[ Multi-view clustering has attracted much at...


# Topic-Document Distribution

If using `calculate_probabilities=True` is not possible, than we can [approximate the topic_document distributions]() using `.approximate_distribution`. It is a fast and flexisble method for creating different topic-document distributions.

In [30]:
# `topic_distr` contains the distribution of topics in each document
topic_distr, _ =topic_model.approximate_distribution(abstracts, window=8, stride=4)

100%|██████████| 118/118 [02:44<00:00,  1.40s/it]


Next, lets take a look at a speciic abstract ans see how the topic distribution was extracted:

In [31]:
abstract_id=10
print(abstracts[abstract_id])

  Speaker identification is a powerful, non-invasive and in-expensive biometric
technique. The recognition accuracy, however, deteriorates when noise levels
affect a specific band of frequency. In this paper, we present a sub-band based
speaker identification that intends to improve the live testing performance.
Each frequency sub-band is processed and classified independently. We also
compare the linear and non-linear merging techniques for the sub-bands
recognizer. Support vector machines and Gaussian Mixture models are the
non-linear merging techniques that are investigated. Results showed that the
sub-band based method used with linear merging techniques enormously improved
the performance of the speaker identification over the performance of wide-band
recognizers when tested live. A live testing improvement of 9.78% was achieved



## Visualization

Visualize the topic-document distribution for a single document

In [33]:
topic_model.visualize_distribution(topic_distr[abstract_id])

Visualize the topic-document distribution for a single document

In [34]:
topic_model.visualize_distribution(topic_distr[abstract_id], custom_labels=True)

It seems to have extracted a number of topics that are relevant and shows the distributions of these topics across the abstract. We can g one step further and visualize them on a token-level:

In [35]:
# calculate the topic distributions on a token-level
topic_distr, topic_token_distr=topic_model.approximate_distribution(abstracts[abstract_id], calculate_tokens=True)

# visualize the token-level distributions
df=topic_model.visualize_approximate_distribution(abstracts[abstract_id], topic_token_distr[0])
df

100%|██████████| 1/1 [00:00<00:00,  5.60it/s]

Styler.applymap has been deprecated. Use Styler.map instead.



Unnamed: 0,Speaker,identification,is,powerful,non,invasive,and,in,expensive,biometric,technique,The,recognition,accuracy,however,deteriorates,when,noise,levels,affect,specific,band,of,frequency,In,this,paper,we,present,sub,band.1,based,speaker,identification.1,that,intends,to,improve,the,live,testing,performance,Each,frequency.1,sub.1,band.2,is.1,processed,and.1,classified,independently,We,also,compare,the.1,linear,and.2,non.1,linear.1,merging,techniques,for,the.2,sub.2,bands,recognizer,Support,vector,machines,and.3,Gaussian,Mixture,models,are,the.3,non.2,linear.2,merging.1,techniques.1,that.1,are.1,investigated,Results,showed,that.2,the.4,sub.3,band.3,based.1,method,used,with,linear.3,merging.2,techniques.2,enormously,improved,the.5,performance.1,of.1,the.6,speaker.1,identification.2,over,the.7,performance.2,of.2,wide,band.4,recognizers,when.1,tested,live.1,live.2,testing.1,improvement,of.3,78,was,achieved
1_speech_audio_speaker_music,0.113,0.113,0.113,0.113,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.101,0.101,0.101,0.101,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.11,0.229,0.348,0.467,0.356,0.238,0.119,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
18_gaussian_variational_inference_gp,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.112,0.224,0.224,0.224,0.112,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
63_emotion_emotion recognition_recognition_facial,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.113,0.113,0.113,0.113,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
92_mixture_em_mixtures_gaussian,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.243,0.69,1.137,1.54,1.297,0.85,0.402,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
93_svm_support vector_support_vector,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.24,0.576,1.005,1.16,0.92,0.584,0.155,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
94_handwritten_text_character_recognition,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.116,0.116,0.116,0.116,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Tip - use_embedding_model

As a default, we compare the c-TF-IDF calculations between the token sets and all topics. Due to its bag-of-word representation, this is quite fast. However, we might want to use the selected embedding_model instrad to do this comparison. Do note that due to the many token sets, it is often computationally quite a bit slower:

```python
topic_distr,_=topic_model.approximate_distribution(docs, use_embedding_model=True)
```