# Overview

Due to the modularity of BERTopic, it can also be used on large datasets `Cohere/wikipedia-22-12` (>1_000_000) if we change some of the internal algorithms such that they can scale a bit better. And we can also enable GPU-accelrated machine learning.

In [1]:
%%capture --no-stderr
# !pip install bertopic==0.16.0

!pip install sentence-transformers==2.3.1
!pip install datasets==2.17.0

Sometimes, it might happen that you get the `NotImplementedError: A UTF-8 locale is required. Got ANSI_X3.4-1968` error, if so make sure to run the following code

In [None]:
# import locale

# _locale._getdefaultlocale = (lambda *args: ['en_US', 'utf8'])

# Loading the dataset

We are going to load in Wikipedia texts. Cohere has fortunately created a dataset split by paragraphs, which allows us to stay within token limit sizes. Let's load in 1 million texts from Wikipedia and see if we can extract topics from them.

In [None]:
data=load_dataset(f'Cohere/wikipedia-22-12', 'en', split='train[:10000]')
# docs=[doc['text'] for doc in data if doc['id']!='1_000']
# print(len(docs))

In [None]:
from datasets import load_dataset
from sentence_transformers import SentenceTransformer

if __name__ == '__main__':
    lang='en'
    data=load_dataset(f'Cohere/wikipedia-22-12', lang, split='train', streaming=True)
    docs=[doc['text'] for doc in data if doc['id']!='1_000_0']
    print(len(docs))
    
    encoder=SentenceTransformer('all-MiniLM-L6-v2')
    
    pool=mode.start_multi_process_pool()
    
    emb=encodr.encode_multi_process(docs, pool)
    print(emb.shape)
    
    model.stop_multi_process_pool(pool)

# Basic Example

This example shows the minimum steps necessary for training a BERTopic model on large datasets. Do note though that memeory errors are still possible when tweaking parameters. After this section, some tips will be mentioned to demonstrate how we can further reduce memory or be more efficient with our training process.


## Embeddings

Next, we are going to pre-calculate the embeddings as input for our BERTopic model. The reason for doing this is that this input step can take quite some time to compute. If we pre-calculate them and save embeddings, we can skip over this step when we are iterating over our model.

In [None]:
# from sentence_transformers import SentenceTransformer

# encoder=SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
# encoder.max_seq_length=256
# encoder.to('cuda')
# encoder

In [None]:
# embeddings=encoder.encode(docs, show_progress_bar=True)
# embeddings.shape

If you want to compute embeddings with `multiple` GPUs and append normlization to it. See [Distribution compute of Quora questions embeddings](https://www.kaggle.com/code/aisuko/distribution-compute-of-quora-questions-embeddings)

In [None]:
import pandas as pd

embedding_df=pd.DataFrame(embeddings,batch_size=128,chunk_size=1024)
embedding_df.to_csv('embedding_of_wikipedia_22_12_without_normalization.csv')