[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AnFreTh/STREAM/blob/main/docs/notebooks/examples.ipynb)
[![Open On GitHub](https://img.shields.io/badge/Open-on%20GitHub-blue?logo=GitHub)](https://github.com/AnFreTh/STREAM/blob/main/docs/notebooks/examples.ipynb)

# Examples

**Note**: Make sure the `nltk` dependencies are installed. If not, please run the following command:
```python
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
```

In [1]:
# uncomment the below line if running in Colab
# package neeeds to be installed for the notebook to run

# ! pip install -U stream_topic

In [2]:
import warnings
warnings.filterwarnings("ignore")

In [3]:
from stream_topic.models import KmeansTM
from stream_topic.utils import TMDataset

Optimize model parameters via bayesian optimization

In [4]:
dataset = TMDataset()
dataset.fetch_dataset("BBC_News")
dataset.preprocess(model_type="KmeansTM")

[32m2024-08-09 15:33:16.644[0m | [1mINFO    [0m | [36mstream_topic.utils.dataset[0m:[36mfetch_dataset[0m:[36m118[0m - [1mFetching dataset: BBC_News[0m
[32m2024-08-09 15:33:17.193[0m | [1mINFO    [0m | [36mstream_topic.utils.data_downloader[0m:[36mload_custom_dataset_from_url[0m:[36m331[0m - [1mDownloading dataset from github[0m
[32m2024-08-09 15:33:17.848[0m | [1mINFO    [0m | [36mstream_topic.utils.data_downloader[0m:[36mload_custom_dataset_from_url[0m:[36m333[0m - [1mDataset downloaded successfully at ~/stream_topic_data/[0m
[32m2024-08-09 15:33:18.133[0m | [1mINFO    [0m | [36mstream_topic.utils.data_downloader[0m:[36mload_custom_dataset_from_url[0m:[36m361[0m - [1mDownloading dataset info from github[0m
[32m2024-08-09 15:33:18.324[0m | [1mINFO    [0m | [36mstream_topic.utils.data_downloader[0m:[36mload_custom_dataset_from_url[0m:[36m363[0m - [1mDataset info downloaded successfully at ~/stream_topic_data/[0m
Preprocessing d

In [5]:
model = KmeansTM()
output = model.optimize_and_fit(dataset, n_trials=10, max_topics=20, min_topics=3)

[I 2024-08-09 15:33:29,603] A new study created in memory with name: no-name-882315ac-44ed-4d90-9fc1-cff18636e26d
[32m2024-08-09 15:33:29.606[0m | [1mINFO    [0m | [36mstream_topic.models.KmeansTM[0m:[36mfit[0m:[36m206[0m - [1m--- Training KmeansTM topic model ---[0m
[32m2024-08-09 15:33:30.201[0m | [1mINFO    [0m | [36mstream_topic.models.abstract_helper_models.base[0m:[36mprepare_embeddings[0m:[36m215[0m - [1m--- Loading precomputed paraphrase-MiniLM-L3-v2 embeddings ---[0m
[32m2024-08-09 15:33:30.285[0m | [1mINFO    [0m | [36mstream_topic.utils.data_downloader[0m:[36mload_custom_dataset_from_url[0m:[36m302[0m - [1mDownloading embeddings from github[0m
[32m2024-08-09 15:33:31.073[0m | [1mINFO    [0m | [36mstream_topic.utils.data_downloader[0m:[36mload_custom_dataset_from_url[0m:[36m304[0m - [1mEmbeddings  downloaded successfully at ~/stream_topic_data/[0m
[32m2024-08-09 15:33:31.083[0m | [1mINFO    [0m | [36mstream_topic.models.abs

In [6]:
topics = model.get_topics()
print(len(topics))

7


## Evaluate

In [8]:
from stream_topic.metrics import NPMI, ISIM

metric = NPMI(dataset)
metric.score(topics)

0.19318

In [9]:
isim_metric = ISIM()
isim_metric.score(topics)

0.18481285870075226