<a href="https://colab.research.google.com/github/Jonathanpro/myaiblog/blob/master/_notebooks/2021-07-22-Topic_Modelling_with_Bert_Topic_inference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP - Topic Modelling with Bertopic - Inference
> A tutorial for inference & vizualization for topic modelling with Huggingface & Bertopic.

- toc: true 
- badges: true
- comments: false
- categories: [jupyter, NLP, GPU]
- image: images/chart-preview.png

# Enabling the GPU

Start this notebook with a GPU attached:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down

# **Requirements**
After installing bertopic click on restart runtime to proceed

In [None]:
!pip install bertopic

Collecting bertopic
  Downloading bertopic-0.8.1-py2.py3-none-any.whl (53 kB)
[?25l[K     |██████                          | 10 kB 21.5 MB/s eta 0:00:01[K     |████████████▏                   | 20 kB 27.9 MB/s eta 0:00:01[K     |██████████████████▎             | 30 kB 20.3 MB/s eta 0:00:01[K     |████████████████████████▍       | 40 kB 16.9 MB/s eta 0:00:01[K     |██████████████████████████████▌ | 51 kB 14.1 MB/s eta 0:00:01[K     |████████████████████████████████| 53 kB 2.0 MB/s 
Collecting plotly<4.14.3,>=4.7.0
  Downloading plotly-4.14.2-py2.py3-none-any.whl (13.2 MB)
[K     |████████████████████████████████| 13.2 MB 142 kB/s 
Collecting umap-learn>=0.5.0
  Downloading umap-learn-0.5.1.tar.gz (80 kB)
[K     |████████████████████████████████| 80 kB 8.6 MB/s 
[?25hCollecting hdbscan>=0.8.27
  Downloading hdbscan-0.8.27.tar.gz (6.4 MB)
[K     |████████████████████████████████| 6.4 MB 21.3 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requir

In [None]:
!pip install numba --upgrade

Collecting numba
  Downloading numba-0.53.1-cp37-cp37m-manylinux2014_x86_64.whl (3.4 MB)
[K     |████████████████████████████████| 3.4 MB 12.8 MB/s 
Collecting llvmlite<0.37,>=0.36.0rc1
  Downloading llvmlite-0.36.0-cp37-cp37m-manylinux2010_x86_64.whl (25.3 MB)
[K     |████████████████████████████████| 25.3 MB 77 kB/s 
Installing collected packages: llvmlite, numba
  Attempting uninstall: llvmlite
    Found existing installation: llvmlite 0.34.0
    Uninstalling llvmlite-0.34.0:
      Successfully uninstalled llvmlite-0.34.0
  Attempting uninstall: numba
    Found existing installation: numba 0.51.2
    Uninstalling numba-0.51.2:
      Successfully uninstalled numba-0.51.2
Successfully installed llvmlite-0.36.0 numba-0.53.1


# **Load data**
You need to create a kaggle.json in your kaggle account and upload it via the below cell

In [None]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))
  
# Then move kaggle.json into the folder where the API expects to find it.
!mkdir -p ~/.kaggle/ && mv kaggle.json ~/.kaggle/ && chmod 600 ~/.kaggle/kaggle.json

Saving kaggle.json to kaggle.json
User uploaded file "kaggle.json" with length 62 bytes


In [None]:
import kaggle

In [None]:
from kaggle.api.kaggle_api_extended import KaggleApi

api = KaggleApi()
api.authenticate()

In [None]:
api.dataset_download_file('pqbsbk/german-news-dataset',
                          file_name='data.csv')

True

In [None]:
!unzip data.csv.zip

Archive:  data.csv.zip
  inflating: data.csv                


In [None]:
!ls

data.csv  data.csv.zip	sample_data


In [None]:
import pandas as pd

In [None]:
df = pd.read_csv("data.csv")

In [None]:
df = df[df["text"].notnull()]

# **Load trained model**

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
!ls /content/gdrive/MyDrive

'Colab Notebooks'	   NER_01_030_2021_07_08.bin
 MC_01_03_29_06_2021.bin   ner.bin
 MC_02_06_07_02_21.bin	   tm_bert_topic
 MC_03_09_07_05_21.bin	   tm_bert_topic_2021_07_19__12_54_00
 multi_class.bin	   tm_bert_topic_2021_07_19__14_33_09


In [None]:
from bertopic import BERTopic
topic_model = BERTopic.load("/content/gdrive/MyDrive/tm_bert_topic_2021_07_19__14_33_09")	

# **Model inference**

In [None]:
topics, probs = topic_model.transform(list(df["text"])[:25000])

HBox(children=(FloatProgress(value=0.0, description='Batches', max=782.0, style=ProgressStyle(description_widt…






# **Visualization**

## Visualize Topics

In [None]:
topic_model.visualize_topics()



## Visualize Topic Hierarchy

In [None]:
topic_model.visualize_hierarchy(top_n_topics=50)

## Visualize Terms

In [None]:
topic_model.visualize_barchart(top_n_topics=5)

## Visualize Topic Similarity

In [None]:
topic_model.visualize_heatmap(n_clusters=20, width=1000, height=1000)

## Visualize Term Score Decline

In [None]:
topic_model.visualize_term_rank()

# **Search Topics**

In [None]:
similar_topics, similarity = topic_model.find_topics("iran", top_n=5); similar_topics

[376, 370, 294, 112, 118]

In [None]:
topic_model.get_topic(370)

[('iran', 0.045999810020958744),
 ('atomabkommen', 0.021954321831109038),
 ('sicherheitsrat', 0.02192809918196278),
 ('sanktionen', 0.020971347590174532),
 ('iranische', 0.012191527878246075),
 ('atomabkommens', 0.011997523481121063),
 ('israel', 0.011565215339411601),
 ('abkommen', 0.010138687569399864),
 ('atomwaffen', 0.008668697396711857),
 ('embargos', 0.006550777015760913)]