# Topic Modelling of 2021 Kaggle Notebooks (Title)

## using BERTopic 

**Install the library `bertopic** 

* There was a strange `numpy` version issue, hence installing with `--no-build-isolation` and `no-binary` 

In [1]:
!pip install bertopic --no-build-isolation --no-binary :all: 

Collecting bertopic
  Downloading bertopic-0.9.4.tar.gz (47 kB)
     |████████████████████████████████| 47 kB 529 kB/s            
[?25h  Preparing metadata (setup.py) ... [?25l- done
[?25hCollecting numpy>=1.20.0
  Downloading numpy-1.21.5.zip (10.7 MB)
     |████████████████████████████████| 10.7 MB 1.5 MB/s            
[?25h  Preparing metadata (pyproject.toml) ... [?25l- \ done
[?25hCollecting hdbscan>=0.8.27
  Downloading hdbscan-0.8.27.tar.gz (6.4 MB)
     |████████████████████████████████| 6.4 MB 46.2 MB/s            
[?25h  Preparing metadata (pyproject.toml) ... [?25l- done
Collecting sentence-transformers>=0.4.1
  Downloading sentence-transformers-2.1.0.tar.gz (78 kB)
     |████████████████████████████████| 78 kB 5.3 MB/s             
[?25h  Preparing metadata (setup.py) ... [?25l- done
Collecting pyyaml<6.0
  Downloading PyYAML-5.4.1.tar.gz (175 kB)
     |████████████████████████████████| 175 kB 35.7 MB/s            
[?25h  Preparing m

### Import BERTopic

In [2]:
from bertopic import BERTopic

* Pandas for reading the input csv 
* datetime for filtering data for 2021

In [3]:
import pandas as pd
import datetime


### Read the Kernels Data from Meta Kaggle

In [4]:
kernels = pd.read_csv("../input/meta-kaggle/Kernels.csv")

This table has got `CurrentUrlSlug` which is basically a proxy for the notebook `title` hence we'll use this for **Topic Modelling**

In [5]:
kernels.head()

Unnamed: 0,Id,AuthorUserId,CurrentKernelVersionId,ForkParentKernelVersionId,ForumTopicId,FirstKernelVersionId,CreationDate,EvaluationDate,MadePublicDate,IsProjectLanguageTemplate,CurrentUrlSlug,Medal,MedalAwardDate,TotalViews,TotalComments,TotalVotes
0,1,2505,205.0,,,1.0,03/25/2015 18:25:32,03/23/2018,03/25/2015,False,hello,,,165,0,0
1,2,3716,1748.0,,26670.0,2.0,03/25/2015 18:31:07,04/16/2015,03/25/2015,False,rf-proximity,3.0,07/15/2016,8649,1,13
2,4,3716,41.0,,,9.0,03/25/2015 21:57:36,03/23/2018,03/25/2015,False,r-version,,,107,0,0
3,5,28963,19.0,,,13.0,03/25/2015 22:01:04,03/23/2018,03/25/2015,False,test1,,,105,0,0
4,6,3716,21.0,,,15.0,03/25/2015 22:19:00,03/23/2018,03/25/2015,False,are-icons-missing,,,110,0,0


Filter our Dataframe for `2021`

In [6]:
kernels_2021 = kernels[(pd.to_datetime(kernels['CreationDate'])>=pd.Timestamp(2021,1,1)) & (pd.to_datetime(kernels['CreationDate'])<=pd.Timestamp(2021,12,31))]

* `ForkParentKernelVersionId` indicates if a Notebook is forked or original. It's possible that some forked notebooks could be very different from the original ones, but for the sake of simplicity we'll **exclude all forked notebooks**.

* We'll use `TotalViews` to exclude notebooks that had 0 views

In [7]:
kernel_2021_nofork_gt_0views = kernels_2021[~kernels_2021.ForkParentKernelVersionId.notnull() & kernels_2021.TotalViews > 0]

In [8]:
kernels_2021.shape

(153790, 16)

In [9]:
kaggle_notebook_tiltes_2021 = list(kernel_2021_nofork_gt_0views.CurrentUrlSlug)

In [10]:
topic_model = BERTopic(verbose=True, embedding_model="paraphrase-MiniLM-L12-v2", min_topic_size=50,  calculate_probabilities=True)

In [11]:

topics, probs = topic_model.fit_transform(kaggle_notebook_tiltes_2021)

Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.70k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/631 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/134M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/316 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/1865 [00:00<?, ?it/s]

2021-12-26 09:40:59,207 - BERTopic - Transformed documents to Embeddings


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


2021-12-26 09:42:11,372 - BERTopic - Reduced dimensionality with UMAP


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

2021-12-26 09:45:31,002 - BERTopic - Clustered UMAP embeddings with HDBSCAN


In [12]:
freq = topic_model.get_topic_info()

In [13]:
freq.head(5)

Unnamed: 0,Topic,Count,Name
0,-1,14684,-1_price_credit_eda_housing
1,0,4397,0_notebook_aliya_notebook726d228659_ds2
2,1,1932,1_starter_riiid_dataset_csv
3,2,1297,2_de_tugas_muhammad_praktikum
4,3,876,3_text_word_word2vec_topic


In [14]:
topic_model.get_topic(0)  # Select the most frequent topic

[('notebook', 0.0044821513312495835),
 ('aliya', 0.0026656647653530174),
 ('notebook726d228659', 0.0026656647653530174),
 ('ds2', 0.0026656647653530174),
 ('tablet', 0.002360758022005322),
 ('notebooka7a58e68d6', 0.0014855480248250018),
 ('notebooka664f3e132', 0.0014855480248250018),
 ('notebooka7b0a2d98a', 0.0014855480248250018),
 ('notebooka8542ac194', 0.0014855480248250018),
 ('notebooka841f88b4d', 0.0014855480248250018)]

In [15]:
topic_model.visualize_topics()

In [16]:
topic_model.visualize_barchart()
