In this example we will see how to perform topic extraction using MiniSom. The goal is to extract the main topics (represented as a set of words) that occur in a collection of documents.

In [2]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from minisom import MiniSom
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

The colloction of documents that we will work with is the famous `20newsgroups` dataset. It contains more than 10000 newsgroups posts. We will download the dataset using sklearn and will transform the textual documents into a matrix `D` where each row represents a post using <a href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer">TF-IDF representation</a>:

In [6]:
dataset = fetch_20newsgroups(shuffle=True, random_state=1,
                             remove=('headers', 'footers', 'quotes'))
documents = dataset.data

no_features = 1000

tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
                                   max_features=no_features,
                                   stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(documents)
tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()
D = tfidf.todense().tolist()

Now we have to train a SOM that clusters the documents, the total number of neurons in the SOM will be also the number of topics to extract:

In [11]:
n_neurons = 2
m_neurons = 4
som = MiniSom(n_neurons, m_neurons, no_features)
som.random_weights_init(D)
som.train(D, 5000, random_order=False, verbose=True)

 [ 5000 / 5000 ] 100% - 0:00:00 left 
 quantization error: 0.9874395767896555


We will consider as topic the list of first `top_keywords` associated with the biggest weights of each neuron. With the following for loop we will inspect all the weights and recover the words associated with the weights using the feature names saved by the TfidfVectorizer:

In [12]:
top_keywords = 10

weights = som.get_weights()
cnt = 1
for i in range(n_neurons):
    for j in range(m_neurons):
        keywords_idx = np.argsort(weights[i,j,:])[-top_keywords:]
        keywords = ' '.join([tfidf_feature_names[k] for k in keywords_idx])
        print('Topic', cnt, ':', keywords)
        cnt += 1

Topic 1 : used groups idea month said guns police gun group hell
Topic 2 : cost 12 sound 30 lines total point pc 100 max
Topic 3 : way hi widget thought hockey better running time know times
Topic 4 : computer chips happens control alt escrow apple text chip key
Topic 5 : just offer command manager os chicago files file dos windows
Topic 6 : heard couple people basic apparently just men results ve questions
Topic 7 : cause turkish send address players likely goal lost government team
Topic 8 : action think paul don body bad christians jesus christ god
