In this example we will see how to perform topic extraction using MiniSom. The goal is to extract the main topics (represented as a set of words) that occur in a collection of documents.

In [3]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from minisom import MiniSom
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

The colloction of documents that we will work with is the famous `20newsgroups` dataset. It contains more than 10000 newsgroups posts. We will download the dataset using sklearn and will transform the textual documents into a matrix `D` where each row represents a post using <a href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer">TF-IDF representation</a>:

In [9]:
dataset = fetch_20newsgroups(shuffle=True, random_state=1,
                             remove=('headers', 'footers', 'quotes'))
documents = dataset.data

no_features = 1000

tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
                                   max_features=no_features,
                                   stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(documents)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
D = tfidf.todense().tolist()

Now we have to train a SOM that clusters the documents, the total number of neurons in the SOM will be also the number of topics to extract:

In [15]:
n_neurons = 2
m_neurons = 4
som = MiniSom(n_neurons, m_neurons, no_features)
som.pca_weights_init(D)
som.train(D, 40000, random_order=False, verbose=False)

We will consider as topic the list of first `top_keywords` associated with the biggest weights of each neuron. With the following for loop we will inspect all the weights and recover the words associated with the weights using the feature names saved by the TfidfVectorizer:

In [8]:
top_keywords = 10

weights = som.get_weights()
cnt = 1
for i in range(n_neurons):
    for j in range(m_neurons):
        keywords_idx = np.argsort(weights[i,j,:])[-top_keywords:]
        keywords = ' '.join([tfidf_feature_names[k] for k in keywords_idx])
        print('Topic', cnt, ':', keywords)
        cnt += 1

Topic 1 : steve low reported truth want knowledge shall right people don
Topic 2 : use don just armenians turkey people like os turkish armenia
Topic 3 : used help new buy x11 mail info thanks appreciated advance
Topic 4 : read study event ideas writing religious learn ed religion alt
Topic 5 : learn files point includes board email sound games home pc
Topic 6 : light matter expect final deleted sure administration clinton stuff like
Topic 7 : words generally clinton lots dod machines encryption like money jesus
Topic 8 : report use cards clipper cable air 17 space 19 mode
