# Example: Introduction to topsbm

Topic modelling with hierarchical stochastic block models

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
from topsbm import TopSBM

## Setup: Load a corpus

1. We have a list of documents, each document contains a list of words.
1. We have a list of document titles (optional)

The example corpus consists of 63 articles from Wikipedia taken from 3 different categories (Experimental Physics, Chemical Physics, and Computational Biology).

We use scikit-learn's [`CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to turn this text into a feature matrix.

In [2]:
# Load texts and vectorize
with open('corpus.txt', 'r') as f:
    docs = f.readlines()

vec = CountVectorizer(token_pattern=r'\S+')
X = vec.fit_transform(docs)

# X is now a sparse matrix of (docs, words)

# titles corresponding to docs
with open('titles.txt', 'r') as f:
    x = f.readlines()
titles = [h.split()[0] for h in x]

In [3]:
# view the data for document 0
print(titles[0])
print(docs[0][:100])

Nuclear_Overhauser_effect
 the nuclear overhauser effect noe is the transfer of nuclear spin polarization from one nuclear spi


## Fit the model

Calling `TopSBM.fit_transform` will:
* construct the bipartite graph between documents and words (samples and features)
* perform Hierarchical Stochastic Block Model inference over the graph
* return an embedding of the samples in the block level with finest granularity

In [4]:
model = TopSBM(weighted_edges=False, random_state=8)
Xt = model.fit_transform(X)

## Plotting the graph and block structure

The following plot shows the (hierarchical) community structure in the word-document network as inferred by the stochastic block model:

* document-nodes are on the left
* word-nodes are on the right
* different colors correspond to the different groups

The result is a grouping of nodes into groups on multiple levels in the hierarchy:

* on the uppermost level, each node belongs to the same group (square in the middle)
* on the next-lower level, we split the network into two groups: the word-nodes and the document-nodes (blue sqaures to the left and right, respectively). This is a trivial structure due to the bipartite character of the network.
* only next lower levels constitute a non-trivial structure: We now further divide nodes into smaller groups (document-nodes into document-groups on the left and word-nodes into word-groups on the right)

In the code, the lowest level is known as level 0, with coarser levels 1, 2, ...

In [19]:
model.plot_graph(n_edges=1000)

## Topics

For each word-group on a given level in the hierarchy, we retrieve the $n$ most common words in each group -- these are the topics!

<span style="font-size: large; color: red; font-weight: bold">Something looks broken here! Are we indexing correctly? Is the graph constructed correctly? Is the inference broken? There is an alarmingly high alphabetic/order correlation in the topic assignment for some topics.</span>

In [33]:
topics = pd.DataFrame(model.groups_[1]['p_w_tw'],
                      index=vec.get_feature_names())

In [49]:
for topic in topics.columns:
    print(topics[topic].nlargest(10))
    print()

anisotropic    0.180791
analyzing      0.102034
an             0.057401
along          0.055311
anderson       0.045819
attained       0.022542
aligned        0.018249
acoustic       0.017458
american       0.016158
algorithms     0.015650
Name: 0, dtype: float64

bioinformaticians    0.070905
assignment           0.033289
before               0.020306
analyzed             0.015646
exclusion            0.015313
america              0.013648
biodatomics          0.012650
air                  0.011651
encode               0.011651
angles               0.011318
Name: 1, dtype: float64

transcriptomes    0.090444
packages          0.079067
linpack           0.072241
oxidation         0.026735
ozone             0.026166
simultaneously    0.023891
image             0.023322
giving            0.022184
showing           0.022184
abnormal          0.019340
Name: 2, dtype: float64

assembled       0.051798
alternative     0.041600
detail          0.023349
decreases       0.022276
addressed      

## Topic-distribution in each document

Which level-1 topics contribute to each document?

In [38]:
pd.DataFrame(model.groups_[1]['p_tw_d'],
             columns=titles)

Unnamed: 0,Nuclear_Overhauser_effect,Quantum_solvent,Rovibrational_coupling,Effective_field_theory,Chemical_physics,Rotational_transition,Dynamic_nuclear_polarisation,Knight_shift,Polarizability,Anisotropic_liquid,...,Louis_and_Beatrice_Laufer_Center_for_Physical_and_Quantitative_Biology,Law_of_Maximum,Enzyme_Function_Initiative,SnoRNA_prediction_software,Sepp_Hochreiter,Aureus_Sciences,IEEE/ACM_Transactions_on_Computational_Biology_and_Bioinformatics,Knotted_protein,BioUML,De_novo_transcriptome_assembly
0,0.460145,0.538136,0.5,0.530474,0.478723,0.478992,0.434272,0.481481,0.47817,0.514286,...,0.357143,0.531469,0.362044,0.517241,0.391371,0.348485,0.340909,0.451087,0.409978,0.432133
1,0.268116,0.161017,0.268018,0.076749,0.138298,0.168067,0.330986,0.343915,0.262994,0.028571,...,0.010204,0.0,0.008759,0.017241,0.004622,0.015152,0.0,0.016304,0.0,0.00554
2,0.003623,0.004237,0.0,0.0,0.0,0.0,0.0,0.0,0.00052,0.0,...,0.040816,0.006993,0.128467,0.086207,0.043143,0.121212,0.022727,0.070652,0.078091,0.114497
3,0.07971,0.067797,0.02027,0.038375,0.053191,0.033613,0.045775,0.031746,0.027547,0.085714,...,0.22449,0.076923,0.268613,0.258621,0.331279,0.227273,0.318182,0.255435,0.253796,0.278855
4,0.032609,0.033898,0.011261,0.027088,0.021277,0.016807,0.018779,0.015873,0.012994,0.114286,...,0.030612,0.027972,0.011679,0.017241,0.016949,0.015152,0.0,0.016304,0.017354,0.014774
5,0.061594,0.076271,0.051802,0.040632,0.074468,0.02521,0.056338,0.026455,0.054574,0.028571,...,0.142857,0.062937,0.134307,0.086207,0.098613,0.121212,0.113636,0.119565,0.151844,0.105263
6,0.083333,0.101695,0.119369,0.232506,0.085106,0.151261,0.099765,0.095238,0.138773,0.171429,...,0.030612,0.048951,0.024818,0.017241,0.053929,0.030303,0.0,0.027174,0.010846,0.014774
7,0.01087,0.016949,0.029279,0.051919,0.138298,0.12605,0.010563,0.0,0.023389,0.057143,...,0.081633,0.244755,0.007299,0.0,0.009245,0.030303,0.0,0.016304,0.0,0.0
8,0.0,0.0,0.0,0.002257,0.010638,0.0,0.003521,0.005291,0.00104,0.0,...,0.081633,0.0,0.054015,0.0,0.050847,0.090909,0.204545,0.027174,0.078091,0.034164


## Extra: Clustering of documents - for free.

The stochastic block models clusters the documents into groups. We do not need to run an additional clustering to obtain this grouping.

For a query article, we can return all articles from the same group

In [86]:
cluster_labels = pd.DataFrame(model.groups_[1]['p_td_d'],
                              columns=titles).idxmax(axis=0)
cluster_idx = cluster_labels['Rovibrational_coupling']
cluster_labels[cluster_labels == cluster_idx]

Nuclear_Overhauser_effect                        0
Quantum_solvent                                  0
Rovibrational_coupling                           0
Effective_field_theory                           0
Chemical_physics                                 0
Rotational_transition                            0
Dynamic_nuclear_polarisation                     0
Knight_shift                                     0
Polarizability                                   0
Anisotropic_liquid                               0
Rotating_wave_approximation                      0
RRKM_theory                                      0
Molecular_vibration                              0
Electrostatic_deflection_(structural_element)    0
Magic_angle_(EELS)                               0
Reactive_empirical_bond_order                    0
McConnell_equation                               0
Ziff-Gulari-Barshad_model                        0
Empirical_formula                                0
Pauli_effect                   