<a href="https://atap.edu.au"><img src="https://www.atap.edu.au/atap-logo.png" width="125" height="50" align="right"></a>
# ATAP: TopSBM

TopSBM is a topic modelling approach that leverages a bipartite network of documents and terms and finding a hierarchy of blocks (or clusters) within the two types.

Australian Text Analytics Platform (ATAP) is an open source environment that provides researchers with tools and training for analysing, processing, and exploring text. ATAP: TopSBM is an effort to integrate the TopSBM approach developed by E.G. Altman et al which focuses on analysing and exploring your text.

**References**:
1. TopSBM: Topic Models based on Stochastic Block Models - https://topsbm.github.io/
2. ATAP: Australian Text Analytics Platform - https://www.atap.edu.au/
   

## Demo
This notebook is a demo of using TopSBM and integrates with `atap_corpus` from ATAP.

It first builds an ATAP Corpus and assign a 'title' as metadata and then compute a document-term-matrix DTM. <br>The DTM will be used as input to build the TopSBM network which the model will be fitted against. The 'title' metadata is used as label for the documents.

Then, wrap the model and corpus with the ATAP wrapper allows you to integrate your results with ATAP Corpus including at the end a download link for your Corpus with your results for re-use. You can then carry it across to another ATAP tool notebook available on our [website](https://www.atap.edu.au/) for further exploration or analysis.

In [None]:
import pandas as pd
# documents
with open('assets/corpus.txt', 'r', encoding='utf-8') as h:
    lines = h.readlines() 
df = pd.DataFrame(lines, columns=['document'])

# meta - title
with open('assets/titles.txt', 'r', encoding='utf-8') as h:
    titles = [l.rstrip() for l in h.readlines()]
df['title'] = titles
df.head(3)

In [None]:
from atap_corpus import Corpus

corpus = Corpus.from_dataframe(df, col_doc='document', name='topsbm')
f"Corpus <name: {corpus.name}   size: {len(corpus)} documents>"

In [None]:
# create your document-term-matrix
dtm_name = 'tokens'                               # name of your DTM
tokeniser_func = lambda doc: doc.split()          # how you define each 'term' in the DTM from each document. (Here, it is whitespace delimited)

corpus.add_dtm_from_docs(tokeniser_func=tokeniser_func, name=dtm_name)
f"Created DTM: {corpus.dtms[dtm_name]}"

## Alternative: Upload Corpus
You may choose to upload the pre-built `demo_corpus.zip` which will be the exact corpus built from the previous cells.

In [None]:
# alternatively: upload demo_corpus.zip - which is the corpus built before.
from atap_corpus.utils import corpus_uploader

finp, corpora = corpus_uploader()
finp

In [None]:
corpus = corpora.items()[-1]   # retrieve last uploaded corpus
dtm_name = list(corpus.dtms.keys())[0] # retrieve the name of the first DTM (only has one)
f"Corpus <name: {corpus.name}   size: {len(corpus)} documents>"

## TopSBM: `make_graph()` and `fit()`

In [None]:
from topsbm.sbmtm import sbmtm

model = sbmtm()
model.make_graph(corpus.dtms['tokens'].to_lists_of_terms(), corpus['title'].tolist())
model.fit()

## Visualise the inferred hierarchical blocks

In [None]:
from atap_wrapper import visualise_blocks 

vis_doc, vis_word = visualise_blocks(model, kind='collapsible-tree')

In [None]:
vis_doc

In [None]:
vis_word

## ATAP wrapper: Download 

Wrap your model and corpus with atap and then call serialise on the wrapper.
This is a custom wrapper for TopSBM only.

In [None]:
from atap_wrapper import wrap

wrapped = wrap(model, corpus, used_dtm=dtm_name) # dtm_name is the name of the dtm you've used earlier to build and fit the model.
wrapped.download()