<a href="https://atap.edu.au"><img src="https://www.atap.edu.au/atap-logo.png" width="125" height="50" align="right"></a>
# ATAP: TopSBM

*Australian Text Analytics Platform (ATAP) is an open source environment that provides researchers with tools and training for analysing, processing, and exploring text. ATAP: TopSBM is an effort to integrate the TopSBM approach developed by E.G. Altman et al which focuses on analysing and exploring your text.*

**This notebook is intended for non technical audience.**

---

**TopSBM** is a topic modelling algorithm. [Topic modelling](https://en.wikipedia.org/wiki/Topic_model) find *topics* within a collection of documents.

A *topic* in topic modelling typically refers to a group of related documents from the collection. Note that the step of assigning a word to describe the group is not part of the topic modelling algorithm as opposed to the conventional idea. (However, this is can be achieved later on using a language model e.g. ChatGPT)

A *document* refers to the full piece of text and is synonymous to the conventional meaning of the word.




**References**:
1. TopSBM: Topic Models based on Stochastic Block Models - https://topsbm.github.io/
2. ATAP: Australian Text Analytics Platform - https://www.atap.edu.au/

## 1. Upload your dataset

In the Corpus Loader below, select your dataset and build it as a Corpus.

This is the first step in using the TopSBM notebook. Your Corpus should contain a collection of documents, so that *topics* may be inferred by running the TopSBM algorithm.

For detailed instructions on how to use the Corpus Loader, please click <a href="Corpus Loader User Guide.pdf" target="_blank">here</a>.

Please note:
+ you don't have to use `corpus.csv` or `title.csv`. These are sample datasets.
    + If you decide to use them, then click on `corpus.csv` and load as corpus, then `title.csv` loader as metadata. Then link via their `doc_id` label.
+ you can safely ignore `corpus.txt`, `titles.txt`, these are kept for archiving purposes and are not used in this notebook.

In [None]:
from atap_corpus_loader import CorpusLoader

loader = CorpusLoader('misc')
loader

In [None]:
corpus = loader.get_latest_corpus()
str(corpus), f"Metas: {', '.join(corpus.metas)}"

## 2. Create a Document-Term Matrix (DTM) of your Corpus.

A DTM is a matrix where rows are documents and the columns are terms (or words). This construct is a part of your Corpus.

You'll need to construct one as it is used by the TopSBM topic modelling algorithm.

To build one, you must first specify how to separate the text in each document into a list of words/terms. Here, we separate them by whitespace.

You may also have multiple DTMs for different ways of separating the terms.

In [None]:
import spacy
nlp = spacy.blank('en')
nlp.max_length = 1_500_000  # increase to support long articles up to 1.5m characters
corpus.run_spacy(nlp)

In [None]:
## set up tokeniser functions
from spacy.matcher import Matcher
from atap_corpus._types import Doc


matcher = Matcher(nlp.vocab)
pattern = [{"IS_STOP": False}]  # Match tokens that are not stopwords
matcher.add("NON_STOPWORDS", [pattern])

doc: Doc
tokeniser_fns = {
    "whitespace": lambda doc: [t.text for t in doc],
    "no_stopwords": lambda doc: [doc[start:end].text for match_id, start, end in matcher(doc)]
}

## 3. TopSBM
Now you have everything you need to run TopSBM!

As you can see, we're accessing the Corpus's DTM, specifically the 'tokens' DTM as specified before.

`model.make_graph(...)` constructs the graph for the model using the information from the DTM.
`.model.fit()` will then run the TopSBM algorithm. 

Once it finishes running, the square bracket indicator on the left of the cell should change from [*] to [\<number\>] where \<number\> is a placeholder. 

In [None]:
from topsbm.sbmtm import sbmtm
import atap_wrapper as atap
import panel as pn
pn.extension()

spinner = pn.indicators.LoadingSpinner(value=True, name='Fitting model...', color='success')
display(spinner)

model = sbmtm()
model.make_graph(
    atap.to_list_of_terms(corpus, tokeniser_fns['no_stopwords']),
    corpus['title'].tolist(),
)
model.fit()

spinner.value=False
spinner.name="Fitting complete."

## 4. Visualise Outputs

Now that the algorithm has been fitted onto your dataset, you can now visualise the outputs.

There are currently 2 visualisations for the model. 

1. visualise the groups (i.e. topics) that's been formed for the words.
2. visualise the groups of documents belonging to the same topics.


### 4a. Topics (groups of documents)

In [None]:
vis_doc = atap.visualise(
    model=model, 
    corpus=corpus, 
    kind='documents',
    hierarchy='radial',
    categories=corpus['category'].tolist() if 'category' in corpus.metas else None,
)  

In [None]:
vis_doc.display(depth=0)

### 4b. Topics (groups of words)

In [None]:
vis_words = atap.visualise(
    model=model, 
    corpus=corpus, 
    kind='words',
    hierarchy='radial',
    top_words_for_level=2,
    top_num_words=10,
)  

In [None]:
vis_words.display(depth=2)

In [None]:
model.print_overview()

In [None]:
model.topics(l=2)

In [None]:
model.topicdist(0, l=2)

In [None]:
model.topicdist_relative(0, l=2)

In [None]:
model.docs_of_topic(l=2)

## Bring your model results to a suite of other ATAP tools!

**First, we'll add the results from TopSBM as meta data into our Corpus.**<br>
This will retain the cluster that each document belongs for each of the levels which you can re-use in other ATAP notebooks!

In [None]:
atap.add_results(model, corpus)

print("""
Below displays Corpus-level metadata called 'attributes' which retains the information on where the added metadata is sourced from.
You have added these metadata to your Corpus (see below under 'meta' key):
""".strip())
pn.pane.JSON(corpus.attributes, hover_preview=True, depth=-1, theme='light')

Then, **Export** the corpus using our corpus loader from before.

In [None]:
print(f"Export corpus you fitted the model on: i.e. name = '{corpus.name}'")
loader