<a href="https://atap.edu.au"><img src="https://www.atap.edu.au/atap-logo.png" width="125" height="50" align="right"></a>
# ATAP: TopSBM

*Australian Text Analytics Platform (ATAP) is an open source environment that provides researchers with tools and training for analysing, processing, and exploring text. ATAP: TopSBM is an effort to integrate the TopSBM approach developed by E.G. Altman et al which focuses on analysing and exploring your text.*

---

**TopSBM** is a topic modelling algorithm. [Topic modelling](https://en.wikipedia.org/wiki/Topic_model) find *topics* within a collection of documents.

A *topic* in topic modelling typically refers to a group of related documents from the collection. Note that the step of assigning a word to describe the group is not part of the topic modelling algorithm as opposed to the conventional idea. (However, this is can be achieved later on using a language model e.g. ChatGPT)

A *document* refers to the full piece of text and is synonymous to the conventional meaning of the word.




**References**:
1. TopSBM: Topic Models based on Stochastic Block Models - https://topsbm.github.io/
2. ATAP: Australian Text Analytics Platform - https://www.atap.edu.au/

In [1]:
%%html 
<style>table {float:left}</style>

## 1. Upload your dataset using the Corpus Loader

In the Corpus Loader below, select your dataset and build it as a Corpus.

This is the first step in using the TopSBM notebook. Your Corpus should contain a collection of documents, so that *topics* may be inferred by running the TopSBM algorithm.

#### Instructions on using the Corpus Loader
1. Upload your corpus files using the file browser on the left - ensure the files are in the directory "corpus_files". You can simply **drag and drop** your files. Clicking on the folder icon will show you the file explorer pane. Wait until your corpus has uploaded before you return to the notebook.
2. Executing the following code cell then makes available the ATAP Corpus Loader.
3. Load your files by selecting the files in the selector window and clicking the 'Load as corpus' button. Then select the right datatype label for your file contents. For example, if your file consists of text, the datatype TEXT is appropriate and no changes are necessary. The Corpus Loader also automatically creates and includes filename and filepath as TEXT data.
4. Give your corpus a name (optional) and click on the button “Build corpus”. Wait until you receive the message “Corpus … built successfully”. Review your corpus in the Corpus Overview or continue immediately to the next code cell in the notebook.

For detailed instructions on how to use the Corpus Loader, including uploading your own datasets, please click <a href="Corpus Loader User Guide.pdf" target="_blank">here</a> to open a PDF.


#### Sample datasets
+ There are 4 sample datasets: `corpus.csv`, `wiki.csv`, `constitutions.csv`, `arxiv.csv`.

    
| Dataset      | Description                      | # documents|
|:--------------|----------------------------------|-----|
| `corpus.csv`    | sourced from [wikipedia](https://wikipedia.org) | 63 |
| `wiki.csv`    | sourced from [wikipedia](https://wikipedia.org)       | 120 |
| `constitutions.csv`    |   constitutions of various countries       |189 |
| `arxiv.csv`    |     sourced from [arxiv.org](https://arxiv.org)    | 2539|

In [None]:
from atap_corpus_loader import CorpusLoader

loader = CorpusLoader('datasets')
loader

In [None]:
corpus = loader.get_latest_corpus()
f"Your selected corpus: name={corpus.name} #documents={len(corpus)} meta data: {', '.join(corpus.metas)}"

## 2. Breaking up your documents into words using a whitespace tokeniser

<br>

> **In order for TopSBM to infer topics from your documents, you how to break up the *'words'* or *'tokens'* (as the technical term) in your documents.**

<br/>
The most common tokeniser is to use whitespace as the delimiter.

e.g. "A fox jumped over a lazy fox" will be broken up into "A", "fox", "jumped", "over", "a", "lazy", "fox"


<br/>

Below cells will use the default tokeniser from [spacy](https://spacy.io/usage/spacy-101#annotations-token) (a popular NLP python library) which is mostly a whitespace tokeniser to break the documents up into words.

In order to use spacy, we must first modify the documents in our Corpus as 'spacy' documents.

In [None]:
import spacy
nlp = spacy.blank('en')
nlp.max_length = 1_500_000  # increase to support long articles up to 1.5m characters
corpus.run_spacy(nlp)

In [None]:
tokeniser_fns = {
    "whitespace": lambda doc: [t.text for t in doc],
    "whitespace_lower_case_only": lambda doc: [t.text.lower() for t in doc],
}

### [Optional] Filtering out stop words

Some times you might want to remove words from your documents that may not contribute to the overall semantics.

Such as, 'the', 'of', 'also', 'am', etc. These words are also known as 'stop words'.

In [None]:
from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)
pattern = [{"IS_STOP": False}]  # Match tokens that are not stopwords
matcher.add("NON_STOPWORDS", [pattern])

filters = {
    "no_stopwords": matcher,
}

Run the cell below to show the list of stop words used by spacy.

In [None]:
import panel as pn
pn.extension()
stop_words: str = ', '.join(sorted(set(nlp.Defaults.stop_words), reverse=False))
pn.widgets.StaticText(name='SpaCy stop words: ', value=stop_words)

## 3. TopSBM
Now you have everything you need to run TopSBM!

As you can see, we're accessing the Corpus's DTM, specifically the 'tokens' DTM as specified before.

`model.make_graph(...)` constructs the graph for the model using the information from the DTM.
`.model.fit()` will then run the TopSBM algorithm. 

Once it finishes running, the square bracket indicator on the left of the cell should change from [*] to [\<number\>] where \<number\> is a placeholder. 

In [None]:
import atap_wrapper as atap

list_of_words =  atap.to_list_of_terms(
    corpus, 
    tokeniser_fns['whitespace_lower_case_only'], 
    filters['no_stopwords'],
)
titles = corpus['title'].tolist() if 'title' in corpus.metas else None

In [None]:
from topsbm.sbmtm import sbmtm
import panel as pn
pn.extension()

spinner = pn.indicators.LoadingSpinner(value=True, name='Fitting model...', color='success')
display(spinner)

model = sbmtm()
model.make_graph(
    list_of_words,
    titles,
)
model.fit()

spinner.value=False
spinner.name="Fitting complete."

## 4. Visualise Outputs in a Radial Cluster

Now that the TopSBM has been fitted onto your dataset, you can now visualise the outputs.

There are currently 2 visualisations for the model. 

1. visualise the word groups (i.e. topics) that's been formed for the words.
2. visualise the document groups belonging to the same topics.


### 4a. Topics (groups of documents)

In [None]:
vis_doc = atap.visualise(
    model=model, 
    corpus=corpus, 
    kind='documents',
    hierarchy='radial',
    categories=corpus['category'].tolist() if 'category' in corpus.metas else None,
)  

In [None]:
vis_doc.display(depth=0)

### 4b. Topics (groups of words)

In [None]:
vis_words = atap.visualise(
    model=model, 
    corpus=corpus, 
    kind='words',
    hierarchy='radial',
    top_words_for_level=2,
    top_num_words=10,
)  

In [None]:
vis_words.display(depth=2)

# 5. More technical: Group membership
In the stochastic block model, word (-nodes) and document (-nodes) are clustered into different groups.

The group membership can be represented by the conditional probability $P(\text{group}\, |\, \text{node})$. Since words and documents belong to different groups (the word-document network is bipartite) we can show separately:

- P(bd | d), the probability of document $d$ to belong to document group $bd$
- P(bw | w), the probability of word $w$ to belong to word group $bw$.ore Technical: 

In [None]:
%matplotlib inline  
import pylab as plt

p_td_d,p_tw_w = model.group_membership(l=1)

plt.figure(figsize=(15,4))
plt.subplot(121)
plt.imshow(p_td_d,origin='lower',aspect='auto',interpolation='none')
plt.title(r'Document group membership $P(bd | d)$')
plt.xlabel('Document d (index)')
plt.ylabel('Document group, bd')
plt.colorbar()

plt.subplot(122)
plt.imshow(p_tw_w,origin='lower',aspect='auto',interpolation='none')
plt.title(r'Word group membership $P(bw | w)$')
plt.xlabel('Word w (index)')
plt.ylabel('Word group, bw')
plt.colorbar()

## Relative Topic Distribution
Compare the frequency $f^i_d$ of words from topic $i$ in document $d$ with the expected value across all documents:

$$ \tau_d^i = (f^i_d -\langle f^i \rangle ) / \langle f^i \rangle $$

as in Eq. (10) of Hyland et al.

In [None]:
model.print_overview()

In [None]:
model.topics(l=2)

In [None]:
print("Document title [relative contribution of each topic]\n")
tau_d=model.topicdist_relative(l=2)

for i in range(len(model.documents)):
    print(model.documents[i],tau_d[i])

In [None]:
model.docs_of_topic(l=2, n=10)

# 6. Export TopSBM's results for your Corpus

> First, we'll add the results from TopSBM as meta data into our Corpus.
> 
> This will retain the cluster that each document belongs for each of levels inferred by TopSBM.

Then, export the Corpus using Corpus Loader from before.

In [None]:
atap.add_results(model, corpus)
print(f"Metadata in your Corpus after added results:\n{', '.join(corpus.metas)}\n")

print("""
Below displays Corpus-level metadata called 'attributes' which retains the information on where the added metadata is sourced from.
""".strip())
pn.pane.JSON(corpus.attributes, hover_preview=True, depth=-1, theme='light')

**Export** the corpus using our corpus loader from before.

In [None]:
print(f"Export corpus you fitted the model on: i.e. name = '{corpus.name}'")
loader

<a href="https://atap.edu.au"><img src="https://www.atap.edu.au/atap-logo.png" width="125" height="50" align="right"></a>
## Bring your Corpus with TopSBM results to other ATAP Tools!

You will find the familiar ATAP Corpus Loader interface and continue your analysis.

Link to a collection of ATAP tools - https://www.atap.edu.au/tools/