<a href="https://atap.edu.au"><img src="https://www.atap.edu.au/atap-logo.png" width="125" height="50" align="right"></a>
# ATAP: TopSBM

*The [Australian Text Analytics Platform (ATAP)](https://www.atap.edu.au) is an open source environment that provides researchers with tools and training for analysing, processing, and exploring text. ATAP: TopSBM is an effort to integrate an approach to topic modelling based on stochastic block models developed by Altmann and colleagues (for references and further details see: [https://topsbm.github.io](https://topsbm.github.io))*

---

**TopSBM** is a topic modelling algorithm. [Topic modelling](https://en.wikipedia.org/wiki/Topic_model) find *topics* within a collection of documents.

Put simply, a *topic* in topic modelling typically refers to groups of co-occurring words in documents, which are then assigned a label which describes the group or ‘topic’. The step of assigning a label to describe the group (the ‘topic’) is not part of the topic modelling algorithm and requires additional research into the relevant ‘topics’ and what kind of information each may characterise.

<div class="alert alert-block alert-warning">
    <span>For any questions, feedback, and/or comments about the tool, please contact the Sydney Informatics Hub at <a href="mailto:sih.info@sydney.edu.au">sih.info@sydney.edu.au</a>.</span>
</div>

<div class="alert alert-block alert-warning">
    <span style="font-weight: bold;">Jupyter Notebook User Guide</span>
    <br>
    <span>
        If you are new to Jupyter Notebook, feel free to take a quick look at <a href="https://github.com/Australian-Text-Analytics-Platform/semantic-tagger/blob/main/documents/jupyter-notebook-guide.pdf">this user guide</a> for basic information on how to use a notebook.
    </span>
</div>

## 1. Upload your dataset using the Corpus Loader

In the Corpus Loader below, select your dataset and build it as a Corpus.

This is the first step in using the TopSBM notebook. Your Corpus should contain a collection of documents, so that *topics* may be inferred by running the TopSBM algorithm.

#### Instructions on using the Corpus Loader
1. Upload your corpus files using the file browser on the left - ensure the files are in the directory "corpus_files". You can simply **drag and drop** your files. Clicking on the folder icon will show you the file explorer pane. Wait until your corpus has uploaded before you return to the notebook.
2. Executing the following code cell then makes available the ATAP Corpus Loader.
3. Load your files by selecting the files in the selector window and clicking the 'Load as corpus' button. Then select the right datatype label for your file contents. For example, if your file consists of text, the datatype TEXT is appropriate and no changes are necessary. The Corpus Loader also automatically creates and includes filename and filepath as TEXT data.
4. Give your corpus a name (optional) and click on the button “Build corpus”. Wait until you receive the message “Corpus … built successfully”. Review your corpus in the Corpus Overview or continue immediately to the next code cell in the notebook.

For detailed instructions on how to use the Corpus Loader, including uploading your own datasets, please click <a href="Corpus Loader User Guide.pdf" target="_blank">here</a> to open the instructions PDF.


#### Sample datasets
+ There are 4 sample datasets: `corpus.csv`, `wiki.csv`, `constitutions.csv`, `arxiv.csv`.

    
| Dataset      | Description                      | # documents|
|:--------------|----------------------------------|-----|
| `corpus.csv`    | sourced from [wikipedia](https://wikipedia.org) | 63 |
| `wiki.csv`    | sourced from [wikipedia](https://wikipedia.org)       | 120 |
| `constitutions.csv`    |   constitutions of various countries       |189 |
| `arxiv.csv`    |     sourced from [arxiv.org](https://arxiv.org)    | 2539|

In [None]:
from atap_corpus_loader import CorpusLoader

loader = CorpusLoader('corpus_files')
loader

Run the cell below to set the last corpus you've uploaded to the loader to be used for this notebook.

In [None]:
corpus = loader.get_latest_corpus()
f"Your selected corpus: name={corpus.name} #documents={len(corpus)} metadata: {', '.join(corpus.metas)}"

## 2. Breaking up your documents into words using a whitespace tokeniser

<br>

> **In order for TopSBM to infer topics from your documents, you have to break up the *'words'* or *'tokens'* (as the technical term) in your documents.**

<br/>
The most common method of tokenisation is to use whitespace as the delimiter.<br><br>

We will be using the default tokeniser from [spacy](https://spacy.io/usage/spacy-101#annotations-token) (a popular NLP python library) to break the documents up into words.
It is a more intelligent whitespace tokeniser as it can handle nuances common in text data.
<br><br>
e.g. "A fox jumped over a lazy fox and it didn't mind at all." will be broken up into "A", "fox", "jumped", "over", "a", "lazy", "fox", "and", "it", "did", "n't", "mind", "at", "all", ".".
<br><br>
Notice that "all." is broken up into "all" and ".", and contractions like "didn't" is broken up into "did", "n't".
<br><br>
*Below are some more examples:*
| sentence | tokens |
|--------------|---------------------------------- |
|Apple is looking at buying U.K. startup for \$1 billion|"Apple", "is", "looking", "at", "buying", "U.K.", "startup", "for", "$", "1", "billion"|
|Autonomous cars shift insurance liability toward manufacturers|"Autonomous", "cars", "shift", "insurance", "liability", "toward", "manufacturers"|
|San Francisco considers banning sidewalk delivery robots|"San", "Francisco", "considers", "banning", "sidewalk", "delivery", "robots"|
|London is a big city in the United Kingdom.|"London", "is", "a", "big", "city", "in", "the", "United", "Kingdom", "."|
|Where are you?|"Where", "are", "you", "?"|
|Who is the president of France?|"Who", "is", "the", "president", "of", "France", "?"|
|What is the capital of the United States?|"What", "is", "the", "capital", "of", "the", "United", "States", "?"|
|When was Barack Obama born?|"When", "was", "Barack", "Obama", "born", "?"|


In order to use spacy, we must first modify the documents in our Corpus to 'spacy' documents.

In [None]:
import spacy
nlp = spacy.blank('en')
nlp.max_length = 1_500_000  # increase to support long articles up to 1.5 million characters
corpus.run_spacy(nlp)

Tokenisers are defined here:
1. `whitespace` - words are broken up by whitespace.
2. `whitespace_lower_case_only` - words are broken up by whitespace and are lower cased.
3. `lemmas` - words are broken up by whitespace and are lemmatised.

In [None]:
tokenisers = {
    "whitespace": lambda doc: [word.text for word in doc],
    "whitespace_lower_case_only": lambda doc: [word.text.lower() for word in doc],
    "lemmas": lambda doc: [word.lemma_ for word in doc],
}

### [Recommended] Filter out special tokens

You may have exported the text from a source that may contain IDs, phone numbers etc. These *special* textual elements can be detrimental to the model's output. 

If you believe these textual elements in your corpus can be used to identify the topics, you can remove this filter in a later cell below in section 3. TopSBM. You'll also find the instructions to remove the filter just before it.

In [None]:
from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)
pattern = [{"IS_ALPHA": True}]  # Match words only
matcher.add("WORDS_ONLY", [pattern])

filters = dict()
filters["no_special_tokens"] = matcher

### [Optional] Filtering out stop words

Sometimes you might want to remove words from your documents that are highly frequent and are commonly excluded from topic modelling and other NLP applications.

Such as, 'the', 'of', 'also', 'am', etc. These words are also known as 'stop words'.

Filters are defined here:
1. `no_stopwords` - stop words are filtered out.

In [None]:
from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)
pattern = [{"IS_STOP": False}]  # Match tokens that are not stopwords
matcher.add("NON_STOPWORDS", [pattern])

filters["no_stopwords"] = matcher

Run the cell below to show the list of stop words used by spacy.

In [None]:
import panel as pn
pn.extension()
stop_words: str = ', '.join(sorted(set(nlp.Defaults.stop_words), reverse=False))
pn.widgets.StaticText(name='SpaCy stop words: ', value=stop_words)

## 3. TopSBM

The following section runs the TopSBM algorithm on your Corpus.

First, we will extract the list of words from your Corpus using the `tokenisers` and `filters` we defined before.

In [None]:
print(f"Available tokenisers: {', '.join(tokenisers.keys())}")
print(f"Available filters: {', '.join(filters.keys())}")

**To change to another available tokeniser**, edit the code cell below before you run it. <br>
The following example will use the *lemmas* tokeniser.<br/>
```python
list_of_words =  atap.to_list_of_words(
    corpus, 
    tokenisers['lemmas'],                            <-- see here
    filters['no_special_tokens'],
)
```

**To add a filter**, edit the code cell below before you run it.<br>
The following example adds the *no_stopwords* filter.<br/>
```python
list_of_words =  atap.to_list_of_words(
    corpus, 
    tokenisers['whitespace_lower_case_only'], 
    filters['no_special_tokens'],
    filters['no_stopwords'],                          <-- see here
)
```

**To remove a filter**, simply remove the line before you run it.<br/>
```python
list_of_words =  atap.to_list_of_words(
    corpus, 
    tokenisers['whitespace_lower_case_only'], 
    filters['no_special_tokens'],
)
```

In [None]:
import atap_wrapper as atap

list_of_words =  atap.to_list_of_words(
    corpus, 
    tokenisers['whitespace_lower_case_only'], 
    filters['no_special_tokens'],
)
titles = corpus['title'].tolist() if 'title' in corpus.metas else None

1. `atap.set_seed(42)` ensures reproducible results when the same seed is used again.
2. `model.make_graph(...)` constructs the graph from your list of words. (optionally provide 'titles' of your documents)
3. `model.fit()` will then run the TopSBM algorithm.

Run the cell below to start fitting the model. If you have a large dataset (Corpus), this can take a while, so be patient.

In [None]:
import atap_wrapper as atap
from topsbm.sbmtm import sbmtm
import panel as pn
pn.extension()

atap.set_seed(42)

spinner = pn.indicators.LoadingSpinner(value=True, name='Fitting model...', color='success')
display(spinner)

model = sbmtm()
model.make_graph(
    list_of_words,
    titles,
)
model.fit()

spinner.value=False
spinner.name="Fitting complete."

## 4. Visualise Outputs in a Radial Cluster

Now that the TopSBM has been fitted onto your dataset, you can now visualise the outputs.

There are currently 2 visualisations for the model. 

1. visualise the document groups belonging to the same topics.
2. visualise the word groups (i.e. topics) that have been formed for the words.


### 4a. Topics (groups of documents)

In [None]:
vis_doc = atap.visualise(
    model=model,   # our topsbm model
    corpus=corpus, # our corpus 
    kind='documents', # visualise documents
    width=1000,  # image width in pixels
    height=1000, # image height in pixels
    hierarchy='radial', # use radial cluster
    categories=corpus['category'].tolist() if 'category' in corpus.metas else None,  # optional - category metadata used for the visualisation
)  

TopSBM is hierarchical, this means it infers topics, here as document groups, in increasing ***levels** of granularity.

A document is assigned to a group at the most granular level, level 0. This document group is then assigned to a group at level 1. This process repeats until the maximum level as inferred by the model.

The grouping in various levels is the same as having categories, sub-categories and sub-sub-categories.

**You can change the `max_level` argument to merge document groups together based on their levels.**
+ `max_level = 0` - show all levels.
+ `max_level = 1` - show up to level 1 (0 being the most granular or lowest level)
+ `max_level = 2` - show up to level 2 (0 being the most granular or lowest level)

An error will be raised if `max_level` is larger than the maximum number of levels inferred by TopSBM.

In [None]:
vis_doc.display(max_level=0)

### 4b. Topics (groups of words)

In [None]:
vis_words = atap.visualise(
    model=model,  # our topsbm model
    corpus=corpus, # our corpus
    width=1000,  # image width in pixels
    height=1000, # image height in pixels
    kind='words',  # visualise words
    hierarchy='radial', # use radial clluster
    top_words_for_level=0, # only select the most probable words of this level
    top_num_words=15, # select the top 'n' most probable words for level specified in top_words_for_level
)  

TopSBM is hierarchical, this means it infers topics, here as word groups, in increasing ***levels** of granularity.

A word is assigned to a group at the most granular level, level 0. This word group is then assigned to a group at level 1. This process repeats until the maximum level as inferred by the model.

The grouping in various levels is the same as having categories, sub-categories and sub-sub-categories.

**You can change the `max_level` argument to merge word groups together based on their levels.**
+ `max_level = 0` - show all levels.
+ `max_level = 1` - show up to level 1 (0 being the outer most or lowest level)
+ `max_level = 2` - show up to level 2 (0 being the outer most or lowest level)

An error will be raised if `max_level` is larger than the maximum number of levels inferred by TopSBM.

In [None]:
vis_words.display(max_level=0)

# 5. More technical: Group membership
*The following sets of analyses are more technical explorations of the topics in your dataset, which provide additional insights.*

<br>
In the stochastic block model, word (-nodes) and document (-nodes) are clustered into different groups.

The group membership can be represented by the conditional probability $P(\text{group}\, |\, \text{node})$. Since words and documents belong to different groups (the word-document network is bipartite) we can show separately:

- $P(bd | d)$ - the probability of document $d$ to belong to document group $bd$
- $P(bw | w)$ - the probability of word $w$ to belong to word group $bw$.

In [None]:
%matplotlib inline  
import pylab as plt

p_td_d,p_tw_w = model.group_membership(l=1)

plt.figure(figsize=(15,4))
plt.subplot(121)
plt.imshow(p_td_d,origin='lower',aspect='auto',interpolation='none')
plt.title(r'Document group membership $P(bd | d)$')
plt.xlabel('Document d (index)')
plt.ylabel('Document group, bd')
plt.colorbar()

plt.subplot(122)
plt.imshow(p_tw_w,origin='lower',aspect='auto',interpolation='none')
plt.title(r'Word group membership $P(bw | w)$')
plt.xlabel('Word w (index)')
plt.ylabel('Word group, bw')
plt.colorbar()

## Relative Topic Distribution
Compare the frequency $f^i_d$ of words from topic $i$ in document $d$ with the expected value across all documents:

$$ \tau_d^i = (f^i_d -\langle f^i \rangle ) / \langle f^i \rangle $$

as in Eq. (10) of [Hyland et al](https://epjdatascience.springeropen.com/articles/10.1140/epjds/s13688-021-00288-5).

In [None]:
model.print_overview()

In [None]:
# Topics (group of words)
model.topics(l=2)

In [None]:
# Relative Document-Topic Contributions
import pandas as pd
import panel as pn

tau_d=model.topicdist_relative(l=2)
num_topics = len(tau_d[0])

rows = [[model.documents[i]] + [tau_d[i][j] for j in range(num_topics)] for i in range(len(model.documents))]
df = pd.DataFrame(rows, columns=["Document"] + [f"topic {i}" for i in range(num_topics)])

print("Document's relative contribution to each topic")
pn.widgets.DataFrame(df, show_index=False, autosize_mode='fit_columns')

In [None]:
# Top contributing documents for each topic.

top_docs=5
model.docs_of_topic(l=2, n=top_docs)

# 6. Export TopSBM's results for your Corpus

> First, we'll add the TopSBM results from your analysis in new metadata columns at the document-level to your corpus.
> These columns will contain the topics that each document belongs to for each level as inferred by TopSBM.


> Then, we'll add some additional metadata your corpus at the corpus-level (we use the word 'attributes' to distinguish between corpus and document level metadata).<br>
> The attributes describe the names of the new metadata columns added by your topsbm analysis, the git repository and the specific commit (i.e. a unique identifier of to a snapshot of the evolving repository) so that you can come back to the exact code you've used for your analysis.

Finally, you can export the Corpus using Corpus Loader from before in the final code cell.

In [None]:
atap.add_results(model, corpus)
print("Added new metadata columns and attributes to your corpus.")

print(f"Your corpus now contains these metadata columns:\n{', '.join(corpus.metas)}\n")

print("""
Your corpus now contains these attributes:
""".strip())
pn.pane.JSON(corpus.attributes, hover_preview=True, depth=-1, theme='light')

**Export** the corpus using our corpus loader from before. You should be at the "Corpus Overview" tab where you can select the file type to export and the Export button.

In [None]:
print(f"Export corpus you fitted the model on: i.e. name = '{corpus.name}'")
loader

<a href="https://atap.edu.au"><img src="https://www.atap.edu.au/atap-logo.png" width="125" height="50" align="right"></a>
## Bring your Corpus with TopSBM results to other ATAP Tools!

If you want to continue your analysis, you can take your corpus together with the associated TopSBM results (as metadata) and run further analyses with other ATAP tools.

These tools will use the same ATAP Corpus Loader interface. You can find a collection of ATAP tools at https://www.atap.edu.au/tools/.