# ATAP Juxtorpus Notebook
This notebook provides a simple workflow for the user to upload, partition and compare their text collections.
The workflow consists of three interactive ATAP tools, [**ATAP_corpus_loader**](https://github.com/Australian-Text-Analytics-Platform/atap-corpus-loader), [**ATAP_corpus_slicer**](https://github.com/Australian-Text-Analytics-Platform/atap-corpus-slicer) and [**Juxtorpus**](https://github.com/Sydney-Informatics-Hub/juxtorpus/). 
The user can upload their text collections as plain text files or spreadsheets, partition data based on versitile conditions, and generate side-to-side comparison between corpus pairs.

<div class="alert alert-block alert-warning">
    
For any questions, feedback, and/or comments about the tool, please contact the Sydney Informatics Hub at [sih.info@sydney.edu.au](mailto:sih.info@sydney.edu.au?subject=[ATAP]%20Juxtorpus%20inquiry).</div>

<div class="alert alert-block alert-warning">
<b>Jupyter Notebook User Guide</b> 

If you are new to Jupyter Notebook, feel free to take a quick look at [this user guide](documents/jupyter-notebook-guide.pdf) for basic information on how to use a notebook.
</div>

<div class="alert alert-block alert-info">
<b>ATAP Corpus Loader User Guide</b>
    
For instructions on how to use the Corpus Loader, please refer to the [Corpus Loader User Guide](documents/Corpus_Loader_User_Guide.pdf). The user guides for the Slicer and Juxtorpus are currently under development.
</div>

## 1. Import Python Packages

In [1]:
from app.JuxVisual import *
from atap_corpus_slicer import CorpusSlicer
from atap_corpus_loader import CorpusLoader
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS


In [2]:
hv.extension('bokeh')

## 2. Load the data
<div class="alert alert-block alert-info">

This notebook will allow you to upload text data in one or more text files, each file contains one document. Alternatively, you can also upload multiple texts as an excel or CSV spreadsheet, in which each row is considered as one document ([see example dataset here](data/ADO/qldelection2020_candidate_tweets.csv)). Multiple files can be zipped and uploaded as a single archive file.
</div>
<div class="alert alert-block alert-warning">

The example dataset under *data/ADO/* is a twitter dataset of all Queensland election candidates during 2020, kindly supplied by the [Australian Digital Observatory](https://www.digitalobservatory.net.au/), an LDaCa partner Australian humanities infrastructure.
</div>

<div class="alert alert-block alert-info">
    
Execute the next cell to run the [*ATAP corpus loader*](https://github.com/Australian-Text-Analytics-Platform/atap-corpus-loader) *UI* so that you can upload your files and build your corpus following the instructions below. Supported file types are .txt, .csv, .xlsx or a zip archive of these file types.  
Once a corpus is successfully built, you can continue with the rest of the notebook to run the Document Similarity Tool with your corpus.
</div>

<div class="alert alert-block alert-info">
<b>Using the ATAP corpus loader to load your dataset/corpus</b> <br>
<td><img src='./documents/img/corpus_loader.png' style='width: 1000px'/></td> <br>
In order to use the document similarity tool, the user needs to build their corpus with the ATAP corpus loader. As a brief overview, the steps to use the loader are as follows (Please check the picture above for locations referred by superscript numbers):

1. Upload your text files using the file explorer pane on the left<sup>1</sup>. <p>If the pane is not activated, clicking on the folder icon<sup>2</sup> will show you the file explorer pane.
Files can be uploaded into any folder by either drag-n-drop or via the upload button<sup>3</sup> <br>
Wait until your corpus is uploaded before you return to the notebook and execute the codes.

2. Execute the following code cell to run the ATAP Corpus Loader in order to build your corpus from the uploaded files, all supported filetypes will be displayed and can be filtered<sup>4</sup> in the corpus-loader.

3. Choose the files in the selector pane<sup>5</sup>, then click the 'Load as corpus' button<sup>6</sup>.  
If loading from a spreadsheet with multiple columns, first, select the correct header of the column that contains the text data<sup>7</sup>. Then make sure the required metadata columns are checked<sup>8</sup> with the correct datatype<sup>9</sup> for your corpus.  
For example, if one column consists of text, the datatype TEXT is appropriate and no changes are necessary.  
If plain text files are loaded, the Corpus Loader also automatically creates and includes the filename as TEXT type metadata.

4. Give your corpus a name<sup>10</sup> and click on the button “Build corpus”<sup>11</sup>. Wait until you receive the message “Corpus … built successfully”.   
Review your corpus in the Corpus Overview or continue immediately to the next code cell in the notebook.  
Refer to the screenshot above for each necessary operation.<br>

</div>

<div class="alert alert-block alert-warning">

The corpus loader also supports to load a metadata spreadsheet and join with the text data with a column of unique identifier, or using the OniLoader to build a corpus from a public LDaCa collection. All these advanced usage can be found in the [Corpus Loader User Guide](documents/Corpus_Loader_User_Guide.pdf).
</div>

In [3]:
corpora: CorpusLoader = CorpusLoader(root_directory='./', 
                                     include_meta_loader=True, 
                                     include_oni_loader=False,
                                     run_logger=True)

slicer: CorpusSlicer = CorpusSlicer(corpus_loader=corpora)
slicer

In [None]:
# Define the word list that you decide to exclude from the analysis in the following line, if you don't have any, leave it as excluded_words = []
# Single or double quote the words and delimiter them with comma, like this: excluded_words = ['these', 'are', 'words', 'I', 'definitely', 'want', 'to', 'exclude']
excluded_words = []
# If you want to include the stopwords pre-defined here: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/feature_extraction/_stop_words.py
# Remove the "#" in front of the following line, if there is one. 
# Otherwise, add a "#" in front of the line to not include pre-defined English stopwords.
excluded_words.extend(ENGLISH_STOP_WORDS)
pn.extension('tabulator')

In [None]:
Jux_dashboard = visualise_jux(corpora, fixed_stopwords=excluded_words)
Jux_dashboard.servable()