# Semantic Tagger (English)

In this notebook, you will use [Python Multilingual Ucrel Semantic Analysis System (PyMUSAS)](https://ucrel.github.io/pymusas/) to tag your text data so that you can extract token level semantic tags from the tagged text. PyMUSAS, is a rule based token and Multi Word Expression (MWE) semantic tagger. The tagger can support any semantic tagset, however the currently released tagset is for the [Ucrel Semantic Analysis System (USAS)](https://ucrel.lancs.ac.uk/usas/) sematic tags. 

In addition to the USAS tags, you will also see the lemmas and Part-ofSpeech (POS) tags in the text. For English, the tagger also identifies and tags Multi Word Expressions (MWE), i.e., expressions formed by two or more words that behave like a unit such as 'South Australia'.


**Note:** This code has been adapted from the [PyMUSAS GitHub page](https://github.com/UCREL/pymusas) and modified to run on a Jupyter Notebook. PyMUSAS is an open-source project that has been created and funded by the [University Centre for Computer Corpus Research on Language (UCREL)](https://ucrel.lancs.ac.uk/) at [Lancaster University](https://www.lancaster.ac.uk/). For more information about PyMUSAS, please visit [the Usage Guides page](https://ucrel.github.io/pymusas/).

<div class="alert alert-block alert-warning">
<b>User guide to using a Jupyter Notebook</b> 

If you are new to Jupyter Notebook, feel free to take a quick look at [this user guide](https://github.com/Sydney-Informatics-Hub/HASS-29_Quotation_Tool/blob/main/documents/jupyter-notebook-guide.pdf) for basic information on how to use a notebook.
</div>

## 1. Setup
Before you begin, you need to import the SemanticTagger and the necessary libraries and initiate them to run in this notebook.

In [None]:
# import the SemanticTagger
from semantic_tagger_en import SemanticTagger, DownloadFileLink

# initialize the SemanticTagger
st = SemanticTagger()

## 2. Load the data
This notebook will allow you to tag text data in a text file (or a number of text files). Alternatively, you can also tag text inside a text column inside your excel spreadsheet.  

<table style='margin-left: 10px'><tr>
<td> <img src='./img/txt_icon.png' style='width: 45px'/> </td>
<td> <img src='./img/xlsx_icon.png' style='width: 55px'/> </td>
<td> <img src='./img/csv_icon.png' style='width: 45px'/> </td>
</tr></table>

<div class="alert alert-block alert-warning">
<b>Uploading your text files</b> 
    
Please upload your text files (.txt) below. Multiple files upload is also accepted.  

<b>Note:</b> If the combined size of your text files is larger than 10MB, we suggest storing them in an excel spreadhseet ([see an example here](https://github.com/Sydney-Informatics-Hub/HASS-29_Quotation_Tool/blob/main/documents/sample_texts.xlsx)) and upload the excel spreadsheet (up to 100MB) instead.
</div>

<div class="alert alert-block alert-danger">
<b>Large file upload</b> 
    
If you have ongoing issues with the file upload, please re-launch the notebook via Binder again. If the issue persists, consider restarting your computer.
</div>

In [None]:
# upload the text files and/or excel spreadsheets onto the system
print('Uploading large files may take a while. Please be patient.')
display(st.upload_box)

## 3. Add Semantic Tags
Once your texts have been uploaded, you can begin to add semantic tags to the texts and download the results to your computer. 

<div class="alert alert-block alert-info">
<b>Tools:</b>    

- PyMUSAS RuleBasedTagger: for adding USAS token and Multi Word Expression (MWE) semantic tags.
- spaCy: for adding lemma and POS tags.
</div>

<div class="alert alert-block alert-danger">
<b>Memory limitation in Binder</b> 
    
Binder deployment is only guaranteed a maximum of 2GB memory. Porcessing large text files may cause the session (kernel) to re-start due to insufficient memory. Check [the user guide](https://github.com/Sydney-Informatics-Hub/HASS-29_Quotation_Tool/blob/main/documents/jupyter-notebook-guide.pdf) for more info. 
</div>

In [None]:
# specify the file name for saving the output
file_name = 'output.xlsx'

# add semantic taggers to the uploaded texts
print('Processing and adding semantic tags to your texts...')
st.tag_text(file_name)

# download the excel spreadsheet onto your computer
print('\nClick below to download:')
display(DownloadFileLink(file_name, file_name))