# Semantic Tagger (English)

In this notebook, you will use [Python Multilingual Ucrel Semantic Analysis System (PyMUSAS)](https://ucrel.github.io/pymusas/) to tag your text data so that you can extract token level semantic tags from the tagged text. PyMUSAS, is a rule based semantic tagger. The tagger can support any semantic tagset, however the currently released tagset is for the [Ucrel Semantic Analysis System (USAS)](https://ucrel.lancs.ac.uk/usas/) sematic tags. 

In addition to the USAS tags, you will also see the lemmas and Part-ofSpeech (POS) tags in the text.

**Note:** This code has been adapted from the [PyMUSAS GitHub page](https://github.com/UCREL/pymusas) and modified to run on a Jupyter Notebook. PyMUSAS is an open-source project that has been created and funded by the [University Centre for Computer Corpus Research on Language (UCREL)](https://ucrel.lancs.ac.uk/) at [Lancaster University](https://www.lancaster.ac.uk/). For more information about PyMUSAS, please visit [the Usage Guides page](https://ucrel.github.io/pymusas/).

<div class="alert alert-block alert-warning">
<b>User guide to using a Jupyter Notebook</b> 

If you are new to Jupyter Notebook, feel free to take a quick look at [this user guide](https://github.com/Australian-Text-Analytics-Platform/semantic-tagger/blob/main/documents/jupyter-notebook-guide.pdf) for basic information on how to use a notebook.
</div>

## 1. Setup
Before you begin, you need to import the SemanticTagger and the necessary libraries and initiate them to run in this notebook.

In [1]:
# import the SemanticTagger
from semantic_tagger_en import SemanticTagger, DownloadFileLink

# initialize the SemanticTagger
st = SemanticTagger()

[nltk_data] Downloading package punkt to /Users/sjuf9909/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Loading spaCy language model...
This may take a while...
Finished loading.


## 2. Load the data
This notebook will allow you to tag text data in a text file (or a number of text files). Alternatively, you can also tag text inside a text column inside your excel spreadsheet ([see an example here](https://github.com/Sydney-Informatics-Hub/HASS-29_Quotation_Tool/blob/main/documents/sample_texts.xlsx)).  

<table style='margin-left: 10px'><tr>
<td> <img src='./img/txt_icon.png' style='width: 45px'/> </td>
<td> <img src='./img/xlsx_icon.png' style='width: 55px'/> </td>
<td> <img src='./img/csv_icon.png' style='width: 45px'/> </td>
<td> <img src='./img/zip_icon.png' style='width: 45px'/> </td>
</tr></table>

<div class="alert alert-block alert-warning">
<b>Uploading your text files</b> 
    
If you have a large number of text files (more than 10MB in total), we suggest you compress (zip) them and upload the zip file instead. If you need assistance on how to compress your file, please check [the user guide](https://github.com/Australian-Text-Analytics-Platform/semantic-tagger/blob/main/documents/jupyter-notebook-guide.pdf) for more info. 
</div>

<div class="alert alert-block alert-danger">
<b>Large file upload</b> 
    
If you have ongoing issues with the file upload, please re-launch the notebook via Binder again. If the issue persists, consider restarting your computer.
</div>

In [2]:
# upload the text files and/or excel spreadsheets onto the system
print('Uploading large files may take a while. Please be patient.')
display(st.upload_box)

Uploading large files may take a while. Please be patient.


VBox(children=(FileUpload(value={}, accept='.txt, .xlsx, .csv, .zip', description='Upload your files (txt, csv…

In [3]:
# display uploaded text
n=5

st.text_df.head(n)

Unnamed: 0,text_name,text,text_id
0,Advertiser_2014_09_0016_Television,Television\nFOXTEL HIGHLIGHTS REALITY THE WORL...,017190ab9b70be8a70525574eeda3956
1,Advertiser_2014_06_0034_Bs-newsflash,B+s newsflash\n3.2 mil THAT'S HOW MANY CENTENA...,201f7dad3f0628777a7f2b30521e1fd4
2,Advertiser_2015_11_0029_Mums-fail-to-keep-abre...,Mums fail to keep abreast of formula\nMUMS are...,cb71bfb6279fa90a556f901e73bb998e
3,Advertiser_2014_03_0002_Bs-newsflash,B+s newsflash\n100 CIGARETTES INCREASES BREAST...,498779062d97cd10b72367ba62656146
4,Advertiser_2016_01_0007_Dont-sweat-it,Don't sweat it\nExcessive perspiration causing...,613354b3febe250104360294b2fd2c2c


## 3. Add Semantic Tags
Once your texts have been uploaded, you can begin to add semantic tags to the texts and download the results to your computer. 

<div class="alert alert-block alert-info">
<b>Tools:</b>    

- PyMUSAS RuleBasedTagger: for adding USAS token tags.
- spaCy: for adding lemma and POS tags.
</div>

<div class="alert alert-block alert-danger">
<b>Memory limitation in Binder</b> 
    
The free Binder deployment is only guaranteed a maximum of 2GB memory. Processing very large text files may cause the session (kernel) to re-start due to insufficient memory. Check [the user guide](https://github.com/Australian-Text-Analytics-Platform/semantic-tagger/blob/main/documents/jupyter-notebook-guide.pdf) for more info. 
</div>

In [4]:
# add semantic taggers to the uploaded texts
print('Processing and adding semantic tags to your texts.')
print('The counter will start soon. Please be patient...')
st.tag_text()

Processing and adding semantic tags to your texts.
The counter will start soon. Please be patient...


100%|█████████████████████████████████████████| 699/699 [01:08<00:00, 10.26it/s]


Once you have tagged the texts, you can display them in the dataframe (table format) below. All you need to do is to select the tagged text you wish to display and click the 'Display tagged text' button. You can also filter the text to only display certain pos tagging or usas tagging only (multiple filter selections are possible).

In [5]:
# display tagged text
st.display_tag_text()

VBox(children=(HBox(children=(HTML(value='<b>Select text:</b>', placeholder=''), Combobox(value='', ensure_opt…

<div class="alert alert-block alert-warning">
<b>What information is included in the above table?</b> 

**token:** each token in the sentence, e.g., word, punctuation, etc.
       
**pos:** part-of-speech tag of the token.
    
**usas_tags:** the the Ucrel Semantic Analysis System (USAS) sematic tag of the token.
    
**usas_tags_def:** the definition of the USAS tag of the token.

**lemma** the lemma of the token.

**token_tag** the token along with its USAS tag.
</div>

<div class="alert alert-block alert-danger">
<b>Analyse the tagged text</b> 

You can also analyse the tagged texts using simple visualizations below. To do so, please select the text (including 'all texts') and the entity to analyse, and click 'Show top entities' button. To check the top words in each entity (e.g., top USAS tag 'Personal names' in the text), select the drop down options on the right (multiple selections possible) and click 'Show top words' to display. Lastly, you can save the displayed charts by clicking the 'Save analysis' button. 
</div>



In [6]:
# analyse tagged texts
st.analyse_tags()

VBox(children=(HBox(children=(HTML(value='<b>Select text:</b>', placeholder=''), Combobox(value='', ensure_opt…

## 4. Save tagged texts
Finally, you can run the below code to save the tagged text dataframe into an Excel spreadsheet and download them to your local computer. Note that each tagged text will be saved as an individual sheet (up to 50 texts at a time).

In [7]:
# save tagged texts
st.save_options()

VBox(children=(VBox(children=(HTML(value='<b>Select the tagged texts to save (up to 50 texts at a time):</b>',…