# QuotationTool
In this notebook, you will use the *QuotationTool* to extract quotes from a list of texts. In addition to extracting the quotes, the tool also provides information about who the speakers are, the location of the quotes (and the speakers) within the text, the identified named entities, etc., which can be useful for your text analysis.  

**Note:** This code has been adapted (with permission) from the [GenderGapTracker GitHub page](https://github.com/sfu-discourse-lab/GenderGapTracker/tree/master/nlp/english) and modified to run on a Jupyter Notebook. The quotation tool’s accuracy rate is evaluated in [this article](https://doi.org/10.1371/journal.pone.0245533).

<div class="alert alert-block alert-warning">
<b>User guide to using a Jupyter Notebook</b> 

If you are new to Jupyter Notebook, feel free to take a quick look at [this user guide](https://github.com/Australian-Text-Analytics-Platform/quotation-tool/blob/main/documents/jupyter-notebook-guide.pdf) for basic information on how to use a notebook.
</div>

### Quotation Tool User Guide

For instructions on how to use the Quotation Tool, please refer to the [Quotation Tool User Guide](documents/quotation_help_pages.pdf).
# QuotationTool
In this notebook, you will use the *QuotationTool* to extract quotes from a list of texts. In addition to extracting the quotes, the tool also provides information about who the speakers are, the location of the quotes (and the speakers) within the text, the identified named entities, etc., which can be useful for your text analysis.  

**Note:** This code has been adapted (with permission) from the [GenderGapTracker GitHub page](https://github.com/sfu-discourse-lab/GenderGapTracker/tree/master/nlp/english) and modified to run on a Jupyter Notebook. The quotation tool’s accuracy rate is evaluated in [this article](https://doi.org/10.1371/journal.pone.0245533).

<div class="alert alert-block alert-warning">
<b>User guide to using a Jupyter Notebook</b> 

If you are new to Jupyter Notebook, feel free to take a quick look at [this user guide](https://github.com/Australian-Text-Analytics-Platform/quotation-tool/blob/main/documents/jupyter-notebook-guide.pdf) for basic information on how to use a notebook.
</div>

### Quotation Tool User Guide

For instructions on how to use the Quotation Tool, please refer to the [Quotation Tool User Guide](documents/quotation_help_pages.pdf).

## Coreference feature

This branch of the Quotation Tool contains an experimental coreference feature. This feature provides the proper name of the speaker when only a pronoun is provided. Considering the following text as an example:

```text
The Prime Minister said "Inflation is the primary concern of this government." He continued, "It is a pressing issue for every Australian."
```

Without the coreference feature, the second quote would be attributed to the speaker "He". With the coreference feature, the second quote would be attributed to the speaker "The Prime Minister".

**Limitations**

This feature is experimental and has not been merged into the main Quotation Tool because of a memory leak issue. A relatively small corpus will cause a relatively large amount of memory to be consumed. The Binder instances used to ensure the portability of the application have limited memory and this issue has caused the Quotation Tool to unexpectedly stop working. If coreference is not a key requirement in your analysis, it is advised to switch to the main Quotation Tool (found on the main branch).

_Recommended solutions_

If the corpus you are attempting to analyse is unable to be processed by the Quotation Tool, try one of the following potential solutions:

- Divide the corpus into smaller chunks and process each one individually
- Run the Quotation Tool on your own machine by downloading and running the code in the repository
- Reduce the size of individual documents by extracting the relevant context. This can be done with the [ATAP Context Extractor](https://github.com/Australian-Text-Analytics-Platform/atap-context-extractor)

## 1. Setup
Before you begin, you need to import the QuotationTool and the necessary libraries and initiate them to run in this notebook.

In [None]:
# import the QuotationTool
import warnings
from extract_display_quotes import QuotationTool

# initialize the QuotationTool
qt = QuotationTool()

<div class="alert alert-block alert-warning">
<b>Installing Libraries</b> 

The requirements file <b>environment.yml</b> is included with this notebook. Take a look inside to find out what libraries you have just installed with the above command.

</div>

## 2. Load the data
<table style='margin-left: 10px'><tr>
<td> <img src='./img/txt_icon.png' style='width: 45px'/> </td>
<td> <img src='./img/docx_icon.png' style='width: 45px'/> </td>
<td> <img src='./img/xlsx_icon.png' style='width: 55px'/> </td>
<td> <img src='./img/csv_icon.png' style='width: 45px'/> </td>
<td> <img src='./img/zip_icon.png' style='width: 45px'/> </td>
</tr></table>

1. First upload your text data to the corpus_files folder using the file browser on the left.
2. Use the Corpus Loader tool below to load the data into a dataframe for analysis.

Note: when loading tabular data (CSV, ODS, XLSX), the 'filename' column is used to identify rows as texts. If this column is not present, the text names will be automatically generated.

Read the [CorpusLoader User Guide](documents/Corpus%20Loader%20User%20Guide.pdf) for more detailed instructions

In [None]:
qt.file_uploader

Once your files are uploaded, you can see a preview of the text in a table format (pandas dataframe).  

<div class="alert alert-block alert-info">
<b>Tools:</b>    
    
- nltk: for sentence tokenization
- spaCy: for text cleaning and normalisation
- pandas: for storing and displaying in dataframe (table) format
</div>

<div class="alert alert-block alert-warning">
<b>Specify the number of rows to display</b> 
    
By default, you will preview the first 5 rows of the extracted quotes in a pandas dataframe (table) format. However, you can preview more or less rows by specifying the number of rows you wish to display in the variable 'n' below. 
</div>

In [None]:
# specify the number of rows you wish to display
n=5

# display a preview of the pandas dataframe
qt.text_df.head(n)

## 3. Extract the quotes
Once your texts have been stored in a pandas dataframe, you can begin to extract the quotes from the texts. You can also extract named entities from your text by setting the named entities you wish to include in the below *inc_ent* variable. If you are extracting quotes from a lot of texts, be patient. As a guideline, for a corpus with a file size of 54.13 MB (~26,000 newspaper articles in plain text format), it can take ca 45 minutes to extract quotes.    

<div class="alert alert-block alert-info">
<b>Tools:</b>    

- quote_extractor: for extracting quotes and speakers
- spaCy: for extracting named entities
    
<b>Note:</b> this tool uses spaCy to tokenize the text, which initially splits the text into tokens based on whitespace characters, and then applies language specific rules to further refine the outcome. For example, the word “don’t” does not contain whitespace, but would be split into two tokens: “do” and “n’t”, whereas “U.K.” would remain as one token. For more information about spaCy tokenizer, please visit [this page](https://spacy.io/usage/linguistic-features#tokenization).
</div>

<div class="alert alert-block alert-warning">
<b>Specify the number of rows to display</b> 
    
By default, you will preview the first 5 rows of the extracted quotes in a pandas dataframe (table) format. However, you can preview more or less rows by specifying the number of rows you wish to display in the variable 'n' below. 
</div>

<div class="alert alert-block alert-danger">
<b>Memory limitation in Binder</b> 
    
The free Binder deployment is only guaranteed a maximum of 2GB memory. Processing very large text files may cause the session (kernel) to re-start due to insufficient memory. Check [the user guide](https://github.com/Sydney-Informatics-Hub/HASS-29_Quotation_Tool/blob/main/documents/jupyter-notebook-guide.pdf) for more info. 
</div>

<div class="alert alert-block alert-warning">
    <b>Available Entities</b> 
    
If you run the code in the cell below, you will see a table of the available entities and their explanation. By modifying the list specified as inc_ent in the cell after this, you can choose which entities to include in the analysis. If you do not modify the `inc_ent` variable, the tool will extract the six entities included by default in the cell ('ORG', 'PERSON', 'GPE', 'NORP', 'FAC', 'LOC').   
</div>

In [None]:
import spacy
import pandas as pd
import panel as pn
pn.extension()
labels = sorted([label for label in qt.nlp.get_pipe('ner').labels])
explain = [spacy.explain(l) for l in labels]    
_explain_df = pd.DataFrame(zip(labels, explain), columns=['Entity', 'Explanation'])
pn.widgets.DataFrame(_explain_df, disabled=True, 
                     widths={'Entity':250, 'Explanation': 900}, 
                     width=1200, show_index=False)

In [None]:
# specify the named entities you wish to include below (see table above)
inc_ent = ['ORG','PERSON','GPE','NORP','FAC','LOC']

# specify the number of rows you wish to display
n=5

# extract quotes from the text and preview them in a pandas dataframe (table) format
quotes_df = qt.extract_quotes(inc_ent)

# display a preview of the pandas dataframe
quotes_df.head(n)

<div class="alert alert-block alert-warning">
<b>What information is included in the above table?</b> 

In general, the quotes are extracted either based on syntactic or heuristic rules. Some quotes can be stand-alone in a sentence, or followed by another quote (floating quote) from the same speaker. Please refer to [this document](https://doi.org/10.1371/journal.pone.0245533.s001) for further information about the quote extraction process.  
    
**text_id:** the unique ID of the text.
    
**text_name** the name of the text, i.e., the name of the .txt files or the 'text_name' column in the excel spreadsheet.
    
**quote_id/speaker_id:** the unique ID of the extracted quote/speaker.
    
**quote/speaker:** the content of the extracted quote and the speaker.
    
**verb:** the verb used to determine the extracted quote.
    
**quote_index/speaker_index/verb_index:** the location of the first and the last characters of the extracted quote/speaker/verb in the text.
    
**quote_entities/speaker_entities:** the entity name and type of the entities identified in the extracted quote/speaker.
    
**quote_token_count:** the length of the extracted quote (in character).
    
**quote_type:** the type of quote based on how it is extracted.
    
**floating_quote:** whether the extracted quote is a floating quote, i.e., a follow up quote from the same speaker (The value TRUE here means that the quote is a floating quote, while FALSE means that the quote is not a floating quote).

**Quotation symbols:** Q (Quotation mark), S (Speaker), V (Verb), C (Content).  

**Named Entities:**  PERSON (People, including fictional), NORP (Nationalities or religious or political groups), FAC (Buildings, airports, highways, etc.), ORG (Companies, agencies, institutions, etc.), GPE (Countries, cities, states), LOC (Non-GPE locations, mountain ranges, bodies of water).
</div>

## 4. Display the quotes
Once you have extracted the quotes, you can see a preview of the quotes using spaCy's visualisation tool, displaCy. 

<div class="alert alert-block alert-info">
<b>Tools:</b>    

- displaCy: for displaying quotes, speakers and named entities
- ipywidgets: for interactive tool
</div>

<div class="alert alert-block alert-danger">
<b>Select the text and the entities to show</b> 

In order to preview the extracted information, select the text you wish to analyse and which entities to show. Then, you can click the ***Preview*** button to display them and the ***Save Preview*** button to save them as an html file. 
</div>

In [None]:
# display a preview of the extracted quotes, speakers and entities within the text
warnings.filterwarnings("ignore")
qt.analyse_quotes(inc_ent)

<div class="alert alert-block alert-danger">
<b>Select the text and the entities to show</b> 

You can also display the top named entitites identified in the quotes and/or speakers. You just need to select the text to analyse (option to analyse 'all texts' is also available), whether to display the identified entities in the speakers and/or quotes, whether to display the entity names and/or types, the number of top entities to display and finally, click the ***Show Top Entities*** and ***Save Top Entities*** buttons to display and save them, respectively. 
</div>

In [None]:
# check the top named entities identified in the quotes and/or speakers
warnings.filterwarnings("ignore")
qt.analyse_entities(inc_ent)

<div class="alert alert-block alert-warning">
<b>Capitalized words</b> 

Please note that lowercase or UPPERCASE words such as quote, QUOTE, Quote, etc. are recognised as different words by the tool, so you may see that they are counted differently in the above analysis.

</div>

## 5. Save the quotes
Finally, you can run the below code to save the quotes pandas dataframe into an Excel spreadsheet and download them to your local computer. 


#### 5a. Frequency lists
Along with the quotes pandas dataframe, this will generate the full results as well as frequency lists for most of the categories for your convenience. Below is a list and explanation of frequency lists that are generated as separate sheets.
+ **verb frequencies**: the frequency of the different items identified as reporting expressions (e.g. the verbs *said*, *say*, *told*);
+ **is floating quote**: how many quotes are floating quotes (TRUE) and how many are not (FALSE);
+ **speaker frequencies**: the frequency of the different speakers identified through [GenderGapTracker](https://github.com/sfu-discourse-lab/GenderGapTracker/tree/master/nlp/english)'s quotation extractor;
+ **speaker entity name**: the frequency of speakers that are identified as entities and the names of these entities (e.g. *Dr Sun*), as per spaCy. This is based on the extraction of speakers from [GenderGapTracker](https://github.com/sfu-discourse-lab/GenderGapTracker/tree/master/nlp/english)'s quote extractor, but excludes pronouns and other speakers that are not identified as entities by spaCy.
+ **speaker entity type**: the frequency of the different entity types for speakers;
+ **quote entity name**: the frequency of entities occurring in quotes and their names, as per spaCy;
+ **quote entity type**: the frequency of the different entity types for entities occurring in quotes;
+ **quote type**:  the frequency of the different quotation types from [GenderGapTracker](https://github.com/sfu-discourse-lab/GenderGapTracker/tree/master/nlp/english)'s quote extractor (e.g. SVC, Heuristic).

<div class="alert alert-block alert-warning">
    <b>How frequencies are counted</b>

- For speaker or quote entity names, the following applies: Each extracted quote may have zero, a single or multiple entity names. For multiple entity names, each ***unique*** name is counted ***per quote***. The number of speaker entity names may therefore exceed the number of quotes.
- The frequencies of entity type categories are counted per unique entity type for each extracted quote. I.e., if PERSON appears twice in the same quote (within the same speaker/source), it is counted only once.
- Words are case insensitive (e.g. said and Said are treated as the same verb form).
- Rather than relying on the provided frequency counts and their specific approach to counting frequencies, users can also do their own frequency calculations by using the results provided in the sheet named ‘full’.

In [None]:
print("Click below to download:")
qt.download(output_dir='./output/', file_name='quotes.xlsx')