# QuotationTool
In this notebook, you will use the *QuotationTool* to extract quotes from a list of texts. In addition to extracting the quotes, the tool also provides information about who the speakers are, the location of the quotes (and the speakers) within the text, the identified named entities, etc., which can be useful for your text analysis.  

**Note:** This code has been adapted (with permission) from the [GenderGapTracker GitHub page](https://github.com/sfu-discourse-lab/GenderGapTracker/tree/master/nlp/english) and modified to run on a Jupyter Notebook. The quotation tool’s accuracy rate is evaluated in [this article](https://doi.org/10.1371/journal.pone.0245533).

<div class="alert alert-block alert-warning">
<b>User guide to using a Jupyter Notebook</b> 

If you are new to Jupyter Notebook, feel free to take a quick look at [this user guide](https://github.com/Australian-Text-Analytics-Platform/quotation-tool/blob/main/documents/jupyter-notebook-guide.pdf) for basic information on how to use a notebook.
</div>

### Quotation Tool User Guide

For instructions on how to use the Quotation Tool, please refer to the [Quotation Tool User Guide](documents/quotation_help_pages.pdf).

## 1. Setup
Before you begin, you need to import the QuotationTool and the necessary libraries and initiate them to run in this notebook.

In [None]:
# import the QuotationTool
import warnings
from extract_display_quotes import QuotationTool, DownloadFileLink

# initialize the QuotationTool
qt = QuotationTool()

<div class="alert alert-block alert-warning">
<b>Installing Libraries</b> 

The requirements file <b>environment.yml</b> is included with this notebook. Take a look inside to find out what libraries you have just installed with the above command.

</div>

## 2. Load the data
This notebook will allow you to extract quotes directly from a text file (or a number of text files). Alternatively, you can also extract quotes from a text column inside your excel spreadsheet ([see an example here](https://github.com/Australian-Text-Analytics-Platform/quotation-tool/blob/main/documents/sample_texts.xlsx?raw=true)).  

<table style='margin-left: 10px'><tr>
<td> <img src='./img/txt_icon.png' style='width: 45px'/> </td>
<td> <img src='./img/xlsx_icon.png' style='width: 55px'/> </td>
<td> <img src='./img/csv_icon.png' style='width: 45px'/> </td>
<td> <img src='./img/zip_icon.png' style='width: 45px'/> </td>
</tr></table>

<div class="alert alert-block alert-warning">
<b>Uploading your text files</b> 
    
If you have a large number of text files (more than 10MB in total), we suggest you compress (zip) them and upload the zip file instead. If you need assistance on how to compress your file, please check [the user guide](https://github.com/Australian-Text-Analytics-Platform/quotation-tool/blob/main/documents/jupyter-notebook-guide.pdf) for more info. 
</div>

<div class="alert alert-block alert-danger">
<b>Large file upload</b> 
    
If you have ongoing issues with the file upload, please re-launch the notebook via Binder again. If the issue persists, consider restarting your computer.
</div>



In [None]:
# upload the text files and/or excel spreadsheets onto the system
display(qt.upload_box)
print('Uploading large files may take a while. Please be patient.')
print('\033[1mPlease wait and do not press any buttons until the progress bar appears...\033[0m')

Once your files are uploaded, you can see a preview of the text in a table format (pandas dataframe).  

<div class="alert alert-block alert-info">
<b>Tools:</b>    
    
- nltk: for sentence tokenization
- spaCy: for text cleaning and normalisation
- pandas: for storing and displaying in dataframe (table) format
</div>

<div class="alert alert-block alert-warning">
<b>Specify the number of rows to display</b> 
    
By default, you will preview the first 5 rows of the extracted quotes in a pandas dataframe (table) format. However, you can preview more or less rows by specifying the number of rows you wish to display in the variable 'n' below. 
</div>

In [None]:
# specify the number of rows you wish to display
n=5

# display a preview of the pandas dataframe
qt.text_df.head(n)

## 3. Extract the quotes
Once your texts have been stored in a pandas dataframe, you can begin to extract the quotes from the texts. You can also extract named entities from your text by setting the named entities you wish to include in the below *inc_ent* variable. If you are extracting quotes from a lot of texts, be patient. As a guideline, for a corpus with a file size of 54.13 MB (~26,000 newspaper articles in plain text format), it can take ca 45 minutes to extract quotes.    

<div class="alert alert-block alert-info">
<b>Tools:</b>    

- quote_extractor: for extracting quotes and speakers
- spaCy: for extracting named entities
    
<b>Note:</b> this tool uses spaCy to tokenize the text, which initially splits the text into tokens based on whitespace characters, and then applies language specific rules to further refine the outcome. For example, the word “don’t” does not contain whitespace, but would be split into two tokens: “do” and “n’t”, whereas “U.K.” would remain as one token. For more information about spaCy tokenizer, please visit [this page](https://spacy.io/usage/linguistic-features#tokenization).
</div>

<div class="alert alert-block alert-warning">
<b>Specify the number of rows to display</b> 
    
By default, you will preview the first 5 rows of the extracted quotes in a pandas dataframe (table) format. However, you can preview more or less rows by specifying the number of rows you wish to display in the variable 'n' below. 
</div>

<div class="alert alert-block alert-danger">
<b>Memory limitation in Binder</b> 
    
The free Binder deployment is only guaranteed a maximum of 2GB memory. Processing very large text files may cause the session (kernel) to re-start due to insufficient memory. Check [the user guide](https://github.com/Sydney-Informatics-Hub/HASS-29_Quotation_Tool/blob/main/documents/jupyter-notebook-guide.pdf) for more info. 
</div>

In [None]:
# specify the named entities you wish to include below
inc_ent = ['ORG','PERSON','GPE','NORP','FAC','LOC']

# specify the number of rows you wish to display
n=5

# extract quotes from the text and preview them in a pandas dataframe (table) format
quotes_df = qt.get_quotes(inc_ent)

# display a preview of the pandas dataframe
quotes_df.head(n)

<div class="alert alert-block alert-warning">
<b>What information is included in the above table?</b> 

In general, the quotes are extracted either based on syntactic or heuristic rules. Some quotes can be stand-alone in a sentence, or followed by another quote (floating quote) from the same speaker. Please refer to [this document](https://doi.org/10.1371/journal.pone.0245533.s001) for further information about the quote extraction process.  
    
**text_id:** the unique ID of the text.
    
**text_name** the name of the text, i.e., the name of the .txt files or the 'text_name' column in the excel spreadsheet.
    
**quote_id/speaker_id:** the unique ID of the extracted quote/speaker.
    
**quote/speaker:** the content of the extracted quote and the speaker.
    
**verb:** the verb used to determine the extracted quote.
    
**quote_index/speaker_index/verb_index:** the location of the first and the last characters of the extracted quote/speaker/verb in the text.
    
**quote_entities/speaker_entities:** the entity name and type of the entities identified in the extracted quote/speaker.
    
**quote_token_count:** the length of the extracted quote (in character).
    
**quote_type:** the type of quote based on how it is extracted.
    
**floating_quote:** whether the extracted quote is a floating quote, i.e., a follow up quote from the same speaker (The value TRUE here means that the quote is a floating quote, while FALSE means that the quote is not a floating quote).

**Quotation symbols:** Q (Quotation mark), S (Speaker), V (Verb), C (Content).  

**Named Entities:**  PERSON (People, including fictional), NORP (Nationalities or religious or political groups), FAC (Buildings, airports, highways, etc.), ORG (Companies, agencies, institutions, etc.), GPE (Countries, cities, states), LOC (Non-GPE locations, mountain ranges, bodies of water).
</div>

## 4. Display the quotes
Once you have extracted the quotes, you can see a preview of the quotes using spaCy's visualisation tool, displaCy. 

<div class="alert alert-block alert-info">
<b>Tools:</b>    

- displaCy: for displaying quotes, speakers and named entities
- ipywidgets: for interactive tool
</div>

<div class="alert alert-block alert-danger">
<b>Select the text and the entities to show</b> 

In order to preview the extracted information, select the text you wish to analyse and which entities to show. Then, you can click the ***Preview*** button to display them and the ***Save Preview*** button to save them as an html file. 
</div>

In [None]:
# display a preview of the extracted quotes, speakers and entities within the text
warnings.filterwarnings("ignore")
qt.analyse_quotes(inc_ent)

<div class="alert alert-block alert-danger">
<b>Select the text and the entities to show</b> 

You can also display the top named entitites identified in the quotes and/or speakers. You just need to select the text to analyse (option to analyse 'all texts' is also available), whether to display the identified entities in the speakers and/or quotes, whether to display the entity names and/or types, the number of top entities to display and finally, click the ***Show Top Entities*** and ***Save Top Entities*** buttons to display and save them, respectively. 
</div>

In [None]:
# check the top named entities identified in the quotes and/or speakers
warnings.filterwarnings("ignore")
qt.analyse_entities(inc_ent)

<div class="alert alert-block alert-warning">
<b>Capitalized words</b> 

Please note that lowercase or UPPERCASE words such as quote, QUOTE, Quote, etc. are recognised as different words by the tool, so you may see that they are counted differently in the above analysis.

</div>

## 5. Save the quotes
Finally, you can run the below code to save the quotes pandas dataframe into an Excel spreadsheet and download them to your local computer.  

In [1]:
warnings.filterwarnings("ignore")
# specify output directory and file name
output_dir = './output/'
file_name = 'quotes.xlsx'

# save quotes_df into an Excel spreadsheet
from pyexcelerate import Workbook
values = [quotes_df.columns] + list(quotes_df.values)
wb = Workbook()
wb.new_sheet('Sheet1', data=values)
wb.save(output_dir + file_name)

# download quotes_df to your computer
print('Click below to download:')
display(DownloadFileLink(output_dir + file_name, 'quotes.xlsx'))

NameError: name 'warnings' is not defined

In [6]:
## Test Quote aggregation
import sys
from config import config
sys.path.insert(0,'./GenderGapTracker/nlp/english')
from quote_extractor import QuoteExtractor
import spacy
import pandas as pd
from spacy_experimental.coref.coref_component import DEFAULT_COREF_MODEL
from spacy_experimental.coref.coref_util import DEFAULT_CLUSTER_PREFIX
config_coref={
    "model": DEFAULT_COREF_MODEL,
    "span_cluster_prefix": DEFAULT_CLUSTER_PREFIX,
}

nlp = spacy.load('en_core_web_lg')
# nlp.add_pipe("experimental_coref", config=config_coref)


qt = QuoteExtractor(config)

In [7]:
text = '''

Senator Matt Canavan, an ally of new Nationals leader Barnaby Joyce, has accused the National Farmers' Federation of hypocrisy for advocating no net emissions of greenhouse gases by 2050 while objecting to land-clearing rules that have helped reduce Australia's carbon dioxide emissions.

The criticism reflects a split within agriculture-based organisations about the benefits or harm of Australia committing to what is known as the CN2050 target at a November United Nations climate conference in Glasgow.

"I find their position absolutely incoherent, the way they have us reducing emissions by 50 per cent through government locking up huge areas of land through tree-clearing laws, which the NFF are opposed to," Senator Canavan, the Nationals' deputy Senate leader, said in an interview.

"The reason we are meeting Kyoto [emissions targets] with a spillover - that is all from 6 million hectares of pastoral land reforested from land-use regulations.

"I suppose they are not the only political organisation to have hypocritical policy positions."

Some National Party politicians, especially in Victoria, have expressed concerns that under Mr Joyce, who was re-elected leader on June 21, the party will become reluctant to support or agree to more stringent climate policies.

Prime Minister Scott Morrison has expressed a desire to reach the 2050 target but has not committed the government to it.

The National Farmers' Federation supports what it calls a 2050 "net zero aspiration" target. The Business Council of Australia, which represents large companies, supports the target without qualification.

Responding to Senator Canavan's comments, NFF chief executive Tony Mahar said the group's approach was consistent and it wanted the debate to focus on more substantial problems, including the policies, technology and funding needed to lower emissions.

"We want to see farmers rewarded for their role in lowering emissions, rather than the policies of the past which saw swaths of private farmland locked up with no recognition and no compensation," he said.

"Land-clearing laws saw farmers do the heavy lifting to meet Australia's Kyoto commitments. It was a kick in the guts we haven't forgotten. Now, we want to be part of the solution in a way that recognises and rewards our contribution."

The leader of the Nationals Victoria, Peter Walsh, acknowledged that his division had considered separating from the rest of the party after Mr Joyce was re-elected leader because of his expressed scepticism about climate policies.

"Victorian industry and the Victorian community expect their government to do more on climate change than our Queensland cousins," Mr Walsh said. "The agriculture industry doesn't want to be carved out [of a climate agreement]. Everyone wants to know what the plan is or if you are part of the discussion."

Australia's greenhouse gas emissions, which represent about 1.3 per cent of the world's, have declined 22.6 per cent since their peak in the 2007 financial year, according to the Department of Industry, Science, Energy and Resources.

The International Energy Agency said in May that, to achieve no net emissions on a global basis by 2050, nuclear power would be needed to back up wind and solar power.

Key pointsThe senator said the farmers' federation's position was 'incoherent'.

He said they can't support net zero emissions and oppose land-clearance restrictions.

'''

doc = nlp(text)

coref = nlp.add_pipe("experimental_coref", config=config_coref)
# This usually happens under the hood
processed = coref(doc)

quotes_o = qt.extract_quotes(doc=doc)


KeyError: "Parameter 'E' for model 'hashembed' has not been allocated yet."

In [3]:
def quote_start(quote):
    '''
    Combine all spans of speaker, quote and verb and output the start of the 
    '''
    names = ['speaker_index', 'quote_index', 'verb_index']
    span = ()
    for n in names:
        if sum(eval(quote[n])):
            span += eval(quote[n])
    return min(span)
    
quotes_s = sorted(quotes_o, key=lambda q: quote_start(q))


In [7]:
def valid_speaker(q):
    return not (q["speaker_index"] == '(0,0)' or q["speaker_pos"] == ['PRON'])

def aggregate(group):
    if len(group) <= 1:
        return group
    else:
        result = group[0]
        result['quote'] = ' \n'.join([q['quote'] for q in group]) 
        result['quote_index'] = ', '.join([q['quote_index'] for q in group])
        result['verb'] = ', '.join([q['verb'] for q in group])
        result['verb_index'] = ', '.join([q['verb_index'] for q in group])
        result['quote_token_count'] = sum([q['quote_token_count'] for q in group])
        result['quote_type'] = ', '.join([q['quote_type'] for q in group])
        return [result]

def aggregate_quotes(quotes):
    results = []
    group = []
    for q in quotes:
        if valid_speaker(q):
            results.extend(aggregate(group))
            print(len(group))
            group = [q]
        else:
            group.append(q)
    results.extend(aggregate(group)) # Aggregate the very last group of quotes
    return results
quotes_a = aggregate_quotes(quotes_s)


0
3
2
1
1
1
1
1


In [8]:

with pd.ExcelWriter("Article1.xlsx") as writer:
    pd.DataFrame.from_dict(quotes_o).to_excel(writer, sheet_name="Orig_quotes")  
    pd.DataFrame.from_dict(quotes_s).to_excel(writer, sheet_name="Sorted_quotes")
    pd.DataFrame.from_dict(quotes_a).to_excel(writer, sheet_name="Aggregated_quotes") 

In [14]:
quotes_a


[{'speaker': "Senator Canavan, the Nationals' deputy Senate leader,",
  'speaker_index': '(720,773)',
  'speaker_pos': ['PROPN',
   'PROPN',
   'PUNCT',
   'DET',
   'PROPN',
   'PART',
   'NOUN',
   'PROPN',
   'NOUN',
   'PUNCT'],
  'quote': '"I find their position absolutely incoherent, the way they have us reducing emissions by 50 per cent through government locking up huge areas of land through tree-clearing laws, which the NFF are opposed to \n"The reason we are meeting Kyoto [emissions targets] with a spillover - that is all from 6 million hectares of pastoral land reforested from land-use regulations.\n\n" \n"\n\nSome National Party politicians, especially in Victoria, have expressed concerns that under Mr Joyce, who was re-elected leader on June 21, the party will become reluctant to support or agree to more stringent climate policies.\n\nPrime Minister Scott Morrison has expressed a desire to reach the 2050 target but has not committed the government to it.\n\nThe National Fa

In [4]:
for q in quotes_s:
    print(q['speaker'],  q["speaker_index"], not (q["speaker_index"] == '(0,0)' or q["speaker_pos"] == ['PRON']))

Senator Canavan, the Nationals' deputy Senate leader, (720,773) True
I (962,963) False
 (0,0) False
NFF chief executive Tony Mahar (1658,1688) True
he (2066,2068) False
The leader of the Nationals Victoria, Peter Walsh, (2313,2363) True
Mr Walsh (2677,2685) True
the Department of Industry, Science, Energy and Resources.

 (3029,3089) True
The International Energy Agency (3089,3120) True
The International Energy Agency (3089,3120) True
Key pointsThe senator (3258,3279) True
He (3339,3341) False


In [24]:
quotes_s[1]

{'speaker': 'I',
 'speaker_index': '(962,963)',
 'speaker_pos': 'PRON',
 'quote': '"The reason we are meeting Kyoto [emissions targets] with a spillover - that is all from 6 million hectares of pastoral land reforested from land-use regulations.\n\n"',
 'quote_index': '(797,962)',
 'verb': 'suppose',
 'verb_index': '(964,971)',
 'quote_token_count': 34,
 'quote_type': 'Heuristic',
 'is_floating_quote': False}