# Quote Extractor
In this notebook, we will use the *Quote Extractor* tool to extract quotes from a list of texts. In addition to extracting the quotes, the tool provides information about who the speakers are, the location of the quotes (and the speakers) in the text, the identified named entities such as Persons, Organisations, Government Entities, etc.  

**Note:** This code has been adapted (with permission) from the [GenderGapTracker GitHub page](https://github.com/sfu-discourse-lab/GenderGapTracker/tree/master/NLP/main) and modified to run on a Jupyter Notebook. The quotation tool’s accuracy rate is evaluated in [this article](https://doi.org/10.1371/journal.pone.0245533).

## 1. Setup
Before we begin, we need to import the QuotationTool and initiate it to run in this notebook.

In [None]:
# import QuotationTool
from extract_display_quotes import QuotationTool

# initialize the QuotationTool
qt = QuotationTool()

## 2. Load the data
This notebook will allow you to extract quotes directly from a text file (or a number of text files). Alternatively, you can also extract quotes from a text column inside your excel spreadsheet, should you wish.

### 2.1. From a text file
In order to extract quotes directly from a text file, please upload all your text files (.txt) below. Using the below code, we will access those files and extract the text into a pandas dataframe (in table format) for further processing.

In [None]:
# widget to upload .txt files
txt_upload = qt.upload_files(file_type = 'text')
txt_upload

Once you have uploaded the text files, you can run the below code to see a preview of the newly created pandas dataframe.

In [None]:
# process the uploaded txt files and convert them into a pandas dataframe 
# for further analysis
text_df = qt.process_txt(txt_upload)
text_df.head()

### 2.2. From an Excel spreadsheet
If you have already stored your texts in an Excel spreadsheet, you can use the below code to access your spreadsheet.

In [None]:
# widget to upload .xlsx files
xlsx_upload = qt.upload_files(file_type = 'excel')
xlsx_upload

In [None]:
# read the pandas dataframe containing the list of texts
text_df = qt.process_xls(xlsx_upload)
text_df.head()

## 3. Extract the quotes
Once your texts have been stored in a pandas dataframe, we can begin to extract the quotes from the texts.

In [None]:
inc_ent = ['ORG','PERSON','GPE','NORP','FAC','LOC']

quotes_df = qt.get_quotes(inc_ent, create_tree=False)
quotes_df.head()

In general, the quotes are extracted either based on syntactic rules or heuristic (custom) rules. Some quotes can be stand-alone in a sentence, or followed by another quote (floating quote) in the same sentence.   

**Quotation symbols:** *Q (Quotation mark), S (Speaker), V (Verb), C (Content)*  

**Named Entities:**  *PERSON (People, including fictional), NORP (Nationalities or religious or political groups), FAC (Buildings, airports, highways, bridges, etc.), ORG (Companies, agencies, institutions, etc.), GPE (Countries, cities, states), LOC (Non-GPE locations, mountain ranges, bodies of water)*

## 4. Display the quotes
Once you have extracted the quotes, you can see a preview of the quotes using spaCy's visualisation tool, displaCy. All you need to do is run the below code and specify the text_id you wish to analyse and what entities to show.  

The text, speakers, quotes and named entities can be saved as an html file. You also have the option to show the top five mentioned entities in the speakers and quotes.

In [None]:
box = qt.analyse_quotes(inc_ent)
box

## 5. Save your quotes
Finally, you can save the quote pandas dataframe into an Excel spreadsheet and download them on your local computer.

In [None]:
# save quotes_df into an Excel spreadsheet
quotes_df.to_excel('./output/quotes.xlsx', index=False)