# QuotationTool
In this notebook, we will use the *QuotationTool* to extract quotes from a list of texts. In addition to extracting the quotes, the tool also provides information about who the speakers are, the location of the quotes (and the speakers) within the text, the identified named entities, etc., which can be useful for your text analysis.  

**Note:** This code has been adapted (with permission) from the [GenderGapTracker GitHub page](https://github.com/sfu-discourse-lab/GenderGapTracker/tree/master/NLP/main) and modified to run on a Jupyter Notebook. The quotation tool’s accuracy rate is evaluated in [this article](https://doi.org/10.1371/journal.pone.0245533).

## 1. Setup
Before we begin, we need to import the QuotationTool and initiate it to run in this notebook.

In [1]:
# import the QuotationTool
from extract_display_quotes import QuotationTool

# initialize the QuotationTool
qt = QuotationTool()

[nltk_data] Downloading package punkt to /Users/sjuf9909/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Loading spaCy language model...
Finished loading.


## 2. Load the data
This notebook will allow you to extract quotes directly from a text file (or a number of text files). Alternatively, you can also extract quotes from a text column inside your excel spreadsheet.  

To extract quotes directly from text files, please upload your text files (.txt) below. However, if you already have your texts stored inside an Excel spreadsheet, you can select the second tab to upload your spreadsheet.

In [2]:
display(qt.file_uploader)

Currently 4 text documents are loaded for analysis


Once your files are uploaded, you can run the below code to see a preview of the text in a table format (pandas dataframe).

In [3]:
# display a preview of the pandas dataframe
qt.text_df

Unnamed: 0,text_name,text,text_id,spacy_text
0,text1,"Facebook and Instagram, which Facebook owns, f...",fa629d3c6eca09ff631cee60c1636657,"(Facebook, and, Instagram, ,, which, Facebook,..."
1,text2,(CBC News)\nRepublican lawmakers and previous ...,0ffd917f3646c7b9b697a871e8177bf3,"((, CBC, News, ), ., \n , Republican, lawmaker..."
2,text3,Federated States of Micronesia President David...,1dcf3eff30c12de4aa90c833fca6d280,"(Federated, States, of, Micronesia, President,..."
3,text4,Chinese state media has launched its strongest...,22e14cd8db835aabba31de096b95fc75,"(Chinese, state, media, has, launched, its, st..."


## 3. Extract the quotes
Once your texts have been stored in a pandas dataframe, you can begin to extract the quotes from the texts. You can also extract named entities from your text by setting the named entities you wish to include in the below *inc_ent* variable.

In [4]:
inc_ent = ['ORG','PERSON','GPE','NORP','FAC','LOC']

quotes_df = qt.get_quotes(inc_ent, create_tree=False)
quotes_df.head()

Unnamed: 0,text_id,text_name,quote_id,quote,quote_index,quote_entities,speaker,speaker_index,speaker_entities,verb,verb_index,quote_token_count,quote_type,is_floating_quote
0,fa629d3c6eca09ff631cee60c1636657,text1,0,"""We didn't just see a breach at the Capitol. S...","(1052, 1238)","[(Capitol, ORG), (the United States, GPE), (Ca...",Grygiel,"(1239, 1246)","[(Grygiel, PERSON)]",said,"(1247, 1251)",38,Heuristic,False
1,fa629d3c6eca09ff631cee60c1636657,text1,1,"""Social media is complicit in this because he ...","(1492, 1691)","[(the United States, GPE)]",,"(0, 0)",[],caused,"(1705, 1711)",39,Heuristic,False
2,fa629d3c6eca09ff631cee60c1636657,text1,2,that Trump wouldn't be able to post for 24 hou...,"(84, 173)","[(Trump, ORG), (Trump, PERSON)]","Facebook and Instagram, which Facebook owns,","(0, 44)","[(Instagram, ORG), (Facebook, ORG)]",announcing,"(73, 83)",17,S V C,False
3,fa629d3c6eca09ff631cee60c1636657,text1,3,that these actions follow years of hemming and...,"(302, 489)","[(Trump, ORG), (Trump, PERSON)]",experts,"(288, 295)",[],noted,"(296, 301)",26,S V C,False
4,fa629d3c6eca09ff631cee60c1636657,text1,4,"what happened in Washington, D.C., on Wednesda...","(592, 813)","[(Trump, ORG), (Washington, GPE), (D.C., GPE),...","Jennifer Grygiel, a Syracuse University commun...","(491, 586)","[(Jennifer Grygiel, PERSON), (Syracuse Univers...",said,"(587, 591)",38,S V C,False


In general, the quotes are extracted either based on syntactic rules or heuristic rules. Some quotes can be stand-alone in a sentence, or followed by another quote (floating quote) in the same sentence.   

**Quotation symbols:** *Q (Quotation mark), S (Speaker), V (Verb), C (Content)*  

**Named Entities:**  *PERSON (People, including fictional), NORP (Nationalities or religious or political groups), FAC (Buildings, airports, highways, bridges, etc.), ORG (Companies, agencies, institutions, etc.), GPE (Countries, cities, states), LOC (Non-GPE locations, mountain ranges, bodies of water)*

## 4. Display the quotes
Once you have extracted the quotes, you can see a preview of the quotes using spaCy's visualisation tool, displaCy. All you need to do is run the below code and select the text_id you wish to analyse and what entities to show. 

Click the ***Preview*** button to display the quotes, and click ***Save Preview*** to save them as an html file. Lastly, you also have the option to show the top five Named Entities mentioned in the speakers and quotes by clicking the ***Top 5 Entitites*** button.

In [5]:
box = qt.analyse_quotes(inc_ent)
box

VBox(children=(HBox(children=(VBox(children=(HTML(value='<b>Select which text to preview:</b>', placeholder=''…

## 5. Save your quotes
Finally, you can save the quotes pandas dataframe into an Excel spreadsheet and download them to your local computer.

In [6]:
# save quotes_df into an Excel spreadsheet
quotes_df.to_excel('./output/quotes.xlsx', index=False)