# Quote Extractor
In this notebook, we will use the *Quote Extractor* tool to extract quotes from a list of texts. In addition to extracting the quotes, the tool provides information about who the speaker is, the location of the quote (and the speaker) in the text.  

**Note:** This code has been adapted from the [GenderGapTracker](https://github.com/sfu-discourse-lab/GenderGapTracker/tree/master/NLP/main) GitHub page and modified to run on a Jupyter Notebook.

## 1. Setup
Before we begin, we need to import the necessary tools and packages for our tool to run.

In [1]:
# import QuotationTool
from extract_display_quotes import QuotationTool

# initialize the QuotationTool
qt = QuotationTool()

[nltk_data] Downloading package punkt to /Users/sjuf9909/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Loading spaCy language model...
Finished loading.


## 2. Load the data
This notebook will allow you to extract quotes directly from a text file (or a number of text files). Alternatively, you can also extract quotes from a text column inside your excel spreadsheet, if you wish.

### 2.1. From a text file
In order to extract quotes directly from a text file, please upload all your text files (.txt) below. Using the below code, we will access those files and extract the text into a pandas dataframe (in table format) for further processing.

In [2]:
# widget to upload .txt files
txt_upload = qt.upload_files(file_type = 'text')
txt_upload

File uploaded!


In [3]:
# process the uploaded txt files and convert them into a pandas dataframe 
# for further analysis
text_df = qt.process_txt(txt_upload)
text_df.head()

Unnamed: 0_level_0,text,spacy_text
text_id,Unnamed: 1_level_1,Unnamed: 2_level_1
text1,"Facebook and Instagram, which Facebook owns, f...","(Facebook, and, Instagram, ,, which, Facebook,..."
text2,(CBC News)\nRepublican lawmakers and previous ...,"((, CBC, News, ), ., \n , Republican, lawmaker..."
text3,Federated States of Micronesia President David...,"(Federated, States, of, Micronesia, President,..."
text4,Chinese state media has launched its strongest...,"(Chinese, state, media, has, launched, its, st..."


### 2.2. From an Excel spreadsheet
If you have already stored your texts in an Excel spreadsheet, you can use the below code to access your spreadsheet.

In [4]:
# widget to upload .xlsx files
xlsx_upload = qt.upload_files(file_type = 'excel')
xlsx_upload

File uploaded!


In [5]:
# read the pandas dataframe containing the list of texts
text_df = qt.process_xls(xlsx_upload)
text_df.head()

Unnamed: 0_level_0,text,spacy_text
text_id,Unnamed: 1_level_1,Unnamed: 2_level_1
text1,"Facebook and Instagram, which Facebook owns, f...","(Facebook, and, Instagram, ,, which, Facebook,..."
text2,(CBC News)\nRepublican lawmakers and previous ...,"((, CBC, News, ), ., \n , Republican, lawmaker..."
text3,Federated States of Micronesia President David...,"(Federated, States, of, Micronesia, President,..."
text4,Chinese state media has launched its strongest...,"(Chinese, state, media, has, launched, its, st..."


## 3. Extract the quotes
Once your texts have been stored in a pandas dataframe, we can begin to extract the quotes from the texts.

In [6]:
inc_ent = ['ORG','PERSON','GPE','NORP','FAC','LOC']

quotes_df = qt.get_quotes(text_df, inc_ent, create_tree=False)
quotes_df.head()

Unnamed: 0,text_id,quote_id,quote,quote_index,quote_entities,speaker,speaker_index,speaker_entities,verb,verb_index,quote_token_count,quote_type,is_floating_quote
0,text1,0,"""We didn't just see a breach at the Capitol. S...","(1052, 1238)","[(Capitol, FAC), (the United States, GPE), (Ca...",Grygiel,"(1239, 1246)","[(Grygiel, PERSON)]",said,"(1247, 1251)",38,Heuristic,False
1,text1,1,"""Social media is complicit in this because he ...","(1492, 1691)","[(the United States, GPE)]",,"(0, 0)",[],caused,"(1705, 1711)",39,Heuristic,False
2,text1,2,that Trump wouldn't be able to post for 24 hou...,"(84, 173)","[(Trump, ORG), (Trump, PERSON)]","Facebook and Instagram, which Facebook owns,","(0, 44)","[(Instagram, ORG), (Facebook, ORG)]",announcing,"(73, 83)",17,S V C,False
3,text1,3,that these actions follow years of hemming and...,"(302, 489)","[(Trump, ORG), (Trump, PERSON)]",experts,"(288, 295)",[],noted,"(296, 301)",26,S V C,False
4,text1,4,"what happened in Washington, D.C., on Wednesda...","(592, 813)","[(Trump, ORG), (Trump, PERSON), (D.C., GPE), (...","Jennifer Grygiel, a Syracuse University commun...","(491, 586)","[(Syracuse University, ORG), (Grygiel, PERSON)...",said,"(587, 591)",38,S V C,False


In general, the quotes are extracted either based on syntactic rules or heuristic (custom) rules. Some quotes can be stand-alone in a sentence, or followed by another quote (floating quote) in the same sentence.   

**Quotation symbols:** *Q (Quotation mark), S (Speaker), V (Verb), C (Content)*  

**Named Entities:**  *PERSON (People, including fictional), NORP (Nationalities or religious or political groups), FAC (Buildings, airports, highways, bridges, etc.), ORG (Companies, agencies, institutions, etc.), GPE (Countries, cities, states), LOC (Non-GPE locations, mountain ranges, bodies of water)*

## 4. Display the quotes
Once you are have extracted the quotes, we 
can show a preview of the quotes using spaCy's visualisation tool, displaCy. All you need to do is run the below function and specify the text_id you wish to analyse.

In [7]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

<IPython.core.display.Javascript object>

In [8]:
box = qt.analyse_quotes(text_df, quotes_df, inc_ent)
box

VBox(children=(HTML(value='<b>Enter the text_id of the text you wish to analyse:</b>', placeholder=''), Text(v…

## 5. Save your quotes
Finally, you can save the quote pandas dataframe into an Excel spreadsheet and download them on your local computer.

In [9]:
# save quotes_df into an Excel spreadsheet
quotes_df.to_excel('./output/quotes.xlsx', index=False)