# Quote Extractor
In this notebook, we will use the *Quote Extractor* tool to extract quotes from a list of texts. In addition to extracting the quotes, the tool provides information about who the speaker is, the location of the quote (and the speaker) in the text.  

**Note:** This code has been adapted from the [GenderGapTracker](https://github.com/sfu-discourse-lab/GenderGapTracker/tree/master/NLP/main) GitHub page and modified to run on a Jupyter Notebook.

## 1. Setup
Before we begin, we need to import the necessary tools and packages for our tool to run.

In [1]:
# import the necessary packages
import os
import sys
import logging
import traceback

# pandas: tools for data processing
import pandas as pd

# spaCy and NLTK: natural language processing tools for working with language/text data
import spacy
import nltk
nltk.download('punkt')
from nltk import Tree
from nltk.tokenize import sent_tokenize

# import the quote extractor tool
from quote_extractor import extract_quotes, get_rawtext_files
from config import config
import utils

# initiate the app_logger
app_logger = utils.create_logger('quote_extractor', log_dir='logs', logger_level=logging.INFO, 
                                 file_log_level=logging.INFO)

[nltk_data] Downloading package punkt to /Users/sjuf9909/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
# download spaCy's en_core_web_lg, the pre-trained English language tool from spaCy
print('Loading spaCy language model...')
nlp = spacy.load('en_core_web_lg')
print('Finished loading.')

Loading spaCy language model...
Finished loading.


## 2. Load the data
This notebook will allow you to extract quotes directly from a text file (or a number of text files). Alternatively, you can also extract quotes from a text column inside your excel spreadsheet, if you wish.

### 2.1. From a text file
In order to extract quotes directly from a text file, you need to store all your text files (.txt) in a folder on your computer, e.g., we use the 'input' folder in the below example. Using the below code, we will access those files and extract the text into a pandas dataframe (in table format) for further processing.

In [3]:
def nlp_preprocess(nlp, text):
    # pre-process the text
    text = sent_tokenize(text)
    text = ' '.join(text)
    text = utils.preprocess_text(text)
    
    # apply the spaCy's tool to the text
    doc = nlp(text)
    
    return doc

In [4]:
# specify the file path to the folder you use to store your text files
file_path = './input/'

# create an empty list for a placeholder to store all the texts
all_files = []

# search for text files (.txt) inside the folder and extract all the texts
for input_file in get_rawtext_files(file_path):
    text_dict = {}
    
    # use the text file name as the doc_id
    doc_id = input_file.replace('.txt', '')
    
    try:
        # read the text file
        doc_lines = open(os.path.join(file_path, input_file), 'r').readlines()
        doc_lines = '\n'.join(doc_lines)
        
        # store them inside a dictionary
        text_dict['text_id'] = doc_id
        text_dict['text'] = doc_lines
        all_files.append(text_dict)
            
    except:
        # this will provide some information in the case of an error
        app_logger.exception("message")
        traceback.print_exc()

# convert the extracted texts into a pandas dataframe for further processing
text_df = pd.DataFrame.from_dict(all_files)
text_df['spacy_text'] = text_df['text'].apply(lambda text: nlp_preprocess(nlp, text))
text_df.set_index('text_id', inplace=True)
text_df.head()

Unnamed: 0_level_0,text,spacy_text
text_id,Unnamed: 1_level_1,Unnamed: 2_level_1
test1,"Facebook and Instagram, which Facebook owns, f...","(Facebook, and, Instagram, ,, which, Facebook,..."
test2,(CBC News)\n\nRepublican lawmakers and previou...,"((, CBC, News, ), ., \n , ., \n , Republican, ..."
test3,Federated States of Micronesia President David...,"(Federated, States, of, Micronesia, President,..."


### 2.2. From an Excel spreadsheet
If you have already stored your texts in an Excel spreadsheet, you can use the below code to access your spreadsheet.

In [5]:
# enter the file path and the file name of the excel spreadsheet containing the text
file_path = './input/'
file_name = 'text_files.xlsx'

# read the pandas dataframe containing the list of texts
text_df = pd.read_excel(file_path + file_name)
text_df['spacy_text'] = text_df['text'].apply(lambda text: nlp_preprocess(nlp, text))
text_df.set_index('text_id', inplace=True)
text_df.head()

Unnamed: 0_level_0,text,spacy_text
text_id,Unnamed: 1_level_1,Unnamed: 2_level_1
text1,"Facebook and Instagram, which Facebook owns, f...","(Facebook, and, Instagram, ,, which, Facebook,..."
text2,(CBC News)\nRepublican lawmakers and previous ...,"((, CBC, News, ), ., \n , Republican, lawmaker..."
text3,Federated States of Micronesia President David...,"(Federated, States, of, Micronesia, President,..."


## 3. Extract the quotes
Once your texts have been stored in a pandas dataframe, we can begin to extract the quotes from the texts.

In [6]:
# specify the column name containing the spacy text
text_col_name = 'spacy_text'

# specify whether you wish to create a parse tree for the quotes 
# you also need to specify the output file path if 'True'
write_quote_trees_in_file = True 
tree_dir = './output/trees/'

# create an empty list to store all detected quotes
all_quotes = []

# go through all the texts and start extracting quotes
for row in text_df.itertuples():
    doc_id = row.Index
    doc = row.spacy_text
    
    try:        
        # extract the quotes
        quotes = extract_quotes(doc_id=doc_id, doc=doc, 
                                write_tree=write_quote_trees_in_file, 
                                tree_dir=tree_dir)
        
        speaks, qts = [quote['speaker'] for quote in quotes], [quote['quote'] for quote in quotes]        
        speak_ents = [[(str(ent), ent.label_) for ent in doc.ents if str(ent) in speak] for speak in speaks]
        quote_ents = [[(str(ent), ent.label_) for ent in doc.ents if str(ent) in qt] for qt in qts]
        
        # add quote_id to each quote
        for n, quote in enumerate(quotes):
            quote['text_id'] = doc_id
            quote['quote_id'] = str(n)
            quote['speaker_entities'] = list(set(speak_ents[n]))
            quote['quote_entities'] = list(set(quote_ents[n]))
        
        # store them in all_quotes
        all_quotes.extend(quotes)
            
    except:
        # this will provide some information in the case of an error
        app_logger.exception("message")
        traceback.print_exception()

## 4. Display the quotes
Once you are have extracted the quotes, we will store them in a pandas dataframe for further analysis.

In [7]:
# convert the outcome into a pandas dataframe
quotes_df = pd.DataFrame.from_dict(all_quotes)

# convert the string format quote spans in the index columns to a tuple of integers
for column in quotes_df.columns:
    if column.endswith('_index'):
        quotes_df[column].replace('','(0,0)', inplace=True)
        quotes_df[column] = quotes_df[column].apply(eval)

# re-arrange the columns
new_index = ['text_id', 'quote_id', 'quote', 'quote_index', 'quote_entities', 
             'speaker', 'speaker_index', 'speaker_entities',
             'verb', 'verb_index', 'quote_token_count', 'quote_type', 'is_floating_quote']
quotes_df = quotes_df.reindex(columns=new_index)
      
# preview the quotes dataframe
quotes_df.head()

Unnamed: 0,text_id,quote_id,quote,quote_index,quote_entities,speaker,speaker_index,speaker_entities,verb,verb_index,quote_token_count,quote_type,is_floating_quote
0,text1,0,"""We didn't just see a breach at the Capitol. S...","(1052, 1238)","[(Capitol, ORG), (Capitol, FAC), (the United S...",Grygiel,"(1239, 1246)","[(Grygiel, PERSON)]",said,"(1247, 1251)",38,Heuristic,False
1,text1,1,"""Social media is complicit in this because he ...","(1492, 1691)","[(years, DATE), (the United States, GPE)]",,"(0, 0)",[],caused,"(1705, 1711)",39,Heuristic,False
2,text1,2,that Trump wouldn't be able to post for 24 hou...,"(84, 173)","[(Trump, ORG), (Trump, WORK_OF_ART), (24 hours...","Facebook and Instagram, which Facebook owns,","(0, 44)","[(Facebook, ORG), (Instagram, ORG)]",announcing,"(73, 83)",17,S V C,False
3,text1,3,that these actions follow years of hemming and...,"(302, 489)","[(Wednesday, DATE), (Trump, ORG), (Trump, WORK...",experts,"(288, 295)",[],noted,"(296, 301)",26,S V C,False
4,text1,4,"what happened in Washington, D.C., on Wednesda...","(592, 813)","[(Wednesday, DATE), (Trump, ORG), (Trump, WORK...","Jennifer Grygiel, a Syracuse University commun...","(491, 586)","[(Syracuse University, ORG), (Grygiel, PERSON)...",said,"(587, 591)",38,S V C,False


In general, the quotes are extracted either based on syntactic rules or heuristic (custom) rules. Some quotes can be stand-alone in a sentence, or followed by another quote (floating quote) in the same sentence.   
**Note:** *Q (Quotation mark), S (Speaker), V (Verb), C (Content)*

We can show a preview of the quotes using spaCy's visualisation tool, displaCy. All you need to do is run the below function and specify the text_id you wish to analyse.

In [49]:
doc = text_df.loc['text3', 'spacy_text']
for ent in doc.ents:
    print(ent.text, 
          ent.start_char-ent.sent.start_char, 
          ent.end_char-ent.sent.start_char, 
          ent.label_)

Federated States 0 16 GPE
Micronesia 20 30 LOC
David Panuelo 41 54 PERSON
Pacific 142 149 LOC
Samoa 157 162 GPE
Fiame Naomi Mata'afa 181 201 PERSON
Pacific 0 7 ORG
Anna Powles 15 26 PERSON
Massey University 32 49 ORG
ABC 59 62 ORG
China 28 33 GPE
PIF 69 72 ORG
China 88 93 GPE


In [96]:
from spacy import displacy
from spacy.tokens import Span

# function to display the quotes and speakers in the text
def show_quotes(text_id, save_to_html=True, out_dir='./output/'):
    doc = text_df.loc[text_id, 'spacy_text']
    
    # create a mapping dataframe between the character index and token index from the spacy text.
    loc2tok_df = pd.DataFrame([(t.idx, t.i) for t in doc], columns = ['loc', 'token'])

    # get the quotes and speakers indexes
    locs = {
        'QUOTE': quotes_df[quotes_df['text_id']==text_id]['quote_index'].tolist(),
        'SPEAKER': set(quotes_df[quotes_df['text_id']==text_id]['speaker_index'].tolist())
    }

    # create displaCy code to visualise quotes and speakers
    my_code_list = ['doc.spans["sc"] = [', ']']
    
    for ent in doc.ents:
        span_code = "Span(doc, {}, {}, '{}'),".format(ent.start, 
                                                      ent.end, 
                                                      ent.label_) 
        my_code_list.insert(1,span_code)
    
    for key in locs.keys():
        for loc in locs[key]:
            if loc!=(0,0):
                # Find out all token indices that falls within the given span (variable loc)
                selTokens = loc2tok_df.loc[(loc[0]<=loc2tok_df['loc']) & (loc2tok_df['loc']<loc[1]), 'token'].tolist()
                start_token, end_token = selTokens[0], selTokens[-1] 
                span_code = "Span(doc, {}, {}, '{}'),".format(start_token, end_token+1, key) 
                my_code_list.insert(1,span_code)
                
    
        
    my_code = ''.join(my_code_list)

    # formatting options
    colors = {'QUOTE': '#7aecec', 'SPEAKER': '#bfeeb7'}
    options = {'ents': ['QUOTE', 'SPEAKER'], 
               'colors': colors, 'top_offset': 31,
               'span_label_offset': 15,
               'top_offset_step':15}

    # execute the code
    exec(my_code)

    # option to save the preview as an html document
    if save_to_html:
        html = displacy.render(doc, style='span', options=options, jupyter=False, page=True)
        #html = displacy.render(doc, style='span', jupyter=False, page=True)
        
        # save the quote preview into an html file
        file = open(out_dir+text_id+'.html', 'w')
        file.write(html)
        file.close()
    
    # display the preview in this notebook
    displacy.render(doc, style='span', options=options, jupyter=True)
    #displacy.render(doc, style='span', jupyter=True)

By default, this function will also save the quote preview in an html file (with the text_id as the file name) inside the output directory. You can turn this option off by setting the *'save_to_html'* parameter to **False**.

In [97]:
# specify the text_id for quote display
text_id = 'text3'

# display the quotes and the speakers in the text
show_quotes(text_id)

In [14]:
displacy.render(doc, style='ent', jupyter=True)

## 5. Save your quotes
Finally, you can save the quote pandas dataframe into an Excel spreadsheet and download them on your local computer.

In [11]:
# save quotes_df into an Excel spreadsheet
quotes_df.to_excel('./output/quotes.xlsx', index=False)