# Quote Extractor
In this notebook, we will use the *Quote Extractor* tool to extract quotes from a list of texts. In addition to extracting the quotes, the tool provides information about who the speaker is, the location of the quote (and the speaker) in the text.  

**Note:** This code has been adapted from the [GenderGapTracker](https://github.com/sfu-discourse-lab/GenderGapTracker/tree/master/NLP/main) GitHub page and modified to run on a Jupyter Notebook.

## 1. Setup
Before we begin, we need to import the necessary tools and packages for our tool to run.

In [1]:
# import the necessary packages
import os
import io
import sys
import codecs
import logging
import traceback
from collections import Counter

# matplotlib: visualization tool
from matplotlib import pyplot as plt

# pandas: tools for data processing
import pandas as pd

# ipywidgets: tools for interactive browser controls in Jupyter notebooks
import ipywidgets as widgets
from ipywidgets import Button, Layout
from IPython.display import display, Markdown, clear_output

# spaCy and NLTK: natural language processing tools for working with language/text data
import spacy
import nltk
nltk.download('punkt')
from nltk import Tree
from nltk.tokenize import sent_tokenize

# import the quote extractor tool
from quote_extractor import extract_quotes, get_rawtext_files
from config import config
import utils

# initiate the app_logger
app_logger = utils.create_logger('quote_extractor', log_dir='logs', logger_level=logging.INFO, 
                                 file_log_level=logging.INFO)

[nltk_data] Downloading package punkt to /Users/sjuf9909/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
# download spaCy's en_core_web_lg, the pre-trained English language tool from spaCy
print('Loading spaCy language model...')
nlp = spacy.load('en_core_web_lg')
print('Finished loading.')

Loading spaCy language model...
Finished loading.


## 2. Load the data
This notebook will allow you to extract quotes directly from a text file (or a number of text files). Alternatively, you can also extract quotes from a text column inside your excel spreadsheet, if you wish.

### 2.1. From a text file
In order to extract quotes directly from a text file, please upload all your text files (.txt) below. Using the below code, we will access those files and extract the text into a pandas dataframe (in table format) for further processing.

In [3]:
# widget to upload .txt files
print('Upload your .txt files here:')
uploader = widgets.FileUpload(
    accept='.txt', # accepted file extension 
    multiple=True  # True to accept multiple files
)
display(uploader)

# give notification when file is uploaded
def _cb(change):
    clear_output()
    print('File uploaded!')
    
uploader.observe(_cb, names='data')

File uploaded!


In [4]:
# function to pre-process text
def nlp_preprocess(nlp, text):
    # pre-process the text
    text = sent_tokenize(text)
    text = ' '.join(text)
    text = utils.preprocess_text(text)
    
    # apply the spaCy's tool to the text
    doc = nlp(text)
    
    return doc

In [5]:
# create an empty list for a placeholder to store all the texts
all_files = []

# search for text files (.txt) inside the folder and extract all the texts
for input_file in uploader.value.keys():
    text_dict = {}
    
    # use the text file name as the doc_id
    doc_id = input_file.replace('.txt', '')
    
    try:
        # read the text file
        doc_lines = codecs.decode(uploader.value[input_file]['content'], encoding='utf-8')
        
        # store them inside a dictionary
        text_dict['text_id'] = doc_id
        text_dict['text'] = doc_lines
        all_files.append(text_dict)
            
    except:
        # this will provide some information in the case of an error
        app_logger.exception("message")
        traceback.print_exc()

# convert the extracted texts into a pandas dataframe for further processing
text_df = pd.DataFrame.from_dict(all_files)
text_df['spacy_text'] = text_df['text'].apply(lambda text: nlp_preprocess(nlp, text))
text_df.set_index('text_id', inplace=True)
text_df.head()

Unnamed: 0_level_0,text,spacy_text
text_id,Unnamed: 1_level_1,Unnamed: 2_level_1
text1,"Facebook and Instagram, which Facebook owns, f...","(Facebook, and, Instagram, ,, which, Facebook,..."
text2,(CBC News)\nRepublican lawmakers and previous ...,"((, CBC, News, ), ., \n , Republican, lawmaker..."
text3,Federated States of Micronesia President David...,"(Federated, States, of, Micronesia, President,..."
text4,Chinese state media has launched its strongest...,"(Chinese, state, media, has, launched, its, st..."


### 2.2. From an Excel spreadsheet
If you have already stored your texts in an Excel spreadsheet, you can use the below code to access your spreadsheet.

In [6]:
# widget to upload .xlsx file
print('Upload your excel spreadsheet here:')
uploader = widgets.FileUpload(
    accept='.xlsx', # accepted file extension
    multiple=False  # to accept one Excel file only
)
display(uploader)

# give notification when file is uploaded
uploader.observe(_cb, names='data')

File uploaded!


In [7]:
# read the pandas dataframe containing the list of texts
text_df = pd.read_excel(io.BytesIO(uploader.data[0]))
text_df['spacy_text'] = text_df['text'].apply(lambda text: nlp_preprocess(nlp, text))
text_df.set_index('text_id', inplace=True)
text_df.head()

Unnamed: 0_level_0,text,spacy_text
text_id,Unnamed: 1_level_1,Unnamed: 2_level_1
text1,"Facebook and Instagram, which Facebook owns, f...","(Facebook, and, Instagram, ,, which, Facebook,..."
text2,(CBC News)\nRepublican lawmakers and previous ...,"((, CBC, News, ), ., \n , Republican, lawmaker..."
text3,Federated States of Micronesia President David...,"(Federated, States, of, Micronesia, President,..."
text4,Chinese state media has launched its strongest...,"(Chinese, state, media, has, launched, its, st..."


## 3. Extract the quotes
Once your texts have been stored in a pandas dataframe, we can begin to extract the quotes from the texts.

In [8]:
# specify the column name containing the spacy text
text_col_name = 'spacy_text'

# specify whether you wish to create a parse tree for the quotes 
write_quote_trees_in_file = False 

# create an output folder and specify the file path if 'True'
os.makedirs('output', exist_ok=True)
os.makedirs('output/trees', exist_ok=True)
tree_dir = './output/trees/'

# create an empty list to store all detected quotes
all_quotes = []
inc_ent = ['ORG','PERSON','GPE','NORP','FAC','LOC']

# go through all the texts and start extracting quotes
for row in text_df.itertuples():
    doc_id = row.Index
    doc = row.spacy_text
    
    try:        
        # extract the quotes
        quotes = extract_quotes(doc_id=doc_id, doc=doc, 
                                write_tree=write_quote_trees_in_file, 
                                tree_dir=tree_dir)
        
        speaks, qts = [quote['speaker'] for quote in quotes], [quote['quote'] for quote in quotes]        
        speak_ents = [[(str(ent), ent.label_) for ent in doc.ents if (str(ent) in speak) & (ent.label_ in inc_ent)] for speak in speaks]        
        quote_ents = [[(str(ent), ent.label_) for ent in doc.ents if (str(ent) in qt) & (ent.label_ in inc_ent)] for qt in qts]

        # add quote_id and named entities to each quote
        for n, quote in enumerate(quotes):
            quote['text_id'] = doc_id
            quote['quote_id'] = str(n)
            quote['speaker_entities'] = list(set(speak_ents[n]))
            quote['quote_entities'] = list(set(quote_ents[n]))
            
        # store them in all_quotes
        all_quotes.extend(quotes)
            
    except:
        # this will provide some information in the case of an error
        app_logger.exception("message")
        traceback.print_exception()

## 4. Display the quotes
Once you are have extracted the quotes, we will store them in a pandas dataframe for further analysis.

In [9]:
# convert the outcome into a pandas dataframe
quotes_df = pd.DataFrame.from_dict(all_quotes)

# convert the string format quote spans in the index columns to a tuple of integers
for column in quotes_df.columns:
    if column.endswith('_index'):
        quotes_df[column].replace('','(0,0)', inplace=True)
        quotes_df[column] = quotes_df[column].apply(eval)

# re-arrange the columns
new_index = ['text_id', 'quote_id', 'quote', 'quote_index', 'quote_entities', 
             'speaker', 'speaker_index', 'speaker_entities',
             'verb', 'verb_index', 'quote_token_count', 'quote_type', 'is_floating_quote']
quotes_df = quotes_df.reindex(columns=new_index)
      
# preview the quotes dataframe
quotes_df.head()

Unnamed: 0,text_id,quote_id,quote,quote_index,quote_entities,speaker,speaker_index,speaker_entities,verb,verb_index,quote_token_count,quote_type,is_floating_quote
0,text1,0,"""We didn't just see a breach at the Capitol. S...","(1052, 1238)","[(Capitol, FAC), (Capitol, ORG), (the United S...",Grygiel,"(1239, 1246)","[(Grygiel, PERSON)]",said,"(1247, 1251)",38,Heuristic,False
1,text1,1,"""Social media is complicit in this because he ...","(1492, 1691)","[(the United States, GPE)]",,"(0, 0)",[],caused,"(1705, 1711)",39,Heuristic,False
2,text1,2,that Trump wouldn't be able to post for 24 hou...,"(84, 173)","[(Trump, PERSON), (Trump, ORG)]","Facebook and Instagram, which Facebook owns,","(0, 44)","[(Instagram, ORG), (Facebook, ORG)]",announcing,"(73, 83)",17,S V C,False
3,text1,3,that these actions follow years of hemming and...,"(302, 489)","[(Trump, PERSON), (Trump, ORG)]",experts,"(288, 295)",[],noted,"(296, 301)",26,S V C,False
4,text1,4,"what happened in Washington, D.C., on Wednesda...","(592, 813)","[(D.C., GPE), (Trump, PERSON), (Trump, ORG), (...","Jennifer Grygiel, a Syracuse University commun...","(491, 586)","[(Syracuse University, ORG), (Jennifer Grygiel...",said,"(587, 591)",38,S V C,False


In general, the quotes are extracted either based on syntactic rules or heuristic (custom) rules. Some quotes can be stand-alone in a sentence, or followed by another quote (floating quote) in the same sentence.   

**Quotation symbols:** *Q (Quotation mark), S (Speaker), V (Verb), C (Content)*  

**Named Entities:**  *PERSON (People, including fictional), NORP (Nationalities or religious or political groups), FAC (Buildings, airports, highways, bridges, etc.), ORG (Companies, agencies, institutions, etc.), GPE (Countries, cities, states), LOC (Non-GPE locations, mountain ranges, bodies of water)*

We can show a preview of the quotes using spaCy's visualisation tool, displaCy. All you need to do is run the below function and specify the text_id you wish to analyse.

In [10]:
from spacy import displacy
from spacy.tokens import Span

# function to display the quotes and speakers in the text
def show_quotes(text_id, show_what, save_to_html=False, out_dir='./output/'):
    doc = text_df.loc[text_id, 'spacy_text']
    entities = quotes_df['quote_entities']
    
    # create a mapping dataframe between the character index and token index from the spacy text.
    loc2tok_df = pd.DataFrame([(t.idx, t.i) for t in doc], columns = ['loc', 'token'])

    # get the quotes and speakers indexes
    locs = {
        'QUOTE': quotes_df[quotes_df['text_id']==text_id]['quote_index'].tolist(),
        'SPEAKER': set(quotes_df[quotes_df['text_id']==text_id]['speaker_index'].tolist())
    }

    # create displaCy code to visualise quotes and speakers
    my_code_list = ['doc.spans["sc"] = [', ']']
    
    for key in locs.keys():
        for loc in locs[key]:
            if loc!=(0,0):
                # Find out all token indices that falls within the given span (variable loc)
                selTokens = loc2tok_df.loc[(loc[0]<=loc2tok_df['loc']) & (loc2tok_df['loc']<loc[1]), 'token'].tolist()
                if 'NAMED ENTITIES' in show_what:
                    for ent in doc.ents:
                        if (ent.start in selTokens) & (ent.label_ in inc_ent):
                            span_code = "Span(doc, {}, {}, '{}'),".format(ent.start, 
                                                              ent.end, 
                                                              ent.label_) 
                            my_code_list.insert(1,span_code)
                if key in show_what:
                    start_token, end_token = selTokens[0], selTokens[-1] 
                    span_code = "Span(doc, {}, {}, '{}'),".format(start_token, end_token+1, key) 
                    my_code_list.insert(1,span_code)
                
    my_code = ''.join(my_code_list)

    # formatting options
    TPL_SPAN = '''
    <span style="font-weight: bold; display: inline-block; position: relative; 
    line-height: 55px">
        {text}
        {span_slices}
        {span_starts}
    </span>
    '''
    
    TPL_SPAN_SLICE = '''
    <span style="background: {bg}; top: {top_offset}px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;">
    </span>
    '''
    
    TPL_SPAN_START = '''
    <span style="background: {bg}; top: {top_offset}px; height: 4px; border-top-left-radius: 3px; border-bottom-left-radius: 3px; left: -1px; width: calc(100% + 2px); position: absolute;">
        <span style="background: {bg}; z-index: 10; color: #000; top: -0.5em; padding: 2px 3px; position: absolute; font-size: 0.6em; font-weight: bold; line-height: 1; border-radius: 3px">
            {label}{kb_link}
        </span>
    </span>
    '''
    
    colors = {'QUOTE': '#66ccff', 'SPEAKER': '#66ff99'}
    options = {'ents': ['QUOTE', 'SPEAKER'], 
               'colors': colors, 
               'top_offset': 42,
               'template': {'span':TPL_SPAN,
                           'slice':TPL_SPAN_SLICE,
                           'start':TPL_SPAN_START},
               'span_label_offset': 14,
               'top_offset_step':14}

    # execute the code
    exec(my_code)

    # option to save the preview as an html document
    if save_to_html:
        html = displacy.render(doc, style='span', options=options, jupyter=False, page=True)
        
        # save the quote preview into an html file
        file = open(out_dir+text_id+'.html', 'w')
        file.write(html)
        file.close()
    
    # display the preview in this notebook
    displacy.render(doc, style='span', options=options, jupyter=True)
    
# function to display top entities
def common_ent(text_id, which_ent='speaker_entities',top_n=5):
    # get the most common entities
    most_ent = quotes_df[quotes_df['text_id']==text_id][which_ent].tolist()
    most_ent = list(filter(None,most_ent))
    most_ent = [ent for most in most_ent for ent in most]
    most_ent = Counter([ent_name for ent_name, ent_label in most_ent])
    top_ent = dict(most_ent.most_common()[:top_n])
    
    # visualize
    plt.rcParams["figure.figsize"] = [10, 3]
    plt.bar(top_ent.keys(), top_ent.values())
    plt.title('Top {} {} in {}'.format(min(top_n,len(top_ent.keys())),which_ent,text_id))
    plt.show()

By default, this function will also save the quote preview in an html file (with the text_id as the file name) inside the output directory. You can turn this option off by setting the *'save_to_html'* parameter to **False**.

In [11]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

<IPython.core.display.Javascript object>

In [12]:
# widget for entering text_id
text = widgets.Text(
    value='text3',
    description='Enter Text ID:',
    style=dict(font_style='italic', fontweight='bold'))

# widget to select what to preview, i.e., speaker, quote, named entities
menu = widgets.SelectMultiple(
    options=['SPEAKER', 'QUOTE', 'NAMED ENTITIES'],
    value=['SPEAKER', 'QUOTE', 'NAMED ENTITIES'],
    rows=3,
    description='Show:',
    disabled=False,
    layout=Layout(margin='10px 0px 10px 2px')
)

# widget to show the preview
preview_button = widgets.Button(description='Click to preview', 
                                layout=Layout(margin='5px 0px 10px 120px'),
                                style=dict(font_style='italic',
                                           font_weight='bold'))
preview_out = widgets.Output()

def on_preview_button_clicked(_):
    with preview_out:
        # what happens when we click the preview_button
        clear_output()
        text_id = text.value
        show_what = menu.value
        try:
            show_quotes(text_id, show_what)
        except:
            print('The text_id you entered does not exist. Please enter the correct text_id.')

# link the preview_button with the function
preview_button.on_click(on_preview_button_clicked)

# widget to save the preview
save_button = widgets.Button(description='Save preview', 
                             layout=Layout(margin='5px 0px 10px 120px'),
                             style=dict(font_style='italic',
                                        font_weight='bold'))

def on_save_button_clicked(_):
    with preview_out:
        # what happens when we click the save_button
        clear_output()
        text_id = text.value
        show_what = menu.value
        try:
            show_quotes(text_id, show_what, save_to_html=True)
            print('Preview saved!')
        except:
            print('The text_id you entered does not exist. Please enter the correct text_id.')

# link the save_button with the function
save_button.on_click(on_save_button_clicked)

# widget to show top 5 entities
top_button = widgets.Button(description='Top 5 entities', 
                             layout=Layout(margin='5px 0px 10px 120px'),
                             style=dict(font_style='italic',
                                        font_weight='bold'))
top_out = widgets.Output()

def on_top_button_clicked(_):
    with top_out:
        # what happens when we click the top_button
        clear_output()
        text_id = text.value
        try:
            common_ent(text_id, which_ent='speaker_entities',top_n=5)
            common_ent(text_id, which_ent='quote_entities',top_n=5)
        except:
            print('The text_id you entered does not exist. Please enter the correct text_id.')

# link the top_button with the function
top_button.on_click(on_top_button_clicked)

# displaying buttons and their outputs
box = widgets.VBox([text, menu, 
                    preview_button, save_button, top_button, 
                    top_out, preview_out])
box

VBox(children=(Text(value='text3', description='Enter Text ID:'), SelectMultiple(description='Show:', index=(0…

## 5. Save your quotes
Finally, you can save the quote pandas dataframe into an Excel spreadsheet and download them on your local computer.

In [13]:
# save quotes_df into an Excel spreadsheet
quotes_df.to_excel('./output/quotes.xlsx', index=False)