# Quote Extractor
In this notebook, we will use the *Quote Extractor* tool to extract quotes from a list of texts. In addition to extracting the quotes, the tool provides information about who the speaker is, the location of the quote (and the speaker) in the text.  

**Note:** This code has been adapted from the [GenderGapTracker](https://github.com/sfu-discourse-lab/GenderGapTracker/tree/master/NLP/main) GitHub page and modified to run on a Jupyter Notebook.

## 1. Setup
Before we begin, we need to import the necessary tools and packages for our tool to run.

In [2]:
# import the necessary packages
import os
import sys
import logging
import traceback

# pandas: tools for data processing
import pandas as pd

# spaCy and NLTK: natural language processing tools for working with language/text data
import spacy
import nltk
nltk.download('punkt')
from nltk import Tree
from nltk.tokenize import sent_tokenize

# import the quote extractor tool
from quote_extractor import extract_quotes, get_rawtext_files
from config import config
import utils

# initiate the app_logger
app_logger = utils.create_logger('quote_extractor', log_dir='logs', logger_level=logging.INFO, 
                                 file_log_level=logging.INFO)

[nltk_data] Downloading package punkt to /Users/sjuf9909/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [3]:
# download spaCy's en_core_web_lg, the pre-trained English language tool from spaCy
print('Loading spaCy language model...')
nlp = spacy.load('en_core_web_lg')
print('Finished loading.')

Loading spaCy language model...
Finished loading.


## 2. Load the data
This notebook will allow you to extract quotes directly from a text file (or a number of text files). Alternatively, you can also extract quotes from a text column inside your excel spreadsheet, if you wish.

### 2.1. From a text file
In order to extract quotes directly from a text file, you need to store all your text files (.txt) in a folder on your computer, e.g., we use the 'input' folder in the below example. Using the below code, we will access those files and extract the text into a pandas dataframe (in table format) for further processing.

In [4]:
# specify the file path to the folder you use to store your text files
file_path = './input/'

# create an empty list for a placeholder to store all the texts
all_files = []

# search for text files (.txt) inside the folder and extract all the texts
for input_file in get_rawtext_files(file_path):
    text_dict = {}
    
    # use the text file name as the doc_id
    doc_id = input_file.replace('.txt', '')
    
    try:
        # read the text file
        doc_lines = open(os.path.join(file_path, input_file), 'r').readlines()
        doc_lines = '\n'.join(doc_lines)
        
        # store them inside a dictionary
        text_dict['text_id'] = doc_id
        text_dict['text'] = doc_lines
        all_files.append(text_dict)
            
    except:
        # this will provide some information in the case of an error
        app_logger.exception("message")
        traceback.print_exc()

# convert the extracted texts into a pandas dataframe for further processing
text_df = pd.DataFrame.from_dict(all_files)
new_index = ['text_id', 'text']
text_df = text_df.reindex(columns=new_index)
text_df.head()

Unnamed: 0,text_id,text
0,test1,"Facebook and Instagram, which Facebook owns, f..."
1,test2,(CBC News)\n\nRepublican lawmakers and previou...
2,test3,Federated States of Micronesia President David...


### 2.2. From an Excel spreadsheet
If you have already stored your texts in an Excel spreadsheet, you can use the below code to access your spreadsheet.

In [5]:
# enter the file path and the file name of the excel spreadsheet containing the text
file_path = './input/'
file_name = 'text_files.xlsx'

# read the pandas dataframe containing the list of texts
text_df = pd.read_excel(file_path + file_name)
text_df.head()

Unnamed: 0,text_id,text
0,text1,"Facebook and Instagram, which Facebook owns, f..."
1,text2,(CBC News)\nRepublican lawmakers and previous ...
2,text3,Federated States of Micronesia President David...


## 3. Extract the quotes
Once your texts have been stored in a pandas dataframe, we can begin to extract the quotes from the texts.

In [6]:
# specify the column name containing the text
text_col_name = 'text'

# specify whether you wish to create a parse tree for the quotes 
# you also need to specify the output file path if 'True'
write_quote_trees_in_file = True #False
tree_dir = './output/trees/'

# create an empty list to store all detected quotes
all_quotes = []

# go through all the texts and start extracting quotes
for n, text in enumerate(text_df[text_col_name]):
    doc_id = text_df['text_id'][n]
    
    try:
        # pre-process the text
        text = sent_tokenize(text)
        text = ' '.join(text)
        text = utils.preprocess_text(text)
        
        # apply the spaCy's tool to the text
        doc = nlp(text)
        
        # extract the quotes
        quotes = extract_quotes(doc_id=doc_id, doc=doc, 
                                write_tree=write_quote_trees_in_file, 
                                tree_dir=tree_dir)
        
        # add quote_id to each quote
        for n, quote in enumerate(quotes):
            quote['text_id'] = doc_id
            quote['quote_id'] = str(n)
            quote['text'] = text
            quote['text_spacy'] = doc
        
        # store them in all_quotes
        all_quotes.extend(quotes)
            
    except:
        # this will provide some information in the case of an error
        app_logger.exception("message")
        traceback.print_exc()

## 4. Display the quotes
Once you are have extracted the quotes, we will store them in a pandas dataframe for further analysis.

In [7]:
# convert the outcome into a pandas dataframe
quotes_df = pd.DataFrame.from_dict(all_quotes)

# convert the values in the index columns to a tuple
for column in quotes_df.columns:
    if column.endswith('_index'):
        quotes_df[column].replace('','(0,0)', inplace=True)
        quotes_df[column] = quotes_df[column].apply(eval)

# re-arrange the columns
new_index = ['text_id', 'text', 'quote_id', 'quote', 'quote_index', 'speaker', 'speaker_index', 
             'verb', 'verb_index', 'quote_token_count', 'quote_type', 'is_floating_quote', 'text_spacy']
quotes_df = quotes_df.reindex(columns=new_index)

# drop unused columns
#quotes_df.drop(['quote_token_count', 'quote_type', 'is_floating_quote', 'verb', 'verb_index'], 
#               axis=1, inplace=True)
       
# preview the quotes dataframe
quotes_df

Unnamed: 0,text_id,text,quote_id,quote,quote_index,speaker,speaker_index,verb,verb_index,quote_token_count,quote_type,is_floating_quote,text_spacy
0,text1,"Facebook and Instagram, which Facebook owns, f...",0,"""We didn't just see a breach at the Capitol. S...","(1052, 1238)",Grygiel,"(1239, 1246)",said,"(1247, 1251)",38,Heuristic,False,"(Facebook, and, Instagram, ,, which, Facebook,..."
1,text1,"Facebook and Instagram, which Facebook owns, f...",1,"""Social media is complicit in this because he ...","(1492, 1691)",,"(0, 0)",caused,"(1705, 1711)",39,Heuristic,False,"(Facebook, and, Instagram, ,, which, Facebook,..."
2,text1,"Facebook and Instagram, which Facebook owns, f...",2,that Trump wouldn't be able to post for 24 hou...,"(84, 173)","Facebook and Instagram, which Facebook owns,","(0, 44)",announcing,"(73, 83)",17,S V C,False,"(Facebook, and, Instagram, ,, which, Facebook,..."
3,text1,"Facebook and Instagram, which Facebook owns, f...",3,that these actions follow years of hemming and...,"(302, 489)",experts,"(288, 295)",noted,"(296, 301)",26,S V C,False,"(Facebook, and, Instagram, ,, which, Facebook,..."
4,text1,"Facebook and Instagram, which Facebook owns, f...",4,"what happened in Washington, D.C., on Wednesda...","(592, 813)","Jennifer Grygiel, a Syracuse University commun...","(491, 586)",said,"(587, 591)",38,S V C,False,"(Facebook, and, Instagram, ,, which, Facebook,..."
5,text1,"Facebook and Instagram, which Facebook owns, f...",5,"\n ""This is what happens","(1012, 1035)",Grygiel,"(1038, 1045)",said,"(1046, 1050)",6,C S V,False,"(Facebook, and, Instagram, ,, which, Facebook,..."
6,text1,"Facebook and Instagram, which Facebook owns, f...",6,They're creeping along towards firmer action,"(1352, 1396)",Grygiel,"(1399, 1406)",said,"(1407, 1411)",7,Q C Q S V,False,"(Facebook, and, Instagram, ,, which, Facebook,..."
7,text1,"Facebook and Instagram, which Facebook owns, f...",7,"that the video was removed because it ""contrib...","(2362, 2467)","Guy Rosen, Facebook's vice-president of integr...","(2285, 2335)",said,"(2336, 2340)",18,S V C,False,"(Facebook, and, Instagram, ,, which, Facebook,..."
8,text1,"Facebook and Instagram, which Facebook owns, f...",8,This is an emergency situation and we are taki...,"(2471, 2594)",Rosen,"(2597, 2602)",said,"(2603, 2607)",19,Q C Q S V,False,"(Facebook, and, Instagram, ,, which, Facebook,..."
9,text1,"Facebook and Instagram, which Facebook owns, f...",9,I know your pain,"(2803, 2819)",his video,"(2784, 2793)",saying,"(2794, 2800)",4,S V C,False,"(Facebook, and, Instagram, ,, which, Facebook,..."


In general, the quotes are extracted either based on syntactic rules or heuristic (custom) rules. Some quotes can be stand-alone in a sentence, or followed by another quote (floating quote) in the same sentence.   
**Note:** *Q (Quotation mark), S (Speaker), V (Verb), C (Content)*

In [15]:
quotes_df['quote'][19]

'It appears to be an attempt to deliberately disrupt existing regional mechanisms which China is not a part of'

In [19]:
doc[95].idx

541

We can show a preview of the quotes using spaCy's visualisation tool, displaCy. All you need to do is run the below function and specify the text_id you wish to analyse.

In [9]:
doc = quotes_df[quotes_df['text_id']=='text3']['text_spacy'].tolist()[0]

In [10]:
spans = [(432, 541)]

loc2tok = [(t.idx, t.i) for t in doc]
loc2tok_df = pd.DataFrame(loc2tok, columns = ['loc', 'token'])

selTokens = []
for span in spans:
    temp = loc2tok_df.loc[(span[0]<=loc2tok_df['loc']) & (loc2tok_df['loc']<span[1]), 'token']
    selTokens.append(temp.tolist())
selTokens

[[76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94]]

In [20]:
from spacy import displacy
from spacy.tokens import Span

# function to display the quotes and speakers in the text
def show_quotes(text_id, save_to_html=True, out_dir='./output/'):
    doc = quotes_df[quotes_df['text_id']==text_id]['text_spacy'].tolist()[0]
    spans = [(432, 541)]

    loc2tok_df = pd.DataFrame([(t.idx, t.i) for t in doc], columns = ['loc', 'token'])

    selTokens = []

    for span in spans:

    temp = loc2tok_df.loc[(span[0]<=loc2tok_df['loc']) & (loc2tok_df['loc']<span[1]), 'token']

    selTokens.append(temp.tolist())



selTokens
    # get the quotes and speakers indexes
    locs = {
        'QUOTE': quotes_df[quotes_df['text_id']==text_id]['quote_index'].tolist(),
        'SPEAKER': set(quotes_df[quotes_df['text_id']==text_id]['speaker_index'].tolist())
    }

    # create displaCy code to visualise quotes and speakers 
    my_code_list = ['doc.spans["sc"] = [', ']']
    
    for key in locs.keys():
        for loc in locs[key]:
            if loc!=(0,0):
                start_token, end_token = selTokens[0], selTokens[-1]
                #start_token = tokens[min(tokens.keys(), key=lambda x:abs(x-loc[0]))]
                #end_token = tokens[min(tokens.keys(), key=lambda x:abs(x-loc[1]))]
                span_code = "Span(doc, {}, {}, '{}'),".format(start_token,end_token, key)
                my_code_list.insert(1,span_code)
                
    my_code = ''.join(my_code_list)

    # formatting options
    colors = {'QUOTE': '#7aecec', 'SPEAKER': '#bfeeb7'}
    options = {'ents': ['QUOTE', 'SPEAKER'], 
               'colors': colors, 'top_offset': 31}

    # execute the code
    exec(my_code)

    # option to save the preview as an html document
    if save_to_html:
        html = displacy.render(doc, style='span', options=options, jupyter=False, page=True)
        
        # save the quote preview into an html file
        file = open(out_dir+text_id+'.html','w')
        file.write(html)
        file.close()
    
    # display the preview in this notebook
    displacy.render(doc, style='span', options=options, jupyter=True)

By default, this function will also save the quote preview in an html file (with the text_id as the file name) inside the output directory. You can turn this option off by setting the *'save_to_html'* parameter to **False**.

In [21]:
# specify the text_id
text_id = 'text3'

# display the quotes and the speakers in the text
show_quotes(text_id)

SyntaxError: invalid syntax (<string>, line 1)

## 5. Save your quotes
Finally, you can save the quote pandas dataframe into an Excel spreadsheet and download them on your local computer.

In [9]:
# save quotes_df into an Excel spreadsheet
quotes_df.to_excel('./output/quotes.xlsx', index=False)