# Quote Extractor
In this notebook, we will use the *Quote Extractor* tool to extract quotes from a list of texts. In addition to extracting the quote, the tool also provides some additional useful information about the quote, including who the speaker is, the location of the quote (and the speaker) within the text, the length of the quote, etc. This code has been adopted from the [GenderGapTracker GitHub page](https://github.com/sfu-discourse-lab/GenderGapTracker/tree/master/NLP/main) and modified to run on a Jupyter Notebook.

## 1. Setup
Before we begin, we need to import the necessary tools and packages for our tool to run.

In [1]:
# import the necessary packages
import os
import sys
import logging
import traceback

# pandas: tools for data processing
import pandas as pd

# spaCy and NLTK: natural language processing tools for working with language/text data
import spacy
import nltk
nltk.download('punkt')
from nltk import Tree
from nltk.tokenize import sent_tokenize

# import the quote extractor tool
from quote_extractor import extract_quotes, get_rawtext_files
from config import config
import utils

# initiate the app_logger
app_logger = utils.create_logger('quote_extractor', log_dir='logs', logger_level=logging.INFO, 
                                 file_log_level=logging.INFO)

[nltk_data] Downloading package punkt to /Users/sjuf9909/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
# download spaCy's en_core_web_lg, the pre-trained English language tool from spaCy
print('Loading spaCy language model...')
nlp = spacy.load('en_core_web_lg')
print('Finished loading.')

Loading spaCy language model...
Finished loading.


## 2. Load the data
This notebook will allow you to extract quotes directly from a text file (or a number of text files). Alternatively, you can also extract quotes from a text column inside your excel spreadsheet, if you wish.

### 2.1. From a text file
In order to extract quotes directly from a text file, you need to store all your text files (.txt) in a folder on your computer, e.g., we use the 'input' folder in the below example. Using the below code, we will access those files and extract the text into a pandas dataframe (in table format) for further processing.

In [3]:
# specify the file path to the folder you use to store your text files
file_path = './input/'

# create an empty list for a placeholder to store all the texts
all_files = []

# search for text files (.txt) inside the folder and extract all the texts
for input_file in get_rawtext_files(file_path):
    text_dict = {}
    
    # use the text file name as the doc_id
    doc_id = input_file.replace('.txt', '')
    
    try:
        # read the text file
        doc_lines = open(os.path.join(file_path, input_file), 'r').readlines()
        doc_lines = '\n'.join(doc_lines)
        
        # store them inside a dictionary
        text_dict['text_id'] = doc_id
        text_dict['text'] = doc_lines
        all_files.append(text_dict)
            
    except:
        # this will provide some information in the case of an error
        app_logger.exception("message")
        traceback.print_exc()

# convert the extracted texts into a pandas dataframe for further processing
text_df = pd.DataFrame.from_dict(all_files)
new_index = ['text_id', 'text']
text_df = text_df.reindex(columns=new_index)
text_df

Unnamed: 0,text_id,text
0,test1,"Facebook and Instagram, which Facebook owns, f..."
1,test2,(CBC News)\n\nRepublican lawmakers and previou...
2,test3,Federated States of Micronesia President David...


### 2.2. From an Excel spreadsheet
If you have already stored your texts in an Excel spreadsheet, you can use the below code to access your spreadsheet.

In [4]:
# enter the file path and the file name of the excel spreadsheet containing the text
file_path = './input/'
file_name = 'text_files.xlsx'

# read the pandas dataframe containing the list of texts
text_df = pd.read_excel(file_path + file_name)
text_df.head()

Unnamed: 0,text_id,text
0,text1,"Facebook and Instagram, which Facebook owns, f..."
1,text2,(CBC News)\nRepublican lawmakers and previous ...
2,text3,Federated States of Micronesia President David...


## 3. Extract the quotes
Once your texts have been extracted and stored in a pandas dataframe, we can begin to extract the quotes from the texts.

In [5]:
# specify the column name containing the text
text_col_name = 'text'

# specify whether you wish to create a parse tree for the quotes 
# you also need to specify the output file path if 'True'
write_quote_trees_in_file = True #False
tree_dir = './output/trees/'

# create an empty list to store all detected quotes
all_quotes = []

# go through all the texts and start extracting quotes
for n, text in enumerate(text_df[text_col_name]):
    doc_id = text_df['text_id'][n]
    
    try:
        # pre-process the text
        text = sent_tokenize(text)
        text = ' '.join(text)
        text = utils.preprocess_text(text)
        
        # apply the spaCy's tool to the text
        doc = nlp(text)
        
        # extract the quotes
        quotes = extract_quotes(doc_id=doc_id, doc=doc, 
                                write_tree=write_quote_trees_in_file, 
                                tree_dir=tree_dir)
        
        # add quote_id to each quote
        for n, quote in enumerate(quotes):
            quote['text_id'] = doc_id
            quote['quote_id'] = str(n)
            quote['text'] = text
        
        # store them in all_quotes
        all_quotes.extend(quotes)
            
    except:
        # this will provide some information in the case of an error
        app_logger.exception("message")
        traceback.print_exc()

## 4. Display the quotes
Once you are have extracted the quotes, we will store them in a pandas dataframe for further analysis.

In [6]:
# convert the outcome into a pandas dataframe
quotes_df = pd.DataFrame.from_dict(all_quotes)

# re-arrange the columns
new_index = ['text_id', 'text', 'quote_id', 'quote', 'quote_index', 'speaker', 'speaker_index', 
             'quote_token_count', 'quote_type', 'is_floating_quote', 'verb', 'verb_index']
quotes_df = quotes_df.reindex(columns=new_index)
quotes_df.drop(['quote_token_count', 'quote_type', 'is_floating_quote', 'verb', 'verb_index'], 
               axis=1, inplace=True)

# preview the quotes dataframe
quotes_df.head()

Unnamed: 0,text_id,text,quote_id,quote,quote_index,speaker,speaker_index
0,text1,"Facebook and Instagram, which Facebook owns, f...",0,"""We didn't just see a breach at the Capitol. S...","(1052,1238)",Grygiel,"(1239,1246)"
1,text1,"Facebook and Instagram, which Facebook owns, f...",1,"""Social media is complicit in this because he ...","(1492,1691)",,"(0,0)"
2,text1,"Facebook and Instagram, which Facebook owns, f...",2,that Trump wouldn't be able to post for 24 hou...,"(84,173)","Facebook and Instagram, which Facebook owns,","(0,44)"
3,text1,"Facebook and Instagram, which Facebook owns, f...",3,that these actions follow years of hemming and...,"(302,489)",experts,"(288,295)"
4,text1,"Facebook and Instagram, which Facebook owns, f...",4,"what happened in Washington, D.C., on Wednesda...","(592,813)","Jennifer Grygiel, a Syracuse University commun...","(491,586)"


We can also display the quotes within the text to show the outcome of the extraction tool. All you need to do is run the below function and specify the text_id you wish to analyse.

In [11]:
from spacy import displacy
from spacy.tokens import Span

# function to display the quotes and speakers in the text
def show_quotes(text_id):
    nlp = spacy.load('en_core_web_lg')
    my_text = quotes_df[quotes_df['text_id']==text_id]['text'].tolist()[0]
    
    # get the quotes and speakers indexes
    quotes_loc = quotes_df[quotes_df['text_id']==text_id]['quote_index'].tolist()
    speakers_loc = set(quotes_df[quotes_df['text_id']==text_id]['speaker_index'].tolist())
    all_locs = [quotes_loc, speakers_loc]

    my_code_list = ['doc.spans["sc"] = [', ']']

    for locs in all_locs:
        for loc in locs:
            if loc!='(0,0)':
                start, end = loc[1:-1].split(',')
                start = int(start)
                end = int(end)
                start_token = len(list(nlp(my_text[:start])))
                end_token = len(list(nlp(my_text[:end])))
                if locs == quotes_loc:
                    my_code_list.insert(1,'Span(doc, ' + \
                                        str(start_token) + \
                                        ', ' + str(end_token) + \
                                        ", 'QUOTE'),")
                else:
                    my_code_list.insert(1,'Span(doc, ' + \
                                        str(start_token) + \
                                        ', ' + str(end_token) + \
                                        ", 'SPEAKER'),")

    my_code = ''.join(my_code_list)

    nlp = spacy.blank('en')
    doc = nlp(my_text)

    colors = {'QUOTE': '#7aecec', 'SPEAKER': '#bfeeb7'}
    options = {'ents': ['QUOTE', 'SPEAKER'], 
               'colors': colors, 'top_offset': 31}

    exec(my_code)

    displacy.render(doc, style='span', options=options, jupyter=True)

def show_ner(text_id):
    nlp = spacy.load("en_core_web_lg")
    my_text = quotes_df[quotes_df['text_id']==text_id]['text'].tolist()[0]
    doc = nlp(my_text)
    displacy.render(doc, style="ent")

In [8]:
# specify the text_id
text_id = 'text3'

# display the quotes and the speakers in the text
show_quotes(text_id)

In [12]:
# display the Named Entities in the text
show_ner(text_id)

## 5. Save your quotes
Once you are happy with the extracted quotes, you can save them into an Excel spreadsheet and download them on your local computer.

In [10]:
# save into an Excel spreadsheet
quotes_df.to_excel('./output/quotes.xlsx', index=False)