# Quote Extractor
In this notebook, we will use the *Quote Extractor* tool to extract quotes from a list of texts. In addition to the quote, the tool also provides some useful information such as who the speaker is, the location of the quote (and the speaker) within the text, the length of the quote, etc.

## 1. Setup
Before we begin, we need to import the necessary tools and packages for our tool to run.

In [1]:
# import the necessary packages
import os
import sys
import logging
import traceback

# pandas: tools for data processing
import pandas as pd

# spaCy and NLTK: natural language processing tools for working with human language data
import spacy
import nltk
nltk.download('punkt')
from nltk import Tree
from nltk.tokenize import sent_tokenize

# import the quote extractor tool
from quote_extractor import extract_quotes, get_rawtext_files
from config import config
import utils

# initiate the app_logger
app_logger = utils.create_logger('quote_extractor', log_dir='logs', logger_level=logging.INFO, 
                                 file_log_level=logging.INFO)

[nltk_data] Downloading package punkt to /Users/sjuf9909/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
# download spaCy's en_core_web_lg, the pre-trained English language pipeline tool from spaCy
print("Loading spaCy language model...")
nlp = spacy.load('en_core_web_lg')
print("Finished loading")

Loading spaCy language model...
Finished loading


## 2. Load the data
You can extract quotes from one text file (.txt file) or from many text files, at the same time, if you wish. All you need to do is to store all your text files in an input folder, and list the text file names on an excel spreadsheet.

In [3]:
# enter the file path and the file name of the excel spreadsheet containing the list of texts
# please ensure that the text files (.txt) are stored in the same folder
file_path = './input/'
file_name = 'text_files.xlsx'

# read the pandas dataframe containing the list of texts
text_df = pd.read_excel(file_path + file_name, index_col=0)
text_df.head()

Unnamed: 0_level_0,texts,description
index,Unnamed: 1_level_1,Unnamed: 2_level_1
0,"Facebook and Instagram, which Facebook owns, f...",random
1,(CBC News)\nRepublican lawmakers and previous ...,CBC news


In [11]:
tt = text_df['texts'][0]

In [4]:
# specify whether you wish to create a parse tree for the quotes and specify the file path if 'True'
write_quote_trees_in_file = False
tree_dir = './output/trees/'

# begin to extract quotes from your text files
all_quotes = []
for n, text in enumerate(text_df['texts']):
    doc_id = 'text' + str(n)
    
    try:
        text = sent_tokenize(text)
        text = " ".join(text)
        text = utils.preprocess_text(text)
        doc = nlp(text)
        quotes = extract_quotes(doc_id=doc_id, doc=doc, 
                                write_tree=write_quote_trees_in_file, 
                                tree_dir=tree_dir)
        
        for n, quote in enumerate(quotes):
            quote['quote_id']=doc_id + '-' + str(n+1)
        all_quotes.extend(quotes)
            
    except:
        app_logger.exception("message")
        traceback.print_exc()

In [5]:
# generate a preview of the quotes
for n, q in enumerate(all_quotes):
    print('Quote number:',n)
    for key, value in q.items():
        print(key.title() + ': ' + str(value))
    print('-' * 120)

Quote number: 0
Speaker: Grygiel
Speaker_Index: (1239,1246)
Quote: "We didn't just see a breach at the Capitol. Social media platforms have been breached by the president repeatedly. This is disinformation. This was a coup attempt in the United States."
Quote_Index: (1052,1238)
Verb: said
Verb_Index: (1247,1251)
Quote_Token_Count: 38
Quote_Type: Heuristic
Is_Floating_Quote: False
Quote_Id: text0-1
------------------------------------------------------------------------------------------------------------------------
Quote number: 1
Speaker: Facebook and Instagram, which Facebook owns
Speaker_Index: (0,43)
Quote: that Trump wouldn't be able to post for 24 hours following two violations of its policies
Quote_Index: (84,173)
Verb: announcing
Verb_Index: (73,83)
Quote_Token_Count: 17
Quote_Type: SVC
Is_Floating_Quote: False
Quote_Id: text0-2
------------------------------------------------------------------------------------------------------------------------
Quote number: 2
Speaker: expe

In [7]:
# specify whether you wish to create a parse tree for the quotes and specify the file path if 'True'
write_quote_trees_in_file = False
tree_dir = './output/trees/'

# begin to extract quotes from your text files
all_quotes = []
for input_file in get_rawtext_files(file_path):
    doc_id = input_file.replace(".txt", "")
    
    try:
        doc_lines = open(os.path.join(file_path, input_file), 'r').readlines()
        #print(doc_lines)
        doc_lines = [line.rstrip() for line in doc_lines if line!='\n']
        #print(doc_lines)
        doc_text = '\n'.join(doc_lines)
        doc_text = utils.preprocess_text(doc_text)
        doc = nlp(doc_text)
        quotes = extract_quotes(doc_id=doc_id, doc=doc, write_tree=write_quote_trees_in_file, tree_dir=tree_dir)
        for n, quote in enumerate(quotes):
            quote['quote_id']=doc_id + '-' + str(n+1)
        all_quotes.extend(quotes)
            
    except:
        app_logger.exception("message")
        traceback.print_exc()

In [9]:
# generate a preview of the quotes
for n, q in enumerate(all_quotes):
    print('Quote number:',n)
    for key, value in q.items():
        print(key.title() + ': ' + str(value))
    print('-' * 120)

Quote number: 0
Speaker: Facebook and Instagram, which Facebook owns
Speaker_Index: (0,43)
Quote: that Trump wouldn't be able to post for 24 hours following two violations of its policies
Quote_Index: (84,173)
Verb: announcing
Verb_Index: (73,83)
Quote_Token_Count: 17
Quote_Type: SVC
Is_Floating_Quote: False
Quote_Id: test1-1
------------------------------------------------------------------------------------------------------------------------
Quote number: 1
Speaker: experts
Speaker_Index: (289,296)
Quote: that these actions follow years of hemming and hawing regarding Trump and his supporters spreading dangerous misinformation and encouraging violence that contributed to Wednesday's events
Quote_Index: (303,490)
Verb: noted
Verb_Index: (297,302)
Quote_Token_Count: 26
Quote_Type: SVC
Is_Floating_Quote: False
Quote_Id: test1-2
------------------------------------------------------------------------------------------------------------------------
Quote number: 2
Speaker: Jennifer Grygi

In [7]:
# generate the outcome in a pandas dataframe
quotes_df = pd.DataFrame.from_dict(all_quotes)
new_index = ['quote_id', 'quote', 'quote_index', 'quote_token_count', 'quote_type','is_floating_quote', 
             'speaker', 'speaker_index', 'verb', 'verb_index']
quotes_df = quotes_df.reindex(columns=new_index)
quotes_df.head()

Unnamed: 0,quote_id,quote,quote_index,quote_token_count,quote_type,is_floating_quote,speaker,speaker_index,verb,verb_index
0,test1-1,that Trump wouldn't be able to post for 24 hou...,"(84,173)",17,SVC,False,"Facebook and Instagram, which Facebook owns","(0,43)",announcing,"(73,83)"
1,test1-2,that these actions follow years of hemming and...,"(303,490)",26,SVC,False,experts,"(289,296)",noted,"(297,302)"
2,test1-3,"what happened in Washington, D.C.","(594,627)",6,SVC,False,"Jennifer Grygiel, a Syracuse University commun...","(493,588)",said,"(589,593)"
3,test1-4,", on Wednesday is a direct result of Trump's u...","(627,815)",32,SVC,False,"Jennifer Grygiel, a Syracuse University commun...","(493,588)",said,"(589,593)"
4,test1-5,"This is what happens,","(1018,1039)",5,QCQSV,False,""" Grygiel","(1039,1048)",said,"(1049,1053)"


In [8]:
quotes_df

Unnamed: 0,quote_id,quote,quote_index,quote_token_count,quote_type,is_floating_quote,speaker,speaker_index,verb,verb_index
0,test1-1,that Trump wouldn't be able to post for 24 hou...,"(84,173)",17,SVC,False,"Facebook and Instagram, which Facebook owns","(0,43)",announcing,"(73,83)"
1,test1-2,that these actions follow years of hemming and...,"(303,490)",26,SVC,False,experts,"(289,296)",noted,"(297,302)"
2,test1-3,"what happened in Washington, D.C.","(594,627)",6,SVC,False,"Jennifer Grygiel, a Syracuse University commun...","(493,588)",said,"(589,593)"
3,test1-4,", on Wednesday is a direct result of Trump's u...","(627,815)",32,SVC,False,"Jennifer Grygiel, a Syracuse University commun...","(493,588)",said,"(589,593)"
4,test1-5,"This is what happens,","(1018,1039)",5,QCQSV,False,""" Grygiel","(1039,1048)",said,"(1049,1053)"
5,test1-6,"They're creeping along towards firmer action,","(1358,1403)",8,QCQSV,False,""" Grygiel","(1403,1412)",said,"(1413,1417)"
6,test1-7,"that the video was removed because it ""contrib...","(2373,2478)",18,SVC,False,"Guy Rosen, Facebook's vice-president of integrity","(2296,2345)",said,"(2347,2351)"
7,test1-8,This is an emergency situation and we are taki...,"(2484,2607)",19,QCQSV,False,Rosen,"(2610,2615)",said,"(2616,2620)"
8,test1-9,"""I know your pain","(2817,2834)",5,SVC,False,Trump,"(2786,2791)",saying,"(2809,2815)"
9,test1-10,"""We can't play into the hands of these people","(2979,3024)",11,SVC,False,Trump,"(2957,2962)",say,"(2974,2977)"


In [9]:
# save the above dataframe into excel
quotes_df.to_excel('./output/quotes.xlsx', index=False)