# Quote Extractor
In this notebook, we will use the *Quote Extractor* tool to extract quotes from a list of texts. In addition to the quote, the tool also provides some useful information such as who the speaker is, the location of the quote (and the speaker) within the text, the length of the quote, etc.

## 1. Setup
Before we begin, we need to import the necessary tools and packages for our tool to run.

In [1]:
# import the necessary packages
import os
import sys
import logging
import traceback

# pandas: tools for data processing
import pandas as pd

# spaCy and NLTK: natural language processing tools for working with language/text data
import spacy
import nltk
nltk.download('punkt')
from nltk import Tree
from nltk.tokenize import sent_tokenize

# import the quote extractor tool
from quote_extractor import extract_quotes, get_rawtext_files
from config import config
import utils

# initiate the app_logger
app_logger = utils.create_logger('quote_extractor', log_dir='logs', logger_level=logging.INFO, 
                                 file_log_level=logging.INFO)

[nltk_data] Downloading package punkt to /Users/sjuf9909/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
# download spaCy's en_core_web_lg, the pre-trained English language tool from spaCy
print('Loading spaCy language model...')
nlp = spacy.load('en_core_web_lg')
print('Finished loading.')

Loading spaCy language model...
Finished loading.


## 2. Load the data
This notebook will allow you to extract quotes directly from a text file (or a number of text files). Alternatively, you can also extract quotes from a text column inside your excel spreadsheet, if you wish.

### 2.1. From a text file
In order to extract quotes directly from a text file, you need to store all your text files (.txt) into a folder, e.g., the 'input' folder in the below example. Using the below code, we will access those files and extract the text into a pandas dataframe (in table format) for further processing.

In [3]:
# specify the file path to the folder you use to store your text files
file_path = './input/'

# create an empty list for a placeholder to store all the texts
all_files = []

# search for text files (.txt) inside the folder and extract all the texts
for input_file in get_rawtext_files(file_path):
    text_dict = {}
    
    # use the text file name as the doc_id
    doc_id = input_file.replace('.txt', '')
    
    try:
        # read the text file
        doc_lines = open(os.path.join(file_path, input_file), 'r').readlines()
        doc_lines = '\n'.join(doc_lines)
        
        # store them inside a dictionary
        text_dict['text_id'] = doc_id
        text_dict['texts'] = doc_lines
        all_files.append(text_dict)
            
    except:
        # this will provide some information in the case of an error
        app_logger.exception("message")
        traceback.print_exc()

# convert the extracted texts into a pandas dataframe for further processing
text_df = pd.DataFrame.from_dict(all_files)
text_df

Unnamed: 0,text_id,texts
0,test1,"Facebook and Instagram, which Facebook owns, f..."
1,test2,(CBC News)\n\nRepublican lawmakers and previou...


### 2.2. From an Excel spreadsheet
If you have already stored your texts in an Excel spreadsheet, you can use the below code to access your spreadsheet.

In [9]:
# enter the file path and the file name of the excel spreadsheet containing the text
file_path = './input/'
file_name = 'text_files.xlsx'

# read the pandas dataframe containing the list of texts
text_df = pd.read_excel(file_path + file_name)
text_df.head()

Unnamed: 0,text_id,texts
0,text1,"Facebook and Instagram, which Facebook owns, f..."
1,text2,(CBC News)\nRepublican lawmakers and previous ...


In [12]:
text_df['text_id'][0]

'text1'

## 3. Extract quotes from your texts
Once your texts have been extracted and stored into a pandas dataframe, we can begin to extract the quotes from the texts.

In [13]:
# specify the column name containing the text
text_col_name = 'texts'

# specify whether you wish to create a parse tree for the quotes 
# you also need to specify the output file path if 'True'
write_quote_trees_in_file = False
tree_dir = './output/trees/'

# create an empty list to store all detected quotes
all_quotes = []

# go through all the texts and start extracting quotes
for n, text in enumerate(text_df[text_col_name]):
    doc_id = text_df['text_id'][n]
    
    try:
        # pre-process the text
        text = sent_tokenize(text)
        text = " ".join(text)
        text = utils.preprocess_text(text)
        
        # apply the spaCy's tool to the text
        doc = nlp(text)
        
        # extract the quotes
        quotes = extract_quotes(doc_id=doc_id, doc=doc, 
                                write_tree=write_quote_trees_in_file, 
                                tree_dir=tree_dir)
        
        # add quote_id to each quote
        for n, quote in enumerate(quotes):
            quote['quote_id']=doc_id + '-' + str(n+1)
        
        # store them in all_quotes
        all_quotes.extend(quotes)
            
    except:
        # this will provide some information in the case of an error
        app_logger.exception("message")
        traceback.print_exc()

We have extracted the quotes from all texts. Now, let's generate a preview of the extracted quotes.

In [18]:
# generate a preview of the quotes
for n, q in enumerate(all_quotes[:min(3,len(all_quotes))]):
    print('Quote number:',n)
    for key, value in q.items():
        print(key.title() + ': ' + str(value))
    print('-' * 120)

Quote number: 0
Speaker: Grygiel
Speaker_Index: (1239,1246)
Quote: "We didn't just see a breach at the Capitol. Social media platforms have been breached by the president repeatedly. This is disinformation. This was a coup attempt in the United States."
Quote_Index: (1052,1238)
Verb: said
Verb_Index: (1247,1251)
Quote_Token_Count: 38
Quote_Type: Heuristic
Is_Floating_Quote: False
Quote_Id: text1-1
------------------------------------------------------------------------------------------------------------------------
Quote number: 1
Speaker: Facebook and Instagram, which Facebook owns
Speaker_Index: (0,43)
Quote: that Trump wouldn't be able to post for 24 hours following two violations of its policies
Quote_Index: (84,173)
Verb: announcing
Verb_Index: (73,83)
Quote_Token_Count: 17
Quote_Type: SVC
Is_Floating_Quote: False
Quote_Id: text1-2
------------------------------------------------------------------------------------------------------------------------
Quote number: 2
Speaker: expe

## 4. Save your quotes
Once you are happy with the extracted quotes, you can save them back into an Excel spreadsheet for further analysis.

In [19]:
# generate the outcome in a pandas dataframe
quotes_df = pd.DataFrame.from_dict(all_quotes)

# re-arrange the columns
new_index = ['quote_id', 'quote', 'quote_index', 'quote_token_count', 'quote_type','is_floating_quote', 
             'speaker', 'speaker_index', 'verb', 'verb_index']
quotes_df = quotes_df.reindex(columns=new_index)

# preview the quotes dataframe
quotes_df.head()

Unnamed: 0,quote_id,quote,quote_index,quote_token_count,quote_type,is_floating_quote,speaker,speaker_index,verb,verb_index
0,text1-1,"""We didn't just see a breach at the Capitol. S...","(1052,1238)",38,Heuristic,False,Grygiel,"(1239,1246)",said,"(1247,1251)"
1,text1-2,that Trump wouldn't be able to post for 24 hou...,"(84,173)",17,SVC,False,"Facebook and Instagram, which Facebook owns","(0,43)",announcing,"(73,83)"
2,text1-3,that these actions follow years of hemming and...,"(302,489)",26,SVC,False,experts,"(288,295)",noted,"(296,301)"
3,text1-4,"what happened in Washington, D.C.","(592,625)",6,SVC,False,"Jennifer Grygiel, a Syracuse University commun...","(491,586)",said,"(587,591)"
4,text1-5,", on Wednesday is a direct result of Trump's u...","(625,813)",32,SVC,False,"Jennifer Grygiel, a Syracuse University commun...","(491,586)",said,"(587,591)"


In [20]:
# save the quotes dataframe into an Excel spreadsheet
quotes_df.to_excel('./output/quotes.xlsx', index=False)

In [3]:
# enter the file path and the file name of the excel spreadsheet containing the list of texts
# please ensure that the text files (.txt) are stored in the same folder
file_path = './input/'
file_name = 'text_files.xlsx'

# read the pandas dataframe containing the list of texts
text_df = pd.read_excel(file_path + file_name, index_col=0)
text_df.head()

Unnamed: 0_level_0,texts,description
index,Unnamed: 1_level_1,Unnamed: 2_level_1
0,"Facebook and Instagram, which Facebook owns, f...",random
1,(CBC News)\nRepublican lawmakers and previous ...,CBC news


In [4]:
# specify whether you wish to create a parse tree for the quotes and specify the file path if 'True'
write_quote_trees_in_file = False
tree_dir = './output/trees/'

# begin to extract quotes from your text files
all_quotes = []
for n, text in enumerate(text_df['texts']):
    doc_id = 'text' + str(n)
    
    try:
        text = sent_tokenize(text)
        text = " ".join(text)
        text = utils.preprocess_text(text)
        #print(text)
        doc = nlp(text)
        quotes = extract_quotes(doc_id=doc_id, doc=doc, 
                                write_tree=write_quote_trees_in_file, 
                                tree_dir=tree_dir)
        
        for n, quote in enumerate(quotes):
            quote['quote_id']=doc_id + '-' + str(n+1)
        all_quotes.extend(quotes)
            
    except:
        app_logger.exception("message")
        traceback.print_exc()

In [5]:
# generate a preview of the quotes
for n, q in enumerate(all_quotes):
    print('Quote number:',n)
    for key, value in q.items():
        print(key.title() + ': ' + str(value))
    print('-' * 120)

Quote number: 0
Speaker: Grygiel
Speaker_Index: (1239,1246)
Quote: "We didn't just see a breach at the Capitol. Social media platforms have been breached by the president repeatedly. This is disinformation. This was a coup attempt in the United States."
Quote_Index: (1052,1238)
Verb: said
Verb_Index: (1247,1251)
Quote_Token_Count: 38
Quote_Type: Heuristic
Is_Floating_Quote: False
Quote_Id: text0-1
------------------------------------------------------------------------------------------------------------------------
Quote number: 1
Speaker: Facebook and Instagram, which Facebook owns
Speaker_Index: (0,43)
Quote: that Trump wouldn't be able to post for 24 hours following two violations of its policies
Quote_Index: (84,173)
Verb: announcing
Verb_Index: (73,83)
Quote_Token_Count: 17
Quote_Type: SVC
Is_Floating_Quote: False
Quote_Id: text0-2
------------------------------------------------------------------------------------------------------------------------
Quote number: 2
Speaker: expe

In [6]:
# generate the outcome in a pandas dataframe
quotes_df = pd.DataFrame.from_dict(all_quotes)
new_index = ['quote_id', 'quote', 'quote_index', 'quote_token_count', 'quote_type','is_floating_quote', 
             'speaker', 'speaker_index', 'verb', 'verb_index']
quotes_df = quotes_df.reindex(columns=new_index)
quotes_df.head()

Unnamed: 0,quote_id,quote,quote_index,quote_token_count,quote_type,is_floating_quote,speaker,speaker_index,verb,verb_index
0,text0-1,"""We didn't just see a breach at the Capitol. S...","(1052,1238)",38,Heuristic,False,Grygiel,"(1239,1246)",said,"(1247,1251)"
1,text0-2,that Trump wouldn't be able to post for 24 hou...,"(84,173)",17,SVC,False,"Facebook and Instagram, which Facebook owns","(0,43)",announcing,"(73,83)"
2,text0-3,that these actions follow years of hemming and...,"(302,489)",26,SVC,False,experts,"(288,295)",noted,"(296,301)"
3,text0-4,"what happened in Washington, D.C.","(592,625)",6,SVC,False,"Jennifer Grygiel, a Syracuse University commun...","(491,586)",said,"(587,591)"
4,text0-5,", on Wednesday is a direct result of Trump's u...","(625,813)",32,SVC,False,"Jennifer Grygiel, a Syracuse University commun...","(491,586)",said,"(587,591)"


In [7]:
quotes_df

Unnamed: 0,quote_id,quote,quote_index,quote_token_count,quote_type,is_floating_quote,speaker,speaker_index,verb,verb_index
0,text0-1,"""We didn't just see a breach at the Capitol. S...","(1052,1238)",38,Heuristic,False,Grygiel,"(1239,1246)",said,"(1247,1251)"
1,text0-2,that Trump wouldn't be able to post for 24 hou...,"(84,173)",17,SVC,False,"Facebook and Instagram, which Facebook owns","(0,43)",announcing,"(73,83)"
2,text0-3,that these actions follow years of hemming and...,"(302,489)",26,SVC,False,experts,"(288,295)",noted,"(296,301)"
3,text0-4,"what happened in Washington, D.C.","(592,625)",6,SVC,False,"Jennifer Grygiel, a Syracuse University commun...","(491,586)",said,"(587,591)"
4,text0-5,", on Wednesday is a direct result of Trump's u...","(625,813)",32,SVC,False,"Jennifer Grygiel, a Syracuse University commun...","(491,586)",said,"(587,591)"
5,text0-6,"This is what happens,","(1015,1036)",5,QCQSV,False,""" Grygiel","(1036,1045)",said,"(1046,1050)"
6,text0-7,"They're creeping along towards firmer action,","(1352,1397)",8,QCQSV,False,""" Grygiel","(1397,1406)",said,"(1407,1411)"
7,text0-8,"that the video was removed because it ""contrib...","(2362,2467)",18,SVC,False,"Guy Rosen, Facebook's vice-president of integrity","(2285,2334)",said,"(2336,2340)"
8,text0-9,"""This is an emergency situation and we are tak...","(2470,2594)",20,QCQSV,False,Rosen,"(2597,2602)",said,"(2603,2607)"
9,text0-10,"""I know your pain","(2802,2819)",5,SVC,False,Trump,"(2771,2776)",saying,"(2794,2800)"


In [9]:
# save the above dataframe into excel
quotes_df.to_excel('./output/quotes.xlsx', index=False)