# Quote Extractor
In this notebook, we will use the *Quote Extractor* tool to extract quotes from a list of texts. In addition to extracting the quote, the tool also provides some additional useful information about the quote, including who the speaker is, the location of the quote (and the speaker) within the text, the length of the quote, etc. This code has been adopted from the [GenderGapTracker GitHub page](https://github.com/sfu-discourse-lab/GenderGapTracker/tree/master/NLP/main) and modified to run on a Jupyter Notebook.

## 1. Setup
Before we begin, we need to import the necessary tools and packages for our tool to run.

In [1]:
# import the necessary packages
import os
import sys
import logging
import traceback

# pandas: tools for data processing
import pandas as pd

# spaCy and NLTK: natural language processing tools for working with language/text data
import spacy
import nltk
nltk.download('punkt')
from nltk import Tree
from nltk.tokenize import sent_tokenize

# import the quote extractor tool
from quote_extractor import extract_quotes, get_rawtext_files
from config import config
import utils

# initiate the app_logger
app_logger = utils.create_logger('quote_extractor', log_dir='logs', logger_level=logging.INFO, 
                                 file_log_level=logging.INFO)

[nltk_data] Downloading package punkt to /Users/sjuf9909/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
# download spaCy's en_core_web_lg, the pre-trained English language tool from spaCy
print('Loading spaCy language model...')
nlp = spacy.load('en_core_web_lg')
print('Finished loading.')

Loading spaCy language model...
Finished loading.


## 2. Load the data
This notebook will allow you to extract quotes directly from a text file (or a number of text files). Alternatively, you can also extract quotes from a text column inside your excel spreadsheet, if you wish.

### 2.1. From a text file
In order to extract quotes directly from a text file, you need to store all your text files (.txt) in a folder on your computer, e.g., we use the 'input' folder in the below example. Using the below code, we will access those files and extract the text into a pandas dataframe (in table format) for further processing.

In [3]:
# specify the file path to the folder you use to store your text files
file_path = './input/'

# create an empty list for a placeholder to store all the texts
all_files = []

# search for text files (.txt) inside the folder and extract all the texts
for input_file in get_rawtext_files(file_path):
    text_dict = {}
    
    # use the text file name as the doc_id
    doc_id = input_file.replace('.txt', '')
    
    try:
        # read the text file
        doc_lines = open(os.path.join(file_path, input_file), 'r').readlines()
        doc_lines = '\n'.join(doc_lines)
        
        # store them inside a dictionary
        text_dict['text_id'] = doc_id
        text_dict['texts'] = doc_lines
        all_files.append(text_dict)
            
    except:
        # this will provide some information in the case of an error
        app_logger.exception("message")
        traceback.print_exc()

# convert the extracted texts into a pandas dataframe for further processing
text_df = pd.DataFrame.from_dict(all_files)
text_df

Unnamed: 0,text_id,texts
0,test1,"Facebook and Instagram, which Facebook owns, f..."
1,test2,(CBC News)\n\nRepublican lawmakers and previou...


### 2.2. From an Excel spreadsheet
If you have already stored your texts in an Excel spreadsheet, you can use the below code to access your spreadsheet.

In [4]:
# enter the file path and the file name of the excel spreadsheet containing the text
file_path = './input/'
file_name = 'text_files.xlsx'

# read the pandas dataframe containing the list of texts
text_df = pd.read_excel(file_path + file_name)
text_df.head()

Unnamed: 0,text_id,texts
0,text1,"Facebook and Instagram, which Facebook owns, f..."
1,text2,(CBC News)\nRepublican lawmakers and previous ...


## 3. Extract the quotes
Once your texts have been extracted and stored in a pandas dataframe, we can begin to extract the quotes from the texts.

In [5]:
# specify the column name containing the text
text_col_name = 'texts'

# specify whether you wish to create a parse tree for the quotes 
# you also need to specify the output file path if 'True'
write_quote_trees_in_file = False
tree_dir = './output/trees/'

# create an empty list to store all detected quotes
all_quotes = []

# go through all the texts and start extracting quotes
for n, text in enumerate(text_df[text_col_name]):
    doc_id = text_df['text_id'][n]
    
    try:
        # pre-process the text
        text = sent_tokenize(text)
        text = ' '.join(text)
        text = utils.preprocess_text(text)
        
        # apply the spaCy's tool to the text
        doc = nlp(text)
        
        # extract the quotes
        quotes = extract_quotes(doc_id=doc_id, doc=doc, 
                                write_tree=write_quote_trees_in_file, 
                                tree_dir=tree_dir)
        
        # add quote_id to each quote
        for n, quote in enumerate(quotes):
            quote['quote_id']=doc_id + '-' + str(n+1)
        
        # store them in all_quotes
        all_quotes.extend(quotes)
            
    except:
        # this will provide some information in the case of an error
        app_logger.exception("message")
        traceback.print_exc()

We have extracted the quotes from the texts. Now, let's generate a preview of the extracted quotes.

In [6]:
# generate a preview of the quotes
for n, q in enumerate(all_quotes[:min(3,len(all_quotes))]):
    print('Quote number:',n)
    for key, value in q.items():
        print(key.title() + ': ' + str(value))
    print('-' * 110)

Quote number: 0
Speaker: Grygiel
Speaker_Index: (1239,1246)
Quote: "We didn't just see a breach at the Capitol. Social media platforms have been breached by the president repeatedly. This is disinformation. This was a coup attempt in the United States."
Quote_Index: (1052,1238)
Verb: said
Verb_Index: (1247,1251)
Quote_Token_Count: 38
Quote_Type: Heuristic
Is_Floating_Quote: False
Quote_Id: text1-1
--------------------------------------------------------------------------------------------------------------
Quote number: 1
Speaker: 
Speaker_Index: (0,0)
Quote: "Social media is complicit in this because he has repeatedly used social media to incite violence. It's a culmination of years of propaganda and abuse of media by the president of the United States."
Quote_Index: (1492,1691)
Verb: caused
Verb_Index: (1705,1711)
Quote_Token_Count: 39
Quote_Type: Heuristic
Is_Floating_Quote: False
Quote_Id: text1-2
-------------------------------------------------------------------------------------

In [7]:
quotes_only = [all_quotes[n]['quote'] for n in range(len(all_quotes))]

my_text = text_df['texts'][0]

for n, quote in enumerate(quotes_only[:12]):
    split_text = my_text.split(quote)
    split_text.insert(1,'\x1b[0m')
    split_text.insert(1,quote)
    split_text.insert(1,'\x1b[31m')
    my_text = ''.join(split_text)
    
print(my_text)

Facebook and Instagram, which Facebook owns, followed up in the evening, announcing [31mthat Trump wouldn't be able to post for 24 hours following two violations of its policies[0m. The White House did not immediately offer a response to the actions.

While some cheered the platforms' response, experts noted [31mthat these actions follow years of hemming and hawing regarding Trump and his supporters spreading dangerous misinformation and encouraging violence that contributed to Wednesday's events[0m.

Jennifer Grygiel, a Syracuse University communications professor and an expert on social media, said [31mwhat happened in Washington, D.C.[0m[31m, on Wednesday is a direct result of Trump's use of social media to spread propaganda and disinformation, and that the platforms should bear some responsibility for their previous inaction[0m.

Police secure U.S. Capitol after pro-Trump rioters cause bedlam at heart of U.S. government
Senate, House resoundingly reject 1st challenge to Bid

In [8]:
my_text = text_df['texts'][1]

for quote in quotes_only[12:]:
    split_text = my_text.split(quote)
    split_text.insert(1,'\x1b[0m')
    split_text.insert(1,quote)
    split_text.insert(1,'\x1b[31m')
    my_text = ''.join(split_text)
    
print(my_text)

(CBC News)
Republican lawmakers and previous administration officials had begged Trump to give a statement to his supporters to quell the violence. He posted his video as authorities struggled to take control of a chaotic situation at the Capitol that led to the evacuation of lawmakers and the death of at least one person.

Lawmakers, world leaders condemn chaos at the U.S. Capitol while some call for Trump's removal
Trudeau says Canadians 'deeply disturbed' by violence in Washington D.C.
Trump has harnessed social media — especially Twitter — as a potent tool for spreading misinformation about the election. Wednesday's riot only increased calls to ban Trump from the platform.

"[31mThe President has promoted sedition and incited violence[0m," said Jonathan Greenblatt, chief executive officer of the Anti-Defamation League, in a statement. [31m"More than anything, what is happening right now at the Capitol is a direct result of the fear and disinformation that has been spewed consist

Using displacy, spaCy's visualisation tool, we can also generate the dependency parsing tree for the extracted quotes.

In [9]:
# spaCy comes with a built-in visualisation suite
from spacy import displacy

qte = [nlp(all_quotes[n]['quote']) for n in range(min(3,len(all_quotes)))]
svg = displacy.render(qte, style='dep', jupyter=True, options={'distance':90})

## 4. Save your quotes
Once you are happy with the extracted quotes, you can save them back into an Excel spreadsheet for further analysis.

In [10]:
# convert the outcome into a pandas dataframe
quotes_df = pd.DataFrame.from_dict(all_quotes)

# re-arrange the columns
new_index = ['quote_id', 'quote', 'quote_index', 'quote_token_count', 'quote_type','is_floating_quote', 
             'speaker', 'speaker_index', 'verb', 'verb_index']
quotes_df = quotes_df.reindex(columns=new_index)

# preview the quotes dataframe
quotes_df

Unnamed: 0,quote_id,quote,quote_index,quote_token_count,quote_type,is_floating_quote,speaker,speaker_index,verb,verb_index
0,text1-1,"""We didn't just see a breach at the Capitol. S...","(1052,1238)",38,Heuristic,False,Grygiel,"(1239,1246)",said,"(1247,1251)"
1,text1-2,"""Social media is complicit in this because he ...","(1492,1691)",39,Heuristic,False,,"(0,0)",caused,"(1705,1711)"
2,text1-3,that Trump wouldn't be able to post for 24 hou...,"(84,173)",17,S V C,False,"Facebook and Instagram, which Facebook owns","(0,43)",announcing,"(73,83)"
3,text1-4,that these actions follow years of hemming and...,"(302,489)",26,S V C,False,experts,"(288,295)",noted,"(296,301)"
4,text1-5,"what happened in Washington, D.C.","(592,625)",6,S V C,False,"Jennifer Grygiel, a Syracuse University commun...","(491,586)",said,"(587,591)"
5,text1-6,", on Wednesday is a direct result of Trump's u...","(625,813)",32,S V C,False,"Jennifer Grygiel, a Syracuse University commun...","(491,586)",said,"(587,591)"
6,text1-7,"This is what happens,","(1015,1036)",5,Q C Q S V,False,""" Grygiel","(1036,1045)",said,"(1046,1050)"
7,text1-8,"They're creeping along towards firmer action,","(1352,1397)",8,Q C Q S V,False,""" Grygiel","(1397,1406)",said,"(1407,1411)"
8,text1-9,"that the video was removed because it ""contrib...","(2362,2467)",18,S V C,False,"Guy Rosen, Facebook's vice-president of integrity","(2285,2334)",said,"(2336,2340)"
9,text1-10,"""This is an emergency situation and we are tak...","(2470,2594)",20,Q C Q S V,False,Rosen,"(2597,2602)",said,"(2603,2607)"


In [11]:
# save into an Excel spreadsheet
quotes_df.to_excel('./output/quotes.xlsx', index=False)