# Text Preprocessing Homework Solution

###### Author: Alex Sherman | alsherman@deloitte.com

In [1]:
from IPython.core.display import display, HTML
import spacy
from spacy.matcher import Matcher
from spacy.matcher import PhraseMatcher
from spacy.lang.en.stop_words import STOP_WORDS
from collections import defaultdict
from configparser import ConfigParser, ExtendedInterpolation

In [2]:
# configuration for data and acronyms
config = ConfigParser(interpolation=ExtendedInterpolation())
config.read('../../config.ini')

DB_PATH = config['DATABASES']['PROJECT_DB_PATH']

In [None]:
# check for the names of the tables in the database
import pandas as pd
from sqlalchemy import create_engine

# connect to the database
engine = create_engine(DB_PATH)

# read the 10k documents 
df = pd.read_sql("SELECT * FROM Sections", con=engine)

# the annual report from 1992 was scanned in poor quality
# and the text was not legible
# filter this document out of the dataframe
df = df[df.filename != 'southwest-airlines-co_annual_report_1992.docx']

# filter to sections that contain the word risk
df = df[df.section_text.str.contains('costs')]

# combine the text of the first ten sections in a variable named text
text = ' '.join(df['section_text'].values[0:10])

In [None]:
# prints out 300 characters of the text string
text[0:300]

In [6]:
# load spacy nlp model
# use 'en' if you don't have the lg model
nlp = spacy.load('en_core_web_lg')

In [25]:
%%time

# load the text in SpaCy
# disable the named entity recognition ('ner')
# this may take 1-2 minutes to load. Add %%time to check.
doc = nlp(text, disable=['ner'])

Wall time: 37.5 s


In [30]:
# prints out fist 100 words of the SpaCy doc
doc[0:100]

In 1994, Southwest Airlines produced a profit of $179.3 million, a 16.2 percent increase over the $154.3 million of 1993, excluding the cumulative effect of 1993 accounting changes. Our net profit margin was percent in a year when the domestic passenger carrier industry, as a whole, basically broke  even. In 1994, we also: Completed negotiation of the launch contract for the Boeing 737-700 and the definition of our new aircraft to be received by Southwest beginning in fourth quarter 1997; Completed negotiation of a contract

### SpaCy - Text Extraction


##### Outside of the collect_sents function, create the following:
- A defaultdict called ent_count to count how many times the pattern appears (e.g. count how many times 'operating costs' appears)
- A defaultdict called ent_sents to collect the sentences in which the patterns appear

##### Create a collect sents function that does the following:
- Uses span.lemma_.lower() for each entity. These entity lemmas will be used as the keys in both defaultdicts
- Count the number of times the pattern entity appear.  
- Collects the sentences in which the patterns appear

In [31]:
def collect_sents(matcher, doc, i, matches):
    # extract the match_id, start, and end
    match_id, start, end = matches[i]
    
    # create the span by using the start and end index in the text
    span = doc[start:end]
    
    # create a variable named entity
    # to hold the lowercase lemma of the span
    entity = span.lemma_.lower()

    # create a variable named sent
    # to hold the sentence in which the entity was found
    sent = span.sent
    
    # increase the count (+1) to the entitiy in the ent_count defaultdict
    ent_count[entity] += 1
    
    # add the sentence to the entity in the ent_sents defaultdict
    ent_sents[entity].append(sent)

In [37]:
# create the ent_sents defaultdict
# think about what datatype to use to store sentences
ent_sents  = defaultdict(list)

# create the ent_count defaultdict
# think about what datatype to use to store a count
ent_count = defaultdict(int)

# add a pattern that captures a lowercase word 'costs'
# and is preceeded by one or more nouns
pattern = [{'POS': 'NOUN', 'OP': '+'},{'LOWER': 'costs'}]

# reset the Matcher with nlp.vocab to clear out any previous patterns
matcher = Matcher(nlp.vocab)

# add the pattern to the matcher, named it 'airline_costs'
matcher.add('airline_costs', collect_sents, pattern) 

# execute the matcher on the doc
# store the result in a variable named matches
matches = matcher(doc)

In [39]:
# review the results of the ent_count
ent_count

defaultdict(int,
            {'advertising cost': 2,
             'agency commission cost': 1,
             'airframe overhaul cost': 3,
             'commission cost': 1,
             'compensation cost': 2,
             'development cost': 1,
             'distribution cost': 6,
             'maintenance cost': 2,
             'operating cost': 5,
             'overhaul cost': 3,
             'production cost': 1,
             'severance cost': 3,
             'system development cost': 1,
             'time cost': 1,
             'unit cost': 1})

In [40]:
# review the results of the ent_sents
ent_sents

defaultdict(list,
            {'advertising cost': [Despite heavy advertising costs to support seven new cities and these campaigns, our overall cost per available seat mile actually declined in 1994 as our People posted record earnings.,
              The overall decrease is primarily attributable to operating efficiencies resulting from the transition of Morris operational functions to Southwest, primarily contract services which decreased $8.8 million (24.4 percent per ASM), offset by an increase in advertising costs of $24.1 million (22.9 percent per ASM) primarily associated with the start-up of seven new cities and new competitive pressures in 1994.],
             'agency commission cost': [The primary factors contributing to this decrease were an 8.8 percent decrease in average jet fuel cost per gallon and lower agency commission costs, offset by increased aircraft rentals.],
             'airframe overhaul cost': [Scheduled airframe overhaul costs are capitalized at amounts not