# Phrase (collocation) Detection Solution

###### Author: Alex Sherman | alsherman@deloitte.com

#### Agenda
1. Acronym replacement
2. SpaCy POS phrases
3. Gensim Phrases and Phraser

In [125]:
import spacy
import pandas as pd
from sqlalchemy import create_engine
from spacy.matcher import Matcher
from spacy.matcher import PhraseMatcher
from collections import defaultdict
from spacy.lang.en.stop_words import STOP_WORDS
from IPython.core.display import display, HTML
from configparser import ConfigParser, ExtendedInterpolation

In [126]:
# configuration for data, acronyms, and gensim paths
config = ConfigParser(interpolation=ExtendedInterpolation())
config.read('../../config.ini')

DB_PATH = config['DATABASES']['PROJECT_DB_PATH']
AIRLINE_ACRONYMS_FILEPATH = config['NLP']['AIRLINE_ACRONYMS_FILEPATH']
AIRLINE_MATCHED_TEXT_PATH = config['NLP']['AIRLINE_MATCHED_TEXT_PATH']
AIRLINE_CLEANED_TEXT_PATH = config['NLP']['AIRLINE_CLEANED_TEXT_PATH']
GENSIM_DICTIONARY_PATH = config['NLP']['GENSIM_DICTIONARY_PATH']
GENSIM_CORPUS_PATH = config['NLP']['GENSIM_CORPUS_PATH']

#### Load data on airline fees

In [127]:
engine = create_engine(DB_PATH)
df = pd.read_sql("SELECT * FROM Sections", con=engine)

# filter to relevant sections
df = df[df['section_text'].str.contains('fee')]
df.head()

Unnamed: 0,section_id,filename,section_name,section_text,criteria,section_length
9,10,southwest-airlines-co_annual_report_1994.docx,DEPARTMENT OF TRANSPORTATION RANKINGS FOR 1994...,A multitude of challenges faced the People of ...,<function heading at 0x000001D4AA492EA0>,2849
15,16,southwest-airlines-co_annual_report_1994.docx,RESULTS OF OPERATIONS,1994 COMPARED WITH 1993 The Company's consolid...,<function heading at 0x000001D4AA492EA0>,13806
23,24,southwest-airlines-co_annual_report_1994.docx,ACQUISITION,"On December 31, 1993, Southwest exchanged 3,57...",<function heading at 0x000001D4AA492EA0>,2141
26,27,southwest-airlines-co_annual_report_1994.docx,ACCRUED LIABILITIES (IN THOUSANDS) LONG-TERM D...,"On March 1, 1993, the Company redeemed the $10...",<function heading at 0x000001D4AA492EA0>,1855
77,78,southwest-airlines-co_annual_report_1995.docx,SECRET NUMBER 1 STICK TO WHAT YOU’RE GOOD AT.,"Since 1971, Southwest Airlines has offered sin...",<function style at 0x000001D4AA49F048>,2566


In [128]:
# store section matches in list
text = [section for section in df['section_text'].values]

# review first sentence of a section match
text[0][0:299]

'A multitude of challenges faced the People of Southwest Airlines in 1994. The mark of a true champion is the ability to “rise to the occasion” and meet challenges. We believe our Employees showed their true Southwest Spirit in 1994, accomplishing three- or four-fold what a normal year would  bring.'

### SpaCy - Preprocessing

In [129]:
%%time

# load spacy nlp model
# use 'en' if you don't have the lg model
nlp = spacy.load('en_core_web_lg')

Wall time: 1min 15s


##### Text Preprocessing - Acronyms

SOURCE: https://www.faa.gov/airports/resources/acronyms/

In [130]:
# read csv with airline industry acronyms
airline_acronyms = pd.read_csv(AIRLINE_ACRONYMS_FILEPATH)
airline_acronyms.head()

Unnamed: 0,Acronym,Definition
0,A/C,Aircraft
1,A/G,Air to Ground
2,A/H,Altitude/Height
3,AAC,Mike Monroney Aeronautical Center
4,AAF,Army Air Field


##### Exercise

**Curate the acronyms:**

1. Convert the acronyms into a dict
2. Clean acronyms and definitions (replace spaces with underscores, strip text, lowercase)
3. Remove any acronyms that are < two characters (e.g. 'at' == 'air traffic')

In [131]:
acronyms = {}

for ind, row in airline_acronyms.iterrows():
    # get the acronym and convert it to lowercase
    acronym = row['Acronym'].lower()
    
    # clean acronym definition: 
    # lower case, strip excess space, replace spaces with underscores to create a single term
    definition = row['Definition'].lower().strip().replace(' ','_')
    
    # add acronyms/definitions pairs to the acronyms dict
    # ignore two character acronyms as they often match actual words
    # e.g. 'at' == 'air traffic'
    if len(acronym) > 2:
        acronyms[acronym] = definition

# view the first few acronyms
list(acronyms.items())[0:5]  # convert to list as dict is not subscriptable

[('fire', 'fire_station'),
 ('noc', 'notice_of_completion'),
 ('rclr', 'rcl_repeater'),
 ('fsdps', 'flight_service_data_processing_system'),
 ('fdp', 'flight_data_processing')]

##### Identify acronyms that exist in text
WARNING SLOW!

In [132]:
%%time

if 1 == 0:
    # review the acronyms
    acronym_matches = []

    # create a nlp pipe to iterate through the text
    for doc in nlp.pipe(text, disable=['tagger','ner']):    
        # iterate through each word in the sentence
        for token in doc:
            token = token.text.lower()
            # check if token is an acronym
            # add matches (acronym and definition) to acronym_matches
            if token in acronyms:
                acronym_matches.append((token, acronyms[token]))

    # review all matching acronyms      
    for match in set(acronym_matches):
        print(match)

('cat', 'clear')
('did', 'direct_inward_dial')
('ata', 'air_transport_association_of_america')
('asm', 'available_seat_mile')
('gps', 'global_positioning_system')
('moa', 'military_operations_area')
('rnav', 'area_navigation')
('dot', 'department_of_transportation')
('grade', 'graphical_airspace_design_environment')
('tops', 'telecommunications_ordering_and_pricing_system_(gsa_software_tool)')
('asr', 'airport_surveillance_radar')
('par', 'preferential_arrival_route')
('faa', 'federal_aviation_administration')
('app', 'approach')
('soc', 'service_oversight_center')
('rnp', 'required_navigation_performance')
('aid', 'airport_information_desk')
('mou', 'memorandum_of_understanding')
('self', 'simplified_short_approach_lighting_system_with_sequenced_flashing_lights')
('tsa', 'taxiway_safety_area')
('far', 'federal_aviation_regulation')
('basic', 'basic_contract_observing_station')
Wall time: 7min 36s


In [133]:
# update acronyms list to remove ambiguous acronyms
acronyms_to_remove = ['cat','app','grade','self','basic','did','far']
for term in acronyms_to_remove:
    acronyms.pop(term)

###### collect sentences about fees for phrase model

In [134]:
def collect_phrase_model_sents(matcher, doc, i, matches):
    # identify matching spans (phrases)
    match_id, start, end = matches[i]
    span = doc[start:end]
    
    # keep only words, lemmatize tokens, remove punctuation
    sent = [str(token.lemma_).lower() 
            for token in span.sent if token.is_alpha]
    
    # replace acronyms
    sent = [acronyms[token] if token in acronyms else token 
            for token in sent]
    
    # collect matching (cleaned) sents
    matched_sents.append(sent)

##### match sentences with the word fee or fees

WARNING SLOW!

In [135]:
%%time 

if 1 == 0:
    # match sentences with the word fee or fees
    matched_sents = []
    pattern = [[{'LOWER': 'fee'}], [{'LOWER': 'fees'}]]

    matcher = Matcher(nlp.vocab)
    
    # use *patterns to add more than one pattern at once
    matcher.add('fees', collect_phrase_model_sents, *pattern)

    for doc in nlp.pipe(text, disable=['tagger','ner']):    
        matcher(doc)

Wall time: 7min 8s


In [137]:
print('Number of matches: {} \n'.format(len(matched_sents)))

print('Example Match:')
print(matched_sents[0])

Number of matches: 454 

Example Match:
['rather', 'than', 'pay', 'the', 'fee', 'demand', 'by', 'this', 'crss', 'we', 'respond', 'quickly', 'with', 'our', 'own', 'travel', 'agency', 'solution', 'direct', 'access', 'and', 'ticket', 'for', 'the', 'large', 'agency', 'swat', 'overnight', 'delivery', 'of', 'southwest', 'produce', 'ticket', 'for', 'approximately', 'large', 'travel', 'agency', 'improve', 'access', 'to', 'ticket', 'by', 'mail', 'for', 'direct', 'customers', 'by', 'reduce', 'the', 'time', 'limit', 'from', 'seven', 'day', 'out', 'from', 'the', 'date', 'of', 'travel', 'to', 'three', 'day', 'and', 'ticketless', 'travel', 'which', 'eliminate', 'the', 'need', 'to', 'print', 'a', 'paper', 'ticket', 'altogether']


##### Export matched text to avoid repeating processing

In [138]:
# uncomment below to write the matched text to a .txt file for later use 

# with open(AIRLINE_MATCHED_TEXT_PATH, 'w') as f:
#    for line in matched_sents:
#        line = ' '.join(line) + '\n'
#        line = line.encode('ascii', errors='ignore').decode('ascii') 
#        f.write(line)

In [140]:
# read matched text
with open(AIRLINE_MATCHED_TEXT_PATH, 'r') as f:
    matched_sents_full = [line for line in f.readlines()]
    matched_sents = [line.split() for line in matched_sents_full]

In [141]:
# store all matched sentences in a dataframe
matches_df = pd.DataFrame(matched_sents_full, columns=['sentences'])

# remove duplicates
matches_df = matches_df.drop_duplicates()

matches_df.head()

Unnamed: 0,sentences
0,rather than pay the fee demand by this crss we...
1,these expense include million of various profe...
2,included in this one time cost result from the...
3,the commitment fee be per annum\n
5,landing fee and other rental per available_sea...


### Use SpaCy part of speech (POS) to create phrases

In [142]:
# combine the matched sentence tokens and parse it with SpaCy
text = ' '.join(matched_sents[0])
text

'rather than pay the fee demand by this crss we respond quickly with our own travel agency solution direct access and ticket for the large agency swat overnight delivery of southwest produce ticket for approximately large travel agency improve access to ticket by mail for direct customers by reduce the time limit from seven day out from the date of travel to three day and ticketless travel which eliminate the need to print a paper ticket altogether'

##### Determine which NLP components can be disabled

In [26]:
def view_pos(doc, n_tokens=5):
    """ print SpaCy POS information about each token in a provided document """
    print('{:15} | {:10} | {:10} | {:30}'.format('TOKEN','POS','DEP_','LEFTS'))
    for token in doc[0:n_tokens]:
        print('{:15} | {:10} | {:10} | {:30}'.format(
            token.text, token.head.pos_,token.dep_, str([t.text for t in token.lefts])))

In [27]:
# observe which part of speech (pos) attributes are disabled by named entity recognition (ner)
pos_doc = nlp(text, disable=['ner'])
view_pos(pos_doc)

TOKEN           | POS        | DEP_       | LEFTS                         
rather          | ADP        | advmod     | []                            
than            | VERB       | advmod     | ['rather']                    
pay             | VERB       | ROOT       | ['than']                      
the             | NOUN       | det        | []                            
fee             | NOUN       | compound   | []                            


In [28]:
# observe which part of speech (pos) attributes are disabled by parser
pos_doc = nlp(text, disable=['ner','parser'])
view_pos(pos_doc)

TOKEN           | POS        | DEP_       | LEFTS                         
rather          | ADV        |            | []                            
than            | ADP        |            | []                            
pay             | VERB       |            | []                            
the             | DET        |            | []                            
fee             | NOUN       |            | []                            


In [29]:
# observe which part of speech (pos) attributes are disabled by tagger
pos_doc = nlp(text, disable=['ner','tagger'])
view_pos(pos_doc, n_tokens=10)

TOKEN           | POS        | DEP_       | LEFTS                         
rather          |            | advmod     | []                            
than            |            | advmod     | ['rather']                    
pay             |            | ROOT       | ['than']                      
the             |            | det        | []                            
fee             |            | compound   | []                            
demand          |            | dobj       | ['the', 'fee']                
by              |            | prep       | []                            
this            |            | det        | []                            
crss            |            | pobj       | ['this']                      
we              |            | nsubj      | []                            


In [30]:
# use explain to define any token.dep_ attributes
spacy.explain('dobj')

'direct object'

In [19]:
dependency_parsing_labels_url = 'https://spacy.io/api/annotation#dependency-parsing'
iframe = '<iframe src={} width=1000 height=400></iframe>'.format(dependency_parsing_labels_url)
HTML(iframe)

##### Extract phrases by identifying tokens describing an object

In [41]:
# add stop words to SpaCy
# this enables the .is_stop attribute with common stop words
from spacy.lang.en.stop_words import STOP_WORDS

for word in STOP_WORDS:
    lex = nlp.vocab[word]
    lex.is_stop = True

In [44]:
def create_pos_phrases(doc):

    phrases = [] 

    doc = nlp(doc, disable=['ner','tagger'])
    for token in doc:
        # find any objects (e.g. direct objects )
        if 'obj' in token.dep_:
            token_text = token.lemma_.lower()
            
            # find any dependent terms to the left of (preceeding) the object
            # ignore dependent terms that are not stop words
            for left_term in (t.text for t in token.lefts if t.is_stop is False):
                # combine the dependent term and object, separated by an underscore
                # e.g. travel agency ==> travel_agency
                phrase = '{}_{}'.format(left_term,token_text)
                phrases.append(phrase)
    
    # convert list of distinct phrases into a sentence
    return ' '.join(set(phrases))

print(create_pos_phrases(matched_sents_full[0]))

paper_ticket agency_swat large_swat seven_day fee_demand produce_ticket large_agency agency_solution travel_agency time_limit direct_customer


In [46]:
%%time

# apply the custom function to every element in the dataframe
matches_df['pos_phrases'] = matches_df.sentences.apply(create_pos_phrases)

Wall time: 24.2 s


In [47]:
matches_df.head()

Unnamed: 0,sentences,pos_phrases
0,rather than pay the fee demand by this crss we...,paper_ticket agency_swat large_swat seven_day ...
1,these expense include million of various profe...,relocation_cost fee_million duplicate_property...
2,included in this one time cost result from the...,relocation_cost cost_result duplicate_property...
3,the commitment fee be per annum\n,
5,landing fee and other rental per available_sea...,airport_credit


##### Pandas Apply

apply is an efficient and fast approach to 'apply' a function to every element in a row. applymap does the same to every element in the entire dataframe (e.g. convert all ints to floats)

Example: https://chrisalbon.com/python/data_wrangling/pandas_apply_operations_to_dataframes/

In [37]:
# create a small dataframe with example data
test_df = pd.DataFrame({'col1':range(0,3),'col2':range(3,6)})
test_df

Unnamed: 0,col1,col2
0,0,3
1,1,4
2,2,5


In [39]:
# apply a built-in function to each element in a column
test_df['col1'].apply(float)

0    0.0
1    1.0
2    2.0
Name: col1, dtype: float64

In [348]:
# apply a custom function to every element in a column
def add_five(row):
    return row + 5

test_df['col1'].apply(add_five)

0    5
1    6
2    7
Name: col1, dtype: int64

In [349]:
# apply an annonomous function to every element in a column
test_df['col1'].apply(lambda x: x+5)

0    5
1    6
2    7
Name: col1, dtype: int64

In [350]:
# apply a built-in function to every element in a dataframe 
test_df.applymap(float)  # applymap

Unnamed: 0,col1,col2
0,0.0,3.0
1,1.0,4.0
2,2.0,5.0


### Collocations

"A collocation is an expression consisting of two or more words that
correspond to some conventional way of saying things. Or in the words
of Firth (1957: 181): “Collocations of a given word are statements of the
habitual or customary places of that word.” Collocations include noun
phrases like strong tea and weapons of mass destruction, phrasal verbs like
to make up, and other stock phrases like the rich and powerful. Particularly
interesting are the subtle and not-easily-explainable patterns of word usage
that native speakers all know: why we say a stiff breeze but not a stiff wind
(while either a strong breeze or a strong wind is okay), or why we speak of
broad daylight (but not bright daylight or narrow darkness)



There are actually different definitions of the notion of collocation. Some
authors in the computational and statistical literature define a collocation
as two or more consecutive words with a special behavior, for example
Choueka (1988):
[A collocation is defined as] a sequence of two or more consecutive
words, that has characteristics of a syntactic and semantic
unit, and whose exact and unambiguous meaning or connotation
cannot be derived directly from the meaning or connotation of its
components. In most linguistically oriented research, a phrase
can be a collocation even if it is not consecutive (as in the example knock
. . . door). The following criteria are typical of linguistic treatments of collocations:

**Non-compositionality**: The meaning of a collocation is not a straightforward
composition of the meanings of its parts. Either the meaning
is completely different from the free combination (as in the case of idioms
like kick the bucket) or there is a connotation or added element of
meaning that cannot be predicted from the parts. For example, white
wine, white hair and white woman all refer to slightly different colors, so
we can regard them as collocations. 

**Non-substitutability**: We cannot substitute near-synonyms for the
components of a colloction. For example, we can’t say yellow wine
instead of white wine even though yellow is as good a description of the
color of white wine as white is (it is kind of a yellowish white).

**Non-modifiability**: Many collocations cannot be freely modified with
additional lexical material or through grammatical transformations.
This is especially true for frozen expressions like idioms. For example,
we can’t modify frog in to get a frog in one’s throat into to get an ugly
frog in one’s throat although usually nouns like frog can be modified by
adjectives like ugly. Similarly, going from singular to plural can make
an idiom ill-formed, for example in people as poor as church mice."

SOURCE: https://nlp.stanford.edu/fsnlp/promo/colloc.pdf

### Exercise

Create a function that returns a window of size n over a given sentence. 

For the sentence **'rather than pay the fee'** return the following if the window is n=3:
- ['rather', 'than', 'pay'],
- ['than','pay','the']
- ['pay', 'the','fee']
- ...


In [48]:
# example sentence
sent = ' '.join(matches_df['sentences'][0:1]).split()
print(sent)

['rather', 'than', 'pay', 'the', 'fee', 'demand', 'by', 'this', 'crss', 'we', 'respond', 'quickly', 'with', 'our', 'own', 'travel', 'agency', 'solution', 'direct', 'access', 'and', 'ticket', 'for', 'the', 'large', 'agency', 'swat', 'overnight', 'delivery', 'of', 'southwest', 'produce', 'ticket', 'for', 'approximately', 'large', 'travel', 'agency', 'improve', 'access', 'to', 'ticket', 'by', 'mail', 'for', 'direct', 'customers', 'by', 'reduce', 'the', 'time', 'limit', 'from', 'seven', 'day', 'out', 'from', 'the', 'date', 'of', 'travel', 'to', 'three', 'day', 'and', 'ticketless', 'travel', 'which', 'eliminate', 'the', 'need', 'to', 'print', 'a', 'paper', 'ticket', 'altogether']


In [49]:
def create_sentence_windows(sentence, n=3):
    "create a sliding window over the n terms in a list of terms"
        
    # create a window on the first n terms by slicing the sentence into the first n terms
    window = sentence[0:n]
    
    # create a list to store all windows
    # add the first window that was created above
    sentence_windows = [window]

    # iterate through the rest of the terms of the sentence
    # e.g. if n=3, then create a new window with terms 2 to 4
    for term in sentence[n:]:
        # remove the first terms of the window and add the next term from the sentence
        window = window[1:] + [term]
        # add the updated window to the master list
        sentence_windows.append(window)

    return sentence_windows

# execute the function
sentence_window = create_sentence_windows(sent, n=3)
# view the first few results
sentence_window[0:5]

[['rather', 'than', 'pay'],
 ['than', 'pay', 'the'],
 ['pay', 'the', 'fee'],
 ['the', 'fee', 'demand'],
 ['fee', 'demand', 'by']]

In [52]:
# execute the function for all sentences

# create a list to store all windows
sentence_window = []

for sent in matches_df['sentences']:
    # convert the sentence string into a list of terms
    sent = sent.split()
    
    # create the sentence windows and append to the sentence_windows list
    windows = create_sentence_windows(sent, n=3)
    
    # add each window to the sentence_window list
    # iterate through windows to make each item in sentence window a window, not a list of windows
    for window in windows:
        sentence_window.append(window)

# view the first five results
sentence_window[0:5]

[['rather', 'than', 'pay'],
 ['than', 'pay', 'the'],
 ['pay', 'the', 'fee'],
 ['the', 'fee', 'demand'],
 ['fee', 'demand', 'by']]

In [143]:
from itertools import combinations
from collections import defaultdict

# create a defaultdict to keep track of common phrases
window_count = defaultdict(int)

for sent in sentence_window:
    # remove stop words
    sentence = [term for term in sent if term not in STOP_WORDS]
    
    # create a combination of terms
    # e.g. (rather, than, pay) --> (rather,than), (than,pay), (rather,pay)
    for combo in combinations(sentence, 2):
        # convert the tuple to a term
        # e.g. (rather, than) --> 'rather_than'
        phrase = '_'.join(combo)
        
        # increment the count for the term each time it appears to identify the most common terms
        window_count[phrase] += 1

# sort to view the most common terms
# the key (lambda x: x[1]) sorts by the count
sorted(window_count.items(), key=lambda x: x[1], reverse=True)[0:20]

[('landing_fee', 55),
 ('security_fee', 50),
 ('check_bag', 41),
 ('rental_expense', 36),
 ('1_quarter', 33),
 ('bag_fee', 30),
 ('require_airline', 28),
 ('additional_fee', 28),
 ('land_fee', 26),
 ('increase_percent', 26),
 ('fee_pay', 25),
 ('available_seat_mile_basis', 24),
 ('2_check', 24),
 ('aviation_security', 24),
 ('seat_selection', 24),
 ('baggage_fee', 24),
 ('ancillary_service', 22),
 ('passenger_protection', 20),
 ('percent_compare', 19),
 ('company_currently', 18)]

### Phrase (collocation) Detection

Phrase modeling is another approach to learning combinations of tokens that together represent meaningful multi-word concepts. We can develop phrase models by looping over the the words in our reviews and looking for words that co-occur (i.e., appear one after another) together much more frequently than you would expect them to by random chance. The formula our phrase models will use to determine whether two tokens $A$ and $B$ constitute a phrase is:

$$\frac{count(A\ B) - count_{min}}{count(A) * count(B)} > threshold$$

- $count(A\ B)$ is the number of times the tokens $A\ B$ appear in the corpus in order
- $count_{min}$ is a user-defined parameter to ensure that accepted phrases occur a minimum number of times
- $count(A)$ is the number of times token $A$ appears in the corpus
- $count(B)$ is the number of times token $B$ appears in the corpus
- $threshold$ is a user-defined parameter to control how strong of a relationship between two tokens the model requires before accepting them as a phrase

Once our phrase model has been trained on our corpus, we can apply it to new text. When our model encounters two tokens in new text that identifies as a phrase, it will merge the two into a single new token.

Phrase modeling is superficially similar to named entity detection in that you would expect named entities to become phrases in the model (so new york would become new_york). But you would also expect multi-word expressions that represent common concepts, but aren't specifically named entities (such as happy hour) to also become phrases in the model.

We turn to the indispensible gensim library to help us with phrase modeling — the Phrases class in particular.

SOURCE: 
- https://github.com/skipgram/modern-nlp-in-python/blob/master/executable/Modern_NLP_in_Python.ipynb
- https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf

##### Scikit-learn API for Gensim

In [144]:
from gensim.sklearn_api.phrases import PhrasesTransformer

sklearn_phrases = PhrasesTransformer(min_count=3, threshold=3)
sklearn_phrases.fit(matched_sents)

PhrasesTransformer(delimiter=b'_', max_vocab_size=40000000, min_count=3,
          progress_per=10000, scoring='default', threshold=3)

In [146]:
#sklearn_phrases.transform(matched_sents)
print(matched_sents)

[['rather', 'than', 'pay', 'the', 'fee', 'demand', 'by', 'this', 'crss', 'we', 'respond', 'quickly', 'with', 'our', 'own', 'travel', 'agency', 'solution', 'direct', 'access', 'and', 'ticket', 'for', 'the', 'large', 'agency', 'swat', 'overnight', 'delivery', 'of', 'southwest', 'produce', 'ticket', 'for', 'approximately', 'large', 'travel', 'agency', 'improve', 'access', 'to', 'ticket', 'by', 'mail', 'for', 'direct', 'customers', 'by', 'reduce', 'the', 'time', 'limit', 'from', 'seven', 'day', 'out', 'from', 'the', 'date', 'of', 'travel', 'to', 'three', 'day', 'and', 'ticketless', 'travel', 'which', 'eliminate', 'the', 'need', 'to', 'print', 'a', 'paper', 'ticket', 'altogether'], ['these', 'expense', 'include', 'million', 'of', 'various', 'professional', 'fee', 'million', 'for', 'disposal', 'of', 'duplicate', 'or', 'incompatible', 'property', 'and', 'equipment', 'and', 'million', 'for', 'employee', 'relocation', 'and', 'severance', 'cost', 'relate', 'to', 'elimination', 'of', 'duplicate',




In [147]:
# review phrase matches
phrases = []
for terms in sklearn_phrases.transform(matched_sents):
    for term in terms:
        if term.count('_') >= 2:
            phrases.append(term)
print(set(phrases))



{'the_department_of_transportation', 'federal_aviation_administration', 'the_taxiway_safety_area', 'available_seat_mile', 'per_available_seat_mile', 'available_seat_mile_increase', 'required_navigation_performance'}


In [148]:
# create a list of stop words
from spacy.lang.en.stop_words import STOP_WORDS
common_terms = list(STOP_WORDS)

**common_terms:** optional list of “stop words” that won’t affect frequency count of expressions containing them.
- The common_terms parameter add a way to give special treatment to common terms (aka stop words) such that their presence between two words won’t prevent bigram detection. It allows to detect expressions like “bank of america” or “eye of the beholder”.


##### Gensim API
A more complex API, though it is faster and has better integration with other gensim components (e.g. Phraser)

In [149]:
from gensim.models.phrases import Phrases
from gensim.models.phrases import Phraser

In [150]:
phrases = Phrases(
      matched_sents
    , common_terms=common_terms
    , min_count=3
    , threshold=3
    , scoring='default'
)

phrases

<gensim.models.phrases.Phrases at 0x2688c33b898>

### Phrases Params

- **scoring:** specifies how potential phrases are scored for comparison to the threshold setting. scoring can be set with either a string that refers to a built-in scoring function, or with a function with the expected parameter names. Two built-in scoring functions are available by setting scoring to a string:

    - ‘default’: from “Efficient Estimaton of Word Representations in Vector Space” by Mikolov, et. al.: 
    
$$\frac{count(AB) - count_{min}}{count(A) * count(B)} * N > threshold$$
    

    - where N is the total vocabulary size.
    - Thus, it is easier to exceed the threshold when the two words occur together often or when the two words are rare (i.e. small product)

In [151]:
bigram = Phraser(phrases)

bigram

<gensim.models.phrases.Phraser at 0x2688c33b0f0>

The phrases object still contains all the source text in memory. A gensim Phraser will remove this extra data to become smaller and somewhat faster than using the full Phrases model. To determine what data to remove, the Phraser ues the  results of the source model’s min_count, threshold, and scoring settings. (You can tamper with those & create a new Phraser to try other values.)

SOURCE: https://radimrehurek.com/gensim/models/phrases.html

In [152]:
def print_phrases(phraser, text_stream, num_underscores=2):
    """ identify phrases from a text stream by searching for terms that
        are separated by underscores and include at least num_underscores
    """
    
    phrases = []
    for terms in phraser[text_stream]:
        for term in terms:
            if term.count('_') >= num_underscores:
                phrases.append(term)
    print(set(phrases))

In [154]:
print_phrases(bigram, matched_sents)

{'cancel_or_oversell', 'service_on_their_website', 'hold_a_reservation', 'taxiway_safety_area', 'increase_after_purchase', 'offer_by_airline', 'measure_have_conjurer', 'cost_such_a_credit', 'primarily_due_to_consult', 'attempt_to_monopolize', 'fee_and_other_rental', 'asif_on_each_airline', 'delay_of_much_than_minute', 'pay_for_ancillary', 'confirmation_and_vii', 'limitation_on_route', 'penalty_for_hour', 'decision_that_create', 'airfare_the_customer', 'fee_for_permanently', 'service_be_provide', 'operation_of_customs', 'gain_a_competitive', 'share_with_ticket', 'effort_to_reduce', 'payments_be_expect', 'new_and_expand', 'follow_the_acquisition', 'difficulty_in_obtain', 'intend_upon_full_integration', 'reservation_for_up_to_hour', 'allow_to_hold', 'fund_for_passenger', 'pursuant_to_authority', 'doe_not_charge', 'reduce_their_capacity', 'result_in_a_low', 'damage_on_behalf', 'information_for_basic', 'cancel_a_pay', 'rate_and_charge', 'require_that_i_advertise', 'numb_of_total', 'security

### Tri-gram phrase model

We can place the text from the first phrase model into another Phrases object to create n-term phrase models. We can repear this process multiple times.

In [157]:
phrases = Phrases(bigram[matched_sents], common_terms=common_terms, min_count=5, threshold=5)
trigram = Phraser(phrases)

print_phrases(trigram, bigram[matched_sents], num_underscores=3)

{'1_or_2_bag', 'purchase_of_miles_rewards', 'approve_legislation_in_december', 'company_currently_expect_landing', 'landing_fee_and_other_rental_expense', 'pet_liquor_sale_advance', 'advance_of_travel_iii_fare', 'passenger_protection_rules_conjurer', 'capacity_and_price_decision', 'cost_such_a_credit', 'remit_this_back_to_the_applicable_governmental_entity', 'primarily_due_to_consult', 'baggage_allowance_and_fee_must_apply', 'passenger_at_the_time_of_book_vi', 'fee_and_other_rental', 'asif_on_each_airline', 'rental_respectively_in_the_consolidated_statement', 'reservation_without_penalty_for_hour_after_the_reservation', 'increase_after_purchase_iv', 'campaign_highlight_the_importance_to_southwest', 'amended_complaint_seek_injunctive', 'effective_july_and_ii_eliminate', 'unlike_much_of_its_competitor_southwest_doe', 'new_and_expand_component_of_the_passenger', 'conspiracy_with_respect_to_the_imposition_of_a_1', 'customer_service_by_show_that_southwest_understand', 'imposition_of_a_1_bag

In [158]:
for doc_num in [5]:
    print('DOC NUMBER: {}\n'.format(doc_num))
    print('ORIGINAL SENTENT: {}\n'.format(' '.join(matched_sents[doc_num])))
    print('BIGRAM: {}\n'.format(' '.join(bigram[matched_sents[doc_num]])))
    print('TRIGRAM: {}'.format(' '.join(trigram[bigram[matched_sents[doc_num]]])))

DOC NUMBER: 5

ORIGINAL SENTENT: landing fee and other rental per available_seat_mile increase percent in compare to which include a airport credit of million

BIGRAM: landing_fee and other rental_per_available_seat_mile increase_percent in compare to which include a airport credit of million

TRIGRAM: landing_fee_and_other_rental_per_available_seat_mile increase_percent in compare to which include a airport credit of million


#### Export Cleaned Text

In [159]:
# write the cleaned text to a new file for later use
with open(AIRLINE_CLEANED_TEXT_PATH, 'w') as f:
    for line in bigram[matched_sents]:
        line = ' '.join(line) + '\n'
        line = line.encode('ascii', errors='ignore').decode('ascii')
        f.write(line)

### Advanced - clean text using SpaCy and gensim

In [167]:
def clean_text(doc):
    print(doc, '\n')

    ents = nlp(doc.text).ents

    # Add named entities, but only if they are a compound of more than word.
    IGNORE_ENTS = ('QUANTITY','ORDINAL','CARDINAL','DATE'
                   ,'PERCENT','MONEY','TIME')
    ents = [ent for ent in ents if 
             (ent.label_ not in IGNORE_ENTS) and (len(ent) > 2)]
    
    # add underscores to combine words in entities
    ents = [str(ent).strip().replace(' ','_') for ent in ents]
    
    # clean text for phrase model
    # Keep only words (no numbers, no punctuation).
    # Lemmatize tokens, remove punctuation and remove stopwords.
    doc_ = [token.lemma_ for token in doc if token.is_alpha]
    phrase_text = [str(term) for term in doc_]
    sent = bigram[phrase_text]
    phrases = []
    for term in sent:
        if '_' in term:
            phrases.append(term)

    # remove stops words - 
    # separate step as they are needed for the phrase model
    doc = [token.lemma_ for token in doc if token.is_alpha and not token.is_stop]

    # add phrases and entities
    doc.extend([entity for entity in ents])
    clean_text = [str(term) for term in doc] + phrases

    return clean_text

# combined terms after phrase model
after_phrase = []
for sent in doc.sents:
    text = clean_text(sent)
    for term in text:
        if '_' in term:
            after_phrase.append(term)

print(set(after_phrase))

The Company’s fleet included 51 aircraft on capital lease as of December 31, 2016, compared with     28 aircraft on capital lease, including one B717, as of December 31, 2015. 

Amounts applicable to these aircraft that are included in property and equipment were: Total rental expense for operating leases, both aircraft and other, charged to operations in 2016, 2015, and 2014 was $932 million, $909 million, and $931 million, respectively. 

The majority of the Company’s terminal operations space, as well as 83 aircraft, were under operating  leases  at  December 31, 2016. 

For aircraft operating leases and for terminal operations leases, expense is  recorded on a straight-line basis and included in Aircraft rentals and in Landing fees and other rentals, respectively, in the Consolidated Statement of Income. 

Future minimum lease payments under capital leases and noncancelable operating leases and rentals to be received under subleases with initial or remaining terms in excess of one 