# Quick and Dirty - Entity Extraction

## Introduction

If you've ever been around a startup or in the tech world for any significant amount of time, you've <strong><em>definitely</em></strong> encountered some, if not all of the following phrases: "agile software development", "prototyping", "feedback-loop", "rapid iteration", etc. 

This Silicon Valley techno-babble can be distilled down to one simple concept, which just so happens to be the mantra of many a successful entrepreneur: test out your idea as quickly as possible, and then make it better over time. Stated more verbosely, before you invest mind and money into creating a cutting-edge solution to a problem, it might benefit you to get a baseline performance for your task using off-the-shelf techniques. Once you establish the efficacy of a low-cost, easy approach, you can then put on your Elon Musk hat and drive towards into #innovation and #disruption. 

A concrete example might help illustrate this point: 

### Entity Extraction

Let's say our goal was to create a natural language system that effectively allowed someone to converse with an academic paper. This task could be step one of many towards the development of an automated scientific discovery tool. Society can thank us later. 

But where do we begin? Well, a part of the solution has to deal with [knowledge extraction](https://en.wikipedia.org/wiki/Knowledge_extraction). In order to create a conversational engine that understands scientific papers, we'll need to start by developing an entity recognition module. 

"What's an entity?" you ask? Excellent question. Take a look at the following sentence:

<quote><strong><em>Dr. Abraham is the primary author of this paper, and a physician in the specialty of internal medicine.</em></strong></quote>

It is relatively easy for , "Dr. Abraham", "primary author", "paper", "physician", "specialty", and "internal medicine". a physician is a type of person, while Dr. Abraham would be an instance of an actual physician. This is the diSo "doctor" is an entity, and "Dr. Abraham" is a named entity. 

## Imports

In [1]:
%load_ext autoreload
%autoreload 2

<em><strong>Fun fact</strong>: Curious about what 'autoreload' does? [Check this out](https://ipython.org/ipython-doc/3/config/extensions/autoreload.html).</em>

In [10]:
import pandas as pd
import spacy
from spacy.displacy.render import EntityRenderer
from IPython.core.display import display, HTML

# Let's suppress some warnings about binaries that didn't compile just right. 
# Google 'numpy.dtype size changed' to learn more.
import warnings
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")

## Utils and Prep

Let's do some basic housekeeping before we start diving headfirst into entity extraction. We'll need to deal with visualization, load up a language model, and of course, examine and set up our data source.

### Show and Tell
spaCy has a wonderful set of classes and methods defined to help visualize parts of the NLP processing pipeline. Up top, where we've imported modules, you'll have noticed that I'm pulling 'EntityRenderer' from spaCy's displacy module, as we'll be repurposing some of this code for our... um... purposes. In general, this is a good exercise if you ever want to get your hands dirty and really learn how certain classes work in your friendly neighborhood open source projects. Nothing should ever be taboo or a black-box; always dissect and play with your code before you eat it.  

Wander on over to spaCy's [website](https://spacy.io/), and you'll quickly discover that they've put in some serious time to make the user interface absolutely gorgeous. (While Matthew undeniably had some input on this, I'm going to make an intelligent assumption that the design ideas are probably Ines' [contribution](https://explosion.ai/about)). 

<em><strong>&lt;rant&gt;</strong> Why spend time discussing visualization at all? Well, one of my biggest pet peeves is this: even if you can create a product, if you don't put in the time to make it look beautiful or delightful to use, then you don't care about packaging your ideas for export to an audience. And that makes me sad. Once you get something working, make it pretty. <strong>&lt;/rant&gt;</strong></em>

In [42]:
def custom_render(doc, df, column, options={}, page=False, minify=False, idx=0):
    """Overload the spaCy built-in rendering to allow custom part-of-speech tags.
    
    Keyword arguments:
    doc -- a spaCy nlp doc object
    df -- a pandas dataframe object
    column -- the name of of a column of interest in the dataframe
    options -- various options to feed into the spaCy renderer, including colors
    page -- rendering markup as full HTML page (default False)
    minify -- for compact HTML (default False)
    idx = index for specific query or doc in dataframe (default 0)
    
    """
    renderer, converter = EntityRenderer, parse_custom_ents
    renderer = renderer(options=options)
    parsed = [converter(doc, df=df, idx=idx, column=column)]
    html = renderer.render(parsed, page=page, minify=minify).strip()  
    return display(HTML(html))

def parse_custom_ents(doc, df, idx, column):
    """Parse custom entity types that aren't in the original spaCy module.
    
    Keyword arguments:
    doc -- a spaCy nlp doc object
    df -- a pandas dataframe object
    idx = index for specific query or doc in dataframe
    column -- the name of of a column of interest in the dataframe
    
    """
    if column in df.columns:
        entities = df[column][idx]
        ents = [{'start': ent[1], 'end': ent[2], 'label': ent[3]} 
                for ent in entities]
    else:
        ents = [{'start': ent.start_char, 'end': ent.end_char, 'label': ent.label_}
            for ent in doc.ents]
    return {'text': doc.text, 'ents': ents, 'title': None}

def render_entities(idx, df, options={}, column='named_ents'):
    """A wrapper function to get text from a dataframe and render it visually in jupyter notebooks
    
    Keyword arguments:
    idx = index for specific query or doc in dataframe (default 0)
    df -- a pandas dataframe object
    options -- various options to feed into the spaCy renderer, including colors
    column -- the name of of a column of interest in the dataframe (default 'named_ents')
    
    """
    text = df['text'][idx]
    custom_render(nlp(text), df=df, column=column, options=options, idx=idx)

In [34]:
# colors for additional part of speech tags we want to visualize
options = {'colors': {'COMPOUND': '#FE6BFE', 'PROPN': '#18CFE6', 'NOUN': '#18CFE6', 'NP': '#1EECA6'}}

In [35]:
pd.set_option('display.max_rows', 10) # edit how jupyter will render our pandas dataframe
pd.options.mode.chained_assignment = None # prevent warning about working on a copy of a df

### Load Model

spaCy's pre-built models are trained on various corpora of text, to capture parts-of-speech, extract named entities, and in general understand how to tokenize words into chunks that have meaning in a given language. 

We'll grab the 'en_core_web_lg' model by running the following command in the shell. 

I've commented it out for now because I've already downloaded it, but if this your first time running this notebook, go ahead and uncomment to run.  

In [8]:
# !python -m spacy download en_core_web_lg
nlp = spacy.load('en_core_web_lg')

<em><strong>Fun fact</strong>: We run shell commands by using the bang operator. This is an instance of a [magic](https://ipython.readthedocs.io/en/stable/interactive/magics.html) command, of which we saw an example at the begnning with '%autoreload'.</em>

### Gather Data

Okay, now let's gather the data and then we'll run our analysis on a mini version for quick prototyping at first, before eventually running through the full set once we've ironed out all the issues. 

In [None]:
PATH = './data/'

In [49]:
!ls {PATH}

nips.csv


<em><strong>Fun fact</strong>: You can use python variables in shell commands by nesting them inside '{' and '}'.</em>

In [36]:
file = 'nips.csv'
df = pd.read_csv(f'{PATH}{file}')

mini_df = df[:10]
mini_df.index = pd.RangeIndex(len(mini_df.index))

# comment this out to run on full dataset
df = mini_df

## Step 1: Inspect and clean data

In [37]:
display(df)

Unnamed: 0,Id,Title,EventType,PdfName,Abstract,PaperText
0,5677,Double or Nothing: Multiplicative Incentive Me...,Poster,5677-double-or-nothing-multiplicative-incentiv...,Crowdsourcing has gained immense popularity in...,Double or Nothing: Multiplicative\nIncentive M...
1,5941,Learning with Symmetric Label Noise: The Impor...,Spotlight,5941-learning-with-symmetric-label-noise-the-i...,Convex potential minimisation is the de facto ...,Learning with Symmetric Label Noise: The\nImpo...
2,6019,Algorithmic Stability and Uniform Generalization,Poster,6019-algorithmic-stability-and-uniform-general...,One of the central questions in statistical le...,Algorithmic Stability and Uniform Generalizati...
3,6035,Adaptive Low-Complexity Sequential Inference f...,Poster,6035-adaptive-low-complexity-sequential-infere...,We develop a sequential low-complexity inferen...,Adaptive Low-Complexity Sequential Inference f...
4,5978,Covariance-Controlled Adaptive Langevin Thermo...,Poster,5978-covariance-controlled-adaptive-langevin-t...,Monte Carlo sampling for Bayesian posterior in...,Covariance-Controlled Adaptive Langevin\nTherm...
5,5714,Robust Portfolio Optimization,Poster,5714-robust-portfolio-optimization.pdf,We propose a robust portfolio optimization app...,Robust Portfolio Optimization\n\nFang Han\nDep...
6,5937,Logarithmic Time Online Multiclass prediction,Spotlight,5937-logarithmic-time-online-multiclass-predic...,We study the problem of multiclass classificat...,Logarithmic Time Online Multiclass prediction\...
7,5802,Planar Ultrametrics for Image Segmentation,Poster,5802-planar-ultrametrics-for-image-segmentatio...,We study the problem of hierarchical clusterin...,Planar Ultrametrics for Image Segmentation\n\n...
8,5776,Expressing an Image Stream with a Sequence of ...,Poster,5776-expressing-an-image-stream-with-a-sequenc...,We propose an approach for generating a sequen...,Expressing an Image Stream with a Sequence of\...
9,5814,Parallel Correlation Clustering on Big Graphs,Poster,5814-parallel-correlation-clustering-on-big-gr...,"Given a similarity graph between items, correl...",Parallel Correlation Clustering on Big Graphs\...


In [38]:
df = pd.DataFrame(df['Abstract'])
df.columns = ['text']
display(df)

Unnamed: 0,text
0,Crowdsourcing has gained immense popularity in...
1,Convex potential minimisation is the de facto ...
2,One of the central questions in statistical le...
3,We develop a sequential low-complexity inferen...
4,Monte Carlo sampling for Bayesian posterior in...
5,We propose a robust portfolio optimization app...
6,We study the problem of multiclass classificat...
7,We study the problem of hierarchical clusterin...
8,We propose an approach for generating a sequen...
9,"Given a similarity graph between items, correl..."


## Step 2: Extract named entities

In [1]:
def extract_named_ents(text):
    return [(ent.text, ent.start_char, ent.end_char, ent.label_) for ent in nlp(text).ents]

def add_named_ents(df):
    df['named_ents'] = df['text'].apply(extract_named_ents)    

In [2]:
add_named_ents(df)
display(df)

NameError: name 'df' is not defined

In [48]:
column = 'named_ents'
render_entities(0, df, options=options, column=column)

## Step 2: Extract all nouns/propn

In [356]:
def extract_nouns(query):
    keep_pos = ['PROPN', 'NOUN']
    return [(tok.text, tok.idx, tok.idx+len(tok.text), tok.pos_) for tok in nlp(query) if tok.pos_ in keep_pos]

def add_nouns(df):
    df['nouns'] = df['queries'].apply(extract_nouns)

In [357]:
add_nouns(df)
display(df)

Unnamed: 0,queries,named_ents,nouns
0,i need to reschedule my appointment which was ...,"[(11/24, 60, 65, CARDINAL), (1267154747, 82, 9...","[(appointment, 24, 35, NOUN), (order, 68, 73, ..."
1,this is the third time i made an appointment w...,"[(third, 12, 17, ORDINAL), (ny, 70, 72, GPE), ...","[(time, 18, 22, NOUN), (appointment, 33, 44, N..."
2,why if i make a appointment for 8 am do they t...,"[(8, 32, 33, DATE), (30, 159, 161, QUANTITY), ...","[(appointment, 16, 27, NOUN), (min, 86, 89, NO..."
3,called pasadena store sunday am . asked how lo...,"[(sunday, 22, 28, DATE), (am, 29, 31, TIME), (...","[(pasadena, 7, 15, NOUN), (store, 16, 21, NOUN..."
4,is it normal to have a 2 pm appointment for ti...,"[(2 pm, 23, 27, TIME), (2 hours, 203, 210, TIM...","[(pm, 25, 27, NOUN), (appointment, 28, 39, NOU..."


In [358]:
column = 'nouns'
render_entities(0, df, options=options, column=column)

## Step 3: Combine nouns/propn & non-numerical entities (v1)

In [359]:
def extract_named_nouns(row_series):
    entities = set()
    idxs = set()
    for noun_tuple in row_series['nouns']:
        for named_entity_tuple in row_series['named_ents']:
            if noun_tuple[1] == named_entity_tuple[1]: 
                idxs.add(noun_tuple[1])
                entities.add(named_entity_tuple)
        if noun_tuple[1] not in idxs:
            entities.add(noun_tuple)
    
    return sorted(list(entities), key=lambda x: x[1])

def add_named_nouns(df):
    df['named_nouns'] = df.apply(extract_named_nouns, axis=1)

In [360]:
add_named_nouns(df)
display(df)

Unnamed: 0,queries,named_ents,nouns,named_nouns
0,i need to reschedule my appointment which was ...,"[(11/24, 60, 65, CARDINAL), (1267154747, 82, 9...","[(appointment, 24, 35, NOUN), (order, 68, 73, ...","[(appointment, 24, 35, NOUN), (order, 68, 73, ..."
1,this is the third time i made an appointment w...,"[(third, 12, 17, ORDINAL), (ny, 70, 72, GPE), ...","[(time, 18, 22, NOUN), (appointment, 33, 44, N...","[(time, 18, 22, NOUN), (appointment, 33, 44, N..."
2,why if i make a appointment for 8 am do they t...,"[(8, 32, 33, DATE), (30, 159, 161, QUANTITY), ...","[(appointment, 16, 27, NOUN), (min, 86, 89, NO...","[(appointment, 16, 27, NOUN), (min, 86, 89, NO..."
3,called pasadena store sunday am . asked how lo...,"[(sunday, 22, 28, DATE), (am, 29, 31, TIME), (...","[(pasadena, 7, 15, NOUN), (store, 16, 21, NOUN...","[(pasadena, 7, 15, NOUN), (store, 16, 21, NOUN..."
4,is it normal to have a 2 pm appointment for ti...,"[(2 pm, 23, 27, TIME), (2 hours, 203, 210, TIM...","[(pm, 25, 27, NOUN), (appointment, 28, 39, NOU...","[(pm, 25, 27, NOUN), (appointment, 28, 39, NOU..."


In [361]:
column = 'named_nouns'
render_entities(0, df, options=options, column=column)

## Step 4: Extract noun phrases (v2)

In [362]:
def extract_noun_phrases(query):
    return [(chunk.text, chunk.start_char, chunk.end_char, chunk.label_) for chunk in nlp(query).noun_chunks]

def add_noun_phrases(df):
    df['noun_phrases'] = df['queries'].apply(extract_noun_phrases)

In [363]:
add_noun_phrases(df)
display(df)

Unnamed: 0,queries,named_ents,nouns,named_nouns,noun_phrases
0,i need to reschedule my appointment which was ...,"[(11/24, 60, 65, CARDINAL), (1267154747, 82, 9...","[(appointment, 24, 35, NOUN), (order, 68, 73, ...","[(appointment, 24, 35, NOUN), (order, 68, 73, ...","[(i, 0, 1, NP), (my appointment, 21, 35, NP), ..."
1,this is the third time i made an appointment w...,"[(third, 12, 17, ORDINAL), (ny, 70, 72, GPE), ...","[(time, 18, 22, NOUN), (appointment, 33, 44, N...","[(time, 18, 22, NOUN), (appointment, 33, 44, N...","[(the third time, 8, 22, NP), (i, 23, 24, NP),..."
2,why if i make a appointment for 8 am do they t...,"[(8, 32, 33, DATE), (30, 159, 161, QUANTITY), ...","[(appointment, 16, 27, NOUN), (min, 86, 89, NO...","[(appointment, 16, 27, NOUN), (min, 86, 89, NO...","[(i, 7, 8, NP), (a appointment, 14, 27, NP), (..."
3,called pasadena store sunday am . asked how lo...,"[(sunday, 22, 28, DATE), (am, 29, 31, TIME), (...","[(pasadena, 7, 15, NOUN), (store, 16, 21, NOUN...","[(pasadena, 7, 15, NOUN), (store, 16, 21, NOUN...","[(sunday, 22, 28, NP), (the wait, 49, 57, NP),..."
4,is it normal to have a 2 pm appointment for ti...,"[(2 pm, 23, 27, TIME), (2 hours, 203, 210, TIM...","[(pm, 25, 27, NOUN), (appointment, 28, 39, NOU...","[(pm, 25, 27, NOUN), (appointment, 28, 39, NOU...","[(it, 3, 5, NP), (a 2 pm appointment, 21, 39, ..."


In [364]:
column = 'noun_phrases'
render_entities(3, df, options=options, column=column)

## Step 5: Extract compound noun phrases (v4)

In [365]:
def extract_compounds(query):
    compound_nps = []
    for tok in nlp(query):
        if tok.dep_ == 'compound':
            compound = ' '.join([tok.text, tok.head.text])
            compound_nps.append((compound, tok.idx, tok.idx+len(compound), tok.dep_.upper()))
    return compound_nps 

def add_compounds(df):
    df['compounds'] = df['queries'].apply(extract_compounds)

In [366]:
add_compounds(df)
display(df)

Unnamed: 0,queries,named_ents,nouns,named_nouns,noun_phrases,compounds
0,i need to reschedule my appointment which was ...,"[(11/24, 60, 65, CARDINAL), (1267154747, 82, 9...","[(appointment, 24, 35, NOUN), (order, 68, 73, ...","[(appointment, 24, 35, NOUN), (order, 68, 73, ...","[(i, 0, 1, NP), (my appointment, 21, 35, NP), ...","[(order number, 68, 80, COMPOUND), (appointmen..."
1,this is the third time i made an appointment w...,"[(third, 12, 17, ORDINAL), (ny, 70, 72, GPE), ...","[(time, 18, 22, NOUN), (appointment, 33, 44, N...","[(time, 18, 22, NOUN), (appointment, 33, 44, N...","[(the third time, 8, 22, NP), (i, 23, 24, NP),...","[(acme corp, 50, 59, COMPOUND), (corp ny, 55, ..."
2,why if i make a appointment for 8 am do they t...,"[(8, 32, 33, DATE), (30, 159, 161, QUANTITY), ...","[(appointment, 16, 27, NOUN), (min, 86, 89, NO...","[(appointment, 16, 27, NOUN), (min, 86, 89, NO...","[(i, 7, 8, NP), (a appointment, 14, 27, NP), (...",[]
3,called pasadena store sunday am . asked how lo...,"[(sunday, 22, 28, DATE), (am, 29, 31, TIME), (...","[(pasadena, 7, 15, NOUN), (store, 16, 21, NOUN...","[(pasadena, 7, 15, NOUN), (store, 16, 21, NOUN...","[(sunday, 22, 28, NP), (the wait, 49, 57, NP),...","[(pasadena store, 7, 21, COMPOUND), (oil chang..."
4,is it normal to have a 2 pm appointment for ti...,"[(2 pm, 23, 27, TIME), (2 hours, 203, 210, TIM...","[(pm, 25, 27, NOUN), (appointment, 28, 39, NOU...","[(pm, 25, 27, NOUN), (appointment, 28, 39, NOU...","[(it, 3, 5, NP), (a 2 pm appointment, 21, 39, ...","[(pm appointment, 25, 39, COMPOUND), (tire bal..."


In [367]:
column = 'compounds'
render_entities(3, df, options=options, column=column)

## Step 6: Combine noun/propn + named entities, and compound noun phrases (v4)

In [368]:
def extract_comp_nouns(row_series, cols=[]):
    return {noun_tuple[0] for col in cols for noun_tuple in row_series[col]}

def add_comp_nouns(df, cols=[]):
    df['comp_nouns'] = df.apply(extract_comp_nouns, axis=1, cols=cols)

In [369]:
cols = ['nouns', 'compounds']
add_comp_nouns(df, cols=cols)
display(df)

Unnamed: 0,queries,named_ents,nouns,named_nouns,noun_phrases,compounds,comp_nouns
0,i need to reschedule my appointment which was ...,"[(11/24, 60, 65, CARDINAL), (1267154747, 82, 9...","[(appointment, 24, 35, NOUN), (order, 68, 73, ...","[(appointment, 24, 35, NOUN), (order, 68, 73, ...","[(i, 0, 1, NP), (my appointment, 21, 35, NP), ...","[(order number, 68, 80, COMPOUND), (appointmen...","{appointment date, location, type, number, com..."
1,this is the third time i made an appointment w...,"[(third, 12, 17, ORDINAL), (ny, 70, 72, GPE), ...","[(time, 18, 22, NOUN), (appointment, 33, 44, N...","[(time, 18, 22, NOUN), (appointment, 33, 44, N...","[(the third time, 8, 22, NP), (i, 23, 24, NP),...","[(acme corp, 50, 59, COMPOUND), (corp ny, 55, ...","{confirmation, strikes, internet, print, corp,..."
2,why if i make a appointment for 8 am do they t...,"[(8, 32, 33, DATE), (30, 159, 161, QUANTITY), ...","[(appointment, 16, 27, NOUN), (min, 86, 89, NO...","[(appointment, 16, 27, NOUN), (min, 86, 89, NO...","[(i, 7, 8, NP), (a appointment, 14, 27, NP), (...",[],"{elizabeth, get, min, car, service, battery, a..."
3,called pasadena store sunday am . asked how lo...,"[(sunday, 22, 28, DATE), (am, 29, 31, TIME), (...","[(pasadena, 7, 15, NOUN), (store, 16, 21, NOUN...","[(pasadena, 7, 15, NOUN), (store, 16, 21, NOUN...","[(sunday, 22, 28, NP), (the wait, 49, 57, NP),...","[(pasadena store, 7, 21, COMPOUND), (oil chang...","{oil change, wait, appts, store, phone, i, m, ..."
4,is it normal to have a 2 pm appointment for ti...,"[(2 pm, 23, 27, TIME), (2 hours, 203, 210, TIM...","[(pm, 25, 27, NOUN), (appointment, 28, 39, NOU...","[(pm, 25, 27, NOUN), (appointment, 28, 39, NOU...","[(it, 3, 5, NP), (a 2 pm appointment, 21, 39, ...","[(pm appointment, 25, 39, COMPOUND), (tire bal...","{balance, tire, reason, pm, tire balance, appo..."


## Step 7: Heuristic Entity Reduction

In [370]:
def drop_duplicate_np_splits(ents):
    drop_ents = set()
    for ent in ents:
        if len(ent.split(' ')) > 1:
            for e in ent.split(' '):
                if e in ents:
                    drop_ents.add(e)
    return ents - drop_ents

def drop_double_char(ents):
    drop_ents = {ent for ent in ents if len(ent) < 3}
    return ents - drop_ents

def drop_syms(ents):
    drop_char = list(';/$,.~`\\\{\}|\'[]<>?"!@#$%^&*()_+-=')
    drop_ents = {ent for ent in ents for char in drop_char if ent.find(char) != -1}
    return ents - drop_ents

def drop_nums(ents):
    drop_char = list('0123456789')
    drop_ents = {ent for ent in ents for char in drop_char if ent.find(char) != -1}
    return ents - drop_ents

def drop_single_char_nps(ents):
    return {' '.join([e for e in ent.split(' ') if not len(e) == 1]) for ent in ents}

def add_clean_ents(df, funcs=[]): 
    col = 'clean_ents'
    df[col] = df['comp_nouns']
    for f in funcs:
        df[col] = df[col].apply(f)

In [371]:
funcs = [drop_duplicate_np_splits, drop_double_char, drop_syms, drop_nums, drop_single_char_nps]
add_clean_ents(df, funcs)
display(df)

Unnamed: 0,queries,named_ents,nouns,named_nouns,noun_phrases,compounds,comp_nouns,clean_ents
0,i need to reschedule my appointment which was ...,"[(11/24, 60, 65, CARDINAL), (1267154747, 82, 9...","[(appointment, 24, 35, NOUN), (order, 68, 73, ...","[(appointment, 24, 35, NOUN), (order, 68, 73, ...","[(i, 0, 1, NP), (my appointment, 21, 35, NP), ...","[(order number, 68, 80, COMPOUND), (appointmen...","{appointment date, location, type, number, com...","{location, type, compensation, lack, tires, ti..."
1,this is the third time i made an appointment w...,"[(third, 12, 17, ORDINAL), (ny, 70, 72, GPE), ...","[(time, 18, 22, NOUN), (appointment, 33, 44, N...","[(time, 18, 22, NOUN), (appointment, 33, 44, N...","[(the third time, 8, 22, NP), (i, 23, 24, NP),...","[(acme corp, 50, 59, COMPOUND), (corp ny, 55, ...","{confirmation, strikes, internet, print, corp,...","{strikes, acme corp, internet, print, hour, ou..."
2,why if i make a appointment for 8 am do they t...,"[(8, 32, 33, DATE), (30, 159, 161, QUANTITY), ...","[(appointment, 16, 27, NOUN), (min, 86, 89, NO...","[(appointment, 16, 27, NOUN), (min, 86, 89, NO...","[(i, 7, 8, NP), (a appointment, 14, 27, NP), (...",[],"{elizabeth, get, min, car, service, battery, a...","{elizabeth, get, min, car, service, battery, a..."
3,called pasadena store sunday am . asked how lo...,"[(sunday, 22, 28, DATE), (am, 29, 31, TIME), (...","[(pasadena, 7, 15, NOUN), (store, 16, 21, NOUN...","[(pasadena, 7, 15, NOUN), (store, 16, 21, NOUN...","[(sunday, 22, 28, NP), (the wait, 49, 57, NP),...","[(pasadena store, 7, 21, COMPOUND), (oil chang...","{oil change, wait, appts, store, phone, i, m, ...","{person, login, oil change, pasadena store, wa..."
4,is it normal to have a 2 pm appointment for ti...,"[(2 pm, 23, 27, TIME), (2 hours, 203, 210, TIM...","[(pm, 25, 27, NOUN), (appointment, 28, 39, NOU...","[(pm, 25, 27, NOUN), (appointment, 28, 39, NOU...","[(it, 3, 5, NP), (a 2 pm appointment, 21, 39, ...","[(pm appointment, 25, 39, COMPOUND), (tire bal...","{balance, tire, reason, pm, tire balance, appo...","{reason, tire balance, pm appointment, rotatio..."


## Step 7: Extract alternate values

In [372]:
def show_similarity(idx, threshold=0.59):
    tokens = nlp(df['queries'][idx])
    inspect = []
    for tok in tokens:
        if tok.text in df['clean_ents'][idx]:
            inspect.append(tok)
    
    sims = set()
    sim_pairs = set()
    for tok1 in inspect:
        for tok2 in inspect:
            if (tok1.similarity(tok2) > threshold) and (tok1.similarity(tok2) < 1):
                if tok1.similarity(tok2) not in sims:
                    sim_pairs.add((tok1.text, tok2.text, tok1.similarity(tok2)))
                    sims.add(tok1.similarity(tok2))
    
    return sim_pairs

In [509]:
def extract_alt_values(row_series, threshold):       
    
    tokens = [tok for tok in nlp(row_series['queries']) if tok.text in row_series['clean_ents']]
    
    sims = set()
    sim_pairs = set()
    for tok1 in tokens:
        for tok2 in tokens:
            if (tok1.similarity(tok2) > threshold) and (tok1.similarity(tok2) < 1):
                if tok1.similarity(tok2) not in sims:
                    sim_pairs.add((tok1.text, tok2.text))
                    sims.add(tok1.similarity(tok2))
    
    alt_dict = {}
    for alt1 in sim_pairs:
        alt_values = set(alt1)
        for alt2 in sim_pairs:
            if len(alt_values.intersection(set(alt2))) > 0:
                alt_values = alt_values.union(set(alt2))
            
        ent = max(alt_values, key=len)
        alt_values.remove(ent)
        alt_dict[ent] = list(alt_values)
    
    return alt_dict

def add_alt_values(df, threshold):
    df['alt_values'] = df.apply(extract_alt_values, axis=1, threshold=threshold)

In [510]:
threshold = 0.59
add_alt_values(df, threshold=threshold)
display(df)

Unnamed: 0,queries,named_ents,nouns,named_nouns,noun_phrases,compounds,comp_nouns,clean_ents,alt_values,true_ents
0,i need to reschedule my appointment which was ...,"[(11/24, 60, 65, CARDINAL), (1267154747, 82, 9...","[(appointment, 24, 35, NOUN), (order, 68, 73, ...","[(appointment, 24, 35, NOUN), (order, 68, 73, ...","[(i, 0, 1, NP), (my appointment, 21, 35, NP), ...","[(order number, 68, 80, COMPOUND), (appointmen...","{appointment date, location, type, number, com...","{location, type, compensation, lack, tires, ti...",{},"{'location': [], 'type': [], 'compensation': [..."
1,this is the third time i made an appointment w...,"[(third, 12, 17, ORDINAL), (ny, 70, 72, GPE), ...","[(time, 18, 22, NOUN), (appointment, 33, 44, N...","[(time, 18, 22, NOUN), (appointment, 33, 44, N...","[(the third time, 8, 22, NP), (i, 23, 24, NP),...","[(acme corp, 50, 59, COMPOUND), (corp ny, 55, ...","{confirmation, strikes, internet, print, corp,...","{strikes, acme corp, internet, print, hour, ou...","{'time': ['hour', 'out']}","{'strikes': [], 'acme corp': [], 'internet': [..."
2,why if i make a appointment for 8 am do they t...,"[(8, 32, 33, DATE), (30, 159, 161, QUANTITY), ...","[(appointment, 16, 27, NOUN), (min, 86, 89, NO...","[(appointment, 16, 27, NOUN), (min, 86, 89, NO...","[(i, 7, 8, NP), (a appointment, 14, 27, NP), (...",[],"{elizabeth, get, min, car, service, battery, a...","{elizabeth, get, min, car, service, battery, a...","{'anything': ['get'], 'appointments': ['appoin...","{'elizabeth': [], 'min': [], 'car': [], 'servi..."
3,called pasadena store sunday am . asked how lo...,"[(sunday, 22, 28, DATE), (am, 29, 31, TIME), (...","[(pasadena, 7, 15, NOUN), (store, 16, 21, NOUN...","[(pasadena, 7, 15, NOUN), (store, 16, 21, NOUN...","[(sunday, 22, 28, NP), (the wait, 49, 57, NP),...","[(pasadena store, 7, 21, COMPOUND), (oil chang...","{oil change, wait, appts, store, phone, i, m, ...","{person, login, oil change, pasadena store, wa...","{'appointment': ['appt', 'appts'], 'phone': ['...","{'person': [], 'login': [], 'oil change': [], ..."
4,is it normal to have a 2 pm appointment for ti...,"[(2 pm, 23, 27, TIME), (2 hours, 203, 210, TIM...","[(pm, 25, 27, NOUN), (appointment, 28, 39, NOU...","[(pm, 25, 27, NOUN), (appointment, 28, 39, NOU...","[(it, 3, 5, NP), (a 2 pm appointment, 21, 39, ...","[(pm appointment, 25, 39, COMPOUND), (tire bal...","{balance, tire, reason, pm, tire balance, appo...","{reason, tire balance, pm appointment, rotatio...",{},"{'reason': [], 'tire balance': [], 'pm appoint..."


In [511]:
idx = 3
df['alt_values'][idx]

{'appointment': ['appt', 'appts'], 'phone': ['call']}

In [512]:
show_similarity(idx)

{('appointment', 'appt', 0.59974325),
 ('appts', 'appt', 0.72425157),
 ('call', 'phone', 0.6134473)}

## Step 8: Combine entities and alternate values

In [513]:
def extract_true_ents(row_series):
    ents = row_series['clean_ents'] - set([v for vs in row_series['alt_values'].values() for v in vs])
    true_ents = {}
    for ent in ents:
        true_ents[ent] = []
    true_ents.update(row_series['alt_values'])
    return true_ents

def add_true_ents(df):
    df['true_ents'] = df.apply(extract_true_ents, axis=1)

In [514]:
add_true_ents(df)
display(df)

Unnamed: 0,queries,named_ents,nouns,named_nouns,noun_phrases,compounds,comp_nouns,clean_ents,alt_values,true_ents
0,i need to reschedule my appointment which was ...,"[(11/24, 60, 65, CARDINAL), (1267154747, 82, 9...","[(appointment, 24, 35, NOUN), (order, 68, 73, ...","[(appointment, 24, 35, NOUN), (order, 68, 73, ...","[(i, 0, 1, NP), (my appointment, 21, 35, NP), ...","[(order number, 68, 80, COMPOUND), (appointmen...","{appointment date, location, type, number, com...","{location, type, compensation, lack, tires, ti...",{},"{'location': [], 'type': [], 'compensation': [..."
1,this is the third time i made an appointment w...,"[(third, 12, 17, ORDINAL), (ny, 70, 72, GPE), ...","[(time, 18, 22, NOUN), (appointment, 33, 44, N...","[(time, 18, 22, NOUN), (appointment, 33, 44, N...","[(the third time, 8, 22, NP), (i, 23, 24, NP),...","[(acme corp, 50, 59, COMPOUND), (corp ny, 55, ...","{confirmation, strikes, internet, print, corp,...","{strikes, acme corp, internet, print, hour, ou...","{'time': ['hour', 'out']}","{'strikes': [], 'acme corp': [], 'internet': [..."
2,why if i make a appointment for 8 am do they t...,"[(8, 32, 33, DATE), (30, 159, 161, QUANTITY), ...","[(appointment, 16, 27, NOUN), (min, 86, 89, NO...","[(appointment, 16, 27, NOUN), (min, 86, 89, NO...","[(i, 7, 8, NP), (a appointment, 14, 27, NP), (...",[],"{elizabeth, get, min, car, service, battery, a...","{elizabeth, get, min, car, service, battery, a...","{'anything': ['get'], 'appointments': ['appoin...","{'elizabeth': [], 'min': [], 'car': [], 'servi..."
3,called pasadena store sunday am . asked how lo...,"[(sunday, 22, 28, DATE), (am, 29, 31, TIME), (...","[(pasadena, 7, 15, NOUN), (store, 16, 21, NOUN...","[(pasadena, 7, 15, NOUN), (store, 16, 21, NOUN...","[(sunday, 22, 28, NP), (the wait, 49, 57, NP),...","[(pasadena store, 7, 21, COMPOUND), (oil chang...","{oil change, wait, appts, store, phone, i, m, ...","{person, login, oil change, pasadena store, wa...","{'appointment': ['appt', 'appts'], 'phone': ['...","{'person': [], 'login': [], 'oil change': [], ..."
4,is it normal to have a 2 pm appointment for ti...,"[(2 pm, 23, 27, TIME), (2 hours, 203, 210, TIM...","[(pm, 25, 27, NOUN), (appointment, 28, 39, NOU...","[(pm, 25, 27, NOUN), (appointment, 28, 39, NOU...","[(it, 3, 5, NP), (a 2 pm appointment, 21, 39, ...","[(pm appointment, 25, 39, COMPOUND), (tire bal...","{balance, tire, reason, pm, tire balance, appo...","{reason, tire balance, pm appointment, rotatio...",{},"{'reason': [], 'tire balance': [], 'pm appoint..."


In [521]:
true_ent_dict = {}
for ent_dict in df['true_ents']:
    for ent, alts in ent_dict.items():
        if len(alts) > 0:
            if ent in true_ent_dict.keys():
                true_ent_dict[ent].extend(ent_dict[ent])
            else:
                true_ent_dict[ent] = ent_dict[ent]
        else:
            true_ent_dict[ent] = []
            
true_ent_dict

{'location': [],
 'type': [],
 'compensation': [],
 'lack': [],
 'tires': [],
 'time': ['hour', 'out'],
 'order number': [],
 'follow': [],
 'appointment date': [],
 'strikes': [],
 'acme corp': [],
 'internet': [],
 'print': [],
 'email confirmation': [],
 'service': [],
 'patchogue ny': [],
 'appointment': ['appt', 'appts'],
 'home': [],
 'corp ny': [],
 'elizabeth': [],
 'min': [],
 'car': [],
 'battery': [],
 'anything': ['get'],
 'appointments': ['appointment'],
 'person': [],
 'login': [],
 'oil change': [],
 'pasadena store': [],
 'wait': [],
 'sunday': [],
 'richard': [],
 'way': [],
 'hours': [],
 'phone': ['call'],
 'reason': [],
 'tire balance': [],
 'pm appointment': [],
 'rotation': []}

In [526]:
new_dict = {}
for k1, v1 in true_ent_dict.items():
    alt_values = set()
    for k2, v2 in true_ent_dict.items():
        if k1 in v2:
            alt_values = alts.union(set(v2))
    
    ent = max(alt_values, key=len)
    alt_values.remove(ent)
    new_dict[ent] = list(alt_values)

new_dict


ValueError: max() arg is an empty sequence

## Step 9: Create global alternates


In [None]:
# TODO: combine all queries, extract tokens in entities, and then try get alternatives based on similarities. 
# TODO: combine alternatives

In [398]:
columns = [f'alternate_{i}' for i in range(len(max(alt_dict.values(), key=len)))]
pd.DataFrame.from_dict(alt_dict, orient='index', columns=columns)

Unnamed: 0,alternate_0,alternate_1
appointment,appt,appts
phone,call,


## Extract all unique entities

In [235]:
def extract_all_entities(df, col):
    return {entity for group in df[col] for entity in group}

In [236]:
entities = extract_all_entities(df, 'entities_v4')
entities

{'answer',
 'ht51',
 'email address',
 'sunrise',
 'scrambler s',
 'city store',
 'spacers',
 'aparece muchos',
 'word',
 'gates cap',
 'complaint',
 'dash mount',
 'rouge',
 'corp customer',
 'bosch gauges',
 'ha;f',
 'rotation',
 'sacramento',
 'mike',
 'toyotya le',
 'performance mods',
 'tire sensors',
 'milford boy',
 'matrix driver',
 'arabia',
 'chevy tahoe',
 'gxp',
 'intake hose',
 'password',
 'bench',
 '4000 lb',
 'act',
 'aspen',
 'okay thanks',
 'disk',
 'axle',
 'dickson',
 'chevrolet',
 'sr7087',
 'hubcap',
 'position',
 'gmc pickup',
 'windshield',
 'igition',
 'style coils',
 'chevy tie',
 'tow service',
 'anaheim',
 'fram',
 'handling',
 'rs5144',
 'thing',
 'brz',
 'brembo calipers',
 'pickup 12/7/2017',
 'side mirror',
 'v2 tire',
 'rears',
 'information lance',
 'pad elizabeth',
 'nerd',
 's10 4.3l',
 'acme store',
 'pickup date',
 'trunk lifts',
 'cylinder head',
 'cv half',
 'w. hillsborough',
 'air cleaner',
 'shoes',
 'accord',
 'sri lanka',
 'cover audis',
 'p

In [238]:
entity_dump_df = pd.DataFrame(pd.Series(list(entities)), columns=['entities'])
entity_dump_df

Unnamed: 0,entities
0,answer
1,ht51
...,...
7597,s10 chevy
7598,macon mo


In [239]:
entity_dump_df.to_csv('extracted_entities.csv', index=False, header=True)

## Load Enitites From File

In [10]:
entities = pd.read_csv('extracted_entities.csv')
entities = set(entities['entities'])
entities

{'truck lighting',
 'shore',
 '4.3 1500',
 'rusx171',
 'carbondale',
 'jay',
 'year warranty',
 'tonneau lock',
 'motor size',
 'spectra',
 'specialty',
 'saturn ion',
 'carson)today',
 'instillation',
 'michellin',
 'technicians',
 'resevoir assembly',
 'defogger',
 'envelope',
 'disposal',
 'al jack',
 '400cc',
 'subaru forester',
 'rotation etc',
 'passenger assembly',
 'base place',
 'caravan drivers',
 'esta cuenta',
 'gallon drum',
 'cp7818',
 'montgomery rd',
 'challenger srt8',
 '300sd mercedes',
 'fleet service',
 'ac connector',
 'rainstorm',
 'chrysler 300',
 'w/',
 'home address',
 'sportsman',
 'spectre',
 'job front',
 'letters',
 'psi',
 'signature ii',
 'straps',
 'armrest',
 'wanna',
 'connecters',
 'stock ez',
 'run',
 'morning',
 '650bagm',
 'appointments',
 'push',
 'richmond',
 'ion',
 'mdx',
 'store help',
 'turlock',
 'surface',
 'engine belt',
 '94h',
 'malibu',
 'dodge journey',
 'workers',
 'business',
 'cabin air',
 '45zr',
 'battery pjbh5',
 'deflector avs',

In [139]:
ents = ['h world', 'chystler', 'mo']



df['entities_v4'][2] = drop_single_char_nps(df['entities_v4'][2])

In [213]:
# TODO: combine alternate values into parent entity
# TODO: remove all the stop words and small talk ones
# TODO: fix noun phrases to fit company entities models
# TODO: generate xls with true_entity and alternates
# TODO: generate summary stats for full project
# TODO: start working on entity types
# TODO: remove common words from word web or from english language - try and narrow to US Language 

## Notes
- we should do entity extraction AFTER true intent not before, otherwie its really easy to get nonsense sentences in the training data that aren't relevant to the end goal.

