<img src='https://spacy.io/assets/img/pipeline.svg'><font size='1'><center>Source: [spaCy Language Processing Pipelines](https://spacy.io/usage/processing-pipelines)</center></font>

# Quick and Dirty - Entity Extraction

<font size='4'><font color='gray'>From idea to prototype in AI.</font></font>

<em>If you've ever been around a startup or in the tech world for any significant amount of time, you've <strong><em>definitely</em></strong> encountered some, if not all of the following phrases: "agile software development", "prototyping", "feedback loop", "rapid iteration", etc. 

This Silicon Valley techno-babble can be distilled down to one simple concept, which just so happens to be the mantra of many a successful entrepreneur: test out your idea as quickly as possible, and then make it better over time. Stated more verbosely, before you invest mind and money into creating a cutting-edge solution to a problem, it might benefit you to get a baseline performance for your task using off-the-shelf techniques. Once you establish the efficacy of a low-cost, easy approach, you can then put on your Elon Musk hat and drive towards #innovation and #disruption. 

A concrete example might help illustrate this point: </em>

## Introduction

### Entity Extraction

Let's say our goal was to create a natural language system that effectively allowed someone to converse with an academic paper. This task could be step one of many towards the development of an automated scientific discovery tool. Society can thank us later. 

But where do we begin? Well, a part of the solution has to deal with [knowledge extraction](https://en.wikipedia.org/wiki/Knowledge_extraction). In order to create a conversational engine that understands scientific papers, we'll first need to develop an entity recognition module, and this, lucky for us, is the topic of our notebook! 

"What's an entity?" you ask? Excellent question. Take a look at the following sentence:

> Dr. Abraham is the primary author of this paper, and a physician in the specialty of internal medicine.

Now, it should be relatively straighforward for an English-speaking human to pick out the important concepts in this sentence:

> **[Dr. Abraham]** is the **[primary author]** of this **[paper]**, and a **[physician]** in the **[specialty]** of **[internal medicine]**.

These words and/or phrases are categorized as "entities" because they represent salient ideas, nouns, and noun phrases in the real world. A subset of entities can be "named", in that they correspond to <strong><em>specific</em></strong> places, people, organizations, and so on. A [named entity](https://en.wikipedia.org/wiki/Named_entity) is to a regular entity, what "Dr. Abraham" is to a "physician". The good doctor is a real person and an instance of the "physician" class, and is therefore considered "named". Examples of named entities include "Google", "Neil DeGrasse Tyson", and "Tokyo", while regular, garden-variety entities can include the list just mentioned, as well as things like "dog", "newspaper", "task", etc.

Let's see if we can get a computer to run this kind of analysis to pull important concepts from sentences. 

### The Task

For our conversational academic paper program, we won't be satisfied with simply capturing named entities, because we need to understand the relationships between general concepts as well as actual things, places, etc. Unfortunately, while most out-of-the-box text processing libraries have a moderately useful <strong>named entity recognizer</strong>, they have little to no support for a generalized <strong>entity recognizer</strong>. 

This is because of a subtle, yet important constraint. 

Entities, as we've discussed, correspond to a superset of named entities, which <strong><em>should</em></strong> make them easier to extract. Indeed, blindly pulling all entities from a text source is in fact simple, but it's sadly not all that useful. In order to justify this exercise, we'd need to develop an entity extraction approach that is restricted to, or is cognizant of, some particular domain, for example, neuroscience, psychology, computer science, economics, etc. This paradoxical complexity makes it nontrivial to create a generic, but useful, entity recognizer. Hence the lack of support in most open-source libraries that deal with natural language processing. 

To largely simplify our task then, we must generate a set of entities from a scientific paper, that is <strong><em>larger</em></strong> than a simple list of named entities, but <strong><em>smaller</em></strong> than the giant list of all entities, restricted to the domain of a particular paper in question. 

Yikes. Are you sweating a little? Because I am. Instead reaching for some Ibuprofen and deep learning pills, let's make a prototype using a little ingenuity, simple open-source code, and a lot of heuristics. Hopefully, through this process, we'll also learn a bit about the text processing pipeline that brings understanding natural language into the realm of the possible. 

Enought chit-chat. Let's get to it!

## Imports

In [1]:
%load_ext autoreload
%autoreload 2

<em><strong>Fun fact</strong>: Curious about what 'autoreload' does? [Check this out](https://ipython.org/ipython-doc/3/config/extensions/autoreload.html).</em>

In [2]:
import pandas as pd
import spacy
from spacy.displacy.render import EntityRenderer
from IPython.core.display import display, HTML

## Utils and Prep

Let's do some basic housekeeping before we start diving headfirst into entity extraction. We'll need to deal with visualization, load up a language model, and of course, examine/set-up our data source.

### Show and Tell
Our prototype will lean heavily on a popular natural langauge processing (NLP) library known as spaCy, which also has a wonderful set of classes and methods defined to help visualize parts of the NLP pipeline. Up top, where we've imported modules, you'll have noticed that we're pulling 'EntityRenderer' from spaCy's displacy module, as we'll be repurposing some of this code for our... um... purposes. In general, this is a good exercise if you ever want to get your hands dirty and really learn how certain classes work in your friendly neighborhood open-source projects. Nothing should ever be off-limits or a black box; always dissect and play with your code before you eat it.  

Wander on over to spaCy's [website](https://spacy.io/), and you'll quickly discover that they've put in some serious thought into making the user interface absolutely gorgeous. (While Matthew undeniably had some input on this, I'm going to make an intelligent assumption that the design ideas are probably Ines' [contribution](https://explosion.ai/about)). 

<em><strong>&lt;rant&gt;</strong> Why spend so much time discussing visualization? Well, one of my biggest pet peeves is this: even if you can create a product, if you don't put in the time to make it look beautiful, or delightful to use, then you don't care about packaging your ideas for export to an audience. And that makes me sad. Once you get something working, make it pretty. <strong>&lt;/rant&gt;</strong></em>

In [3]:
def custom_render(doc, df, column, options={}, page=False, minify=False, idx=0):
    """Overload the spaCy built-in rendering to allow custom part-of-speech (POS) tags.
    
    Keyword arguments:
    doc -- a spaCy nlp doc object
    df -- a pandas dataframe object
    column -- the name of of a column of interest in the dataframe
    options -- various options to feed into the spaCy renderer, including colors
    page -- rendering markup as full HTML page (default False)
    minify -- for compact HTML (default False)
    idx -- index for specific query or doc in dataframe (default 0)
    
    """
    renderer, converter = EntityRenderer, parse_custom_ents
    renderer = renderer(options=options)
    parsed = [converter(doc, df=df, idx=idx, column=column)]
    html = renderer.render(parsed, page=page, minify=minify).strip()  
    return display(HTML(html))

def parse_custom_ents(doc, df, idx, column):
    """Parse custom entity types that aren't in the original spaCy module.
    
    Keyword arguments:
    doc -- a spaCy nlp doc object
    df -- a pandas dataframe object
    idx -- index for specific query or doc in dataframe
    column -- the name of of a column of interest in the dataframe
    
    """
    if column in df.columns:
        entities = df[column][idx]
        ents = [{'start': ent[1], 'end': ent[2], 'label': ent[3]} 
                for ent in entities]
    else:
        ents = [{'start': ent.start_char, 'end': ent.end_char, 'label': ent.label_}
            for ent in doc.ents]
    return {'text': doc.text, 'ents': ents, 'title': None}

def render_entities(idx, df, options={}, column='named_ents'):
    """A wrapper function to get text from a dataframe and render it visually in jupyter notebooks
    
    Keyword arguments:
    idx -- index for specific query or doc in dataframe (default 0)
    df -- a pandas dataframe object
    options -- various options to feed into the spaCy renderer, including colors
    column -- the name of of a column of interest in the dataframe (default 'named_ents')
    
    """
    text = df['text'][idx]
    custom_render(nlp(text), df=df, column=column, options=options, idx=idx)

In [4]:
# colors for additional part of speech tags we want to visualize
options = {'colors': {'COMPOUND': '#FE6BFE', 'PROPN': '#18CFE6', 'NOUN': '#18CFE6', 'NP': '#1EECA6'}}

In [5]:
pd.set_option('display.max_rows', 10) # edit how jupyter will render our pandas dataframes
pd.options.mode.chained_assignment = None # prevent warning about working on a copy of a dataframe

### Load Model

spaCy's pre-built models are trained on different corpora of text, to capture parts-of-speech, extract named entities, and in general understand how to tokenize words into chunks that have meaning in a given language. 

We'll grab the 'en_core_web_lg' model by running the following command in the shell (comment it out once you've run it so you don't keep downloading it every time you go through the notebook). 

In [6]:
# !python -m spacy download en_core_web_lg
nlp = spacy.load('en_core_web_lg')

<em><strong>Fun fact</strong>: We can run shell commands in a Jupyter notebook by using the bang operator. This is an example of a [magic](https://ipython.readthedocs.io/en/stable/interactive/magics.html) command, of which we saw an example at the begnning with '%autoreload'.</em>

### Gather Data

We'll be using papers and posters presented at the [Neural Information Processing Systems (NIPS)](https://nips.cc/) conference held in a different location around the world each year. NIPS is the premier conference for all things machine learning, and considering our goal with this notebook, is an apropos choice to source our data. We'll pull a conveniently packaged dataset from [Kaggle](https://www.kaggle.com/benhamner/nips-2015-papers/version/2/home), a data science competition site, and then work with a subset of the papers to keep our prototyping as lean and fast as possible. 

Once we've grabbed the files using Kaggle's [API](https://github.com/Kaggle/kaggle-api), we'll take a look at what we're working with. Let's store everything in a separate 'data' folder to keep our directory clean. I've discarded all extra files and renamed the essential one to 'nips.csv'.

In [126]:
PATH = './data/'

In [127]:
!ls {PATH}

nips.csv


<em><strong>Fun fact</strong>: You can use python variables in shell commands by nesting them inside curly braces.</em>

In [128]:
file = 'nips.csv'
df = pd.read_csv(f'{PATH}{file}')

mini_df = df[:10]
mini_df.index = pd.RangeIndex(len(mini_df.index))

# comment this out to run on full dataset
df = mini_df

### Game Plan

Now that we're all ready to get started, let's come up with a general list of tasks we'll need to complete before we end up with a decent prototype and baseline for an entity extractor. 

<br>
<strong>
<ol>
    <li>Inspect and clean data</li>
    <li>Extract named entities</li>
    <li>Extract nouns</li>
    <li>Combine named entities and nouns</li>
    <li>Extract noun phrases</li>
    <li>Extract compound noun phrases</li>
    <li>Combine entities and compound noun phrases</li>
    <li>Reduce entity count with heuristics</li>
    <li>Create API for entity extraction pipeline</li>
    <li>Celebrate with excessive fist-pumping</li>
</ol>
</strong>

That doesn't look too bad now does it? Let's build ourselves a prototype entity extractor.

## Step 1: Inspect and clean data

In [129]:
display(df)

Unnamed: 0,Id,Title,EventType,PdfName,Abstract,PaperText
0,5677,Double or Nothing: Multiplicative Incentive Me...,Poster,5677-double-or-nothing-multiplicative-incentiv...,Crowdsourcing has gained immense popularity in...,Double or Nothing: Multiplicative\nIncentive M...
1,5941,Learning with Symmetric Label Noise: The Impor...,Spotlight,5941-learning-with-symmetric-label-noise-the-i...,Convex potential minimisation is the de facto ...,Learning with Symmetric Label Noise: The\nImpo...
2,6019,Algorithmic Stability and Uniform Generalization,Poster,6019-algorithmic-stability-and-uniform-general...,One of the central questions in statistical le...,Algorithmic Stability and Uniform Generalizati...
3,6035,Adaptive Low-Complexity Sequential Inference f...,Poster,6035-adaptive-low-complexity-sequential-infere...,We develop a sequential low-complexity inferen...,Adaptive Low-Complexity Sequential Inference f...
4,5978,Covariance-Controlled Adaptive Langevin Thermo...,Poster,5978-covariance-controlled-adaptive-langevin-t...,Monte Carlo sampling for Bayesian posterior in...,Covariance-Controlled Adaptive Langevin\nTherm...
5,5714,Robust Portfolio Optimization,Poster,5714-robust-portfolio-optimization.pdf,We propose a robust portfolio optimization app...,Robust Portfolio Optimization\n\nFang Han\nDep...
6,5937,Logarithmic Time Online Multiclass prediction,Spotlight,5937-logarithmic-time-online-multiclass-predic...,We study the problem of multiclass classificat...,Logarithmic Time Online Multiclass prediction\...
7,5802,Planar Ultrametrics for Image Segmentation,Poster,5802-planar-ultrametrics-for-image-segmentatio...,We study the problem of hierarchical clusterin...,Planar Ultrametrics for Image Segmentation\n\n...
8,5776,Expressing an Image Stream with a Sequence of ...,Poster,5776-expressing-an-image-stream-with-a-sequenc...,We propose an approach for generating a sequen...,Expressing an Image Stream with a Sequence of\...
9,5814,Parallel Correlation Clustering on Big Graphs,Poster,5814-parallel-correlation-clustering-on-big-gr...,"Given a similarity graph between items, correl...",Parallel Correlation Clustering on Big Graphs\...


In [130]:
lower = lambda x: x.lower() # make everything lowercase

In [131]:
df = pd.DataFrame(df['Abstract'].apply(lower))
df.columns = ['text']
display(df)

Unnamed: 0,text
0,crowdsourcing has gained immense popularity in...
1,convex potential minimisation is the de facto ...
2,one of the central questions in statistical le...
3,we develop a sequential low-complexity inferen...
4,monte carlo sampling for bayesian posterior in...
5,we propose a robust portfolio optimization app...
6,we study the problem of multiclass classificat...
7,we study the problem of hierarchical clusterin...
8,we propose an approach for generating a sequen...
9,"given a similarity graph between items, correl..."


### Analysis

Initially, there was quite a bit of metadata associated with each entry, including a unique identifier, the type of paper presented at the conference, as well as the actual paper text. After pulling out just the abstracts, we've now ended up with with a clean, read-to-go dataframe. 

## Step 2: Extract named entities

In [132]:
def extract_named_ents(text):
    """Extract named entities, and beginning, middle and end idx using spaCy's out-of-the-box model. 
    
    Keyword arguments:
    text -- the actual text source from which to extract entities
    
    """
    return [(ent.text, ent.start_char, ent.end_char, ent.label_) for ent in nlp(text).ents]

def add_named_ents(df):
    """Create new column in data frame with named entity tuple extracted.
    
    Keyword arguments:
    df -- a dataframe object
    
    """
    df['named_ents'] = df['text'].apply(extract_named_ents)    

In [133]:
add_named_ents(df)
display(df)

Unnamed: 0,text,named_ents
0,crowdsourcing has gained immense popularity in...,"[(several hundred, 896, 911, CARDINAL)]"
1,convex potential minimisation is the de facto ...,"[(2008, 109, 113, DATE), (2008, 500, 504, DATE..."
2,one of the central questions in statistical le...,"[(one, 0, 3, CARDINAL)]"
3,we develop a sequential low-complexity inferen...,[]
4,monte carlo sampling for bayesian posterior in...,[]
5,we propose a robust portfolio optimization app...,[]
6,we study the problem of multiclass classificat...,[]
7,we study the problem of hierarchical clusterin...,[]
8,we propose an approach for generating a sequen...,[]
9,"given a similarity graph between items, correl...","[(3-approximation, 257, 272, CARDINAL), (graph..."


In [134]:
column = 'named_ents'
render_entities(9, df, options=options, column=column) # take a look at the first abstract

### Analysis

A quick glance at some of the abstracts shows that while we are able to extract numeric entities, not much else comes through. Not great. But then again, this is exactly why simply extracting named entities is not enough. On the plus side, our intuition about built-in models and scientific text was spot on! The spaCy named entity recognizer just wasn't exposed to this category of corpora and was instead trained on [blogs, news, and comments](https://spacy.io/models/en#en_core_web_lg). Academic papers don't use the most common English words, so it isn't unreasonable to expect a generally trained model to fail when confronted with text in such a restricted domain.   

Look at a few more abstracts by changing the index parameter in our "render_entities" function to convince yourself of the following notion:

We need to widen our search. 

## Step 3: Extract all nouns

In [135]:
def extract_nouns(text):
    """Extract a few types of nouns, and beginning, middle and end idx using spaCy's POS (part of speech) tagger. 
    
    Keyword arguments:
    text -- the actual text source from which to extract entities
    
    """
    keep_pos = ['PROPN', 'NOUN']
    return [(tok.text, tok.idx, tok.idx+len(tok.text), tok.pos_) for tok in nlp(text) if tok.pos_ in keep_pos]

def add_nouns(df):
    """Create new column in data frame with nouns extracted.
    
    Keyword arguments:
    df -- a dataframe object
    
    """
    df['nouns'] = df['text'].apply(extract_nouns)

In [136]:
add_nouns(df)
display(df)

Unnamed: 0,text,named_ents,nouns
0,crowdsourcing has gained immense popularity in...,"[(several hundred, 896, 911, CARDINAL)]","[(crowdsourcing, 0, 13, NOUN), (popularity, 33..."
1,convex potential minimisation is the de facto ...,"[(2008, 109, 113, DATE), (2008, 500, 504, DATE...","[(minimisation, 17, 29, NOUN), (approach, 46, ..."
2,one of the central questions in statistical le...,"[(one, 0, 3, CARDINAL)]","[(questions, 19, 28, NOUN), (learning, 44, 52,..."
3,we develop a sequential low-complexity inferen...,[],"[(complexity, 28, 38, NOUN), (inference, 39, 4..."
4,monte carlo sampling for bayesian posterior in...,[],"[(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN..."
5,we propose a robust portfolio optimization app...,[],"[(portfolio, 20, 29, NOUN), (optimization, 30,..."
6,we study the problem of multiclass classificat...,[],"[(problem, 13, 20, NOUN), (multiclass, 24, 34,..."
7,we study the problem of hierarchical clusterin...,[],"[(problem, 13, 20, NOUN), (clustering, 37, 47,..."
8,we propose an approach for generating a sequen...,[],"[(approach, 14, 22, NOUN), (sequence, 40, 48, ..."
9,"given a similarity graph between items, correl...","[(3-approximation, 257, 272, CARDINAL), (graph...","[(similarity, 8, 18, NOUN), (graph, 19, 24, NO..."


In [137]:
column = 'nouns'
render_entities(0, df, options=options, column=column)

### Analysis

This is more colorful. But is it useful? It looks like we are able to pull out a lot of concepts, but things like "rest", "popularity", and "data", aren't all that interesting in the first abstract. Our search is too wide at this point. 

Good to know. Let's keep going. 

## Step 4: Combine named entities and nouns

In [138]:
def extract_named_nouns(row_series):
    """Combine nouns and non-numerical entities. 
    
    Keyword arguments:
    row_series -- a Pandas Series object
    
    """
    ents = set()
    idxs = set()
    # remove duplicates and merge two lists together
    for noun_tuple in row_series['nouns']:
        for named_ents_tuple in row_series['named_ents']:
            if noun_tuple[1] == named_ents_tuple[1]: 
                idxs.add(noun_tuple[1])
                ents.add(named_ents_tuple)
        if noun_tuple[1] not in idxs:
            ents.add(noun_tuple)
    
    return sorted(list(ents), key=lambda x: x[1])

def add_named_nouns(df):
    """Create new column in data frame with nouns and named ents.
    
    Keyword arguments:
    df -- a dataframe object
    
    """
    df['named_nouns'] = df.apply(extract_named_nouns, axis=1)

In [139]:
add_named_nouns(df)
display(df)

Unnamed: 0,text,named_ents,nouns,named_nouns
0,crowdsourcing has gained immense popularity in...,"[(several hundred, 896, 911, CARDINAL)]","[(crowdsourcing, 0, 13, NOUN), (popularity, 33...","[(crowdsourcing, 0, 13, NOUN), (popularity, 33..."
1,convex potential minimisation is the de facto ...,"[(2008, 109, 113, DATE), (2008, 500, 504, DATE...","[(minimisation, 17, 29, NOUN), (approach, 46, ...","[(minimisation, 17, 29, NOUN), (approach, 46, ..."
2,one of the central questions in statistical le...,"[(one, 0, 3, CARDINAL)]","[(questions, 19, 28, NOUN), (learning, 44, 52,...","[(questions, 19, 28, NOUN), (learning, 44, 52,..."
3,we develop a sequential low-complexity inferen...,[],"[(complexity, 28, 38, NOUN), (inference, 39, 4...","[(complexity, 28, 38, NOUN), (inference, 39, 4..."
4,monte carlo sampling for bayesian posterior in...,[],"[(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN...","[(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN..."
5,we propose a robust portfolio optimization app...,[],"[(portfolio, 20, 29, NOUN), (optimization, 30,...","[(portfolio, 20, 29, NOUN), (optimization, 30,..."
6,we study the problem of multiclass classificat...,[],"[(problem, 13, 20, NOUN), (multiclass, 24, 34,...","[(problem, 13, 20, NOUN), (multiclass, 24, 34,..."
7,we study the problem of hierarchical clusterin...,[],"[(problem, 13, 20, NOUN), (clustering, 37, 47,...","[(problem, 13, 20, NOUN), (clustering, 37, 47,..."
8,we propose an approach for generating a sequen...,[],"[(approach, 14, 22, NOUN), (sequence, 40, 48, ...","[(approach, 14, 22, NOUN), (sequence, 40, 48, ..."
9,"given a similarity graph between items, correl...","[(3-approximation, 257, 272, CARDINAL), (graph...","[(similarity, 8, 18, NOUN), (graph, 19, 24, NO...","[(similarity, 8, 18, NOUN), (graph, 19, 24, NO..."


In [140]:
column = 'named_nouns'
render_entities(0, df, options=options, column=column)

### Analysis

To keep things simple, we're combining the named entities extracted using spaCy's built-in model and nouns we pulled with the POS tagger. We're going to drop any numeric entities for now because they are harder to deal with and don't really represent new concepts. You'll notice (if you look closely enough), that we are also ignoring any hyphenated entities. In spaCy's tokenizer, it is possible to prevent hyphenated words form being split apart, but we'll reserve this, along with other types of advanced fine-tuning or low-level editing to if and when we move beyond the prototype phase. 

## Step 5: Extract noun phrases

In the last few steps, we dealt with one-word entities, but sometimes combinations of two or more words represent a single concept. This means that we'll need to pull n-length phrases from our academic abstracts to make our prototype useful. 

Even mild exposure to computer science, or any of the various isoforms of engineering, will have introduced you to the idea of an abstraction, wherein low-level concepts are bundled into higher-order relationships. The <strong>noun phrase</strong> is one such abstraction, consists of two or more words, and is the by-product of dependency parsing, POS tagging, and tokenization. spaCy's POS tagger is essentially a statistical model which learns to predict the tag (noun, verb, adjective, etc.) for a given word using examples of tagged-sentences. 

This supervised machine learning approach relies on tokens generated from splitting text into somewhat atomic units using a rule-based tokenizer (although there are some interesting [unsupervised models](https://github.com/google/sentencepiece) out there as well). Dependency parsing then uncovers relationships between these tagged tokens, allowing us to finally extract noun chunks or phrases of relevance. 

The full pipeline goes something like this: 

<strong>raw text</strong> &rarr; <strong>tokenization &rarr; </strong> <strong>POS tagging</strong> &rarr; <strong>dependency parsing</strong> &rarr; <strong>noun chunk extraction</strong>

Of course before extracting noun phrases, we could plug in named entity recognition, but that is the part of the pipeline we are attempting to modify for our own purposes. Barring our custom intrusion, this is exactly how spaCy's built-in model works! Scroll up to the very top of our notebook to see an image. 

Neat huh? Need a visualization of tokenization, POS tagging, and dependency parsing to convince you of just how cool this is? 

Take a look:

In [142]:
text = "Dr. Abraham is the primary author of this paper, and a physician in the specialty of internal medicine."
spacy.displacy.render(nlp(text), jupyter=True) # generating raw-markup using spacy's built-in renderer

And here are the noun phrases in our dummy example, extrapolated from this dependency tree:

In [143]:
dummy_df = pd.DataFrame([text])
dummy_df.columns = ['text']
dummy_df['noun_phrases'] = dummy_df['text'].apply(extract_noun_phrases)
column = 'noun_phrases'
render_entities(0, dummy_df, options=options, column=column)

Compare this to what we'd originally set out to accomplish:

> **[Dr. Abraham]** is the **[primary author]** of this **[paper]**, and a **[physician]** in the **[specialty]** of **[internal medicine]**.

I don't know about you, but everytime I see this work, I'm blown away by both the intricate complexity and yet beatiful simplicity of this process. Ignorning the prepositions, with one single move, we've done a damn-near perfect job of extracting the main ideas from this sentence. How amazing is that?! 

Hats off to spaCy, and the hordes of data scientists, machine learning engineers, and linguists that made this possible.

Now if we just use this approach to add extracted noun phrases to the list of single-word entities we pulled earlier, we should be close to finishing our prototype!

In [144]:
def extract_noun_phrases(text):
    """Combine noun phrases. 
    
    Keyword arguments:
    text -- the actual text source from which to extract entities
    
    """
    return [(chunk.text, chunk.start_char, chunk.end_char, chunk.label_) for chunk in nlp(text).noun_chunks]

def add_noun_phrases(df):
    """Create new column in data frame with noun phrases.
    
    Keyword arguments:
    df -- a dataframe object
    
    """
    df['noun_phrases'] = df['text'].apply(extract_noun_phrases)

In [145]:
add_noun_phrases(df)
display(df)

Unnamed: 0,text,named_ents,nouns,named_nouns,noun_phrases
0,crowdsourcing has gained immense popularity in...,"[(several hundred, 896, 911, CARDINAL)]","[(crowdsourcing, 0, 13, NOUN), (popularity, 33...","[(crowdsourcing, 0, 13, NOUN), (popularity, 33...","[(crowdsourcing, 0, 13, NP), (immense populari..."
1,convex potential minimisation is the de facto ...,"[(2008, 109, 113, DATE), (2008, 500, 504, DATE...","[(minimisation, 17, 29, NOUN), (approach, 46, ...","[(minimisation, 17, 29, NOUN), (approach, 46, ...","[(convex potential minimisation, 0, 29, NP), (..."
2,one of the central questions in statistical le...,"[(one, 0, 3, CARDINAL)]","[(questions, 19, 28, NOUN), (learning, 44, 52,...","[(questions, 19, 28, NOUN), (learning, 44, 52,...","[(the central questions, 7, 28, NP), (statisti..."
3,we develop a sequential low-complexity inferen...,[],"[(complexity, 28, 38, NOUN), (inference, 39, 4...","[(complexity, 28, 38, NOUN), (inference, 39, 4...","[(we, 0, 2, NP), (a sequential low-complexity ..."
4,monte carlo sampling for bayesian posterior in...,[],"[(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN...","[(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN...","[(bayesian posterior inference, 25, 53, NP), (..."
5,we propose a robust portfolio optimization app...,[],"[(portfolio, 20, 29, NOUN), (optimization, 30,...","[(portfolio, 20, 29, NOUN), (optimization, 30,...","[(we, 0, 2, NP), (a robust portfolio optimizat..."
6,we study the problem of multiclass classificat...,[],"[(problem, 13, 20, NOUN), (multiclass, 24, 34,...","[(problem, 13, 20, NOUN), (multiclass, 24, 34,...","[(we, 0, 2, NP), (the problem, 9, 20, NP), (mu..."
7,we study the problem of hierarchical clusterin...,[],"[(problem, 13, 20, NOUN), (clustering, 37, 47,...","[(problem, 13, 20, NOUN), (clustering, 37, 47,...","[(we, 0, 2, NP), (the problem, 9, 20, NP), (hi..."
8,we propose an approach for generating a sequen...,[],"[(approach, 14, 22, NOUN), (sequence, 40, 48, ...","[(approach, 14, 22, NOUN), (sequence, 40, 48, ...","[(we, 0, 2, NP), (an approach, 11, 22, NP), (a..."
9,"given a similarity graph between items, correl...","[(3-approximation, 257, 272, CARDINAL), (graph...","[(similarity, 8, 18, NOUN), (graph, 19, 24, NO...","[(similarity, 8, 18, NOUN), (graph, 19, 24, NO...","[(a similarity graph, 6, 24, NP), (items, 33, ..."


In [146]:
column = 'noun_phrases'
render_entities(0, df, options=options, column=column)

### Analysis

Hmm... should've seen this coming. While we've now done a great job of extracting noun phrases from our abstracts, we're running into the same problem as before. Our funnel is too wide, and we're pulling bigrams like "the simplicity", "the rest", and "this mechanism". These chunks are indeed noun phrases, but not domain-specific concepts. 

Let's see if we narrow our search and get just the most important phrases. 

## Step 6: Extract compound noun phrases

In [165]:
def extract_compounds(text):
    """Extract compound noun phrases with beginning and end idxs. 
    
    Keyword arguments:
    text -- the actual text source from which to extract entities
    
    """
    comp_idx = 0
    compound = []
    compound_nps = []
    tok_idx = 0
    for idx, tok in enumerate(nlp(text)):
        if tok.dep_ == 'compound':

            # capture hyphenated compounds
            children = ''.join([c.text for c in tok.children])
            if '-' in children:
                compound.append(''.join([children, tok.text]))
            else:
                compound.append(tok.text)

            # remember starting index of first child in compound or word
            try:
                tok_idx = [c for c in tok.children][0].idx
            except IndexError:
                if len(compound) == 1:
                    tok_idx = tok.idx
            comp_idx = tok.i

        # append the last word in a compound phrase
        if tok.i - comp_idx == 1:
            compound.append(tok.text)
            if len(compound) > 1: 
                compound = ' '.join(compound)
                compound_nps.append((compound, tok_idx, tok_idx+len(compound), 'COMPOUND'))

            # reset parameters
            tok_idx = 0 
            compound = []

    return compound_nps

def add_compounds(df):
    """Create new column in data frame with compound noun phrases.
    
    Keyword arguments:
    df -- a dataframe object
    
    """
    df['compounds'] = df['text'].apply(extract_compounds)

In [166]:
add_compounds(df)
display(df)

Unnamed: 0,text,named_ents,nouns,named_nouns,noun_phrases,compounds
0,crowdsourcing has gained immense popularity in...,"[(several hundred, 896, 911, CARDINAL)]","[(crowdsourcing, 0, 13, NOUN), (popularity, 33...","[(crowdsourcing, 0, 13, NOUN), (popularity, 33...","[(crowdsourcing, 0, 13, NP), (immense populari...","[(machine learning applications, 47, 76, COMPO..."
1,convex potential minimisation is the de facto ...,"[(2008, 109, 113, DATE), (2008, 500, 504, DATE...","[(minimisation, 17, 29, NOUN), (approach, 46, ...","[(minimisation, 17, 29, NOUN), (approach, 46, ...","[(convex potential minimisation, 0, 29, NP), (...","[(label noise, 143, 154, COMPOUND), (function ..."
2,one of the central questions in statistical le...,"[(one, 0, 3, CARDINAL)]","[(questions, 19, 28, NOUN), (learning, 44, 52,...","[(questions, 19, 28, NOUN), (learning, 44, 52,...","[(the central questions, 7, 28, NP), (statisti...","[(learning theory, 44, 59, COMPOUND), (inferen..."
3,we develop a sequential low-complexity inferen...,[],"[(complexity, 28, 38, NOUN), (inference, 39, 4...","[(complexity, 28, 38, NOUN), (inference, 39, 4...","[(we, 0, 2, NP), (a sequential low-complexity ...","[(low-complexity inference procedure, 28, 62, ..."
4,monte carlo sampling for bayesian posterior in...,[],"[(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN...","[(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN...","[(bayesian posterior inference, 25, 53, NP), (...","[(monte carlo sampling, 0, 20, COMPOUND), (mac..."
5,we propose a robust portfolio optimization app...,[],"[(portfolio, 20, 29, NOUN), (optimization, 30,...","[(portfolio, 20, 29, NOUN), (optimization, 30,...","[(we, 0, 2, NP), (a robust portfolio optimizat...","[(portfolio optimization approach, 20, 51, COM..."
6,we study the problem of multiclass classificat...,[],"[(problem, 13, 20, NOUN), (multiclass, 24, 34,...","[(problem, 13, 20, NOUN), (multiclass, 24, 34,...","[(we, 0, 2, NP), (the problem, 9, 20, NP), (mu...","[(test time, 134, 143, COMPOUND), (tree constr..."
7,we study the problem of hierarchical clusterin...,[],"[(problem, 13, 20, NOUN), (clustering, 37, 47,...","[(problem, 13, 20, NOUN), (clustering, 37, 47,...","[(we, 0, 2, NP), (the problem, 9, 20, NP), (hi...","[(lp relaxation, 182, 195, COMPOUND), (cost pe..."
8,we propose an approach for generating a sequen...,[],"[(approach, 14, 22, NOUN), (sequence, 40, 48, ...","[(approach, 14, 22, NOUN), (sequence, 40, 48, ...","[(we, 0, 2, NP), (an approach, 11, 22, NP), (a...","[(image stream, 77, 89, COMPOUND), (image stre..."
9,"given a similarity graph between items, correl...","[(3-approximation, 257, 272, CARDINAL), (graph...","[(similarity, 8, 18, NOUN), (graph, 19, 24, NO...","[(similarity, 8, 18, NOUN), (graph, 19, 24, NO...","[(a similarity graph, 6, 24, NP), (items, 33, ...","[(similarity graph, 8, 24, COMPOUND), (correla..."


In [167]:
column = 'compounds'
render_entities(0, df, options=options, column=column)

### Analysis

That's better. Not perfect... but better. We should now be able to add these compound noun phrases to the list of entities we've extracted from each abstract, which will most likely include bigrams, but can technically also contain entities with more than two words. 

## Step 7: Combine entities and compound noun phrases

In [168]:
def extract_comp_nouns(row_series, cols=[]):
    """Combine compound noun phrases and entities. 
    
    Keyword arguments:
    row_series -- a Pandas Series object
    
    """
    return {noun_tuple[0] for col in cols for noun_tuple in row_series[col]}

def add_comp_nouns(df, cols=[]):
    """Create new column in data frame with merged entities.
    
    Keyword arguments:
    df -- a dataframe object
    cols -- a list of column names that need to be merged
    
    """
    df['comp_nouns'] = df.apply(extract_comp_nouns, axis=1, cols=cols)

In [169]:
cols = ['nouns', 'compounds']
add_comp_nouns(df, cols=cols)
display(df)

Unnamed: 0,text,named_ents,nouns,named_nouns,noun_phrases,compounds,comp_nouns
0,crowdsourcing has gained immense popularity in...,"[(several hundred, 896, 911, CARDINAL)]","[(crowdsourcing, 0, 13, NOUN), (popularity, 33...","[(crowdsourcing, 0, 13, NOUN), (popularity, 33...","[(crowdsourcing, 0, 13, NP), (immense populari...","[(machine learning applications, 47, 76, COMPO...","{error rates, rates, questions, low-quality da..."
1,convex potential minimisation is the de facto ...,"[(2008, 109, 113, DATE), (2008, 500, 504, DATE...","[(minimisation, 17, 29, NOUN), (approach, 46, ...","[(minimisation, 17, 29, NOUN), (approach, 46, ...","[(convex potential minimisation, 0, 29, NP), (...","[(label noise, 143, 154, COMPOUND), (function ...","{paper, loss’ sln -, hinge loss, minimisation,..."
2,one of the central questions in statistical le...,"[(one, 0, 3, CARDINAL)]","[(questions, 19, 28, NOUN), (learning, 44, 52,...","[(questions, 19, 28, NOUN), (learning, 44, 52,...","[(the central questions, 7, 28, NP), (statisti...","[(learning theory, 44, 59, COMPOUND), (inferen...","{learning theory, agents, size, pac, order, qu..."
3,we develop a sequential low-complexity inferen...,[],"[(complexity, 28, 38, NOUN), (inference, 39, 4...","[(complexity, 28, 38, NOUN), (inference, 39, 4...","[(we, 0, 2, NP), (a sequential low-complexity ...","[(low-complexity inference procedure, 28, 62, ...","{otheronline, dirichlet process mixtures, clus..."
4,monte carlo sampling for bayesian posterior in...,[],"[(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN...","[(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN...","[(bayesian posterior inference, 25, 53, NP), (...","[(monte carlo sampling, 0, 20, COMPOUND), (mac...","{subsampling, schemes, area, gradient methods,..."
5,we propose a robust portfolio optimization app...,[],"[(portfolio, 20, 29, NOUN), (optimization, 30,...","[(portfolio, 20, 29, NOUN), (optimization, 30,...","[(we, 0, 2, NP), (a robust portfolio optimizat...","[(portfolio optimization approach, 20, 51, COM...","{size, effectiveness, order, returns, ones, po..."
6,we study the problem of multiclass classificat...,[],"[(problem, 13, 20, NOUN), (multiclass, 24, 34,...","[(problem, 13, 20, NOUN), (multiclass, 24, 34,...","[(we, 0, 2, NP), (the problem, 9, 20, NP), (mu...","[(test time, 134, 143, COMPOUND), (tree constr...","{test time, training time approaches, test err..."
7,we study the problem of hierarchical clusterin...,[],"[(problem, 13, 20, NOUN), (clustering, 37, 47,...","[(problem, 13, 20, NOUN), (clustering, 37, 47,...","[(we, 0, 2, NP), (the problem, 9, 20, NP), (hi...","[(lp relaxation, 182, 195, COMPOUND), (cost pe...","{lp, clustering, algorithm, cost perfect, part..."
8,we propose an approach for generating a sequen...,[],"[(approach, 14, 22, NOUN), (sequence, 40, 48, ...","[(approach, 14, 22, NOUN), (sequence, 40, 48, ...","[(we, 0, 2, NP), (an approach, 11, 22, NP), (a...","[(image stream, 77, 89, COMPOUND), (image stre...","{pictures, architecture, output dimension, seq..."
9,"given a similarity graph between items, correl...","[(3-approximation, 257, 272, CARDINAL), (graph...","[(similarity, 8, 18, NOUN), (graph, 19, 24, NO...","[(similarity, 8, 18, NOUN), (graph, 19, 24, NO...","[(a similarity graph, 6, 24, NP), (items, 33, ...","[(similarity graph, 8, 24, COMPOUND), (correla...","{billion-edge graphs, seconds, clustering, edg..."


In [170]:
df['comp_nouns'][0]

{'amounts',
 'applications',
 'benefit',
 'challenge',
 'crowdsourcing',
 'data',
 'error',
 'error rates',
 'expenditure',
 'experiments',
 'form',
 'incentive',
 'low-quality data',
 'lunch',
 'machine',
 'machine learning applications',
 'mechanism',
 'mechanisms',
 'no',
 'no-free-lunch requirement',
 'payment',
 'payment mechanism',
 'popularity',
 'problem',
 'quality',
 'questions',
 'rates',
 'reduction',
 'requirement',
 'rest',
 'simplicity',
 'spammers',
 'workers'}

### Analysis

Now that we have all the entities grouped together, we can see how good we are doing. We've successfully captured single-word as well as n-gram entities, but there do seem to be a lot of duplicates and words that should be a part of a phrase, but were somehow split apart, probably as a result of not dealing with hyphenation properly when we first tokenized our abstracts. This should be relatively easy to take care of now. We'll also apply a few other heuristics to clean up our list. 

## Step 8: Reduce entity count with heuristics

In [None]:
def drop_duplicate_np_splits(ents):
    drop_ents = set()
    for ent in ents:
        if len(ent.split(' ')) > 1:
            for e in ent.split(' '):
                if e in ents:
                    drop_ents.add(e)
    return ents - drop_ents

def drop_double_char(ents):
    drop_ents = {ent for ent in ents if len(ent) < 3}
    return ents - drop_ents

def drop_syms(ents):
    drop_char = list(';/$,.~`\\\{\}|\'[]<>?"!@#$%^&*()_+-=')
    drop_ents = {ent for ent in ents for char in drop_char if ent.find(char) != -1}
    return ents - drop_ents

def drop_nums(ents):
    drop_char = list('0123456789')
    drop_ents = {ent for ent in ents for char in drop_char if ent.find(char) != -1}
    return ents - drop_ents

def drop_single_char_nps(ents):
    return {' '.join([e for e in ent.split(' ') if not len(e) == 1]) for ent in ents}

def add_clean_ents(df, funcs=[]): 
    col = 'clean_ents'
    df[col] = df['comp_nouns']
    for f in funcs:
        df[col] = df[col].apply(f)

In [None]:
funcs = [drop_duplicate_np_splits, drop_double_char, drop_syms, drop_nums, drop_single_char_nps]
add_clean_ents(df, funcs)
display(df)

## Step 9: Create API for entity extraction pipeline

In [None]:
# TODO: update with other entity_extraction file. 
# TODO: remove common words from word web or from english language - try and narrow to US Language 
# TODO: lowercase all entities

In [48]:
# TODO: combine everything to a nice class and easy to use one time pandas.DataFrame.apply 
# function to extract relevant domain entities all at once
# TODO: extract html from nbconvert and push to medium post

This exercise wasn't to extract the end-all, be-all set of entities from abstracts. It was to get a quick baseline for what we can do with limited knowledge about the domain, and limited super deep learing superpowers. 

## Step 10: Celebrate with excessive fist-pumping