Natural Language Processing is the use of machines to manipulate natural language. Here, we focus on written language, or in simpler words: text.

# Don't Panic! Hitchhiker's Guide to NLP with spaCy

## _Any semblance of order or coherence is purely accidental_

Humans are the only known species to have developed written languages. Yet, children don't learn to read and write on their own. This is to highlight the complexity of text processing and NLP.

The study of natural language processing has been around for more than 50 years. The famous Turing test for General Artificial Intelligence is based on machine comprehension. The field has grown with linguistics and the computational techniques both.

Some Applications of NLP

*  Sentiment Analysis on Social Media
*  Automated Customer Service
*  Chatbots, such as that of Uber, Intercom


In [3]:
import spacy
import random
from collections import Counter #for counting
import seaborn as sns #for visualization
import matplotlib.pyplot as plt
import pandas as pd
plt.style.use('seaborn')
sns.set(font_scale=2)
import json
def pretty_print(pp_object):
    print(json.dumps(pp_object, indent=2))
    
from IPython.display import Markdown, display
def printmd(string, color=None):
    colorstr = "<span style='color:{}'>{}</span>".format(color, string)
    display(Markdown(colorstr))


## About spaCy

spaCy is a free open-source library for Natural Language Processing in Python. 

It features NER, POS tagging, dependency parsing, word vectors and more. The name spaCy comes from spaces + Cython. This is because spaCy started off as an industrial grade solution for tokenization - and eventually expanding to other challenges. Cython allows spaCy to be incredibly fast as compared to other solutions like NLTK. 

![](http://)It has trainable, or in other words customizable and extendable models for most of these tasks - while providing some really good models out of the box. 

In [4]:
!pip install --upgrade pip
!pip install textacy

In [5]:
!python -m spacy download en_core_web_lg
nlp = spacy.load('en_core_web_lg')
# python -m spacy download en_vectors_web_lg

In [6]:
tweets = pd.read_csv("../input/all-djtrum-tweets/all_djt_tweets.csv")

## Data
We explore some tweets from President Donald Trump

![](https://screenshotscdn.firefoxusercontent.com/images/69402d7a-bfec-4cd4-9c57-cca42b6e7b86.png)

## What's In Here?
In this kernel, we will learn how to use spaCy in Python to generate questions and answers from *any free text*. We will learn about named entitiy recognition, dependency parsing, part of speech tagging, and more!

1. [Named Entity Recognition](#Named-Entity-Recognition-aka-NER),  visualization with `displacy` and **redacting names automatically without a dictionary**!
2. [Part of Speech Tagging](#Part-of-Speech-Tagging), and exploring what Trump says with *word clouds*!
3. [Using Linguistic annotations with spaCy Match](#Using-Linguistic-annotations-with-spaCy-Match)
4. Dependency Parsing, for [**Automatic Question and Answer Generation**](#Automatic-Question-and-Answer-Generation)

# Named Entity Recognition aka NER

> spaCy can recognise various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn't always work perfectly and might need some tuning later, depending on your use case.
> 
>  -- from the amazing [spaCy docs](https://spacy.io/usage/linguistic-features#section-named-entities)

## Entities Explained

| Type | 	Description|
|:---|:---
| PERSON |	People, including fictional. |
| NORP | Nationalities or religious or political groups.| 
| FAC|  	Buildings, airports, highways, bridges, etc.| 
| ORG|  	Companies, agencies, institutions, etc.| 
| GPE|  	Countries, cities, states.| 
| LOC|  	Non-GPE locations, mountain ranges, bodies of water.| 
| PRODUCT|  	Objects, vehicles, foods, etc. (Not services.)| 
| EVENT|  	Named hurricanes, battles, wars, sports events, etc.| 
| WORK_OF_ART|  	Titles of books, songs, etc.| 
| LAW|  	Named documents made into laws.| |
| LANGUAGE|  	Any named language.| 
| DATE|  	Absolute or relative dates or periods.| 
| TIME|  	Times smaller than a day.| 
| PERCENT|  	Percentage, including "%".| 
| MONEY|  	Monetary values, including unit.| 
| QUANTITY|  	Measurements, as of weight or distance.| 
| ORDINAL|  	"first", "second", etc.| 
| CARDINAL|  	Numerals that do not fall under another type.| 

Let's look at some examples of above in real world sentences. We will also use the `spacy.explain()` on all entities for one example - to build a quick mental model of how these things work.

In [7]:
def explain_text_entities(text):
    doc = nlp(text)
    for ent in doc.ents:
        print(f'Entity: {ent}, Label: {ent.label_}, {spacy.explain(ent.label_)}')

In [8]:
explain_text_entities(tweets['text'][9])

Let's continue exploring NER for some more examples, with different entities: 

In [9]:
one_sentence = tweets['text'][0]
doc = nlp(one_sentence)
spacy.displacy.render(doc, style='ent',jupyter=True)

In [10]:
one_sentence = tweets['text'][240]
doc = nlp(one_sentence)
spacy.displacy.render(doc, style='ent',jupyter=True)

In [11]:
one_sentence = tweets['text'][300]
doc = nlp(one_sentence)
spacy.displacy.render(doc, style='ent',jupyter=True)

In [12]:
one_sentence = tweets['text'][450]
doc = nlp(one_sentence)
spacy.displacy.render(doc, style='ent',jupyter=True)

## Redacting Names

One simple use case for NER is to automatically redact names. This is important and quite useful. 

For example, 

- to ensure that your company data actually complies with GDPR 
- when journalists wants to publish a large set of documents while still hiding the identity of their sources

We do this redaction by following broad steps:

```markdown
1. find all PERSON names
2. replace these by a filler like ["REDACTED"]
```

In [13]:
def redact_names(text):
    doc = nlp(text)
    redacted_sentence = []
    for ent in doc.ents:
        ent.merge()
    for token in doc:
        if token.ent_type_ == "PERSON":
            redacted_sentence.append("[REDACTED]")
        else:
            redacted_sentence.append(token.string)
    return "".join(redacted_sentence)

In [14]:
printmd("**Before**", color="blue")
one_sentence = tweets['text'][450]
doc = nlp(one_sentence)
spacy.displacy.render(doc, style='ent',jupyter=True)
printmd("**After**", color="blue")
one_sentence = redact_names(tweets['text'][450])
doc = nlp(one_sentence)
spacy.displacy.render(doc, style='ent',jupyter=True)

printmd("Notice that `Obama W.H.` was removed", color="#6290c8")

## Part-of-Speech Tagging

Sometimes, we want to quickly pull out keywords, or keyphrases from a larger body of text. This helps us mentally paint a picture of what this text is about. This is particularly helpful in analysis of texts like long emails or essays.

As a quick hack, we can pull out all relevant "nouns". This is because most keywords are in fact nouns of some form.

### Noun Chunks
We need noun chunks. Noun chunks are noun phrases - not a single word, but a short phrase which describes the noun. For example, "the blue skies" or "the world’s largest conglomerate".

To get the noun chunks in a document, simply iterate over doc.noun_chunks:


In [15]:
example_text = tweets['text'][9]
doc = nlp(example_text)
spacy.displacy.render(doc, style='ent', jupyter=True)

for idx, sentence in enumerate(doc.sents):
    for noun in sentence.noun_chunks:
        print(f"sentence {idx+1} has noun chunk '{noun}'")

You might notice that Part-of-Speech tagging is different from our NER results. In this particular example, `Stock Market` is not an entity, but definitely a noun. 

What are the "Parts of Speech that we can pull out of such sentences? 

In [16]:
one_sentence = tweets['text'][300]
doc = nlp(one_sentence)
spacy.displacy.render(doc, style='ent', jupyter=True)

for token in doc:
    print(token, token.pos_)

# What does Trump talk about? 

It might be interesting to explore what does Trump even talk about? Is it always them 'Angry Dems'? Or is he a narcissist with too many mentions of The President and the USA? 

One way to explore this would be to mine out all the entities and noun chunks from all his tweets! Let's go ahead and do that with amazing ease using spaCy

In [17]:
text = tweets['text'].str.cat(sep=' ')
# spaCy enforces a max limit of 1000000 characters for NER and similar use cases.
# Since `text` might be longer than that, we will slice it off here
max_length = 1000000-1
text = text[:max_length]

# removing URLs and '&amp' substrings using regex
import re
url_reg  = r'[a-z]*[:.]+\S+'
text   = re.sub(url_reg, '', text)
noise_reg = r'\&amp'
text   = re.sub(noise_reg, '', text)

In [18]:
doc = nlp(text)

In [19]:
items_of_interest = list(doc.noun_chunks)
# each element in this list is spaCy's inbuilt `Span`, which is not useful for us
items_of_interest = [str(x) for x in items_of_interest]
# so we've converted it to string

In [20]:
df_nouns = pd.DataFrame(items_of_interest, columns=["TrumpSays"])
plt.figure(figsize=(5,4))
sns.countplot(y="TrumpSays",
             data=df_nouns,
             order=df_nouns["TrumpSays"].value_counts().iloc[:10].index)
plt.show()

Hmm, this is interesting in stating he uses "I" a lot more than "we" and "We", put together, but not much beyond that. 
What topics does he talk about these filler words? 

**Let's remove these filler words and try again!**

In [21]:
trump_topics = []
for token in doc:
    if (not token.is_stop) and (token.pos_ == "NOUN") and (len(str(token))>2):
        trump_topics.append(token)
        
trump_topics = [str(x) for x in trump_topics]

In [22]:
df_nouns = pd.DataFrame(trump_topics, columns=["Trump Topics"])
df_nouns
plt.figure(figsize=(5,4))
sns.countplot(y="Trump Topics",
             data=df_nouns,
             order=df_nouns["Trump Topics"].value_counts().iloc[:10].index)
plt.show()

This is still not very insightul! Let's investigate the entities from Trump tweets instead?

## Exploring Entities

In [23]:
trump_topics = []
for ent in doc.ents:
    if ent.label_ not in ["PERCENT", "CARDINAL", "DATE"]:
#         print(ent.text,ent.label_)
        trump_topics.append(ent.text.strip())

In [24]:
df_ttopics = pd.DataFrame(trump_topics, columns=["Trump Nouns"])
plt.figure(figsize=(5,4))
sns.countplot(y="Trump Nouns",
             data=df_ttopics,
             order=df_ttopics["Trump Nouns"].value_counts().iloc[1:11].index)
plt.show()
# from collections import Counter
# item_counter = Counter(items_of_interest)
# item_counter.most_common()

#### Wow! Trump is really obsessed with Democrats, himself and Hillary. 

In [25]:
from spacy.lang.en.stop_words import STOP_WORDS
from wordcloud import WordCloud
plt.figure(figsize=(10,5))
wordcloud = WordCloud(background_color="white",
                      stopwords = STOP_WORDS,
                      max_words=45,
                      max_font_size=30,
                      random_state=42
                     ).generate(str(trump_topics))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

## Using Linguistic annotations with spaCy Match

> Based on the [Rule Matching docs at spaCy](https://spacy.io/usage/linguistic-features#section-rule-based-matching)

We want to find out what Trump is saying about 

1. Himself e.g. "I am rich". 
2. Russia
3. Democrats

We want to start off by finding _adjectives following_ "Democrats are" or "Democrats were". 

This is obviously a very rudimentary solution, but it'll be fast, and a great way get an idea for what's in your data. Our pattern looks like this:

```bash
[{'LOWER': 'Russia'}, {'LEMMA': 'be'}, {'POS': 'ADV', 'OP': '*'}, {'POS': 'ADJ'}]
```

This translates to a token whose lowercase form matches "democrats" (like Democrats, democrats or DEMoCrats), followed by a token with the lemma "be" (for example, is, was, or 's), followed by an optional adverb, followed by an adjective. 

The optional adverb makes sure you won't miss adjectives with intensifiers, like "pretty awful" or "very nice".

This kind of adjective mining can then be used as features to do _aspect-based sentiment analysis_, which is finding sentiment with respect to specific entities or words. 

In [26]:
from spacy.matcher import Matcher
# doc = nlp(text)
matcher = Matcher(nlp.vocab)
matched_sents = [] # collect data of matched sentences to be visualized

def collect_sents(matcher, doc, i, matches, label='MATCH'):
    """
    Function to help reformat data for displacy visualization
    """
    match_id, start, end = matches[i]
    span = doc[start : end]  # matched span
    sent = span.sent  # sentence containing matched span
    
    # append mock entity for match in displaCy style to matched_sents
    
    if doc.vocab.strings[match_id] == 'DEMOCRATS':  # don't forget to get string!
        match_ents = [{'start': span.start_char - sent.start_char,
                   'end': span.end_char - sent.start_char,
                   'label': 'DEMOCRATS'}]
        matched_sents.append({'text': sent.text, 'ents': match_ents })
    elif doc.vocab.strings[match_id] == 'RUSSIA':  # don't forget to get string!
        match_ents = [{'start': span.start_char - sent.start_char,
               'end': span.end_char - sent.start_char,
               'label': 'RUSSIA'}]
        matched_sents.append({'text': sent.text, 'ents': match_ents })
    elif doc.vocab.strings[match_id] == 'I':  # don't forget to get string!
        match_ents = [{'start': span.start_char - sent.start_char,
               'end': span.end_char - sent.start_char,
               'label': 'NARC'}]
        matched_sents.append({'text': sent.text, 'ents': match_ents })
    
# declare different patterns
russia_pattern = [{'LOWER': 'russia'}, {'LEMMA': 'be'}, {'POS': 'ADV', 'OP': '*'},
           {'POS': 'ADJ'}]
democrats_pattern = [{'LOWER': 'democrats'}, {'LEMMA': 'be'}, {'POS': 'ADV', 'OP': '*'},
           {'POS': 'ADJ'}]
i_pattern = [{'LOWER': 'i'}, {'LEMMA': 'be'}, {'POS': 'ADV', 'OP': '*'},
           {'POS': 'ADJ'}]

matcher.add('DEMOCRATS', collect_sents, democrats_pattern)  # add pattern
matcher.add('RUSSIA', collect_sents, russia_pattern)  # add pattern
matcher.add('I', collect_sents, i_pattern)  # add pattern
matches = matcher(doc)

spacy.displacy.render(matched_sents, style='ent', manual=True, jupyter=True,  options = {'colors': {'NARC': '#6290c8', 'RUSSIA': '#cc2936', 'DEMOCRATS':'#f2cd5d'}})

#### Let's preview 2-3 elements used in the displaCy visualization above. This is what the list of dictionaries looks like: 

In [27]:
pretty_print(matched_sents[:3])

# Automatic Question and Answer Generation

> ### When asked to produce The Ultimate Question, Deep Thought says that it cannot...
>
> -- Douglas Adams

### The Challenge

Can you automatically convert a sentence to a question?

For instance, 
```bash
Martin Luther King Jr. was a civil rights activist and skilled orator
``` 

to 

```js
Who was Martin Luther King Jr.?
```

Notice that when we convert a sentence to a question, the answer might not be in the original sentence anymore. To me, the answer to that question might be something different and that's fine. We are not aiming for _correct_ answers here.

## Question Generation using Dependency Parsing


Dependency parsing analyzes the grammatical structure of a sentence. It establishes a "tree" like structure between a "root" word and those that are related to it by branches of some manner. 

In [28]:
example_text = tweets['text'][180]
doc = nlp(example_text)

In [29]:
options = {'compact': True, 'bg': '#09a3d5',
           'color': 'white', 'font': 'Trebuchet MS'}
spacy.displacy.render(doc, jupyter=True, style='dep', options=options)

We can understand these relationship as a parent-child format as well, looking at one word at a time

In [30]:
for token in doc:
    print(token.text, token.dep_, token.head.text, token.head.pos_,
          [child for child in token.children])

Adding some explainer text in the output itself:

In [31]:
for token in doc:
    print(f"token: {token.text},\t dep: {token.dep_},\t head: {token.head.text},\t pos: {token.head.pos_},\
    ,\t children: {[child for child in token.children]}")

To generate our questions, let's actually use these two ideas:
- Subject of Verb
- Object of Verb


In [32]:
from textacy.spacier import utils as spacy_utils

def para_to_ques(eg_text):
    """
    Generates a few simple questions by slot filling pieces from sentences
    """
    doc = nlp(eg_text)
    results = []
    for sentence in doc.sents:
        root = sentence.root
        ask_about = spacy_utils.get_subjects_of_verb(root)
        answers = spacy_utils.get_objects_of_verb(root)
        if len(ask_about) > 0 and len(answers) > 0:
            if root.lemma_ == "be":
                question = f'What {root} {ask_about[0]}?'
            else:
                question = f'What does {ask_about[0]} {root.lemma_}?'
            results.append({'question':question, 'answers':answers})
    return results

In [33]:
example_text = tweets['text'][180]
doc = nlp(example_text)
spacy.displacy.render(doc, style='ent', jupyter=True)
print(para_to_ques(example_text))

For simply hinting at the power of `textacy` and `spaCy`, I have written only two simple rules to create questions - by adding more, with more nuanced examples, we can generate a large number of specific questions _and_ answers. 

You can find a throrough and **Complete guide to question formation in English on [StackExchange here](https://ell.stackexchange.com/a/1198)**

# What's did we really see here?

## Don't Panic! Nothing is really beyond spaCy

![](https://i.kym-cdn.com/photos/images/newsfeed/001/022/354/081.jpeg)

1. [Named Entity Recognition](#Named-Entity-Recognition-aka-NER), the different entities and it's visualization with `displacy`
2. [Part of Speech Tagging](#Part-of-Speech-Tagging), and exploring what Trump says with *word clouds*!
3. [Using Linguistic annotations with spaCy Match](#Using-Linguistic-annotations-with-spaCy-Match)
4. Dependency Parsing, for [**Automatic Question and Answer Generation**](#Automatic-Question-and-Answer-Generation)

# Bookmark this with [http://bit.ly/spacykernel](http://bit.ly/spacykernel)