# Natural Language Processing in Python

Area of computer science and artifical intelligence concerned with the interactions between computers and human languages, in particular how to program computer to process and analyze large amounts of natural language data

**NLP Basics**
- [1. Tokenization](#1.-Tokenization)
    * [1.1. Part-of-Speech Tagging (POS)](#1.1.-Part-of-Speech-Tagging-(POS))
    * [1.2. Dependencies](#1.2.-Dependencies)
    * [1.3. Named Entities](#1.3.-Named-Entities)
    * [1.4. Noun Chunks](#1.4.-Noun-Chunks)
    * [1.5. Vocabulary and Matching](#1.5.-Vocabulary-and-Matching)
    * [1.6. Additional Token Attributes](#1.6.-Additional-Token-Attributes)
        + [1.6.1. Stemming/Lemmatization](#1.6.1.-Stemming/Lemmatization)
        + [1.6.2. Stop Words](#1.6.2.-Stop-Words)
        + [1.6.3. Spans](#1.6.3.-Spans)
        + [1.6.4. Sentences](#1.6.4.-Sentences)
    * [1.7. Visualization](#1.7.-Visualization)
        + [1.7.1. Visualizing POS](#1.7.1.-Visualizing-POS)
        + [1.7.2. Visualizing NER](#1.7.2.-Visualizing-NER)

In [1]:
# Importing of libraries

import spacy

nlp = spacy.load('en_core_web_sm')
# Other libraries include: 'en_core_web_md' and 'en_core_web_lg'

# 1. Tokenization

- Process of breaking up the original text into components pieces (token)

In [2]:
doc = nlp(u"Tesla isn't looking into startups anymore.")

for token in doc:
    print(f'{token.text:{15}} {token.pos_:{15}} {token.dep_:{15}}')

Tesla           PROPN           nsubj          
is              AUX             aux            
n't             PART            neg            
looking         VERB            ROOT           
into            ADP             prep           
startups        NOUN            pobj           
anymore         ADV             advmod         
.               PUNCT           punct          


## 1.1. Part-of-Speech Tagging (POS)

- POS tagging or grammatical tagging is the process of making up a word in a text as corresponding to a particular part of speech, based on its definition and its context
    * For a full list of POS Tags visit: https://spacy.io/api/annotation#pos-tagging
- To view the **coarse** POS tag use `token.pos_`
- To view the **fine-grained** tag use `token.tag_`
- To view the description of either type of tag use `spacy.explain(tag)`

<div class="alert alert-success">Note that `token.pos` and `token.tag` return integer hash values; by adding the underscores we get the text equivalent that lives in **doc.vocab**.</div>

In [3]:
for token in doc:
    print(f'{token.text:{15}} {token.pos_:{15}} {token.tag_:{15}} {spacy.explain(token.tag_)}')

Tesla           PROPN           NNP             noun, proper singular
is              AUX             VBZ             verb, 3rd person singular present
n't             PART            RB              adverb
looking         VERB            VBG             verb, gerund or present participle
into            ADP             IN              conjunction, subordinating or preposition
startups        NOUN            NNS             noun, plural
anymore         ADV             RB              adverb
.               PUNCT           .               punctuation mark, sentence closer


In [4]:
# Counting POS Tags

POS_counts = doc.count_by(spacy.attrs.POS)
POS_counts

{96: 1, 87: 1, 94: 1, 100: 1, 85: 1, 92: 1, 86: 1, 97: 1}

## 1.2. Dependencies

- A dependency parser analyzes the grammatical structure of a sentence, establishing relationships between "head" words and words which modify those heads
    * For a full list of Syntactic Dependencies visit https://spacy.io/api/annotation#dependency-parsing

<div class="alert alert-success">Note that `token.dep` return integer hash values; by adding the underscores we get the text equivalent that lives in **doc.vocab**.</div>

In [5]:
for token in doc:
    print(f'{token.text:{15}} {token.dep_:{15}}')

Tesla           nsubj          
is              aux            
n't             neg            
looking         ROOT           
into            prep           
startups        pobj           
anymore         advmod         
.               punct          


In [6]:
# Count the different dependencies:
DEP_counts = doc.count_by(spacy.attrs.DEP)

for k,v in sorted(DEP_counts.items()):
    print(f'{k}. {doc.vocab[k].text:{4}}: {v}')

400. advmod: 1
405. aux : 1
425. neg : 1
429. nsubj: 1
439. pobj: 1
443. prep: 1
445. punct: 1
8206900633647566924. ROOT: 1


## 1.3. Named Entities

- Named entity recognition (NER) is the task of identifying and categorizing key information (entities) in text
    * An entity can be any word or series of words that consistently refers to the same thing

In [7]:
# Write a function to display basic entity info:
def show_ents(doc):
    if doc.ents:
        for ent in doc.ents:
            print(ent.text+' - '+ent.label_+' - '+str(spacy.explain(ent.label_)))
    else:
        print('No named entities found.')

In [8]:
doc1 = nlp(u'May I go to Washington, DC next May to see the Washington Monument?')

show_ents(doc1)

Washington, DC - GPE - Countries, cities, states
next May - DATE - Absolute or relative dates or periods
the Washington Monument - ORG - Companies, agencies, institutions, etc.


In [9]:
# Entity annotations

for ent in doc1.ents:
    print(ent.text, ent.start, ent.end, ent.start_char, ent.end_char, ent.label_)

Washington, DC 4 7 12 26 GPE
next May 7 9 27 35 DATE
the Washington Monument 11 14 43 66 ORG


## 1.4. Noun Chunks

- `Doc.noun_chunks` are *base noun phrases*: token spans that include the noun and words describing the noun. 
- Noun chunks cannot be nested, cannot overlap, and do not involve prepositional phrases or relative clauses.<br>
- Where `Doc.ents` rely on the **ner** pipeline component, `Doc.noun_chunks` are provided by the **parser**.
    * For more on **noun_chunks** visit https://spacy.io/usage/linguistic-features#noun-chunks

In [10]:
doc2 = nlp(u"Autonomous cars shift insurance liability toward manufacturers.")

for chunk in doc2.noun_chunks:
    print(chunk.text+' - '+chunk.root.text+' - '+chunk.root.dep_+' - '+chunk.root.head.text)

Autonomous cars - cars - nsubj - shift
insurance liability - liability - dobj - shift
manufacturers - manufacturers - pobj - toward


## 1.5. Vocabulary and Matching

### 1.5.1. Rule-based Matching

spaCy offers a rule-matching tool called `Matcher` that allows you to build a library of token patterns, then match those patterns against a Doc object to return a list of found matches. You can match on any part of the token including text and annotations, and you can add multiple patterns to the same matcher.

In [11]:
from spacy.matcher import Matcher

# Setting up the matcher
matcher = Matcher(nlp.vocab)

# Example document - linking 'Solar Power', 'Solar-power' and 'solarpower' to be SolarPower
doc3 = nlp(u'The Solar Power industry continues to grow as demand \
for solarpower increases. Solar-power cars are gaining popularity.')

# Generating the patterns to be matched
pattern1 = [{'LOWER': 'solarpower'}]
pattern2 = [{'LOWER': 'solar'}, {'LOWER': 'power'}]
pattern3 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'power'}]

# Adding the patterns to the matcher
matcher.add('SolarPower', None, pattern1, pattern2, pattern3)

# Applying the matcher and showing the matches
found_matches = matcher(doc3)

for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id] # To get the string representation of the word
    span = doc3[start:end]
    print(match_id, string_id, start, end, span.text)

8656102463236116519 SolarPower 1 3 Solar Power
8656102463236116519 SolarPower 10 11 solarpower
8656102463236116519 SolarPower 13 16 Solar-power


### 1.5.2. PhraseMatcher

An alternative - and often more efficient - method is to match on terminology lists. In this case we use PhraseMatcher to create a Doc object from a list of phrases, and pass that into `matcher` instead.

- Methodology is similar to Rule-based Matching
    * However, there is a need to convert the phrase list to be that of nlp format via the function `nlp()`

## 1.6. Additional Token Attributes

### 1.6.1. Stemming/Lemmatization

- **Stemming:**
    * Crude method for cataloging related words; essentially chops off letter from the end until the stem is reached
- **Lemmatization:**
    * Looks beyond word reduction, and considers a language's full vocabulary to apply a morphological analysis to words
        + Lemmatization is therefore better than stemming

In [12]:
# Lemmatization

doc4 = nlp(u"I am a runner running in a race because I love to run since I ran today")

for token in doc4:
    print(f'{token.text:{10}} {token.pos_:{10}} {token.lemma:{20}} {token.lemma_:{15}}')

I          PRON         561228191312463089 -PRON-         
am         AUX        10382539506755952630 be             
a          DET        11901859001352538922 a              
runner     NOUN       12640964157389618806 runner         
running    VERB       12767647472892411841 run            
in         ADP         3002984154512732771 in             
a          DET        11901859001352538922 a              
race       NOUN        8048469955494714898 race           
because    SCONJ      16950148841647037698 because        
I          PRON         561228191312463089 -PRON-         
love       VERB        3702023516439754181 love           
to         PART        3791531372978436496 to             
run        VERB       12767647472892411841 run            
since      SCONJ      10066841407251338481 since          
I          PRON         561228191312463089 -PRON-         
ran        VERB       12767647472892411841 run            
today      NOUN       11042482332948150395 today        

### 1.6.2. Stop Words

- Words like "a" and "the" appear so frequently that they don't require tagging as thoroughly as nouns, verbs and modifiers. We call these **stop words**, and they can be filtered from the text to be processed. 

In [13]:
# Checking if a word is a stop word

print(nlp.vocab['myself'].is_stop)

print(nlp.vocab['mystery'].is_stop)

True
False


In [14]:
# Add the word to the set of stop words. Use lowercase!
nlp.Defaults.stop_words.add('btw')

# Set the stop_word tag on the lexeme
nlp.vocab['btw'].is_stop = True

len(nlp.Defaults.stop_words)

327

In [15]:
# Remove the word from the set of stop words
nlp.Defaults.stop_words.remove('btw')

# Remove the stop_word tag from the lexeme
nlp.vocab['btw'].is_stop = False

len(nlp.Defaults.stop_words)

326

### 1.6.3. Spans

- A **span** is a slice of the Doc object in the form `Doc[start:stop]`

In [16]:
doc5 = nlp(u'Although commmonly attributed to John Lennon from his song "Beautiful Boy", \
the phrase "Life is what happens to us while we are making other plans" was written by \
cartoonist Allen Saunders and published in Reader\'s Digest in 1957, when Lennon was 17.')

life_quote = doc5[16:30]
print(life_quote)

"Life is what happens to us while we are making other plans"


### 1.6.4. Sentences

- The `sents` tag facilitates the segementation of the document

In [17]:
doc6 = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')

for sent in doc6.sents:
    print(sent)

This is the first sentence.
This is another sentence.
This is the last sentence.


## 1.7. Visualization

- spaCy offers an outstanding visualizer called **displaCy**, which allows us to see the link between the tokens
- If using another Python IDE or writing a script, you can choose to have spaCy serve up HTML separately.
    * Instead of `displacy.render()`, use `displacy.serve()`:

In [18]:
# Import the displaCy library
from spacy import displacy

### 1.7.1. Visualizing POS

In [19]:
# Creating a sample document
doc7 = nlp(u"The quick brown fox jumped over the lazy dog's back.")

displacy.render(doc7, style='dep', jupyter=True, options={'distance': 110})

In [20]:
for token in doc7:
    print(f'{token.text:{10}} {token.pos_:{7}} {token.dep_:{7}} {spacy.explain(token.dep_)}')

The        DET     det     determiner
quick      ADJ     amod    adjectival modifier
brown      ADJ     amod    adjectival modifier
fox        PROPN   nsubj   nominal subject
jumped     VERB    ROOT    None
over       ADP     prep    prepositional modifier
the        DET     det     determiner
lazy       ADJ     amod    adjectival modifier
dog        NOUN    poss    possession modifier
's         PART    case    case marking
back       NOUN    pobj    object of preposition
.          PUNCT   punct   punctuation


### 1.7.2. Visualizing NER

In [21]:
doc8 = nlp(u'Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $6 million. '
         u'By contrast, Sony sold only 7 thousand Walkman music players.')

displacy.render(doc8, style='ent', jupyter=True)

In [22]:
# To customize colors, effects and view specific entities
colors = {'ORG': 'linear-gradient(90deg, #aa9cfc, #fc9ce7)', 'PRODUCT': 'radial-gradient(yellow, green)'}
options = {'ents': ['ORG', 'PRODUCT'], 'colors':colors}

displacy.render(doc8, style='ent', jupyter=True, options=options)