<br>

# Advanced NLP with `spaCy`

<br>

## Finding words, phrases, names and concepts

### Intro to `spaCy`

<br>

In [2]:
# import the English language class
from spacy.lang.en import English

# create an nlp object
nlp = English()

<br>

the nlp object contains a processing pipeline, language-specific rules for tokenization

<br>

In [3]:
# a Doc object is created by processing a string of text with the nlp object
doc = nlp( "Hello world!" )

#iterate over tokens in a doc:
for token in doc:
    print( token.text )

Hello
world
!


In [5]:
# index into the Doc to get a single token
token =  doc[1]
print( token )

# get the token text by way of the .text attribute
print( token.text )

world
world


In [6]:
# Span object: consistes of multiple tokens .. a slice of the Doc object
span = doc[1:4]
print( span.text )

world!


In [7]:
# lexical attributes
doc = nlp( "It costs $5." )

print( 'Index:  ', [ token.i for token in doc ] )
print( 'Text:  ', [ token.text for token in doc ] )
print( 'is_alpha  ', [ token.is_alpha for token in doc ] )
print( 'is_punct  ', [ token.is_punct for token in doc ] )
print( 'like_num  ', [ token.like_num for token in doc ] )

Index:   [0, 1, 2, 3, 4]
Text:   ['It', 'costs', '$', '5', '.']
is_alpha   [True, True, False, False, False]
is_punct   [False, False, False, False, True]
like_num   [False, False, False, True, False]


In [8]:
# Import the German language class
from spacy.lang.de import German

# Create the nlp object
nlp = German()

# Process a text (this is German for: "Kind regards!")
doc = nlp("Liebe Grüße!")

# Print the document text
print(doc.text)

Liebe Grüße!


In [9]:
# Process the text
doc = nlp("In 1990, more than 60% of people in East Asia were in extreme poverty. Now less than 4% are.")

# Iterate over the tokens in the doc
for token in doc:
    # Check if the token resembles a number
    if token.like_num:
        # Get the next token in the document
        next_token = doc[token.i+1]
        # Check if the next token's text equals '%'
        if next_token.text == '%':
            print('Percentage found:', token.text)

Percentage found: 60
Percentage found: 4


<br>

### Statistical Models

enable `spaCy` to predict linguistic attributes in context  

* POS tags
* suntactic dependencies
* named entities

train on labeled example texts and can be updated with more examples to fine-tune predictions  

<br>

In [13]:
import spacy
# load the small english model
nlp = spacy.load( 'en_core_web_sm' )
#process the text
doc = nlp( 'She ate the pizza' )
#iterate over the tokens
for token in doc:
    print( token.text, token.pos_, token.dep_, token.head.text )

She PRON nsubj ate
ate VERB ROOT ate
the DET det pizza
pizza NOUN dobj ate


In [14]:
doc = nlp( u"Apple is looking at buying U.K. startup for $1 billion" )
for ent in doc.ents: 
    print( ent.text, ent.label_ )

Apple ORG
U.K. GPE
$1 billion MONEY


In [17]:
# for some help
print( spacy.explain( 'GPE' ) )
print( spacy.explain( 'NNP' ) )
print( spacy.explain( 'dobj' ) )

Countries, cities, states
noun, proper singular
direct object


In [18]:
text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp( text)

for token in doc:
    # Get the token text, part-of-speech tag and dependency label
    token_text = token.text
    token_pos = token.pos_
    token_dep = token.dep_
    # This is for formatting only
    print('{:<12}{:<10}{:<10}'.format(token_text, token_pos, token_dep))

It          PRON      dep       
’s          INTJ      intj      
official    ADJ       amod      
:           PUNCT     punct     
Apple       PROPN     nsubj     
is          AUX       ROOT      
the         DET       det       
first       ADJ       amod      
U.S.        PROPN     nmod      
public      ADJ       amod      
company     NOUN      attr      
to          PART      aux       
reach       VERB      relcl     
a           DET       det       
$           SYM       quantmod  
1           NUM       compound  
trillion    NUM       nummod    
market      NOUN      compound  
value       NOUN      dobj      


In [19]:
text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp( text )

# Iterate over the predicted entities
for ent in doc.ents:
    # print the entity text and its label
    print(ent.text, ent.label_)

Apple ORG
first ORDINAL
U.S. GPE
$1 trillion MONEY


<br>

### Rule-based Matching

**match patterns** - list of dictionaries, one per token

<br>

In [23]:
from spacy.matcher import Matcher
nlp = spacy.load( 'en_core_web_sm' )
matcher = Matcher( nlp.vocab)
#add pattern to the matcher
pattern = [ { 'ORTH':'iPhone' }, { 'ORTH':'X' } ]
matcher.add( 'IPHONE_PATTERN', [ pattern ] )
#return matches on a doc
doc = nlp( 'New iPhone X release date leaked' )
matches = matcher( doc )
matches

[(9528407286733565721, 1, 3)]

In [24]:
for match_id, start, end in matches:
    # iterate over and matches and create a span object
    matched_span = doc[ start:end ]
    print( matched_span.text )

iPhone X


In [26]:
# lexical matches
pattern = [
    { 'IS_DIGIT': True },
    { 'LOWER':'fifa' },
    { 'LOWER':'world' },
    { 'LOWER':'cup' },
    { 'IS_PUNCT': True }
]

doc = nlp( '2018 FIFA World Cup: France won!')
matcher.add( 'FIFA_PATTERN', [pattern] )
matches = matcher( doc )
matches

[(17311505950452258848, 0, 5)]

In [27]:
pattern = [
    { 'LEMMA': 'love', 'POS': 'VERB' },
    { 'POS': 'NOUN' }
]

doc = nlp( 'I loved dogs but now I love cats more' )
matcher.add( 'LOVE', [pattern] )
matches = matcher( doc )
matches

[(18437031736592595799, 1, 3), (18437031736592595799, 6, 8)]

<br>

Using operators and quantifiers

| Operator |    Description   |
|:-----------:|:----------------------------:|
| {'OP': '!'} |    Negation: match 0 times   |
| {'OP': '?'} | Optional: match 0 or 1 times |
| {'OP': '+'} | Match 1 or more times        |
| {'OP': '*'} | Match 0 or more times        |

<br>

In [29]:
doc = nlp("After making the iOS update you won't notice a radical system-wide redesign: nothing like the aesthetic upheaval we got with iOS 7. Most of iOS 11's furniture remains the same as in iOS 10. But you will discover some tweaks once you delve a little deeper.")

# Write a pattern for full iOS versions ("iOS 7", "iOS 11", "iOS 10")
pattern = [{'TEXT': 'iOS'}, {'IS_DIGIT': True}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add('IOS_VERSION_PATTERN', [pattern])
matches = matcher(doc)
print('Total matches found:', len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print('Match found:', doc[start:end].text)

Total matches found: 3
Match found: iOS 7
Match found: iOS 11
Match found: iOS 10


In [31]:
doc = nlp("i downloaded Fortnite on my laptop and can't open the game at all. Help? so when I was downloading Minecraft, I got the Windows version where it is the '.zip' folder and I used the default program to unpack it... do I also need to download Winzip?")

# Write a pattern that matches a form of "download" plus proper noun
pattern = [{'LEMMA': 'download'}, {'POS': 'PROPN'}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add('DOWNLOAD_THINGS_PATTERN', [pattern] )
matches = matcher(doc)
print('Total matches found:', len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print('Match found:', doc[start:end].text)

Total matches found: 3
Match found: downloaded Fortnite
Match found: downloading Minecraft
Match found: download Winzip
