<br>

# Advanced NLP with `spaCy`

<br>

## Finding words, phrases, names and concepts

### Intro to `spaCy`

<br>

In [2]:
# import the English language class
from spacy.lang.en import English

# create an nlp object
nlp = English()

<br>

the nlp object contains a processing pipeline, language-specific rules for tokenization

<br>

In [3]:
# a Doc object is created by processing a string of text with the nlp object
doc = nlp( "Hello world!" )

#iterate over tokens in a doc:
for token in doc:
    print( token.text )

Hello
world
!


In [5]:
# index into the Doc to get a single token
token =  doc[1]
print( token )

# get the token text by way of the .text attribute
print( token.text )

world
world


In [6]:
# Span object: consistes of multiple tokens .. a slice of the Doc object
span = doc[1:4]
print( span.text )

world!


In [7]:
# lexical attributes
doc = nlp( "It costs $5." )

print( 'Index:  ', [ token.i for token in doc ] )
print( 'Text:  ', [ token.text for token in doc ] )
print( 'is_alpha  ', [ token.is_alpha for token in doc ] )
print( 'is_punct  ', [ token.is_punct for token in doc ] )
print( 'like_num  ', [ token.like_num for token in doc ] )

Index:   [0, 1, 2, 3, 4]
Text:   ['It', 'costs', '$', '5', '.']
is_alpha   [True, True, False, False, False]
is_punct   [False, False, False, False, True]
like_num   [False, False, False, True, False]


In [8]:
# Import the German language class
from spacy.lang.de import German

# Create the nlp object
nlp = German()

# Process a text (this is German for: "Kind regards!")
doc = nlp("Liebe Grüße!")

# Print the document text
print(doc.text)

Liebe Grüße!


In [9]:
# Process the text
doc = nlp("In 1990, more than 60% of people in East Asia were in extreme poverty. Now less than 4% are.")

# Iterate over the tokens in the doc
for token in doc:
    # Check if the token resembles a number
    if token.like_num:
        # Get the next token in the document
        next_token = doc[token.i+1]
        # Check if the next token's text equals '%'
        if next_token.text == '%':
            print('Percentage found:', token.text)

Percentage found: 60
Percentage found: 4


<br>

### Statistical Models

enable `spaCy` to predict linguistic attributes in context  

* POS tags
* suntactic dependencies
* named entities

train on labeled example texts and can be updated with more examples to fine-tune predictions  

<br>

In [13]:
import spacy
# load the small english model
nlp = spacy.load( 'en_core_web_sm' )
#process the text
doc = nlp( 'She ate the pizza' )
#iterate over the tokens
for token in doc:
    print( token.text, token.pos_, token.dep_, token.head.text )

She PRON nsubj ate
ate VERB ROOT ate
the DET det pizza
pizza NOUN dobj ate


In [14]:
doc = nlp( u"Apple is looking at buying U.K. startup for $1 billion" )
for ent in doc.ents: 
    print( ent.text, ent.label_ )

Apple ORG
U.K. GPE
$1 billion MONEY


In [17]:
# for some help
print( spacy.explain( 'GPE' ) )
print( spacy.explain( 'NNP' ) )
print( spacy.explain( 'dobj' ) )

Countries, cities, states
noun, proper singular
direct object


In [18]:
text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp( text)

for token in doc:
    # Get the token text, part-of-speech tag and dependency label
    token_text = token.text
    token_pos = token.pos_
    token_dep = token.dep_
    # This is for formatting only
    print('{:<12}{:<10}{:<10}'.format(token_text, token_pos, token_dep))

It          PRON      dep       
’s          INTJ      intj      
official    ADJ       amod      
:           PUNCT     punct     
Apple       PROPN     nsubj     
is          AUX       ROOT      
the         DET       det       
first       ADJ       amod      
U.S.        PROPN     nmod      
public      ADJ       amod      
company     NOUN      attr      
to          PART      aux       
reach       VERB      relcl     
a           DET       det       
$           SYM       quantmod  
1           NUM       compound  
trillion    NUM       nummod    
market      NOUN      compound  
value       NOUN      dobj      


In [19]:
text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp( text )

# Iterate over the predicted entities
for ent in doc.ents:
    # print the entity text and its label
    print(ent.text, ent.label_)

Apple ORG
first ORDINAL
U.S. GPE
$1 trillion MONEY


<br>

### Rule-based Matching

**match patterns** - list of dictionaries, one per token

<br>

In [23]:
from spacy.matcher import Matcher
nlp = spacy.load( 'en_core_web_sm' )
matcher = Matcher( nlp.vocab)
#add pattern to the matcher
pattern = [ { 'ORTH':'iPhone' }, { 'ORTH':'X' } ]
matcher.add( 'IPHONE_PATTERN', [ pattern ] )
#return matches on a doc
doc = nlp( 'New iPhone X release date leaked' )
matches = matcher( doc )
matches

[(9528407286733565721, 1, 3)]

In [24]:
for match_id, start, end in matches:
    # iterate over and matches and create a span object
    matched_span = doc[ start:end ]
    print( matched_span.text )

iPhone X


In [26]:
# lexical matches
pattern = [
    { 'IS_DIGIT': True },
    { 'LOWER':'fifa' },
    { 'LOWER':'world' },
    { 'LOWER':'cup' },
    { 'IS_PUNCT': True }
]

doc = nlp( '2018 FIFA World Cup: France won!')
matcher.add( 'FIFA_PATTERN', [pattern] )
matches = matcher( doc )
matches

[(17311505950452258848, 0, 5)]

In [27]:
pattern = [
    { 'LEMMA': 'love', 'POS': 'VERB' },
    { 'POS': 'NOUN' }
]

doc = nlp( 'I loved dogs but now I love cats more' )
matcher.add( 'LOVE', [pattern] )
matches = matcher( doc )
matches

[(18437031736592595799, 1, 3), (18437031736592595799, 6, 8)]

<br>

Using operators and quantifiers

| Operator |    Description   |
|:-----------:|:----------------------------:|
| {'OP': '!'} |    Negation: match 0 times   |
| {'OP': '?'} | Optional: match 0 or 1 times |
| {'OP': '+'} | Match 1 or more times        |
| {'OP': '*'} | Match 0 or more times        |

<br>

In [29]:
doc = nlp("After making the iOS update you won't notice a radical system-wide redesign: nothing like the aesthetic upheaval we got with iOS 7. Most of iOS 11's furniture remains the same as in iOS 10. But you will discover some tweaks once you delve a little deeper.")

# Write a pattern for full iOS versions ("iOS 7", "iOS 11", "iOS 10")
pattern = [{'TEXT': 'iOS'}, {'IS_DIGIT': True}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add('IOS_VERSION_PATTERN', [pattern])
matches = matcher(doc)
print('Total matches found:', len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print('Match found:', doc[start:end].text)

Total matches found: 3
Match found: iOS 7
Match found: iOS 11
Match found: iOS 10


In [31]:
doc = nlp("i downloaded Fortnite on my laptop and can't open the game at all. Help? so when I was downloading Minecraft, I got the Windows version where it is the '.zip' folder and I used the default program to unpack it... do I also need to download Winzip?")

# Write a pattern that matches a form of "download" plus proper noun
pattern = [{'LEMMA': 'download'}, {'POS': 'PROPN'}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add('DOWNLOAD_THINGS_PATTERN', [pattern] )
matches = matcher(doc)
print('Total matches found:', len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print('Match found:', doc[start:end].text)

Total matches found: 3
Match found: downloaded Fortnite
Match found: downloading Minecraft
Match found: download Winzip


In [32]:
doc = nlp("Features of the app include a beautiful design, smart search, automatic labels and optional voice responses.")

# Write a pattern for adjective plus one or two nouns
pattern = [{'POS': 'ADJ'}, {'POS': 'NOUN'}, {'POS': 'NOUN', 'OP': '?'}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add('ADJ_NOUN_PATTERN', [pattern])
matches = matcher(doc)
print('Total matches found:', len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print('Match found:', doc[start:end].text)

Total matches found: 5
Match found: beautiful design
Match found: smart search
Match found: automatic labels
Match found: optional voice
Match found: optional voice responses


<br>

## Large-sclae Data Analysis with `spaCy`

### Data Structures

**`Vocab`** - stores data shared across multiple doccuments  
encodes strings as **hash values**  
strings are only stored once in the `StringStore` via `nlp.vocab.strings`  
**`StringStore`** - is a bidirectional lookup table  

    coffee_hash = nlp.vocab.strings['coffee']
    coffee_string = nlp.vocab.strings[coffee_hash]
    
However, hashes cannot be reversed (that's why we need to provide the shared vocab)  

    # This will result in an error:
    string = nlp.vocab.strings[3197928453018144401]
    
<br>

In [33]:
doc = nlp( 'I love coffee' )
print( 'hash value: ', nlp.vocab.strings['coffee'] )
print( 'string values: ', nlp.vocab.strings[3197928453018144401] )

hash value:  3197928453018144401
string values:  coffee


In [34]:
print( 'hash value: ', doc.vocab.strings['coffee'] )

hash value:  3197928453018144401


<br>

**Lexemes** - entries in the vocabulary.  

`Lexeme` objects are entries in the `Vocab` and contain context-independent information about a word  

<br>

In [35]:
lexeme = nlp.vocab['coffee']
print( lexeme.text, lexeme.orth, lexeme.is_alpha )

coffee 3197928453018144401 True


<br>

### Data Structures: Doc, Span and Token

Best Practices:

* `Doc` and `Span` are very powerful and hold references and relationships of words and sentences
    * Convert strings as late as possible
    * Use token attributes is available (e.g. `token.i` as index)
* Don't forget to pass the shared `vocab`

<br>

In [37]:
# The Doc Object

# create an nlp object
from spacy.lang.en import English
nlp = English()
#import the Doc class
from spacy.tokens import Doc
words = ['Hello', 'world', '!']
spaces = [True, False, False]
#create a doc object manually
doc = Doc( nlp.vocab, words=words, spaces=spaces )
doc

Hello world!

In [42]:
# Import the Doc class
from spacy.tokens import Doc

# Desired text: "spaCy is cool!"
words = ['spaCy', 'is', 'cool', '!']
spaces = [True, True, False, False]

# Create a Doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

spaCy is cool!


In [41]:
# a span is a sclice of a doc object
from spacy.tokens import Span
# ccreate a span manually
span = Span( doc, 0, 2 )
print( span )
#create a span with a label
lspan = Span( doc, 0, 2, label = 'GREETING' )
print( lspan )
#add a span to a doc's entities
doc.ents = [lspan]
doc

Hello world
Hello world


Hello world!

In [43]:
# Import the Doc and Span classes
from spacy.tokens import Doc, Span

# Create a doc from the words and spaces
doc = Doc(nlp.vocab, words=['I', 'like', 'David', 'Bowie'], spaces=[True, True, True, False])

# Create a span for "David Bowie" from the doc and assign it the label "PERSON"
span = Span(doc, 2, 4, label='PERSON')

# Add the span to the doc's entities
doc.ents = [span]

# Print entities' text and labels
print([(ent.text, ent.label_) for ent in doc.ents])

[('David Bowie', 'PERSON')]


<br>

### Word Vectors nd Semantic Similarity

Comparing Semantic Similarity:  

* `spaCy` can compare similarity between objects
    - Doc.similarity()
    - Span.similarity()
    - Token.similarity()
* returns a similarity score between 0 and 1
* **Necessary** - a larger word model which has word vectors included:
    - `en_core_web_md` medium english model
    - `en_core_web_lg` large english model

<br>

I needed to run this:  

    pip3 install spacy
    python3 -m spacy download en_core_web_sm
    
<br>

In [46]:
# load one of the larger english models
# comparing documents
nlp = spacy.load( 'en_core_web_md' )
doc1 = nlp( 'I like fast food' )
doc2 = nlp( 'I like pizza' )
doc3 = nlp( 'The dog is asleep on the couch' )
print( doc1.similarity( doc2 ) )
print( doc1.similarity( doc3 ) )

0.8627204117787385
0.644411562288608


In [47]:
# comparing tokens
doc = nlp( 'Vultures love pizza and pasta' )
token1 = doc[2]
token2 = doc[4]
token3 = doc[0]
print( token1.similarity(token2) )
print( token1.similarity(token3) )

0.73695457
0.110321715


In [48]:
# comparing documents & tokens
doc = nlp( 'I like science fiction' )
token = nlp( 'soap' )[0]
print( doc.similarity(token) )

0.31157307324660816


In [49]:
# comparing docs & spans
span = nlp( 'I like pizza and pasta' )[2:6]
doc = nlp( 'McDonals sells burgers' )
doc2 = nlp( 'Luigis sells pasta' )
print( span.similarity( doc ) )
print( span.similarity( doc2 ) )

0.6199092090831612
0.7486420021584955


<br>

But how does `spaCy` predict similarity?

* similarity is determined using **word vectors**
* **word vectors** - multi-dimentional representations of word meaning
* generated using algorithms like [Word2Vec](https://en.wikipedia.org/wiki/Word2vec) and lots of text for modeling
* new words can be added to `spaCy`'s statistical models
* by default, similarity is given as cosine similarity, but there are other measures
* for multi-token objects (`Doc`s and `Span`s) the vectors default to the average of the individual token vectors. As a result, short phrases are better than long documents which regress to a mean

<br>

In [53]:
# examine a word vector
doc = nlp( 'I am learning to code python' )
# access the vector via the token.vector attribute
print( len( doc[ 5 ].vector ) )
print( doc[ 5 ].vector[0:10] )

300
[ 0.035414 -0.4573    0.42617   0.23448   0.18446   0.78676   0.15513
 -0.41701   0.36996  -0.25015 ]


<br>

Similarity Depends on Context.  
Similarity measures can be very useful in some NLP tasks: recommendation systems to suggest related content, flagging duplicate posts on social media platforms  
However, there is no objective definition of similarity, so one measure doesnt fit for every task.  

<br>

### Combining Models & Rules  

|                         |                   **Statistical Models**                   | **Rule-based Systems**                                 |
|:-----------------------:|:----------------------------------------------------------:|--------------------------------------------------------|
| **Use Cases**           | application needs to generalize based on examples          | dictionary with finite number of examples              |
| **Real World Examples** | product names, person names, subject/object relationships  | countries of the world, cities, drug names, dog breeds |
| **spaCy Features**      | entity reconizer, dependency parser, part-of-speech tagger | tokenizer, Matcher,  PhraseMatcher                     |

In [58]:
# recap: rule-based matching
from spacy.matcher import Matcher
matcher = Matcher( nlp.vocab )
# patterns are lists of dictionaries describing the tokens
patter = [{'LEMMA': 'love', 'POS':'VERB'}, {'LOWER':'cats'}]
matcher.add( 'LOVE_CATS', [pattern] )
# operators can spcify how often a token should be matched
patter = [{'LOWER': 'very', 'OP':'+'}, {'LOWER':'happy'}]
doc = nlp( "I love cats and I'm very very happy" )
matches = matcher( doc )
matches

[]

In [61]:
matcher = Matcher( nlp.vocab )
matcher.add( 'DOG', [[{'LOWER':'golden'},{'LOWER':'retriever'}]])
doc = nlp( 'I have a Golden Retriever' )
for match_id, start, end in matcher( doc ):
    span = doc[start:end]
    print( 'Matched span: ', span.text )
    print( 'Root token: ', span.root.text )
    print( 'Root head token: ', span.root.head.text )
    print( 'Previous token: ', doc[ start-1 ].text, doc[ start-1 ].pos_ )

Matched span:  Golden Retriever
Root token:  Retriever
Root head token:  have
Previous token:  a DET


In [65]:
message = """Twitch Prime, the perks program for Amazon Prime members offering free loot, games and other benefits, 
is ditching one of its best features: ad-free viewing. According to an email sent out to Amazon Prime members 
today, ad-free viewing will no longer be included as a part of Twitch Prime for new members, beginning on 
September 14. However, members with existing annual subscriptions will be able to continue to enjoy ad-free 
viewing until their subscription comes up for renewal. Those with monthly subscriptions will have access to 
ad-free viewing until October 15."""

doc = nlp( message )

# Create the match patterns
pattern1 = [{'LOWER': 'amazon'}, {'IS_TITLE': True, 'POS': 'PROPN'}]
pattern2 = [{"TEXT": "ad"},{"TEXT": "-"},{"TEXT": "free"}, {"POS": "NOUN"}]

# Initialize the Matcher and add the patterns
matcher = Matcher(nlp.vocab)
matcher.add('PATTERN1', [pattern1])
matcher.add('PATTERN2', [pattern2])

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Print pattern string name and text of matched span
    print(doc.vocab.strings[match_id], doc[start:end].text)

PATTERN1 Amazon Prime
PATTERN2 ad-free viewing
PATTERN1 Amazon Prime
PATTERN2 ad-free viewing
PATTERN2 ad-free viewing


<br>

Phrase Matching with `Phrasematcher`:  

* similar to regex or keyword searches but for use with tokens
* Use a `Doc` object as a pattern
* good for matching against large word lists. is fast & more efficient than `Matcher`

<br>

In [62]:
from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher( nlp.vocab )
pattern = nlp( "Golden Retriever" )
matcher.add( 'DOG', [pattern] )

doc = nlp( 'I have a Golden Retriever' )

#iterate over the matches

for match_id, start, end in matcher( doc ):
    span = doc[ start:end ]
    print( 'Matched span: ', span.text )

Matcched span:  Golden Retriever


In [69]:
text = 'Czech Republic may help Slovakia protect its airspace'
doc = nlp( text )

COUNTRIES = ['Afghanistan','Åland Islands','Albania','Algeria','American Samoa','Andorra','Angola','Anguilla',
 'Antarctica','Antigua and Barbuda','Argentina','Armenia','Aruba','Australia','Austria','Azerbaijan','Bahamas',
 'Bahrain','Bangladesh','Barbados','Belarus','Belgium','Belize','Benin','Bermuda','Bhutan','Bolivia (Plurinational State of)',
 'Bonaire, Sint Eustatius and Saba','Bosnia and Herzegovina','Botswana','Bouvet Island','Brazil','British Indian Ocean Territory',
 'United States Minor Outlying Islands','Virgin Islands (British)','Virgin Islands (U.S.)','Brunei Darussalam',
 'Bulgaria','Burkina Faso','Burundi','Cambodia','Cameroon','Canada','Cabo Verde','Cayman Islands','Central African Republic',
 'Chad','Chile','China','Christmas Island','Cocos (Keeling) Islands','Colombia','Comoros','Congo','Congo (Democratic Republic of the)',
 'Cook Islands','Costa Rica','Croatia','Cuba','Curaçao','Cyprus','Czech Republic','Denmark','Djibouti','Dominica',
 'Dominican Republic','Ecuador','Egypt','El Salvador','Equatorial Guinea','Eritrea','Estonia','Ethiopia','Falkland Islands (Malvinas)',
 'Faroe Islands','Fiji','Finland','France','French Guiana','French Polynesia','French Southern Territories','Gabon',
 'Gambia','Georgia','Germany','Ghana','Gibraltar','Greece','Greenland','Grenada','Guadeloupe','Guam','Guatemala',
 'Guernsey','Guinea','Guinea-Bissau','Guyana','Haiti','Heard Island and McDonald Islands','Holy See','Honduras',
 'Hong Kong','Hungary','Iceland','India','Indonesia',"Côte d'Ivoire",'Iran (Islamic Republic of)','Iraq','Ireland',
 'Isle of Man','Israel','Italy','Jamaica','Japan','Jersey','Jordan','Kazakhstan','Kenya','Kiribati','Kuwait','Kyrgyzstan',
 "Lao People's Democratic Republic",'Latvia','Lebanon','Lesotho','Liberia','Libya','Liechtenstein','Lithuania',
 'Luxembourg','Macao','Macedonia (the former Yugoslav Republic of)','Madagascar','Malawi','Malaysia','Maldives',
 'Mali','Malta','Marshall Islands','Martinique','Mauritania','Mauritius','Mayotte','Mexico','Micronesia (Federated States of)',
 'Moldova (Republic of)','Monaco','Mongolia','Montenegro','Montserrat','Morocco','Mozambique','Myanmar','Namibia',
 'Nauru','Nepal','Netherlands','New Caledonia','New Zealand','Nicaragua','Niger','Nigeria','Niue','Norfolk Island',
 "Korea (Democratic People's Republic of)",'Northern Mariana Islands','Norway','Oman','Pakistan','Palau','Palestine, State of',
 'Panama','Papua New Guinea','Paraguay','Peru','Philippines','Pitcairn','Poland','Portugal','Puerto Rico','Qatar',
 'Republic of Kosovo','Réunion','Romania','Russian Federation','Rwanda','Saint Barthélemy','Saint Helena, Ascension and Tristan da Cunha',
 'Saint Kitts and Nevis','Saint Lucia','Saint Martin (French part)','Saint Pierre and Miquelon','Saint Vincent and the Grenadines',
 'Samoa','San Marino','Sao Tome and Principe','Saudi Arabia','Senegal','Serbia','Seychelles','Sierra Leone','Singapore',
 'Sint Maarten (Dutch part)','Slovakia','Slovenia','Solomon Islands','Somalia','South Africa','South Georgia and the South Sandwich Islands',
 'Korea (Republic of)','South Sudan','Spain','Sri Lanka','Sudan','Suriname','Svalbard and Jan Mayen','Swaziland',
 'Sweden','Switzerland','Syrian Arab Republic','Taiwan','Tajikistan','Tanzania, United Republic of','Thailand','Timor-Leste',
 'Togo','Tokelau','Tonga','Trinidad and Tobago','Tunisia','Turkey','Turkmenistan','Turks and Caicos Islands','Tuvalu',
 'Uganda','Ukraine','United Arab Emirates','United Kingdom of Great Britain and Northern Ireland',
 'United States of America','Uruguay','Uzbekistan','Vanuatu','Venezuela (Bolivarian Republic of)','Viet Nam',
 'Wallis and Futuna','Western Sahara','Yemen','Zambia','Zimbabwe']

# Import the PhraseMatcher and initialize it
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher( nlp.vocab )

# Create pattern Doc objects and add them to the matcher
# This is the faster version of: [nlp(country) for country in COUNTRIES]
patterns = list(nlp.pipe(COUNTRIES))
matcher.add('COUNTRY', None, *patterns)

# Call the matcher on the test document and print the result
matches = matcher(doc)
print([doc[start:end] for match_id, start, end in matches])

[Czech Republic, Slovakia]


In [72]:
text = """After the Cold War, the UN saw a radical expansion in its peacekeeping duties, taking on more missions in 
ten years than it had in the previous four decades.Between 1988 and 2000, the number of adopted Security Council 
resolutions more than doubled, and the peacekeeping budget increased more than tenfold. The UN negotiated an end 
to the Salvadoran Civil War, launched a successful peacekeeping mission in Namibia, and oversaw democratic 
elections in post-apartheid South Africa and post-Khmer Rouge Cambodia. In 1991, the UN authorized a US-led 
coalition that repulsed the Iraqi invasion of Kuwait. Brian Urquhart, Under-Secretary-General from 1971 to 1985, 
later described the hopes raised by these successes as a "false renaissance" for the organization, given the more 
troubled missions that followed. Though the UN Charter had been written primarily to prevent aggression by one 
nation against another, in the early 1990s the UN faced a number of simultaneous, serious crises within nations 
such as Somalia, Haiti, Mozambique, and the former Yugoslavia. The UN mission in Somalia was widely 
viewed as a failure after the US withdrawal following casualties in the Battle of Mogadishu, and the UN mission 
to Bosnia faced "worldwide ridicule" for its indecisive and confused mission in the face of ethnic cleansing. 
In 1994, the UN Assistance Mission for Rwanda failed to intervene in the Rwandan genocide amid indecision in 
the Security Council. Beginning in the last decades of the Cold War, American and European critics of the UN 
condemned the organization for perceived mismanagement and corruption. In 1984, the US President, Ronald Reagan, 
withdrew his nation\'s funding from UNESCO (the United Nations Educational, Scientific and Cultural Organization, 
founded 1946) over allegations of mismanagement, followed by Britain and Singapore. Boutros Boutros-Ghali, 
Secretary-General from 1992 to 1996, initiated a reform of the Secretariat, reducing the size of the organization 
somewhat. His successor, Kofi Annan (1997–2006), initiated further management 
reforms in the face of threats from the United States to withhold its UN dues. In the late 1990s and 2000s, 
international interventions authorized by the UN took a wider variety of forms. The UN mission in the Sierra 
Leone Civil War of 1991–2002 was supplemented by British Royal Marines, and the invasion of Afghanistan in 2001 
was overseen by NATO. In 2003, the United States invaded Iraq despite failing to pass a UN Security Council 
resolution for authorization, prompting a new round of questioning of the organization\'s effectiveness. Under 
the eighth Secretary-General, Ban Ki-moon, the UN has intervened with peacekeepers in crises including the War in 
Darfur in Sudan and the Kivu conflict in the Democratic Republic of Congo and sent observers and chemical weapons
inspectors to the Syrian Civil War. In 2013, an internal review of UN actions in the final battles of the Sri 
Lankan Civil War in 2009 concluded that the organization had suffered "systemic failure". One hundred and one 
UN personnel died in the 2010 Haiti earthquake, the worst loss of life in the organization\'s history. The 
Millennium Summit was held in 2000 to discuss the UN\'s role in the 21st century. The three day meeting was the 
largest gathering of world leaders in history, and culminated in the adoption by all member states of the 
Millennium Development Goals (MDGs), a commitment to achieve international development in areas such as poverty 
reduction, gender equality, and public health. Progress towards these goals, which were to be met by 2015, was 
ultimately uneven. The 2005 World Summit reaffirmed the UN\'s focus on promoting development, peacekeeping, 
human rights, and global security. The Sustainable Development Goals were launched in 2015 to succeed the 
Millennium Development Goals. In addition to addressing global challenges, the UN has sought to improve its 
accountability and democratic legitimacy by engaging more with civil society and fostering a global constituency. 
In an effort to enhance transparency, in 2016 the organization held its first public 
debate between candidates for Secretary-General. On 1 January 2017, Portuguese diplomat António Guterres, who 
previously served as UN High Commissioner for Refugees, became the ninth Secretary-General. Guterres has 
highlighted several key goals for his administration, including an emphasis on diplomacy for preventing conflicts, 
more effective peacekeeping efforts, and streamlining the organization to be more responsive and versatile to 
global needs."""

# Create a doc and find matches in it
doc = nlp( text )

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Create a Span with the label for "GPE"
    span = Span(doc, start, end, label="GPE")
    print( span.text )

    # Overwrite the doc.ents and add the span
    #doc.ents = list(doc.ents) + [span]

# Print the entities in the document
#print([(ent.text, ent.label_) for ent in doc.ents if ent.label_ == 'GPE'])

# Create a doc and find matches in it
doc = nlp(text)

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Create a Span with the label for "GPE" and overwrite the doc.ents
    span = Span(doc, start, end, label='GPE')
    #doc.ents = list(doc.ents) + [span]
    
    # Get the span's root head token
    span_root_head = span.root.head
    # Print the text of the span root's head token and the span text
    print(span_root_head.text, '-->', span.text)

Namibia
South Africa
Cambodia
Kuwait
Somalia
Haiti
Mozambique
Somalia
Rwanda
Singapore
Afghanistan
Iraq
Sudan
Congo
Haiti
in --> Namibia
in --> South Africa
Africa --> Cambodia
of --> Kuwait
as --> Somalia
Somalia --> Haiti
Haiti --> Mozambique
in --> Somalia
for --> Rwanda
Britain --> Singapore
of --> Afghanistan
invaded --> Iraq
in --> Sudan
of --> Congo
earthquake --> Haiti


<br>

## `spaCy`'s Processing Pipeline

### Processing Pipelines

**pipeline** - a series of functions applied to a `Doc` to add attributes  
ex: calling `nlp()`. this pipeline takes text as input $\rightarrow$ tokenizer $\rightarrow$ tagger $\rightarrow$ parser $\rightarrow$ ner $\rightarrow$ $\rightarrow$ $\rightarrow$ and returns a `Doc` object  
The tokenizer turns a string into a `Doc` object. `spaCy` then applies every component in the pipeline on the `Doc`, in order.


|   **Name**  |     **Description**     | **Creates**                                               |
|:-----------:|:-----------------------:|-----------------------------------------------------------|
| **tagger**  | Part-of-speach tagger   | `Token.tag`                                               |
| **parser**  | Dependency parser       | `Token.dep`, `Token.head`, `Doc.sents`, `Doc.noun_chunks` |
| **ner**     | Named entity recognizer | `Doc.ents`, `Token.ent_iob`, `Token.ent_type`             |
| **textcat** | Text classifier         | `Doc.cats`                                                |

<br>

In [74]:
# use the nlp.pipe_names for a list of pipeline component names
print( nlp.pipe_names )

# for a list of component names and component function tuples
print( nlp.pipeline )

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec object at 0x7f52a0b917d0>), ('tagger', <spacy.pipeline.tagger.Tagger object at 0x7f52a22b4170>), ('parser', <spacy.pipeline.dep_parser.DependencyParser object at 0x7f529e3ff830>), ('attribute_ruler', <spacy.pipeline.attributeruler.AttributeRuler object at 0x7f529fa26960>), ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer object at 0x7f529e596fa0>), ('ner', <spacy.pipeline.ner.EntityRecognizer object at 0x7f52a22a4e50>)]


<br>

### Custom Pipeline Components

**custom pipeline components** - let you add your ownfunction to the `spaCy` pipeline that is executes when you call `nlp()` on some text  

* components are functions that take a `doc`, modify it and return it
* can be added using the `nlp.add_pipe` method  

for exmple:

    def custom_component( doc ):
        # do something to the doc here
        return doc
    nlp.add_pipe( custom_component )
    
    
| **Argument** |    **Description**   | **Example**                                |
|:------------:|:--------------------:|--------------------------------------------|
| **`last`**   | If `True`, add last  | `nlp.add_pipe( component, last=True )`     |
| **`first`**  | If `True`, add first | `nlp.add_pipe( component, first=True )`    |
| **`before`** | Add before component | `nlp.add_pipe( component, before='ner')`   |
| **`after`**  | Add after component  | `nlp.add_pipe( component, after='tagger')` |
    
<br>

In [79]:
from spacy.language import Language
# create an nlp object
nlp = spacy.load( 'en_core_web_sm' )
# define a custom component
@Language.component("info_component")
def custom_component( doc ):
    # print the docs length
    print( 'Doc length: ', len( doc ) )
    # return the doc object
    return doc
# add the component first in the pipeline
nlp.add_pipe("info_component", name="print_info", last=True)
# print the pipeline component names
print( 'Pipeline: ', nlp.pipe_names )

Pipeline:  ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner', 'print_info']


In [80]:
# process some text
doc = nlp( 'I like coffee' )

Doc length:  3


In [90]:
nlp = spacy.load( 'en_core_web_sm' )
# Define the custom component
@Language.component("info_component")
def animal_component(doc):
    # Create a Span for each match and assign the label 'ANIMAL'
    # and overwrite the doc.ents with the matched spans
    doc.ents = [Span(doc, start, end, label='ANIMAL')
                for match_id, start, end in matcher(doc)]
    return doc
    
# Add the component to the pipeline after the 'ner' component 
#nlp.add_pipe("info_component", name = "animal_component", after='ner')
print( 'Pipeline: ', nlp.pipe_names )
# Process the text and print the text and label for the doc.ents
doc = nlp("I have a cat and a Golden Retriever")
print([(ent.text, ent.label_) for ent in doc.ents])

Pipeline:  ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
[('a Golden Retriever', 'FAC')]


In [89]:
doc.ents

(a Golden Retriever,)

<br>

### Setting Custon Attributes

* add custom metadata to documents, tokens and spans
* accessible via the `._` property
* registered on the global `Doc`, `Token`, or `Span` using the `set_extension` method

examples:  

    doc._.title = 'My document'
    token._.ir_color = True
    span._.has_color = False

Extension Attribute Types:  

1. Attribute extensions - set a default value that can be overwritten
2. Property extensions - define a getter and an optional setter function
    - getter only called when you retreive the attribute value
3. Method extensions - assign a function that becomes available as an object method. lets you pass arguments to the extension function

<br>

In [91]:
from spacy.tokens import Doc, Token, Span
Doc.set_extension( 'title', default=None )
Token.set_extension( 'is_color', default=False )
Span.set_extension( 'has_color', default=False )

In [92]:
#overwrite extension attribute value
doc = nlp( 'The sky is blue' )
doc[3]._.is_color = True

In [95]:
nlp = spacy.load( 'en_core_web_sm' )
def get_is_color( token ):
    colors = ['red','yellow','blue']
    return token.text in colors
doc = nlp( 'The sky is blue' )
Token.set_extension( 'is_color', getter=get_is_color )
print( doc[3]._.is_color, '-', doc[3].text )

ValueError: [E090] Extension 'is_color' already exists on Token. To overwrite the existing extension, set `force=True` on `Token.set_extension`.

In [96]:
# define a method with arguments
def has_token( doc, token_text ):
    in_doc = token_text in [token_text for token in doc ]
Doc.set_extension( 'has_token', method=has_token )
doc = nlp( "The sky is blue" )
print( doc._.has_token('blue'), '-blue' )

None -blue
