#### INTRODUCTION TO NLP IN PYTHON WITH SPACY

spaCy is a relatively new package for “Industrial strength NLP in Python” developed by Matt Honnibal at Explosion AI. It is designed with the applied data scientist in mind, meaning it does not weigh the user down with decisions over what esoteric algorithms to use for common tasks and it’s fast. Incredibly fast (it’s implemented in Cython). If you are familiar with the Python data science stack, spaCy is your numpy for NLP – it’s reasonably low-level, but very intuitive and performant.

**Installation**

Thanks to our great community, we've finally re-added conda support. You can now install spaCy via conda-forge:

conda install -c conda-forge spacy



Using pip, spaCy releases are currently only available as source packages.


pip install -U spacy 

After installation you need to download a language model. For more info and available models, see the docs on models.


python -m spacy download en

First, we load spaCy’s pipeline, which by convention is stored in a variable named nlp. declaring this variable will take a couple of seconds as spaCy loads its models and data to it up-front to save time later. In effect, this gets some heavy lifting out of the way early, so that the cost is not incurred upon each application of the nlp parser to your data. Note that here I am using the English language model, but there is also a fully featured German model, with tokenisation (discussed below) implemented across several languages.

We invoke nlp on the sample text to create a Doc object. The Doc object is now a vessel for NLP tasks on the text itself, slices of the text (Span objects) and elements (Token objects) of the text. It is worth noting that Token and Span objects actually hold no data. Instead they contain pointers to data contained in the Doc object and are evaluated lazily (i.e. upon request). Much of spaCy’s core functionality is accessed through the methods on Doc (n=33), Span (n=29) and Token (n=78) objects.

In [1]:
import spacy
nlp = spacy.load("en")
doc = nlp("The big grey dog ate all of the chocolate,but fortunately he wasn't sick!")


**Tokenization**

Tokenisation is a foundational step in many NLP tasks. Tokenising text is the process of splitting a piece of text into words, symbols, punctuation, spaces and other elements, thereby creating “tokens”. A naive way to do this is to simply split the string on white space:

In [2]:
doc.text.split()

['The',
 'big',
 'grey',
 'dog',
 'ate',
 'all',
 'of',
 'the',
 'chocolate,but',
 'fortunately',
 'he',
 "wasn't",
 'sick!']

#### Linguistic annotations
spaCy provides a variety of linguistic annotations to give you insights into a text's grammatical structure. This includes the word types, like the parts of speech, and how the words are related to each other. For example, if you're analysing text, it makes a huge difference whether a noun is the subject of a sentence, or the object – or whether "google" is used as a verb, or refers to the website or company in a specific context.

In [28]:
import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')
for token in doc:
    print(token.text, token.pos_, token.dep_)

Apple PROPN nsubj
is VERB aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.K. PROPN compound
startup NOUN dobj
for ADP prep
$ SYM quantmod
1 NUM compound
billion NUM pobj


#### Tokenization
During processing, spaCy first tokenizes the text, i.e. segments it into words, punctuation and so on. This is done by applying rules specific to each language. For example, punctuation at the end of a sentence should be split off – whereas "U.K." should remain one token. Each Doc consists of individual tokens, and we can simply iterate over them:



In [6]:
import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')
for token in doc:
    print(token.text)

Apple
is
looking
at
buying
U.K.
startup
for
$
1
billion


#### Part-of-speech tags and dependencies
After tokenization, spaCy can parse and tag a given Doc. This is where the statistical model comes in, which enables spaCy to make a prediction of which tag or label most likely applies in this context. A model consists of binary data and is produced by showing a system enough examples for it to make predictions that generalise across the language – for example, a word following "the" in English is most likely a noun.

In [14]:
import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
          token.shape_, token.is_alpha, token.is_stop)

Apple apple PROPN NNP nsubj Xxxxx True False
is be VERB VBZ aux xx True True
looking look VERB VBG ROOT xxxx True False
at at ADP IN prep xx True True
buying buy VERB VBG pcomp xxxx True False
U.K. u.k. PROPN NNP compound X.X. False False
startup startup NOUN NN dobj xxxx True False
for for ADP IN prep xxx True True
$ $ SYM $ quantmod $ False False
1 1 NUM CD compound d False False
billion billion NUM CD pobj xxxx True False


#### Named Entities 
A named entity is a "real-world object" that's assigned a name – for example, a person, a country, a product or a book title. spaCy can recognise various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn't always work perfectly and might need some tuning later, depending on your use case.

In [4]:
import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY


#### Word vectors and similarity

spaCy is able to compare two objects, and make a prediction of how similar they are. Predicting similarity is useful for building recommendation systems or flagging duplicates. For example, you can suggest a user content that's similar to what they're currently looking at, or label a support ticket as a duplicate if it's very similar to an already existing one.

In [34]:
import spacy

#nlp = spacy.load('en_core_web_md')  # make sure to use larger model!
#nlp = spacy.load('en_core_web_sm')
nlp = spacy.load('en_core_web_lg')
tokens = nlp(u'dog cat banana')

for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))
        

dog dog 1.0
dog cat 0.801686
dog banana 0.243276
cat dog 0.801686
cat cat 1.0
cat banana 0.281544
banana dog 0.243276
banana cat 0.281544
banana banana 1.0


In [21]:
import spacy

#nlp = spacy.load('en_core_web_md')
nlp = spacy.load('en_core_web_lg')
tokens = nlp(u'dog cat banana afskfsd')

for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)

dog True 7.03367 False
cat True 6.68082 False
banana True 6.70001 False
afskfsd False 0.0 True


#### Vocab, hashes and lexemes

Whenever possible, spaCy tries to store data in a vocabulary, the Vocab , that will be shared by multiple documents. To save memory, spaCy also encodes all strings to hash values – in this case for example, "coffee" has the hash 3197928453018144401. Entity labels like "ORG" and part-of-speech tags like "VERB" are also encoded. Internally, spaCy only "speaks" in hash values.

Token: A word, punctuation mark etc. in context, including its attributes, tags and dependencies.
Lexeme: A "word type" with no context. Includes the word shape and flags, e.g. if it's lowercase, a digit or punctuation.
Doc: A processed container of tokens in context.
Vocab: The collection of lexemes.
StringStore: The dictionary mapping hash values to strings, for example 3197928453018144401 → "coffee".


In [22]:
import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u'I love coffee')
print(doc.vocab.strings[u'coffee'])  # 3197928453018144401
print(doc.vocab.strings[3197928453018144401])  # 'coffee'

3197928453018144401
coffee


In [23]:
import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u'I love coffee')
for word in doc:
    lexeme = doc.vocab[word.text]
    print(lexeme.text, lexeme.orth, lexeme.shape_, lexeme.prefix_, lexeme.suffix_,
          lexeme.is_alpha, lexeme.is_digit, lexeme.is_title, lexeme.lang_)

I 4690420944186131903 X I I True False True en
love 3702023516439754181 xxxx l ove True False False en
coffee 3197928453018144401 xxxx c fee True False False en


In [33]:
import spacy
from spacy.tokens import Doc
from spacy.vocab import Vocab

nlp = spacy.load('en_core_web_sm')
doc = nlp(u'I love coffee') # original Doc
print(doc.vocab.strings[u'coffee'])  # 3197928453018144401
print(doc.vocab.strings[3197928453018144401])  # 'coffee' 👍

empty_doc = Doc(Vocab())  # new Doc with empty Vocab
# empty_doc.vocab.strings[3197928453018144401] will raise an error :(

empty_doc.vocab.strings.add(u'coffee')  # add "coffee" and generate hash
print(empty_doc.vocab.strings[3197928453018144401])  # 'coffee' 👍

new_doc = Doc(doc.vocab)  # create new doc with first doc's vocab
print(new_doc.vocab.strings[3197928453018144401])  # 'coffee' 👍

3197928453018144401
coffee
coffee
coffee


#### Serialization
If you've been modifying the pipeline, vocabulary, vectors and entities, or made updates to the model, you'll eventually want to save your progress – for example, everything that's in your nlp object. This means you'll have to translate its contents and structure into a format that can be saved, like a file or a byte string. This process is called serialization. spaCy comes with built-in serialization methods and supports the Pickle protocol.

In [1]:
#text = open('customer_feedback_627.txt', 'r').read() # open a document
#doc = nlp(text) # process it
#doc.to_disk('/customer_feedback_627.bin') # save the processed Doc

If you need it again later, you can load it back into an empty Doc with an empty Vocab by calling from_disk() :

In [2]:
#from spacy.tokens import Doc # to create empty Doc
#from spacy.vocab import Vocab # to create empty Vocab

#doc = Doc(Vocab()).from_disk('/customer_feedback_627.bin') # load processed Doc


#### Install models and process text

In [5]:
#python -m spacy download en_core_web_sm
#python -m spacy download de_core_news_sm

In [6]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Hello, world. Here are two sentences.')
print([t.text for t in doc])

nlp_de = spacy.load('de_core_news_sm')
doc_de = nlp_de(u'Ich bin ein Berliner.')
print([t.text for t in doc_de])

['Hello', ',', 'world', '.', 'Here', 'are', 'two', 'sentences', '.']
['Ich', 'bin', 'ein', 'Berliner', '.']


#### Get tokens, noun chunks & sentences

In [7]:
import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u"Peach emoji is where it has always been. Peach is the superior "
          u"emoji. It's outranking eggplant 🍑 ")
print(doc[0].text)          # Peach
print(doc[1].text)          # emoji
print(doc[-1].text)         # 🍑
print(doc[17:19].text)      # outranking eggplant

noun_chunks = list(doc.noun_chunks)
print(noun_chunks[0].text)  # Peach emoji

sentences = list(doc.sents)
assert len(sentences) == 3
print(sentences[1].text)    # 'Peach is the superior emoji.'

Peach
emoji
🍑
outranking eggplant
Peach emoji
Peach is the superior emoji.


#### Get part-of-speech tags and flags 


In [15]:
import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')
apple = doc[0]
print('Fine-grained POS tag', apple.pos_, apple.pos)
print('Coarse-grained POS tag', apple.tag_, apple.tag)
print('Word shape', apple.shape_, apple.shape)
print('Alphanumeric characters?', apple.is_alpha)
print('Punctuation mark?', apple.is_punct)

billion = doc[10]
print('Digit?', billion.is_digit)
print('Like a number?', billion.like_num)
print('Like an email address?', billion.like_email)

Fine-grained POS tag PROPN 95
Coarse-grained POS tag NNP 15794550382381185553
Word shape Xxxxx 16072095006890171862
Alphanumeric characters? True
Punctuation mark? False
Digit? False
Like a number? True
Like an email address? False


####  Use hash values for any string

In [32]:
import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u'I love coffee')

coffee_hash = nlp.vocab.strings[u'coffee']  # 3197928453018144401
coffee_text = nlp.vocab.strings[coffee_hash]  # 'coffee'
print(coffee_hash, coffee_text)
print(doc[2].orth, coffee_hash)  # 3197928453018144401
print(doc[2].text, coffee_text)  # 'coffee'

beer_hash = doc.vocab.strings.add(u'beer')  # 3073001599257881079
beer_text = doc.vocab.strings[beer_hash]  # 'beer'
print(beer_hash, beer_text)

unicorn_hash = doc.vocab.strings.add(u'🦄 ')  # 18234233413267120783
unicorn_text = doc.vocab.strings[unicorn_hash]  # '🦄 '
print(unicorn_hash, unicorn_text)


3197928453018144401 coffee
3197928453018144401 3197928453018144401
coffee coffee
3073001599257881079 beer
17758882941175878347 🦄 


#### Recognise and update named entities

In [31]:
import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u'San Francisco considers banning sidewalk delivery robots')
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

from spacy.tokens import Span
doc = nlp(u'FB is hiring a new VP of global policy')
doc.ents = [Span(doc, 0, 1, label=doc.vocab.strings[u'ORG'])]
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

San Francisco 0 13 GPE
FB 0 2 ORG


#### Train and update neural network models

In [23]:
import spacy
import random

nlp = spacy.load('en')
train_data = [("Uber blew through $1 million", {'entities': [(0, 4, 'ORG')]})]

with nlp.disable_pipes(*[pipe for pipe in nlp.pipe_names if pipe != 'ner']):
    optimizer = nlp.begin_training()
    for i in range(10):
        random.shuffle(train_data)
        for text, annotations in train_data:
            nlp.update([text], [annotations], sgd=optimizer)
nlp.to_disk('/Users/nanaakwasiabayieboateng/PythonNLTK/model')

#### Visualize a dependency parse and named entities in your browser


In [25]:
from spacy import displacy

doc_dep = nlp(u'This is a sentence.')
displacy.serve(doc_dep, style='dep')

doc_ent = nlp(u'When Sebastian Thrun started working on self-driving cars at Google '
              u'in 2007, few people outside of the company took him seriously.')
displacy.serve(doc_ent, style='ent')


[93m    Serving on port 5000...[0m
    Using the 'dep' visualizer


    Shutting down server on port 5000.


[93m    Serving on port 5000...[0m
    Using the 'ent' visualizer


    Shutting down server on port 5000.



#### Get word vectors and similarity

In [38]:

import spacy
#nlp = spacy.load('en_core_web_md')  # doesn't work
#nlp = spacy.load('en_core_web_sm') #works to
nlp = spacy.load('en_core_web_lg') #gives best results

#nlp = spacy.load('en_core_web_md')
doc = nlp(u"Apple and banana are similar. Pasta and hippo aren't.")

apple = doc[0]
banana = doc[2]
pasta = doc[6]
hippo = doc[8]

print('apple <-> banana', apple.similarity(banana))
print('pasta <-> hippo', pasta.similarity(hippo))
print(apple.has_vector, banana.has_vector, pasta.has_vector, hippo.has_vector)

apple <-> banana 0.583184
pasta <-> hippo 0.0793491
True True True True


In [52]:
# pip install spacy
# python -m spacy download en_core_web_sm

import spacy

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load('en_core_web_sm')

# Process whole documents
text = (u"When Sebastian Thrun started working on self-driving cars at "
        u"Google in 2007, few people outside of the company took him "
        u"seriously. “I can tell you very senior CEOs of major American "
        u"car companies would shake my hand and turn away because I wasn’t "
        u"worth talking to,” said Thrun, now the co-founder and CEO of "
        u"online higher education startup Udacity, in an interview with "
        u"Recode earlier this week.")
doc = nlp(text)

# Find named entities, phrases and concepts
for entity in doc.ents:
    print(entity.text, entity.label_)

# Determine semantic similarities
doc1 = nlp(u"my fries were super gross")
doc2 = nlp(u"such disgusting fries")
similarity = doc1.similarity(doc2)
print(doc1.text, doc2.text, similarity)

Sebastian Thrun PERSON
Google ORG
2007 DATE
American NORP
Thrun PERSON
Recode ORG
earlier this week DATE
my fries were super gross such disgusting fries 0.713970251872


#### Simple and efficient serialization

In [41]:
import spacy
from spacy.tokens import Doc
from spacy.vocab import Vocab

nlp = spacy.load('en')
#customer_feedback = open('customer_feedback_627.txt').read()
#doc = nlp(customer_feedback)
#doc.to_disk('/Users/nanaakwasiabayieboateng/PythonNLTK/customer_feedback_627.bin')

#new_doc = Doc(Vocab()).from_disk('/Users/nanaakwasiabayieboateng/PythonNLTK/customer_feedback_627.bin')

#### Match text with token rules

In [42]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)

def set_sentiment(matcher, doc, i, matches):
    doc.sentiment += 0.1

pattern1 = [{'ORTH': 'Google'}, {'ORTH': 'I'}, {'ORTH': '/'}, {'ORTH': 'O'}]
pattern2 = [[{'ORTH': emoji, 'OP': '+'}] for emoji in ['😀', '😂', '🤣', '😍']]
matcher.add('GoogleIO', None, pattern1) # match "Google I/O" or "Google i/o"
matcher.add('HAPPY', set_sentiment, *pattern2) # match one or more happy emoji

doc = nlp(u"A text about Google I/O 😀😀")
matches = matcher(doc)

for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]
    span = doc[start:end]
    print(string_id, span.text)
print('Sentiment', doc.sentiment)

GoogleIO Google I/O
HAPPY 😀😀
Sentiment 0.10000000149011612


#### Multi-threaded generator

In [46]:
texts = [u'One document.', u'...', u'Lots of documents']
# .pipe streams input, and produces streaming output
#iter_texts = (texts[i % 3] for i in xrange(100000000))
#for i, doc in enumerate(nlp.pipe(iter_texts, batch_size=50, n_threads=4)):
#    assert doc.is_parsed
#    if i == 100:
#        break

#### Get syntactic dependencies


In [47]:
import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u"When Sebastian Thrun started working on self-driving cars at Google "
          u"in 2007, few people outside of the company took him seriously.")

dep_labels = []
for token in doc:
    while token.head != token:
        dep_labels.append(token.dep_)
        token = token.head
print(dep_labels)

['advmod', 'advcl', 'compound', 'nsubj', 'advcl', 'nsubj', 'advcl', 'advcl', 'xcomp', 'advcl', 'prep', 'xcomp', 'advcl', 'npadvmod', 'amod', 'pobj', 'prep', 'xcomp', 'advcl', 'punct', 'amod', 'pobj', 'prep', 'xcomp', 'advcl', 'amod', 'pobj', 'prep', 'xcomp', 'advcl', 'pobj', 'prep', 'xcomp', 'advcl', 'prep', 'xcomp', 'advcl', 'pobj', 'prep', 'xcomp', 'advcl', 'prep', 'xcomp', 'advcl', 'pobj', 'prep', 'xcomp', 'advcl', 'punct', 'amod', 'nsubj', 'nsubj', 'prep', 'nsubj', 'prep', 'prep', 'nsubj', 'det', 'pobj', 'prep', 'prep', 'nsubj', 'pobj', 'prep', 'prep', 'nsubj', 'dobj', 'advmod', 'punct']


#### Export to numpy arrays

In [49]:
import spacy
from spacy.attrs import ORTH, LIKE_URL

nlp = spacy.load('en_core_web_sm')
doc = nlp(u"Check out https://spacy.io")
for token in doc:
    print(token.text, token.orth, token.like_url)

attr_ids = [ORTH, LIKE_URL]
doc_array = doc.to_array(attr_ids)
print(doc_array.shape)
print(len(doc), len(attr_ids))

assert doc[0].orth == doc_array[0, 0]
assert doc[1].orth == doc_array[1, 0]
assert doc[0].like_url == doc_array[0, 1]

assert list(doc_array[:, 1]) == [t.like_url for t in doc]
print(list(doc_array[:, 1]))

Check 8104846059040039827 False
out 1696981056005371314 False
https://spacy.io 17142293684782158888 True
(3, 2)
3 2
[0, 0, 1]


####  Calculate inline markup on original string

In [51]:
import spacy

def put_spans_around_tokens(doc):
    """Here, we're building a custom "syntax highlighter" for
    part-of-speech tags and dependencies. We put each token in a
    span element, with the appropriate classes computed. All whitespace is
    preserved, outside of the spans. (Of course, HTML will only display
    multiple whitespace if enabled – but the point is, no information is lost
    and you can calculate what you need, e.g. <br />, <p> etc.)
    """
    output = []
    html = '<span class="{classes}">{word}</span>{space}'
    for token in doc:
        if token.is_space:
            output.append(token.text)
        else:
            classes = 'pos-{} dep-{}'.format(token.pos_, token.dep_)
            output.append(html.format(classes=classes, word=token.text, space=token.whitespace_))
    string = ''.join(output)
    string = string.replace('\n', '')
    string = string.replace('\t', '    ')
    return '<pre>{}</pre>.format(string)

nlp = spacy.load('en_core_web_sm')
doc = nlp(u"This is a test.\n\nHello   world.")
html = put_spans_around_tokens(doc)
print(html)

'https://spacy.io'