# NLTK vs spaCy

In [1]:
import time

## Tokenization

In [2]:
text = '''The performance of these simple machine learning algorithms depends heavilyon therepresentationof the data they are given. For example, when logisticregression is used to recommend cesarean delivery, the AI system does not examinethe patient directly. Instead, the doctor tells the system several pieces of relevantinformation, such as the presence or absence of a uterine scar. Each piece ofinformation included in the representation of the patient is known as afeature.Logistic regression learns how each of these features of the patient correlates withvarious outcomes. However, it cannot inﬂuence how features are deﬁned in anyway. If logistic regression were given an MRI scan of the patient, rather thanthe doctor’s formalized report, it would not be able to make useful predictions.Individual pixels in an MRI scan have negligible correlation with any complicationsthat might occur during delivery.This dependence on representations is a general phenomenon that appearsthroughout computer science and even daily life. In computer science, operationssuch as searching a collection of data can proceed exponentially faster if the collec-tion is structured and indexed intelligently. People can easily perform arithmeticon Arabic numerals but ﬁnd arithmetic on Roman numerals much more timeconsuming. It is not surprising that the choice of representation has an enormouseﬀect on the performance of machine learning algorithms. For a simple visualexample, see ﬁgure 1.1.Many artiﬁcial intelligence tasks can be solved by designing the right set offeatures to extract for that task, then providing these features to a simple machinelearning algorithm. For example, a useful feature for speaker identiﬁcation fromsound is an estimate of the size of the speaker’s vocal tract. This feature gives astrong clue as to whether the speaker is a man, woman, or child'''

### NLTK

In [3]:
from nltk import sent_tokenize, word_tokenize

In [4]:
start_time = time.time()

word_tokens = word_tokenize(text)
words = [word_token for word_token in word_tokens]

print("--- %s seconds ---" % (time.time() - start_time))

--- 0.019628047943115234 seconds ---


### spaCy

In [5]:
import spacy
nlp = spacy.load('en')

In [6]:
start_time = time.time()

doc = nlp(text)
words = [token.text for token in doc]

print("--- %s seconds ---" % (time.time() - start_time))

--- 0.07090187072753906 seconds ---


In [7]:
start_time = time.time()

sentences = sent_tokenize(text)
words = [sentence for sentence in sentences]

print("--- %s seconds ---" % (time.time() - start_time))

for sentence in words:
    print(sentence)

--- 0.0 seconds ---
The performance of these simple machine learning algorithms depends heavilyon therepresentationof the data they are given.
For example, when logisticregression is used to recommend cesarean delivery, the AI system does not examinethe patient directly.
Instead, the doctor tells the system several pieces of relevantinformation, such as the presence or absence of a uterine scar.
Each piece ofinformation included in the representation of the patient is known as afeature.Logistic regression learns how each of these features of the patient correlates withvarious outcomes.
However, it cannot inﬂuence how features are deﬁned in anyway.
If logistic regression were given an MRI scan of the patient, rather thanthe doctor’s formalized report, it would not be able to make useful predictions.Individual pixels in an MRI scan have negligible correlation with any complicationsthat might occur during delivery.This dependence on representations is a general phenomenon that appearsthro

In [8]:
start_time = time.time()

doc = nlp(text)
words = [sent for sent in doc.sents]

print("--- %s seconds ---" % (time.time() - start_time))
for sentence in words:
    print(sentence)

--- 0.07812714576721191 seconds ---
The performance of these simple machine learning algorithms depends heavilyon therepresentationof the data they are given.
For example, when logisticregression is used to recommend cesarean delivery, the AI system does not examinethe patient directly.
Instead, the doctor tells the system several pieces of relevantinformation, such as the presence or absence of a uterine scar.
Each piece ofinformation included in the representation of the patient is known as afeature.
Logistic regression learns how each of these features of the patient correlates withvarious outcomes.
However, it cannot inﬂuence how features are deﬁned in anyway.
If logistic regression were given an MRI scan of the patient, rather thanthe doctor’s formalized report, it would not be able to make useful predictions.
Individual pixels in an MRI scan have negligible correlation with any complicationsthat might occur during delivery.
This dependence on representations is a general phenomen

In [9]:
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion. And people still wonder why?')
for token in doc:
    print(token.text, token.pos_, token.dep_)

Apple PROPN nsubj
is VERB aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.K. PROPN compound
startup NOUN dobj
for ADP prep
$ SYM quantmod
1 NUM compound
billion NUM pobj
. PUNCT punct
And CCONJ cc
people NOUN nsubj
still ADV advmod
wonder VERB ROOT
why ADV ccomp
? PUNCT punct


In [10]:
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
          token.shape_, token.is_alpha, token.is_stop)

Apple apple PROPN NNP nsubj Xxxxx True False
is be VERB VBZ aux xx True True
looking look VERB VBG ROOT xxxx True False
at at ADP IN prep xx True True
buying buy VERB VBG pcomp xxxx True False
U.K. u.k. PROPN NNP compound X.X. False False
startup startup NOUN NN dobj xxxx True False
for for ADP IN prep xxx True True
$ $ SYM $ quantmod $ False False
1 1 NUM CD compound d False False
billion billion NUM CD pobj xxxx True False


In [11]:
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Apple is looking at buying Colombo based startup for $1 billion')

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Apple 0 5 ORG
Colombo 27 34 ORG
$1 billion 53 63 MONEY


In [12]:
from spacy import displacy
 
doc = nlp('I just bought 2 shares of Apple at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ')
displacy.render(doc, style='ent', jupyter=True)

In [13]:
doc = nlp('I am Lahiru and I just bought 2 Apples at 9 a.m. from the Apple Inc. before the current stock went up by 1 billion $')
displacy.render(doc, style='ent', jupyter=True)

In [14]:
doc = nlp("These are apples. These are oranges.")
for sent in doc.sents:
    print(sent)

These are apples.
These are oranges.


In [15]:
from spacy import displacy
 
doc = nlp('Wall Street Journal just published a piece on crypto currencies')
displacy.render(doc, style='dep', jupyter=True, options={'distance': 80})

In [16]:
doc = nlp("Wall Street Journal just published an interesting piece on crypto currencies")
for chunk in doc.noun_chunks:
    print(chunk.text, chunk.label_, chunk.root.text)

Wall Street Journal NP Journal
an interesting piece NP piece
crypto currencies NP currencies


In [17]:
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'I love coffee')
print(doc.vocab.strings[u'coffee'])  # 3197928453018144401
print(doc.vocab.strings[3197928453018144401])

3197928453018144401
coffee


In [18]:
nlp = spacy.load('en_core_web_sm')
doc = nlp('I love coffee')
for word in doc:
    lexeme = doc.vocab[word.text]
    print(lexeme.text, lexeme.orth, lexeme.shape_, lexeme.prefix_, lexeme.suffix_,
          lexeme.is_alpha, lexeme.is_digit, lexeme.is_title, lexeme.lang_)

I 4690420944186131903 X I I True False True en
love 3702023516439754181 xxxx l ove True False False en
coffee 3197928453018144401 xxxx c fee True False False en


In [19]:
target = nlp("Cats are beautiful animals.")
 
doc1 = nlp("Dogs are awesome.")
doc2 = nlp("Some gorgeous creatures are felines.")
doc3 = nlp("Dolphins are swimming mammals.")
 
print(target.similarity(doc1))  # 0.8901765218466683
print(target.similarity(doc2))  # 0.9115828449161616
print(target.similarity(doc3))  # 0.782295675287610

0.7927212922616225
0.7982259384182496
0.8015801058229597


In [20]:
nlp = spacy.load('en_core_web_sm')
doc = nlp(u"Autonomous cars shift insurance liability toward manufacturers")
for chunk in doc.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_,
          chunk.root.head.text)

Autonomous cars cars nsubj shift
insurance liability liability dobj shift
manufacturers manufacturers pobj toward
