
1. Functionality of spacy and nltk are same
2. Spacy provides most efficient algorithm for a given task. So if you care about end result go with spacy. While NLTK provides access to many algorithms. If you care about specific algo and customization go with nltk.

**Spacy**

In [None]:
pip install spacy



In [None]:
import spacy

In [None]:
# sentence tokenization
sent_tokenizer = spacy.load("en_core_web_sm") # get full-fledged pipeline because of load("en_core_web_sm")

tokenized_sentences = sent_tokenizer("Dr. strange is not a strange man. Does hulk love him?")

for sent in tokenized_sentences.sents:
  print(sent)


Dr. strange is not a strange man.
Does hulk love him?


In [None]:
# word tokenization

for sent in tokenized_sentences.sents:
  for word in sent:
    print(word)

Dr.
strange
is
not
a
strange
man
.
Does
hulk
love
him
?


In [None]:
# word tokenization
for word in tokenized_sentences:
    print(word)

Dr.
strange
is
not
a
strange
man
.
Does
hulk
love
him
?


**NLTK**

In [None]:
pip install nltk



In [None]:
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize, word_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
# sentence tokenization
sent_tokenize("Dr. strange is not a strange man. Does hulk love him?")

['Dr. strange is not a strange man.', 'Does hulk love him?']

In [None]:
# word tokenization
word_tokenize("Dr. strange is not a strange man. Does hulk love him?")

['Dr.',
 'strange',
 'is',
 'not',
 'a',
 'strange',
 'man',
 '.',
 'Does',
 'hulk',
 'love',
 'him',
 '?']

**More with spacy**

In [None]:
nlp = spacy.blank("en")
doc = nlp("Dr. strange is not a strange man. Does hulk love him?")

In [None]:
token0 = doc[0]

In [None]:
dir(token0)

['_',
 '__bytes__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 'ancestors',
 'check_flag',
 'children',
 'cluster',
 'conjuncts',
 'dep',
 'dep_',
 'doc',
 'ent_id',
 'ent_id_',
 'ent_iob',
 'ent_iob_',
 'ent_kb_id',
 'ent_kb_id_',
 'ent_type',
 'ent_type_',
 'get_extension',
 'has_dep',
 'has_extension',
 'has_head',
 'has_morph',
 'has_vector',
 'head',
 'i',
 'idx',
 'iob_strings',
 'is_alpha',
 'is_ancestor',
 'is_ascii',
 'is_bracket',
 'is_currency',
 'is_digit',
 'is_left_punct',
 'is_lower',
 'is_oov',
 'is_punct',
 'is_quote',
 'is_right_punct',
 'is_sent_end',
 'is_sent_start',
 'is_space',
 'is_stop',
 'is_title',
 'is_upper',
 'lang',
 'lang_',
 'le

In [None]:
token0.is_currency

False

In [None]:
text = 'Dayton high school, 8th grade students information\n ==================================================\n \n Name\tbirth day   \temail\n -----\t------------\t------\n Virat   5 June, 1882    virat@kohli.com\n Maria\t12 April, 2001  maria@sharapova.com\n Serena  24 June, 1998   serena@williams.com \n Joe      1 May, 1997    joe@root.com\n \n \n \n'

In [None]:
doc = nlp(text)
emails = []
for token in doc:
    if token.like_email:
        emails.append(token.text)
emails

['virat@kohli.com',
 'maria@sharapova.com',
 'serena@williams.com',
 'joe@root.com']

**Customizing spacy tokenizer**

In [None]:
from spacy.symbols import ORTH

nlp = spacy.blank("en")
doc = nlp("gimme double cheese extra large healthy pizza")
tokens = [token.text for token in doc]
tokens

['gimme', 'double', 'cheese', 'extra', 'large', 'healthy', 'pizza']

In [None]:
nlp.tokenizer.add_special_case("gimme", [
    {ORTH: "gim"},
    {ORTH: "me"},
])
doc = nlp("gimme double cheese extra large healthy pizza")
tokens = [token.text for token in doc]
tokens

['gim', 'me', 'double', 'cheese', 'extra', 'large', 'healthy', 'pizza']

**blank and load**

In [None]:
doc = nlp("Dr. Strange loves pav bhaji of mumbai. Hulk loves chat of delhi")
for sentence in doc.sents:
    print(sentence)

ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: `nlp.add_pipe('sentencizer')`. Alternatively, add the dependency parser or sentence recognizer, or set sentence boundaries by setting `doc[i].is_sent_start`.

In [None]:
nlp.pipeline

[]

In [None]:
nlp.add_pipe('sentencizer')

<spacy.pipeline.sentencizer.Sentencizer at 0x789ae66e1f40>

In [None]:
nlp.pipeline

[('sentencizer', <spacy.pipeline.sentencizer.Sentencizer at 0x789ae66e1f40>)]

In [None]:
doc = nlp("Dr. Strange loves pav bhaji of mumbai. Hulk loves chat of delhi")
for sentence in doc.sents:
    print(sentence)

Dr. Strange loves pav bhaji of mumbai.
Hulk loves chat of delhi


**Collecting URL from a given text**

In [None]:
text='''
Look for data to help you address the question. Governments are good
sources because data from public research is often freely available. Good
places to start include http://www.data.gov/, and http://www.science.
gov/, and in the United Kingdom, http://data.gov.uk/.
Two of my favorite data sets are the General Social Survey at http://www3.norc.org/gss+website/,
and the European Social Survey at http://www.europeansocialsurvey.org/.
'''

In [None]:
doc = nlp(text)
data_websites = [token.text for token in doc if token.like_url ]
data_websites

['http://www.data.gov/',
 'http://www.science',
 'http://data.gov.uk/.',
 'http://www3.norc.org/gss+website/',
 'http://www.europeansocialsurvey.org/.']

**Figure out all transactions from a text with amount and currency**

In [None]:
transactions = "Tony gave two $ to Peter, Bruce gave 500 € to Steve"
doc = nlp(transactions)
for token in doc:
    if token.like_num and doc[token.i+1].is_currency:
        print(token.text, doc[token.i+1].text)

two $
500 €
