# What is Spacy?

1.) Open Source Natural Language Processing Library

2.) Designed to effectively handle NLP tasks with the most efficient implementation of common algorithms.

3.) For many NLP tasks, Spacy onle has one implemented method, choosing the most efficient algorithm currently available.

4.) This means you often don't have the option to choose other algorithms.


# What is NLTK ?

1.) NLTK - Natural Language Toolkit is a very popular open source.

2.) Intially release in 2001, it is older than spacy(released in 2015)

3.) It provides many functionalities , but includes less efficient implementations.

# NLTK vs Spacy

1.) For many common NLP tasks, Spacy is much faster and more efficient, at the cost of the user not being able to choose algorithmic implementations.

2.) However, Spacy does not include pre-created models for some applications, such as sentiment analysis which is typically easier to perform with NLTK.



# What is Natural Language Processing ?

"Natural Language Processing(NLP) is an area of computer science and artificial intelligence concerned with the interactions between computers and human(natural) languages, in particular how to program computers to process and analyze large amounts of narural language data.

-> Often when performing analysis, lots of data is numerical , such as sales number, physical measurements, quantifiable categories.

-> Computers are very good at handling direct numerical information.

what to do about Text data?

-> Computer needs a specialized processing techniques in order to "understand" raw text data.

-> Text data is highly unstructured and can be in multiple languages!

-> Natural Language Processing attempts to use a veriety of techniques in order to create structure out of text data.

Example Use Cases:

1.) Classifying Emails as Spam vs Legitimate

2.) Sentiment Analysis if Text Movie Review

3.) Analyzing Trends from written customer feedback forms.

4.) Understanding text commands, "Hey Google, play this song"



In [1]:
# import Spcay

import spacy

In [2]:
# Load language library

nlp = spacy.load("en_core_web_sm")
nlp



<spacy.lang.en.English at 0x7b4e73cb7c40>

In [3]:
# create a document object

doc = nlp(u'Tesla and Bitcon prices are going up after election\
 currently at $84000')




In [4]:
# create a token from doc variable

for token in doc:
    print(token.text)



Tesla
and
Bitcon
prices
are
going
up
after
election
currently
at
$
84000


In [5]:
# POS = Part of speech

for token in doc:
    print(token.text,token.pos)


# In our output we see numbers like 95,89,96
# each of these number actually corresponds with a part
# of speech, like and adverb, a verb, a noun , a conjugation etc.
# to get name use token.pos_

print(" ")


for token in doc:
    print(token.text,token.pos,token.pos_)



Tesla 96
and 89
Bitcon 96
prices 92
are 87
going 100
up 85
after 85
election 92
currently 86
at 85
$ 99
84000 93
 
Tesla 96 PROPN
and 89 CCONJ
Bitcon 96 PROPN
prices 92 NOUN
are 87 AUX
going 100 VERB
up 85 ADP
after 85 ADP
election 92 NOUN
currently 86 ADV
at 85 ADP
$ 99 SYM
84000 93 NUM


In [6]:
# DEP = Syntactic dependency

for token in doc:
    print(token.text,token.pos,token.pos_,token.dep_)

Tesla 96 PROPN nsubj
and 89 CCONJ cc
Bitcon 96 PROPN compound
prices 92 NOUN conj
are 87 AUX aux
going 100 VERB ROOT
up 85 ADP prt
after 85 ADP prep
election 92 NOUN pobj
currently 86 ADV advmod
at 85 ADP prep
$ 99 SYM nmod
84000 93 NUM pobj


In [7]:
# Pipe line object

nlp.pipeline



[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x7b4e739417e0>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x7b4e73941ba0>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7b4e7bd0ed50>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x7b4e886efbc0>),
 ('lemmatizer',
  <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x7b4e73bcbbc0>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7b4e7bd0fd10>)]

In [8]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [9]:
# Tokenization

# The very 1st step processing any text is to split it
# up all the component parts, that is the words and
# punctuation into tokens and these tokesn are annotated
# inside the doc object to contain descriptive information

doc2 = nlp(u'Tesla is not looking for startups anymore')
doc2



Tesla is not looking for startups anymore

In [10]:
for token in doc2:
    print(token.text,token.pos_,token.dep_)



Tesla PROPN nsubj
is AUX aux
not PART neg
looking VERB ROOT
for ADP prep
startups NOUN pobj
anymore ADV advmod


In [11]:
doc2 = nlp(u"Tesla isn't   looking for startups anymore")
doc2
for token in doc2:
    print(token.text,token.pos_,token.dep_)



Tesla PROPN nsubj
is AUX aux
n't PART neg
   SPACE dep
looking VERB ROOT
for ADP prep
startups NOUN pobj
anymore ADV advmod


In [12]:
doc2[0]

Tesla

In [18]:
# So far we iterated all tokens.

# we can also take all of them using indexing.

doc[2].pos_

'PROPN'

In [20]:
# syntactic dependency

doc2[0].dep_

'nsubj'


## Additional Token Attributes



|Tag|Description|doc2[0].tag|
|:------|:------:|:------|
|`.text`|The original word text<!-- .element: style="text-align:left;" -->|`Tesla`|
|`.lemma_`|The base form of the word|`tesla`|
|`.pos_`|The simple part-of-speech tag|`PROPN`/`proper noun`|
|`.tag_`|The detailed part-of-speech tag|`NNP`/`noun, proper singular`|
|`.shape_`|The word shape â€“ capitalization, punctuation, digits|`Xxxxx`|
|`.is_alpha`|Is the token an alpha character?|`True`|
|`.is_stop`|Is the token part of a stop list, i.e. the most common words of the language?|`False`|

In [24]:
# lemmas( The base form of the word)

print(doc2[4].text)
print(doc2[4].lemma_)

looking
look


In [29]:
# Part of speech

print(doc[2].pos_)

print(doc2[4].tag_ +' || '+spacy.explain(doc2[4].tag_))

PROPN
VBG || verb, gerund or present participle


In [32]:
# Word shapes

print(doc[2].text+ ' || ' +doc[2].shape_)
print(doc2[5].text+ ' || ' +doc[5].shape_)

Bitcon || Xxxxx
for || xxxx


In [35]:
# Boolean value

print(doc2[0].text)

print(doc2[0].is_alpha)


print(doc2[5].text)

print(doc2[5].is_alpha)

Tesla
True
for
True


# Spans

Large document objects can be hard to work with some times. A spane is a slice of Doc object in the form

Doc[start:stop]

In [22]:
doc3 = nlp(u'Although commmonly attributed to John Lennon from his song "Beautiful Boy", \
the phrase "Life is what happens to us while we are making other plans" was written by \
cartoonist Allen Saunders and published in Reader\'s Digest in 1957, when Lennon was 17.')
doc3

Although commmonly attributed to John Lennon from his song "Beautiful Boy", the phrase "Life is what happens to us while we are making other plans" was written by cartoonist Allen Saunders and published in Reader's Digest in 1957, when Lennon was 17.

In [36]:
# grab a span from it

life_quote=doc3[16:30]

In [37]:
life_quote

"Life is what happens to us while we are making other plans"

In [38]:
type(life_quote)
#spcay know type of span

spacy.tokens.span.Span

In [39]:
type(doc3)

spacy.tokens.doc.Doc

In [40]:
doc4 = nlp(u'This is the Spacy introduction. I tried to do as much practice as I could do. Next will do tokenization')

In [42]:
# Print each sentence

for sentence in doc4.sents:
    print(sentence)

This is the Spacy introduction.
I tried to do as much practice as I could do.
Next will do tokenization


In [45]:
doc4[3]

Spacy

In [46]:
doc4[0].is_sent_start

True

In [47]:
doc4[15].is_sent_end

False