## Spacy
Open Source Natural Language Processing Library.
Designed to effectively handle NLP tasks with the most efficient implementation of common algortihms.
For many NLP tasks, Spacy only has one implemented method, choosing the most efficient algorithm currently available

## NLTK
NLTK- Natural Language Toolkit is a very popular open source. 
Initially released in 2001, it is much older than Spacy (released 2015)
It also provides many functionalities, but includes less efficient implementations

## NLTK vs Spacy
For many common NLP tasks, Spacy is much faster and more efficient, at the cost of the user not being able to choose
algorithmic implementations
Howeever, Spacy does not include pre-created models for some applications, such as sentiment analysis, which is 
typically easier to perform with NLTK.

In [1]:
#pip install spacy
#download the language library that Spacy needs --> python -m spacy download en_core_web_sm
#we can then load the package via ---> spacy.load('en_core_web_sm')
import spacy

## What is NLP
According to Wikipedia, "NLP is an area of computer science and artificial intelligence concerned with the interactions
between computers and human (natural) languages; in particular how to program computers to prcoess and analysze large amounts of natural language data."

Often when performing analysis, lots of data is numerical, such as sales numbers, physical measurements, quantifiable categories. Computers are very good at handling direct numerical information. 
But what do we do about text data?
As humans we can tell there is a plethora of informations inside of text documents. But a computer needs specialised processing techniques in order to "understand" raw text data.
Text data is highly unstructured and can be in multiple languages
NLP attempts to use a variety of techniques in order to create structure out of text data.

## Example Use Cases
Classifying emails as spam vs legitimate
Sentiment analysis if text movie reviews
Analyzing trends from written customer feedback forms
Understanding text commands, "Hey Google, play this song"

## Spacy Basics

There are a few keys steps for working with Spacy that we will cover in this lecture

1. Loading the Language Library
2. Building a Pipeline Object
3. Using Tokens
4. Parts of Speech Tagging
5. Understanding Token Attributes

In [2]:
 import spacy

In [3]:
 #loading a model
 nlp = spacy.load('en_core_web_sm')

In [6]:
#create a document doc object and pass a unicode string --> preface u
doc = nlp(u'Tesla is looking at buying U.S. startup for $6 million')
print(type(doc))

<class 'spacy.tokens.doc.Doc'>


In [12]:
#spacy will parse the entire string into several components for us --
# it will parse it into token i.e. individual words

for token in doc:
    print(token.text, token.pos_, token.dep_)

Tesla PROPN nsubj
is AUX aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.S. PROPN compound
startup NOUN dobj
for ADP prep
$ SYM quantmod
6 NUM compound
million NUM pobj


In [13]:
#Building a pipeline object
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x25970448880>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x2597a738b20>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x2597a6412a0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x2597a57ebc0>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x2597a8b9e80>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x2597a6413f0>)]

In [14]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

## Tokenization

The very 1st step in processing any text is to split it up into all the component parts into tokens. And these tokens are annotated inside the doc object to contain the scripted information.


In [15]:
doc2 = nlp(u"Tesla isn't looking into startups anymore")

In [17]:
for token in doc2:
    print(token, token.pos_, token.dep_)   #token, parts of speech, syntatic dependency

Tesla PROPN nsubj
is AUX aux
n't PART neg
looking VERB ROOT
into ADP prep
startups NOUN pobj
anymore ADV advmod


In [18]:
#we can also use indexing to grab the specific tokens
doc2[0]

Tesla

In [20]:
doc[0].pos_         #proper noun

'PROPN'

In [21]:
doc[0].dep_

'nsubj'

In [24]:
doc[0].lemma_

'Tesla'

In [25]:
doc[0].tag_

'NNP'

In [26]:
doc[0].shape_

'Xxxxx'

In [27]:
doc[0].is_alpha   #is the token an alpha character

True

In [28]:
doc[0].is_stop 

False

In [29]:
#we can also grab the span of the document object
test1 = doc2[3:6]
test1                 #span of the overall document

looking into startups

In [30]:
type(test1)

spacy.tokens.span.Span

In [31]:
type(doc2)

spacy.tokens.doc.Doc

In [32]:
doc3 = nlp(u"This is the first sentence. This is the second sentence. This is the last sentence")

In [33]:
for sentence in doc3.sents:      #gives us the sentences in the complete document
    print(sentence)

This is the first sentence.
This is the second sentence.
This is the last sentence


In [34]:
doc3[6]

This

In [35]:
doc3[6].is_sent_start       #tells it is the start of the sentence

True