# NLP Basics with Spacy
This notebook contains basics of NLP with Spacy

## Create nlp object with a specific language

In [1]:
import spacy

# load the simplified version of the english core language
nlp = spacy.load('en_core_web_sm')

## Create a simple document
This document will be automatically parsed and processed with spacy with the defined language.

In [2]:
doc = nlp(u'Tesla is looking at buying U.S. startup for $6 million')

From there we can take a look at the elements of the documents.

In [14]:
col1 = "Token"
col2 = "POS" # Part of Speech
col3 = "S-dep" # Syntactic dependency

print(f"{col1:{20}}{col2:{20}}{col3:{20}}")
for token in doc:
    print(f"{token.text:{20}}{token.pos_:{20}}{token.dep_}")

Token               POS                 S-dep               
Tesla               PROPN               nsubj
is                  VERB                aux
looking             VERB                ROOT
at                  ADP                 prep
buying              VERB                pcomp
U.S.                PROPN               compound
startup             NOUN                dobj
for                 ADP                 prep
$                   SYM                 quantmod
6                   NUM                 compound
million             NUM                 pobj


## Spacy pipeline object
The core unit of Spacy is the pipeline object which is a processing/transformation pipeline that takes the original text and applies all the NLP transformations required.

In [15]:
nlp.pipeline

[('tagger', <spacy.pipeline.Tagger at 0x7f0ca22bbd50>),
 ('parser', <spacy.pipeline.DependencyParser at 0x7f0ca1e4b770>),
 ('ner', <spacy.pipeline.EntityRecognizer at 0x7f0ca1e4bd10>)]

As we can observe here, the default instantiation is a pipeline composed of three components.

A parsed document is an iterable and the items can be accessed by index.

In [17]:
n = 0
print(f"The {n}th token in the document is: {doc[n]}")

The 0th token in the document is: Tesla


## Explore the different parsed elements

In [21]:
from spacy.tokens.doc import Doc
import pandas as pd

def get_doc_elements(doc: Doc):
    elements = ["text", "lemma", "pos", "tag", "shape", "alpha", "stop"]
    rows = [ [token.text, token.lemma_, token.pos_, token.tag_, token.shape_, token.is_alpha, token.is_stop] 
            for token in  doc]
    return pd.DataFrame(rows, columns=elements)

In [22]:
doc_elements = get_doc_elements(doc)
doc_elements

Unnamed: 0,text,lemma,pos,tag,shape,alpha,stop
0,Tesla,tesla,PROPN,NNP,Xxxxx,True,False
1,is,be,VERB,VBZ,xx,True,True
2,looking,look,VERB,VBG,xxxx,True,False
3,at,at,ADP,IN,xx,True,True
4,buying,buy,VERB,VBG,xxxx,True,False
5,U.S.,u.s.,PROPN,NNP,X.X.,False,False
6,startup,startup,NOUN,NN,xxxx,True,False
7,for,for,ADP,IN,xxx,True,True
8,$,$,SYM,$,$,False,False
9,6,6,NUM,CD,d,False,False


Where:

|Tag|Description|doc2[0].tag|
|:------|:------:|:------|
|`.text`|The original word text<!-- .element: style="text-align:left;" -->|`Tesla`|
|`.lemma_`|The base form of the word|`tesla`|
|`.pos_`|The simple part-of-speech tag|`PROPN`/`proper noun`|
|`.tag_`|The detailed part-of-speech tag|`NNP`/`noun, proper singular`|
|`.shape_`|The word shape – capitalization, punctuation, digits|`Xxxxx`|
|`.is_alpha`|Is the token an alpha character?|`True`|
|`.is_stop`|Is the token part of a stop list, i.e. the most common words of the language?|`False`|

## Span objects
A span can be thought as a slice of a document, i.e., it can start from some index to another. This facilitates processing chunks of text instead of the whole corpus.

In [25]:
# Definition of NLP according to Wikipedia 
doc = nlp(u"Natural language processing (NLP) is a subfield of computer science, \
information engineering, and artificial intelligence concerned with the \
interactions between computers and human (natural) languages, in particular \
how to program computers to process and analyze large amounts of natural language data.\
Challenges in natural language processing frequently involve speech recognition, natural \
language understanding, and natural language generation.")

quote = doc[10:30]
quote

computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural)

Notice here that the slice is for tokens and not individual characters.

## Work with sentences
We can iterate over a document sentences, i,e., phrases separated by period "." characters.

In [26]:
doc = nlp("This is the first sentence. This is the second sentence. And this is the last sentence.")
for sent in doc.sents:
    print(sent)

This is the first sentence.
This is the second sentence.
And this is the last sentence.


**Note:** Each period is considered a token, so the second "This" in the above document is at index `6` not index `5`, index `5` is the previous period.

In [31]:
print(f"Token 5: {doc[5]}")
print(f"Token 6: {doc[6]}")
print(f"Is token 6 a sentence start? {doc[6].is_sent_start}")

Token 5: .
Token 6: This
Is token 6 a sentence start? True
