# spaCy
spaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython. The library is published under the MIT license.

It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.

## How different from nltk?
NLTK was built with learning in mind. It is a great toolkit for teaching, learning, and experimenting with NLP. But spaCy was built with production-readiness in mind, focusing more on efficiency and performance.

<img src = "https://www.researchgate.net/profile/Aneek-Barman-Roy/publication/336147300/figure/fig4/AS:809035569852417@1569900519208/A-Comparison-of-NLTK-and-SpaCy-Frameworks-Bobriakov-2018.ppm">

# Piplelines
Some spaCy's work independently , while some require trained pipelines.

When nlp is called on a text, spacy first tokenizes the text to produce a Doc. Doc is then send through processing Pipleline. Pipleline consists of a tagger, parser, entity recognizer. Each pipeline component returns the processed Doc.
<img src = "https://spacy.io/pipeline-fde48da9b43661abcdf62ab70a546d71.svg">

# Architecture
The central data structures in spaCy are the Language class, the Vocab and the Doc object. The Language class is used to process a text and turn it into a Doc object. It’s typically stored as a variable called nlp. The Doc object owns the sequence of tokens and all their annotations.

In [1]:
import spacy
spacy.__version__

'3.1.3'

******************************************************************************

## nlp
At the center of spacy is the object containing the processing pipeline. This variable is called nlp

In [2]:
# create English nlp object
# Import the English language class
from spacy.lang.en import English
nlp = English()

- nlp object can now be used as a function to analyse text
- contains all different components in pipeline
- supports language specific rules for tokenization(words and punctuations)

# Documents, tokens, spans
Processing text with the nlp object returns a Doc object that holds all information about the tokens, their linguistic features and their relationships. Doc behaves like a normal python sequence.

# Tokens
Token objects represent tokens in a document, example a word or punctuation character 
<img src = "https://hashouttech.com/static/3c5973a520b86c3660a9771453df5794/2bef9/span-object.png">

In [3]:
doc = nlp("This is a text!")
# Token texts
print ([token.text for token in doc])
token_1 = doc[3]
token_1.text

['This', 'is', 'a', 'text', '!']


'text'

# Spans
Span object is the slice of the document consisting of one or more tokens. Its only a view of the doc, and doesnt contain any data in itself. Span indices are exclusive. So doc[2:4] is a span starting at token 2, up to but not including token 4. 

In [4]:
span = doc[2:4]
span.text

'a text'

## Creating span manually

In [5]:
# Import the Span object
from spacy.tokens import Span
# Create a Doc object
doc = nlp("I live in New York")
# Span for "New York" with label GPE (geopolitical)
span = Span(doc, 3, 5, label="GPE")
print ("span.text : ",span.text)
print ("span label : ", span.label_)

span.text :  New York
span label :  GPE


## Lexical Attributes

In [6]:
doc = nlp("Does this toy costs $10 ?")
print ("Index:  ",[token.i for token in doc])
print ("Text:  ",[token.text for token in doc])
print ("is_alpha:  ",[token.is_alpha for token in doc])
print ("is_punct:  ",[token.is_punct for token in doc])
print ("like_num:  ",[token.like_num for token in doc])

Index:   [0, 1, 2, 3, 4, 5, 6]
Text:   ['Does', 'this', 'toy', 'costs', '$', '10', '?']
is_alpha:   [True, True, True, True, False, False, False]
is_punct:   [False, False, False, False, False, False, True]
like_num:   [False, False, False, False, False, True, False]


# Getting Started with spaCy
There are two main components to spaCy
- spaCy’s Statistical Models
- spaCy’s Processing Pipeline

# Statistical Models
Enables spaCy to predict linguistic attributes in context(example if a word is verb or and a span is a place)

- Parts-of-speech tags
- Syntactic dependencies
- Named Enitities
They are trained on large annotated texts. Can be further updated with more examples to fine-tune predictions

# Model Packages
spaCy provides many models which can be downloaded

## Naming convention of model
All pipeline packages follow the naming convention of [lang]_[name]

- Type : Capabilities (e.g. core for general-purpose pipeline with tagging, parsing, lemmatization and named entity recognition, or dep for only tagging, parsing and lemmatization).
- Genre: Type of text the pipeline is trained on, e.g. web or news.
- Size: Package size indicator, sm, md, lg or trf (sm: no word vectors, md: reduced word vector table with 20k unique vectors for ~500k words, lg: large word vector table with ~500k entries)

"en_core_web_sm" is a small model, trained on web-text and supports all core capabilities. The package provides binary weights that enable spaCy to make predictions. It also includes the vocab to tell spacy which language to use.

In [7]:
!python -m spacy validate

[2K[38;5;2m✔ Loaded compatibility table[0m
[1m
[38;5;4mℹ spaCy installation: /opt/conda/lib/python3.7/site-packages/spacy[0m

NAME             SPACY            VERSION                            
en_core_web_lg   >=3.1.0,<3.2.0   [38;5;2m3.1.0[0m   [38;5;2m✔[0m
en_core_web_sm   >=3.1.0,<3.2.0   [38;5;2m3.1.0[0m   [38;5;2m✔[0m



In [8]:
# Load the installed model "en_core_web_sm"
nlp = spacy.load("en_core_web_sm")

# Parts-of-speech Tags
Along with POS we can predict how words are related. Example if a word subject or object.

In [9]:
doc = nlp("The kid is going to school.")
for token in doc:
    print('token.text: {0}; token.pos_: {1}; token.dep_: {2}; token.head.text : {3}'.format(token.text,token.pos_, token.dep_, token.head.text))
    # dep_ returns dependency label. head returns syntactic head attribute
    # The head is the most important node in a phrase, while the Root is the most important node in the whole sentence: it is directly or indirectly the head of every other node.

token.text: The; token.pos_: DET; token.dep_: det; token.head.text : kid
token.text: kid; token.pos_: NOUN; token.dep_: nsubj; token.head.text : going
token.text: is; token.pos_: AUX; token.dep_: aux; token.head.text : going
token.text: going; token.pos_: VERB; token.dep_: ROOT; token.head.text : going
token.text: to; token.pos_: ADP; token.dep_: prep; token.head.text : going
token.text: school; token.pos_: NOUN; token.dep_: pobj; token.head.text : to
token.text: .; token.pos_: PUNCT; token.dep_: punct; token.head.text : going
