## What is spaCy?
Spacy is an open-source, free Python module for sophisticated natural language processing (NLP).

You'll eventually want to know more about the text you're working with, let's say. For instance, what's the matter? In the context, what do the words mean? What is being done to whom by whom? Which goods and businesses are referenced in the text? Which texts are comparable to one another?

Spicy, which was created especially for production use, aids in the development of applications that can handle and "understand" vast amounts of text. It can be used to pre-process text for deep learning or to create systems for information extraction or natural language processing.

## What isn't spacy?
First, spaCy isn't a platform or an "API". 
Second, spaCy is not an out-of-the-box chat bot engine. 
Third, spaCy is not research software. 


## Statistical models
Some of spaCy's features work independently, while others require statistical models to be loaded, which enable spaCy to predict linguistic annotations. For example, whether a word is a verb or noun. spaCy currently offers statistical models for a variety of languages, which can be installed as individual Python modules. Models can differ in size, speed, memory usage, accuracy, and the data they include. The model you choose always depends on your use cases and the texts you're working with. For a general use case, the small and the default models are always a good start. They typically include the following components:

Binary weights for the part-of-speech tagger, dependency parser, and named entity recognizer to predict those annotations in context.

Lexical entries in the vocabulary, i.e. words and their context-independent attributes like the shape or spelling.

Data files like lemmatization rules and lookup tables.

Word vectors, i.e. multi-dimensional meaning representations of words that let you determine how similar they are to each other.
Configuration options, like the language and processing pipeline settings, to put spaCy in the correct state when you load in the model.

## Linguistic annotations
A spacy offers a range of linguistic annotations to help you understand the grammatical structure of a text. This covers the many kinds of words, such as the parts of speech, and the relationships between them. 

In [2]:
## install spacy with pip

In [3]:
pip install spacy

Note: you may need to restart the kernel to use updated packages.


In [5]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     - -------------------------------------- 0.5/12.8 MB 4.2 MB/s eta 0:00:03
     ---------- ----------------------------- 3.4/12.8 MB 11.2 MB/s eta 0:00:01
     -------------------- ------------------- 6.6/12.8 MB 13.0 MB/s eta 0:00:01
     ------------------------------ -------- 10.0/12.8 MB 13.2 MB/s eta 0:00:01
     --------------------------------------  12.6/12.8 MB 13.6 MB/s eta 0:00:01
     --------------------------------------  12.6/12.8 MB 13.6 MB/s eta 0:00:01
     --------------------------------------- 12.8/12.8 MB 10.6 MB/s eta 0:00:00
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0
[38;5;2m[+] Download and installation successful[0m
You can now load the package via spacy.load('en_c

In [6]:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text, token.pos_, token.dep_)
# Text: The original word text.
# Lemma: The base form of the word.
# POS: The simple part-of-speech tag.
# Tag: The detailed part-of-speech tag.
# Dep: Syntactic dependency, i.e. the relation between tokens.
# Shape: The word shape - capitalization, punctuation, digits.
# is alpha: Is the token an alpha character?
# is stop: Is the token part of a stop list, i.e. the most common words of the language?

Apple PROPN nsubj
is AUX aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.K. PROPN nsubj
startup VERB ccomp
for ADP prep
$ SYM quantmod
1 NUM compound
billion NUM pobj


## Tokenization
During processing, spaCy first tokenizes the text, i.e., segments it into words, punctuation, and so on. This is done by applying rules specific to each language. For example, punctuation at the end of a sentence should be split off—whereas "U.K." should remain one token. Each Doc consists of individual tokens, and we can iterate over them:

In [7]:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text)

Apple
is
looking
at
buying
U.K.
startup
for
$
1
billion


## Part-of-speech(pos) tags and dependencies
After tokenization, spaCy can parse and tag a given Doc. This is where the statistical model comes in, which enables spaCy to make a prediction of which tag or label most likely applies in this context. A model consists of binary data and is produced by showing a system enough examples for it to make predictions that generalize across the language – for example, a word following "the" in English is most likely a noun.
Tokens for linguistic annotations are available.Spacy, like many other NLP libraries, converts all strings to hash values in order to increase efficiency and minimize memory usage. Therefore, we must append an underscore__ to an attribute's name in order to obtain its readable string representation:

In [8]:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Coronavirus: Delhi resident tests positive for coronavirus, total 31 people infected in India")
for token in doc:
    print(token.text, token.pos_, token.dep_ , token.lemma_, token.shape_, token.tag_, token.is_alpha, token.is_stop)

Coronavirus PROPN nsubj Coronavirus Xxxxx NNP True False
: PUNCT punct : : : False False
Delhi PROPN compound Delhi Xxxxx NNP True False
resident NOUN nsubj resident xxxx NN True False
tests VERB appos test xxxx VBZ True False
positive ADJ amod positive xxxx JJ True False
for ADP prep for xxx IN True True
coronavirus NOUN pobj coronavirus xxxx NN True False
, PUNCT punct , , , False False
total ADJ ROOT total xxxx JJ True False
31 NUM nummod 31 dd CD False False
people NOUN dobj people xxxx NNS True False
infected VERB acl infect xxxx VBN True False
in ADP prep in xx IN True True
India PROPN pobj India Xxxxx NNP True False


In [9]:
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Google, Apple crack down on fake coronavirus apps")
displacy.render(doc, style="dep")

## Named Entities
A named entity is a "real-world object" that has been given a name, such as a person, a nation, a product, or the title of a book. By requesting a prediction from the model, Spicy is able to identify different kinds of named entities in a document. This doesn't always work flawlessly and may require some fine-tuning later, depending on your use case, because models are statistical and heavily rely on the samples they were trained on. Named entities are accessible as a document's ents property:

In [10]:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Coronavirus: Delhi resident tests positive for coronavirus, total 31 people infected in India")
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

31 66 68 CARDINAL
India 88 93 GPE


In [11]:
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")
text ="Coronavirus: Delhi resident tests positive for coronavirus, total 31 people infected in India"
doc = nlp(text)
displacy.render(doc, style="ent")

## Words Vector and similarity
A word's multi-dimensional meaning representations, known as "word embeddings" or word vectors, are compared to determine similarity.Word vectors typically look like this and can be created with an algorithm like word2vec:

Important_note: Spacy's tiny models (all of the packages end in sm) only feature context-sensitive tensors and do not come with word vectors in order to keep them quick and simple. This implies that although you can still compare documents, tokens, and spans using similarity(), the results will be less satisfactory because each token will not have any assigned vectors. 

In [12]:
import spacy.cli
spacy.cli.download("en_core_web_sm")
import en_core_web_sm
nlp = en_core_web_sm.load()

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [13]:
import spacy
nlp = spacy.load("en_core_web_sm")
tokens = nlp("lion bear apple banana fadsfdshds")
for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)

lion True 6.254689 True
bear True 6.0756683 True
apple True 6.385152 True
banana True 6.4799643 True
fadsfdshds True 7.043507 True


Due to their widespread usage in English, the terms "lion," "bear," "apple," and "banana" are included in the model's lexicon and have accompanying vectors. The vector representation of the word "fadsfdshds" is 300 dimensions of 0, meaning it is almost nonexistent because it is far less common and out of lexicon. Consider using one of the larger models or loading in a whole vector package, such as en_vectors_web_Ig, which has more than a million unique vectors, if your application will benefit from a wider vocabulary with more vectors.

The similarity() method included with each Doc, Span, and Token allows you to compare it to another object and calculate how similar they are. Naturally, the degree of similarity between "dog" and "cat" is always subjective; it truly relies on your point of view. A somewhat universal definition of similarity is typically assumed by spaCy's similarity model.

In [14]:
import spacy
nlp = spacy.load("en_core_web_sm")
tokens = nlp("lion bear cow apple mango spinach")
for token11 in tokens:
    for token13 in tokens:
        print(token11.text, token13.text, token11.similarity(token13))

lion lion 1.0
lion bear 0.6604732871055603
lion cow 0.6303659081459045
lion apple 0.5379678010940552
lion mango 0.6732673645019531
lion spinach 0.4564811885356903
bear lion 0.6604732871055603
bear bear 1.0
bear cow 0.6623415946960449
bear apple 0.5358937978744507
bear mango 0.5956253409385681
bear spinach 0.48093488812446594
cow lion 0.6303659081459045
cow bear 0.6623415946960449
cow cow 1.0
cow apple 0.6099163889884949
cow mango 0.5889562368392944
cow spinach 0.4700290262699127
apple lion 0.5379678010940552
apple bear 0.5358937978744507
apple cow 0.6099163889884949
apple apple 1.0
apple mango 0.6297526955604553
apple spinach 0.38630571961402893
mango lion 0.6732673645019531
mango bear 0.5956253409385681
mango cow 0.5889562368392944
mango apple 0.6297526955604553
mango mango 1.0
mango spinach 0.46731528639793396
spinach lion 0.4564811885356903
spinach bear 0.48093488812446594
spinach cow 0.4700290262699127
spinach apple 0.38630571961402893
spinach mango 0.46731528639793396
spinach spin

  print(token11.text, token13.text, token11.similarity(token13))


## Pipelines

 A pipeline is a sequence of pipes, or actors on data, that make alterations to the data or extract information from it. In some cases, later pipes require the output from earlier pipes. 
spaCy is a Python library for Natural Language Processing (NLP). NLP pipelines with spaCy are free and open source. Developers use it to create information extraction and natural language comprehension systems, as in Cython. Use the tool for production, boasting a concise and user-friendly API.

Discover the fundamentals of spaCy, such as tokenization, part-of-speech tagging, and named entity identification.

Understand spaCy’s text processing architecture, which is efficient and quick, making it appropriate for large-scale NLP jobs.

In spaCy, you may explore NLP pipelines and create bespoke pipelines for specific tasks.


Explore the advanced capabilities of spaCy, including rule-based matching, syntactic parsing, and entity linking.


Learn about the many pre-trained language models available in spaCy and how to utilize them for various NLP applications.
Learn named entity recognition (NER) strategies for identifying and categorizing entities in text using spaCy.
