 # What's spacy?
SpaCy is free, open-source library for advanced Natural language processing (NLP) in Python.

Suppose you're working with a lot of text, you'll eventually want to know more about it. For example, what's it about? What does the words mean in the context? Who is doing what to whom? What products and compnaies are mentioned in the text? Which texts are simmilar to each other.

spacy is designed specifically for production use and helps you build applications that process and "understand" large volume of text. It can be used to build information extraction or natural language processing systems, or to pre-process text for deep learning.
 # What spacy isn't?

First, spacy isn't a platform or an "API". Unlike a platform, spaCy doesn't provide a software as a service or a web application. It's an open-source library designed to help you build NLP applications, not a consumable service.

Second, spacy is not an out-of-the-box chat bot engine. While spaCy can be used to power conversational applications, it's not designed specifically for chat bots, and only provides the underlying text processing capabilities.

Third, spacy is not research software. It's built on the latest research, but it's designed to get things done. This leads to fairly different design decisions than NLTK or CoreNLP, which were created as platforms for teaching and research. The main difference is that spaCy is integrated and opinionated. spacy tries to avoid asking the user to choose between multiple algorithms that deliver equivalent functionality. Keeping the menu small lets spacy deliver generally better performance and developer experience.

Fourth, spacy is not a company It's an open-source library. The company publishing spaCy and other software is called Explosion Al.

# Installation

spacy is compatible with 64bit of Cython 2.7/3.5+ and runs on Unix/Linux, macOS/OS X and Windows. The latest version of spacy is available over pip and conda.

-> Installation with pip in Linux, Windows and macOs/OS X for both version of Python 2.7/3.5+

pip install -U spacy or pip install spacy

--> Installation with conda in Linux, Windows and macOs/OS X for both version of Python 2.7/3.5+

conda install -c conda-forge spacy

# Features
Here, you'll come across mentions of spacy's features and capabilities.

# Statistical models
Some of spacy's features works independently, other requires statistical models to be loaded, which enable spacy to predict linguistic annotations-For example, whether a word is a verb or noun. spaCy currently offers statistical models for a variety of languages, which can be installed as individual Python modules. Models can differ in size, speed, memory usage, accuracy, and the data they include. The model you choose always depends upon your use cases and the texts you're working with. For a general use case, the small and the default models are always a good start. They typically include the following components:

Binary weights for the part-of-speech tagger, dependency parser and named entity recognizer to predict those annotations in context.

Lexical entries in the vocabulary, i.e. words and their context-independent attributes like the shape or spelling.

Data files like lemmatization rules and lookup tables.

Word vectors, i.e. multi-dimensional meaning representations of words that let you determine how similar they are to each other.

Configuration options, like the language and processing pipeline settings, to put spacy in the correct state when you load in the model.

 # Linguistic annotations

spacy provides a variety of linguistic annotations to give you insights into a text's grammatical structure. This includes the word types, like the parts of speech, and how the words are related to each other. For example, if you're analyzing text, it makes a huge difference whether a noun is the subject of a sentence, or the object or whether "google" is used as a verb, or refers to the website or company in a specific context.

Once you've downloaded and installed a model, you can load it via spacy.load() This will retum a Language object containing all components and data needed to process text. We usually call it nip object on a string of text will retum a processed Doc:

In [3]:
!pip install spacy


Collecting numpy>=1.19.0 (from spacy)
  Using cached numpy-2.0.2-cp312-cp312-win_amd64.whl.metadata (59 kB)
Using cached numpy-2.0.2-cp312-cp312-win_amd64.whl (15.6 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.1.2
    Uninstalling numpy-2.1.2:
      Successfully uninstalled numpy-2.1.2
Successfully installed numpy-2.0.2


  You can safely remove it manually.
  You can safely remove it manually.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
contourpy 1.2.0 requires numpy<2.0,>=1.20, but you have numpy 2.0.2 which is incompatible.
numba 0.59.1 requires numpy<1.27,>=1.22, but you have numpy 2.0.2 which is incompatible.
pywavelets 1.5.0 requires numpy<2.0,>=1.22.4, but you have numpy 2.0.2 which is incompatible.
streamlit 1.32.0 requires numpy<2,>=1.19.3, but you have numpy 2.0.2 which is incompatible.
streamlit 1.32.0 requires packaging<24,>=16.8, but you have packaging 24.1 which is incompatible.
tensorflow-intel 2.17.0 requires numpy<2.0.0,>=1.26.0; python_version >= "3.12", but you have numpy 2.0.2 which is incompatible.


In [4]:
!python -m spacy download en_core_web_sm


Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     --------------------------------------- 0.1/12.8 MB 812.7 kB/s eta 0:00:16
      --------------------------------------- 0.2/12.8 MB 2.2 MB/s eta 0:00:06
     -- ------------------------------------- 0.7/12.8 MB 4.0 MB/s eta 0:00:04
     --- ------------------------------------ 1.0/12.8 MB 4.6 MB/s eta 0:00:03
     --- ------------------------------------ 1.2/12.8 MB 4.4 MB/s eta 0:00:03
     ---- ----------------------------------- 1.3/12.8 MB 4.5 MB/s eta 0:00:03
     ---- ----------------------------------- 1.3/12.8 MB 4.5 MB/s eta 0:00:03
     ---- ----------------------------------- 1.3/12.8 MB 4.5 MB/s eta 0:00:03
     ---- ----------------------------------- 1

In [5]:
import spacy

# Load the English model
nlp = spacy.load("en_core_web_sm")

# Test it with a sample text
doc = nlp("Coronavirus: Delhi resident tests positive for coronavirus, total 31 people infected in India.")

# Iterate over tokens in the doc and print their attributes
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
          token.shape_, token.is_alpha, token.is_stop)


Coronavirus Coronavirus PROPN NNP nsubj Xxxxx True False
: : PUNCT : punct : False False
Delhi Delhi PROPN NNP compound Xxxxx True False
resident resident NOUN NN nsubj xxxx True False
tests test VERB VBZ appos xxxx True False
positive positive ADJ JJ amod xxxx True False
for for ADP IN prep xxx True True
coronavirus coronavirus NOUN NN pobj xxxx True False
, , PUNCT , punct , False False
total total ADJ JJ ROOT xxxx True False
31 31 NUM CD nummod dd False False
people people NOUN NNS dobj xxxx True False
infected infect VERB VBN acl xxxx True False
in in ADP IN prep xx True True
India India PROPN NNP pobj Xxxxx True False
. . PUNCT . punct . False False


In [6]:

# Cell 1: Install SpaCy
!pip install spacy

# Cell 2: Download the English model
!python -m spacy download en_core_web_sm

# Cell 3: Import SpaCy and use it
import spacy

# Load the English model
nlp = spacy.load("en_core_web_sm")

# Test it with a sample text
doc = nlp("Coronavirus: Delhi resident tests positive for coronavirus, total 31 people infected in India.")

# Iterate over tokens in the doc and print their attributes
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
          token.shape_, token.is_alpha, token.is_stop)


Collecting en-core-web-sm==3.8.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[38;5;2m[+] Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
Coronavirus Coronavirus PROPN NNP nsubj Xxxxx True False
: : PUNCT : punct : False False
Delhi Delhi PROPN NNP compound Xxxxx True False
resident resident NOUN NN nsubj xxxx True False
tests test VERB VBZ appos xxxx True False
positive positive ADJ JJ amod xxxx True False
for for ADP IN prep xxx True True
coronavirus coronavirus NOUN NN pobj xxxx True False
, , PUNCT , punct , False False
total total ADJ JJ ROOT xxxx True False
31 31 NUM CD nummod dd False False
people people NOUN NNS dobj xxxx True False
infected infect VERB VBN acl xxxx True False
in in ADP IN prep xx True True
India India PROPN NNP pobj Xxxxx True False
. . PUNCT . punct . False False


# Tokenization

During processing, spaCy first tokenizes the text, i.e. segments it into words, punctuation and so on. This is done by applying rules specific to each language. For example, punctuation at the end of a sentence should be split off-whereas "U.K." should remain one token. Each Doc consists of individual tokens, and we can iterate over them:

In [8]:
import spacy
nlp =spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text)

Apple
is
looking
at
buying
U.K.
startup
for
$
1
billion


First, the raw text is split on whitespace characters, similar to text.split(' '). Then, the tokenizer processes the text from left to right. On each substring, it performs two checks:

1. Does the substring match a tokenizer exception rule? For example, "don't" does not contain whitespace, but should be split into two tokens, "do" and "n't", while "U.K." should always remain one token.
2. Can a prefix, suffix or infix be split off? For example punctuation like commas, periods, hyphens or quotes.
If there's a match, the rule is applied and the tokenizer continues its loop, starting with the newly split substrings. This way, spacy can split complex, nested tokens like combinations of abbreviations and multiple punctuation marks.

 # Part-of-speech(pos) tags and dependencies
After tokenization, spaCy can parse and tag a given Doc. This is where the statistical model comes in, which enables spacy to make a prediction of which tag or label most likely applies in this context. A model consists of binary data and is produced by showing a system enough examples for it to make predictions that generalize across the language for example, a word following "the" in English is most likely a noun.

Linguistic annotations are available as Token Like many NLP libraries, spaCy encodes all strings to hash values to reduce memory usage and improve efficiency. So to get the readable string representation of an attribute, we need to add an underscore_to its name:

In [13]:
import spacy

nlp = spacy.load("en_core_web_sm")

doc = nlp("Coronavirus: Delhi resident tests positive for coronavirus, total 31 people infected in India")
for token in doc:
    print(token.text, token.lemma, token.pos, token.tag, token.dep,
        token.shape, token.is_alpha, token.is_stop)

Coronavirus 14196865665419520621 96 15794550382381185553 429 16072095006890171862 True False
: 11532473245541075862 97 11532473245541075862 445 11532473245541075862 False False
Delhi 7055494911946032454 96 15794550382381185553 7037928807040764755 16072095006890171862 True False
resident 12791935767146688153 92 15308085513773655218 429 13110060611322374290 True False
tests 1618900948208871284 100 13927759927860985106 403 13110060611322374290 True False
positive 8177200212695065927 84 10554686591937588953 402 13110060611322374290 True False
for 16037325823156266367 85 1292078113972184607 443 4088098365541558500 True True
coronavirus 12610307704150367228 92 15308085513773655218 439 13110060611322374290 True False
, 2593208677638477497 97 2593208677638477497 445 2593208677638477497 False False
total 12505506919507411536 84 10554686591937588953 8206900633647566924 13110060611322374290 True False
31 16449342687276647362 93 8427216679587749980 12837356684637874264 4620368362210911820 False Fa

In [12]:
import spacy
nlp=spacy.load("en_core_web_sm")
doc= nlp("Coronavirus: Delhi resident tests positive for coronavirus, total 31 people infected in India")
for token in doc:
    print(token.text, token.lemma, token.pos, token.tag, token.dep, token.shape, token.is_alpha, token.is_stop)

Coronavirus 14196865665419520621 96 15794550382381185553 429 16072095006890171862 True False
: 11532473245541075862 97 11532473245541075862 445 11532473245541075862 False False
Delhi 7055494911946032454 96 15794550382381185553 7037928807040764755 16072095006890171862 True False
resident 12791935767146688153 92 15308085513773655218 429 13110060611322374290 True False
tests 1618900948208871284 100 13927759927860985106 403 13110060611322374290 True False
positive 8177200212695065927 84 10554686591937588953 402 13110060611322374290 True False
for 16037325823156266367 85 1292078113972184607 443 4088098365541558500 True True
coronavirus 12610307704150367228 92 15308085513773655218 439 13110060611322374290 True False
, 2593208677638477497 97 2593208677638477497 445 2593208677638477497 False False
total 12505506919507411536 84 10554686591937588953 8206900633647566924 13110060611322374290 True False
31 16449342687276647362 93 8427216679587749980 12837356684637874264 4620368362210911820 False Fa

In [None]:
import spacy
from spacy import displacy
nlp=spacy.load("en_core_web_sm")
doc = nlp("Google, Apple crack down on fake coronavirus apps")
displacy.serve(doc, style="dep")




Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...



 # Named Entities
A named entity is a "real-world object" that's assigned a name - for example, a person, a country, a product or a book title. spacy can recognize various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn't always work perfectly and might need some tuning later, depending on your use case.
Named entities are available as the ents property of a Doc:

In [None]:
import spacy
nlp =spacy.load("en_core_web_sm")
doc = nlp("Coronavirus: Delhi resident tests positive for coronavirus, total 31 people infected in India")
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

 # Visualizing the Named Entity recognizer
The entity visualizer, ent, highlight named entities and their label in the text.

In [None]:
import spacy
from spacy import displacy
text = "Coronavirus: Delhi resident tests positive for coronavirus, total 31 people infected in India"
nlp = spacy.load("en core web sm")
doc = nlp(text)
displacy.serve(doc, style="ent")
#https://spacy.io/api/annotation#named-entities

 # Words vector and similarity
Similarity is determined by comparing word vectors or 'word embeddings", multi-dimensional meaning representations of a word. Word vectors can be generated using an algorithm like word2vec and usually look like this:

### Important_note:
To make them compact and fast, spacy's small models(all the pacakages end with sm) don't ship with the word vectors, and only include context-sensitive tensors. This means you can still use the similarity() to compare documents, tokens and spans - but result won't be as good, and individual tokens won't have any vectors is assigned. So, in orders to use real word vectors, you need to download a larger model:

python-m spacy download en core web md


Models that come with built-in word vectors make them available as the Token.vector attribute. Doc.vector and Span.vector will default to an average of their token vectors. You can also check if a token has a vector assigned, and get the L2 norm, which can be used to normalize vectors

In [None]:
#!python- spacy download en_core_web_md
import spacy.cli
spacy.cli.download("en core web md")
import en_core_web_md
nlp_en_core_web_md.load()