**What is SpaCy?**

*   SpaCy is a free, open-source library for advanced NLP in Python.​
*   It's written from the ground up in carefully memory-managed Cython.​
*   It can be used to build information extraction or natural language    understanding systems, or to pre-process text for deep learning.


**Features-**
*   Non-destructive tokenization​
*   Named entity recognition​
*   Support for 59+ languages​
*   46 statistical models for 16 languages​
*   Pretrained word vectors​
*   State-of-the-art speed​
*   Easy deep learning integration
*   Part-of-speech tagging
*   Built in visualizers for syntax and NER

In [None]:
import spacy

**Model Naming Convention-**

[lang]\_[name]

[lang]\_[type_genre_size]
* **type:** Model capabilities (e.g. core for general-purpose model with vocabulary, syntax, entities and word vectors, or depent for only vocab, syntax and entities)
***genre:** Type of text the model is trained on (e.g. web for web text, news for news text)
***size:** Model size indicator (sm, md or lg)

In [None]:
!pip install --user https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.2.0/en_core_web_md-2.2.0.tar.gz

*Note-*

Some models do not exist as a package in its own right on pypi.org or Anaconda, so you can't just pip install it by name. However you can find download links for the model on the [Github page](https://github.com/explosion/spacy-models) and you can pip install directly from one of the download URLs

In [None]:
#Download model --------->  python -m spacy download en_core_web_sm
model = "en_core_web_md"
nlp = spacy.load(model)
type(nlp)
#contains the processing pipeline
#includes language-specific rules for tokenization etc.

spacy.lang.en.English

In [None]:
nlp.pipe_names

['tagger', 'parser', 'ner']

![image](https://spacy.io/pipeline-7a14d4edd18f3edfee8f34393bff2992.svg)

In [None]:
text = '''Abraham Lincoln was an American statesman and lawyer who served as the 16th president of the United States.
       Lincoln led the nation through its greatest moral, constitutional, and political crisis in the American Civil War'''

In [None]:
doc = nlp(text)

*   The Doc behaves like a normal Python sequence and lets you iterate over its tokens, or get a token by its index.
*   Even though a Doc is processed, it still holds all information of the original text, like whitespace characters. 

**Tokenization**

In [None]:
tokens = []
for token in doc:
  tokens.append(token.text)
print(tokens)

['Abraham', 'Lincoln', 'was', 'an', 'American', 'statesman', 'and', 'lawyer', 'who', 'served', 'as', 'the', '16th', 'president', 'of', 'the', 'United', 'States', '.', '\n       ', 'Lincoln', 'led', 'the', 'nation', 'through', 'its', 'greatest', 'moral', ',', 'constitutional', ',', 'and', 'political', 'crisis', 'in', 'the', 'American', 'Civil', 'War']


In [None]:
#span
span = doc[4:8]
span.text

'American statesman and lawyer'

**Punctuation and Stop Words**

In [None]:
from spacy.lang.en.stop_words import STOP_WORDS as stopwords
print(len(stopwords))
pp_text = [token.text for token in doc if not (token.is_punct or token.is_stop)]
pp_text

326


['Abraham',
 'Lincoln',
 'American',
 'statesman',
 'lawyer',
 'served',
 '16th',
 'president',
 'United',
 'States',
 '\n       ',
 'Lincoln',
 'led',
 'nation',
 'greatest',
 'moral',
 'constitutional',
 'political',
 'crisis',
 'American',
 'Civil',
 'War']

**Lemmatization, POS, NER and Dependency Parsing**

*   .pos_ - Course grained parts of speech
*   .tag_ - Fine grained parts of speech




In [None]:
import pandas as pd
info = []
for token in doc:
  info.append([token , token.lemma_ , token.ent_type_ , token.pos_ , token.tag_ , token.dep_ ])
info = pd.DataFrame(info ,columns= ['token' , 'lemma' , 'ent' ,'pos' , 'tag' , 'dep'])
print(info)

             token           lemma      ent    pos   tag       dep
0          Abraham         Abraham   PERSON  PROPN   NNP  compound
1          Lincoln         Lincoln   PERSON  PROPN   NNP     nsubj
2              was              be             AUX   VBD      ROOT
3               an              an             DET    DT       det
4         American        american     NORP    ADJ    JJ      amod
5        statesman       statesman            NOUN    NN      attr
6              and             and           CCONJ    CC        cc
7           lawyer          lawyer            NOUN    NN      conj
8              who             who            PRON    WP     nsubj
9           served           serve            VERB   VBD     relcl
10              as              as           SCONJ    IN      prep
11             the             the             DET    DT       det
12            16th            16th  ORDINAL    ADJ    JJ      amod
13       president       president            NOUN    NN      

https://universaldependencies.org/docs/u/pos/

**DisplaCy**

In [None]:
from spacy import displacy
doc2 = nlp("Apple is looking at buying U.K. startup for $1 billion")
displacy.render(doc2 ,style = 'dep' ,jupyter=True , options= {'distance': 90 , 'compact' :False})
#https://spacy.io/usage/visualizers

In [None]:
displacy.render(doc2 ,style = 'ent'  ,jupyter = True, options= {'distance': 90 , 'compact' :True})

In [None]:
print(doc[10].ent_type_ , spacy.explain(doc[10].ent_type_))

GPE Countries, cities, states


**Updating pre-trained SpaCy model**

In [None]:
LABEL = "PERSON"
train_data = [
        ("Uber blew through $1 million a week", {"entities": [(0, 4, "ORG")]}),
        ("Google rebrands its business apps", {"entities": [(0, 6, "ORG")]}),
        ("Barak Obama is an American politician and attorney who served as the 44th president of the U.S.", {"entities": [(0, 12, "PERSON")]}),
        ("Obama was the first African-American president of the U.S.", {"entities": [(0, 5, "PERSON")]})]

In [None]:
#Adding a new entity type to NER model
from spacy.util import minibatch, compounding
import warnings
import random

ner = nlp.get_pipe("ner")
ner.add_label(LABEL)
optimizer = nlp.begin_training()

# get names of other pipes to disable them during training
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]

with nlp.disable_pipes(*other_pipes):
  warnings.filterwarnings("once", category=UserWarning, module='spacy')
  sizes = compounding(1.0, 4.0, 1.001)
  # batch up the examples using spaCy's minibatch
  for itn in range(30):
    random.shuffle(train_data)
    batches = minibatch(train_data, size=4)
    losses = {}
    for batch in batches:
      texts, annotations = zip(*batch)
      nlp.update(texts, annotations, sgd=optimizer, drop=0.2, losses=losses)
    print("Losses", losses)



Losses {'ner': 19.43379008769989}
Losses {'ner': 19.79410707950592}
Losses {'ner': 14.909452080726624}
Losses {'ner': 9.787634700536728}
Losses {'ner': 9.776979446411133}
Losses {'ner': 4.928063243627548}
Losses {'ner': 3.189954459667206}
Losses {'ner': 4.985356777906418}
Losses {'ner': 5.195562303066254}
Losses {'ner': 3.542061448097229}
Losses {'ner': 1.5673684254288673}
Losses {'ner': 2.9016424072906375}
Losses {'ner': 1.74193372647278}
Losses {'ner': 2.4541166853159666}
Losses {'ner': 4.067929798969999}
Losses {'ner': 1.6679688841104507}
Losses {'ner': 2.339744319440797}
Losses {'ner': 0.5388410957530141}
Losses {'ner': 0.33195654349401593}
Losses {'ner': 0.12093722051940858}
Losses {'ner': 0.19967850274406374}
Losses {'ner': 0.5970494074354065}
Losses {'ner': 0.04640352039132267}
Losses {'ner': 0.3100949887302704}
Losses {'ner': 0.2158881261420902}
Losses {'ner': 0.00839119428928825}
Losses {'ner': 0.007865471881814301}
Losses {'ner': 0.0064036969815788325}
Losses {'ner': 0.097917

In [None]:
# test the trained model
test_text = "Barak served as 16th U.S. president."
test_doc = nlp(test_text)
print("Entities in '%s'" % test_text)
for ent in test_doc.ents:
  print(ent.label_, ent.text)


Entities in 'Barak served as 16th U.S. president.'
PERSON Barak


**Noun Phrase Detection**

In [None]:
# Extract Noun Phrases
for chunk in doc.noun_chunks:
  print(chunk)

Abraham Lincoln
an American statesman
lawyer
who
the 16th president
the United States
Lincoln
the nation
its greatest moral, constitutional, and political crisis
the American Civil War


In [None]:
#Retokenization
with doc.retokenize() as retokenizer:
  for chunks in doc.noun_chunks:
      retokenizer.merge(chunks)

for token in doc:
  print(token)

Abraham Lincoln
was
an American statesman
and
lawyer
who
served
as
the 16th president
of
the United States
.

       
Lincoln
led
the nation
through
its greatest moral, constitutional, and political crisis
in
the American Civil War


**Rule-Based Matching**

In [None]:
new_doc = nlp(text)
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
def extract_full_name(doc):
  pattern = [{'POS': 'PROPN'}, {'POS': 'PROPN'}]
  matcher.add('FULL_NAME', None, pattern)
  matches = matcher(doc)
  for match_id, start, end in matches:
    span = doc[start:end]
    print(span.text)
  return

extract_full_name(new_doc)

Abraham Lincoln
United States
American Civil
Civil War


**Custom Pipeline Component**

In [None]:
def set_custom_boundaries(doc):
  # Adds support to use `, ` as the delimiter for sentence detection
  for token in doc[:-1]:
    if token.text == ',':
      doc[token.i+1].is_sent_start = True
  return doc

In [None]:
c_text = '''The fox jumps over the dog, the fox is very clever and quick. The dog is slow and lazy.The cat is smarter than the fox and the dog.'''
custom_nlp = spacy.load(model)
custom_nlp.add_pipe(set_custom_boundaries, before='parser')
c_doc = custom_nlp(c_text)
print(custom_nlp.pipe_names)

['tagger', 'set_custom_boundaries', 'parser', 'ner']


In [None]:
# Sentence Detection with customization
sentences = list(c_doc.sents)
for idx ,sentence in zip(range(len(sentences)) , sentences):
  print(idx ,sentence)

0 The fox jumps over the dog,
1 the fox is very clever and quick.
2 The dog is slow and lazy.
3 The cat is smarter than the fox and the dog.


**Word Vectors**

In [None]:
monkey_doc = nlp.vocab['monkey']
print(monkey_doc.vector)

[-0.016269  -0.66774   -0.21387   -0.47919   -0.62437    0.55147
 -0.091257  -0.26658    0.36018    0.88362   -0.65668   -0.6079
 -0.23139   -0.03844    0.055579   0.30017   -0.040855   0.42249
  0.16564   -0.32602   -0.015816  -0.1382     0.031401  -0.044665
  0.036645  -0.26125   -0.15189   -0.32714    0.28742   -0.13792
 -0.20004   -0.083271   0.50561    0.11606   -0.084337  -0.05022
  0.11342    0.060052  -0.39589   -0.20859    0.35132    0.044604
 -0.48136    0.18762    0.14262    0.12414   -0.039411  -0.035831
  0.18882    0.2482     0.014472  -0.101      0.034416  -0.09526
 -0.10771    0.14274    0.44874    0.16699    0.33336   -0.15883
 -0.29088    0.3315    -0.014055   0.11276   -0.1387    -0.47695
 -0.26999   -0.047984   0.10394    0.010292  -0.26874   -0.23449
 -0.25156    0.046208  -0.041264  -0.048494  -0.39944   -0.26964
  0.087368  -0.19203   -0.24358    0.32748   -0.10197   -0.14169
 -0.67555   -0.2514     1.0979     0.304     -0.1469    -0.1116
  0.4606     0.30304   -

In [None]:
unknown_doc = nlp('dgdgs')
print(unknown_doc.vector)

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


**Custom word vectors**

*   Custom word vectors can be trained using a number of open-source libraries, such as Gensim, Fast Text, or Tomas Mikolov’s original word2vec implementation.

*  For everyday use, we want to convert the vectors model into a binary format that loads faster and takes up less space on disk. The easiest way to do this is the [init-model](https://spacy.io/api/cli#init-model) command-line utility.

python -m spacy init-model [lang] [output_dir]



In [None]:
#Example-
#wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.la.300.vec.gz
#python -m spacy init-model en /tmp/la_vectors_wiki_lg --vectors-loc cc.la.300.vec.gz
#nlp_latin = spacy.load("/tmp/la_vectors_wiki_lg")

**Similarity**
*   By default spaCy calculates cosine similarity
*   Each Doc, Span and Token comes with a [.similarity()](https://spacy.io/api/token#similarity) method that lets you compare it with another object, and determine the similarity.

In [None]:
banana = nlp.vocab['banana']
dog = nlp.vocab['dog']
fruit = nlp.vocab['fruit']
animal = nlp.vocab['animal']

print('dog and banana', dog.similarity(banana))
print('dog and animal', dog.similarity(animal))
print('dog and fruit ', dog.similarity(fruit))

dog and banana 0.24327643
dog and animal 0.66185343
dog and fruit  0.23552851


In [None]:
text1 = 'How can I end violence?'
text2 = 'What should I do to be a peaceful?'
doc1 = nlp(text1)
doc2 = nlp(text2)
print("Similarity :", doc1.similarity(doc2))


Similarity : 0.9165530875252746


**References-**

https://course.spacy.io/en

https://spacy.io/usage

https://realpython.com/natural-language-processing-spacy-python/

