Spacy can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.

In [1]:
!pip install -U spacy

Collecting spacy
[?25l  Downloading https://files.pythonhosted.org/packages/3a/70/a0b8bd0cb54d8739ba4d6fb3458785c3b9b812b7fbe93b0f10beb1a53ada/spacy-3.0.5-cp37-cp37m-manylinux2014_x86_64.whl (12.8MB)
[K     |████████████████████████████████| 12.8MB 17.1MB/s 
[?25hCollecting catalogue<2.1.0,>=2.0.1
  Downloading https://files.pythonhosted.org/packages/de/68/027c9f70a58fa4d76521d94237e305247fca196d374635b339401ebed5d8/catalogue-2.0.2-py3-none-any.whl
Collecting pathy>=0.3.5
  Downloading https://files.pythonhosted.org/packages/a2/53/97dc0197cca9357369b3b71bf300896cf2d3604fa60ffaaf5cbc277de7de/pathy-0.4.0-py3-none-any.whl
Collecting srsly<3.0.0,>=2.4.0
[?25l  Downloading https://files.pythonhosted.org/packages/c3/84/dfdfc9f6f04f6b88207d96d9520b911e5fec0c67ff47a0dea31ab5429a1e/srsly-2.4.1-cp37-cp37m-manylinux2014_x86_64.whl (456kB)
[K     |████████████████████████████████| 460kB 42.8MB/s 
Collecting thinc<8.1.0,>=8.0.2
[?25l  Downloading https://files.pythonhosted.org/packages/e3/08/

Load a model

In [2]:
!python -m spacy download en_core_web_sm

2021-04-14 06:13:28.422333: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Collecting en-core-web-sm==3.0.0
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl (13.7MB)
[K     |████████████████████████████████| 13.7MB 16.2MB/s 
Installing collected packages: en-core-web-sm
  Found existing installation: en-core-web-sm 2.2.5
    Uninstalling en-core-web-sm-2.2.5:
      Successfully uninstalled en-core-web-sm-2.2.5
Successfully installed en-core-web-sm-3.0.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [3]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [4]:
doc = nlp("Company Y is planning to acquire stake in X company for $23 billion")
for token in doc:
    print(token.text, token.pos_, token.dep_)

Company NOUN compound
Y PROPN nsubj
is AUX aux
planning VERB ROOT
to PART aux
acquire VERB xcomp
stake NOUN dobj
in ADP prep
X NOUN compound
company NOUN pobj
for ADP prep
$ SYM quantmod
23 NUM compound
billion NUM pobj


 **spaCy’s Processing Pipeline:**
  
   The first step for a text string, when working with spaCy, is to pass it to an NLP object. This object is essentially a pipeline of several text pre-processing operations through which the input text string has to go through.



In [5]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']

In [6]:
#disable the pipeline components
nlp.disable_pipes('tagger', 'parser')

['tagger', 'parser']

In [7]:
nlp.pipe_names

['tok2vec', 'ner', 'attribute_ruler', 'lemmatizer']

In [8]:
# Iterate over the tokens
for token in doc:
    # Print the token and its part-of-speech tag
    print(token, token.tag_, token.pos_, spacy.explain(token.tag_))

Company NN NOUN noun, singular or mass
Y NNP PROPN noun, proper singular
is VBZ AUX verb, 3rd person singular present
planning VBG VERB verb, gerund or present participle
to TO PART infinitival "to"
acquire VB VERB verb, base form
stake NN NOUN noun, singular or mass
in IN ADP conjunction, subordinating or preposition
X NN NOUN noun, singular or mass
company NN NOUN noun, singular or mass
for IN ADP conjunction, subordinating or preposition
$ $ SYM symbol, currency
23 CD NUM cardinal number
billion CD NUM cardinal number


In [9]:
from spacy import displacy
displacy.render(doc, style="dep" , jupyter=True)

Dependency parsing:

It is the process of extracting the dependency parse of a sentence to represent its grammatical structure. It defines the dependency relationship between headwords and their dependents.

In [10]:
# Iterate over the tokens
for token in doc: 
    # Print the token and its part-of-speech tag
    print(token.text, "-->", token.dep_)

Company --> compound
Y --> nsubj
is --> aux
planning --> ROOT
to --> aux
acquire --> xcomp
stake --> dobj
in --> prep
X --> compound
company --> pobj
for --> prep
$ --> quantmod
23 --> compound
billion --> pobj


In [11]:
spacy.explain("nsubj"), spacy.explain("ROOT"), spacy.explain("aux"), spacy.explain("advcl"), spacy.explain("dobj")

('nominal subject',
 None,
 'auxiliary',
 'adverbial clause modifier',
 'direct object')

Lemmatization:

It is the process of reducing inflected forms of a word while still ensuring that the reduced form belongs to the language. This reduced form or root word is called a lemma.

In [12]:
# Iterate over the tokens
for token in doc:
    # Print the token and its part-of-speech tag
    print(token.text, "-->", token.lemma_)

Company --> company
Y --> Y
is --> be
planning --> plan
to --> to
acquire --> acquire
stake --> stake
in --> in
X --> x
company --> company
for --> for
$ --> $
23 --> 23
billion --> billion


In [14]:
# Create an nlp object
doc = nlp("Reliance is looking at buying U.K. based analytics startup for $7 billion.This is India.India is great")
 
sentences = list(doc.sents)
len(sentences)



ValueError: ignored

In [None]:
for sentence in sentences:
     print (sentence)

In [15]:
nlp = spacy.load("en_core_web_sm")
doc= nlp(u"""The Amazon rainforest,[a] alternatively, the Amazon Jungle, also known in English as Amazonia, is a moist broadleaf tropical rainforest in the Amazon biome that covers most of the Amazon basin of South America. This basin encompasses 7,000,000 km2 (2,700,000 sq mi), of which 5,500,000 km2 (2,100,000 sq mi) are covered by the rainforest. This region includes territory belonging to nine nations.

The majority of the forest is contained within Brazil, with 60% of the rainforest, followed by Peru with 13%, Colombia with 10%, and with minor amounts in Bolivia, Ecuador, French Guiana, Guyana, Suriname, and Venezuela. Four nations have "Amazonas" as the name of one of their first-level administrative regions and France uses the name "Guiana Amazonian Park" for its rainforest protected area. The Amazon represents over half of the planet's remaining rainforests,[2] and comprises the largest and most biodiverse tract of tropical rainforest in the world, with an estimated 390 billion individual trees divided into 16,000 species.[3]

Etymology
The name Amazon is said to arise from a war Francisco de Orellana fought with the Tapuyas and other tribes. The women of the tribe fought alongside the men, as was their custom.[4] Orellana derived the name Amazonas from the Amazons of Greek mythology, described by Herodotus and Diodorus.[4]

History
See also: History of South America § Amazon, and Amazon River § History
Tribal societies are well capable of escalation to all-out wars between tribes. Thus, in the Amazonas, there was perpetual animosity between the neighboring tribes of the Jivaro. Several tribes of the Jivaroan group, including the Shuar, practised headhunting for trophies and headshrinking.[5] The accounts of missionaries to the area in the borderlands between Brazil and Venezuela have recounted constant infighting in the Yanomami tribes. More than a third of the Yanomamo males, on average, died from warfare.[6]""")

entities=[(i, i.label_, i.label) for i in doc.ents]
entities

[(Amazon, 'ORG', 383),
 (English, 'LANGUAGE', 389),
 (Amazonia, 'GPE', 384),
 (Amazon, 'ORG', 383),
 (Amazon, 'NORP', 381),
 (South America, 'LOC', 385),
 (7,000,000, 'CARDINAL', 397),
 (2,700,000, 'CARDINAL', 397),
 (5,500,000, 'CARDINAL', 397),
 (2,100,000, 'CARDINAL', 397),
 (nine, 'CARDINAL', 397),
 (Brazil, 'GPE', 384),
 (60%, 'PERCENT', 393),
 (Peru, 'GPE', 384),
 (13%, 'PERCENT', 393),
 (Colombia, 'GPE', 384),
 (10%, 'PERCENT', 393),
 (Bolivia, 'GPE', 384),
 (Ecuador, 'GPE', 384),
 (French, 'NORP', 381),
 (Guiana, 'PERSON', 380),
 (Guyana, 'GPE', 384),
 (Suriname, 'GPE', 384),
 (Venezuela, 'GPE', 384),
 (Four, 'CARDINAL', 397),
 (Amazonas, 'PERSON', 380),
 (one, 'CARDINAL', 397),
 (first, 'ORDINAL', 396),
 (France, 'GPE', 384),
 (Guiana Amazonian Park, 'PERSON', 380),
 (Amazon, 'ORG', 383),
 (over half, 'CARDINAL', 397),
 (an estimated 390 billion, 'QUANTITY', 395),
 (16,000, 'CARDINAL', 397),
 (Amazon, 'ORG', 383),
 (Francisco de Orellana, 'PERSON', 380),
 (Tapuyas, 'LOC', 385)

In [16]:
displacy.render(doc, style = "ent",jupyter = True)