In [1]:
import spacy

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
nlp = spacy.blank('en')

In [3]:
nlp.pipe_names

[]

It has no elements in its pipeline since it's a blank model.

Pre-trained Model:

In [4]:
nlp = spacy.load('en_core_web_sm')

In [5]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [8]:
doc = nlp("Captain america ate 100$ of samosa. Then he said I can do this all day.")

print("Word | POS | Dependency | Lemma")
print("-------------------------------------")
for token in doc:
    print(token.text," | ",token.pos_," | ",token.dep_," | ",token.lemma_)

Word | POS | Dependency | Lemma
-------------------------------------
Captain  |  PROPN  |  compound  |  Captain
america  |  PROPN  |  nsubj  |  america
ate  |  VERB  |  ROOT  |  eat
100  |  NUM  |  nummod  |  100
$  |  NUM  |  dobj  |  $
of  |  ADP  |  prep  |  of
samosa  |  PROPN  |  pobj  |  samosa
.  |  PUNCT  |  punct  |  .
Then  |  ADV  |  advmod  |  then
he  |  PRON  |  nsubj  |  he
said  |  VERB  |  ROOT  |  say
I  |  PRON  |  nsubj  |  I
can  |  AUX  |  aux  |  can
do  |  VERB  |  ccomp  |  do
this  |  PRON  |  dobj  |  this
all  |  DET  |  det  |  all
day  |  NOUN  |  npadvmod  |  day
.  |  PUNCT  |  punct  |  .


In [11]:
doc = nlp("Tesla Inc has acquaried twitter for $45 billion.")

for ent in doc.ents:
    print(ent.text," | ",ent.label_," | ",str(spacy.explain(ent.label_)))

Tesla Inc  |  ORG  |  Companies, agencies, institutions, etc.
$45 billion  |  MONEY  |  Monetary values, including unit


ner (Named Entity Recognition) is a pre-trained model that identifies named entities (people, places, organizations, etc.) and their types in unstructured text. In the example above it found the entity "Tesla Inc." of type ORG, i.e. organization and "$45 billion" of type MONEY, i.e. monetary value.

In [12]:
from spacy import displacy

displacy.render(doc,style='ent',jupyter=True)

In [15]:
text = "Bloomberg founded Bloomberg data company called by his name in 1981."

doc = nlp(text)

displacy.render(doc,style='ent',jupyter=True)

Customizing the blank model:

In [16]:
source_nlp = spacy.load('en_core_web_sm')

nlp = spacy.blank('en')

nlp.add_pipe("ner", source=source_nlp)
nlp.pipe_names

['ner']

In [17]:
text = "Bloomberg founded Bloomberg data company called by his name in 1981."

doc = nlp(text)

displacy.render(doc,style='ent',jupyter=True)

Now It works for blank model too.

# Exercises:

In [18]:
nlp = spacy.load('en_core_web_sm')

Exercise: 1
- Get all the proper nouns from a given text in a list and also count how many of them.
- Proper Noun means a noun that names a particular person, place, or thing.

In [21]:
text = '''Ravi and Raju are the best friends from school days.They wanted to go for a world tour and 
visit famous cities like Paris, London, Dubai, Rome etc and also they called their another friend Mohan to take part of this world tour.
They started their journey from Hyderabad and spent next 3 months travelling all the wonderful cities in the world and cherish a happy moments!
'''

# https://spacy.io/usage/linguistic-features

#creating the nlp object
doc = nlp(text)

proper_nouns = [token.text for token in doc if token.pos_ == 'PROPN']
print(proper_nouns, f"\nThere are {len(proper_nouns)} number of proper nouns in the text")

['Raju', 'Paris', 'London', 'Dubai', 'Rome', 'Mohan', 'Hyderabad'] 
There are 7 number of proper nouns in the text


Excersie: 2
- Get all companies names from a given text and also the count of them.
- Hint: Use the spacy ner functionality

In [23]:
text = '''The Top 5 companies in USA are Tesla, Walmart, Amazon, Microsoft, Google and the top 5 companies in 
India are Infosys, Reliance, HDFC Bank, Hindustan Unilever and Bharti Airtel'''


doc = nlp(text)

companies = [token.text for token in doc if token.ent_type_ == 'ORG']
print(companies, f"\nThere are {len(companies)} number of companies in the text")


['Tesla', 'Walmart', 'Amazon', 'Microsoft', 'Google', 'Infosys', 'Reliance', 'HDFC', 'Bank', 'Hindustan', 'Unilever', 'Bharti'] 
There are 12 number of companies in the text
