<a href="https://colab.research.google.com/github/Soumyajain29/NLP-Tools-and-Libraries/blob/master/Spacy_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import spacy

In [2]:
#Download model --------->  python -m spacy download en_core_web_sm
model = "en_core_web_md"

**Model Naming Convention-**

[lang]\_[name]

[lang]\_[type_genre_size]
* **type:** Model capabilities (e.g. core for general-purpose model with vocabulary, syntax, entities and word vectors, or depent for only vocab, syntax and entities)
***genre:** Type of text the model is trained on (e.g. web for web text, news for news text)
***size:** Model size indicator (sm, md or lg)

In [None]:
#!pip install --user https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.2.0/en_core_web_md-2.2.0.tar.gz

Some models do not exist as a package in its own right on pypi.org or Anaconda, so you can't just pip install it by name. However you can find download links for the model on the [Github page](https://github.com/explosion/spacy-models) and you can pip install directly from one of the download URLs

In [4]:
nlp = spacy.load(model)
type(nlp)
#contains the processing pipeline
#includes language-specific rules for tokenization etc.

spacy.lang.en.English

In [None]:
#from spacy.lang.en import English
#nlp = English()

In [83]:
nlp.pipe_names

['tagger', 'parser', 'ner']

![image](https://spacy.io/pipeline-7a14d4edd18f3edfee8f34393bff2992.svg)

In [50]:
txt = """When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously."""

In [51]:
doc = nlp(txt)

The Doc behaves like a normal Python sequence and lets you iterate over its tokens, or get a token by its index.

In [52]:
#Tokenization
tokens = []
for token in doc:
  tokens.append(token.text)
print(tokens)

['When', 'Sebastian', 'Thrun', 'started', 'working', 'on', 'self', '-', 'driving', 'cars', 'at', 'Google', 'in', '2007', ',', 'few', 'people', 'outside', 'of', 'the', 'company', 'took', 'him', 'seriously', '.']


In [53]:
token1 = doc[1]
token1.is_punct

False

In [54]:
#span
span = doc[2:4]
span.text

'Thrun started'

In [55]:
from spacy import displacy
displacy.render(doc ,style = 'dep' ,jupyter=True , options= {'distance': 90 , 'compact' :True})
#https://spacy.io/usage/visualizers

In [56]:
displacy.render(doc ,style = 'ent'  ,jupyter = True, options= {'distance': 90 , 'compact' :True})

In [57]:
#Stopwords
from spacy.lang.en.stop_words import STOP_WORDS as stopwords
print(len(stopwords))
new_tokens = [token.text for token in doc if not token.is_punct and not token.is_stop]
print(new_tokens)

326
['Sebastian', 'Thrun', 'started', 'working', 'self', 'driving', 'cars', 'Google', '2007', 'people', 'outside', 'company', 'took', 'seriously']


In [75]:
#Pos ,Lemmatization , Dependency Parsing

import pandas as pd
info = []
for token in doc:
  info.append([token , token.lemma_ , token.ent_type_ , token.pos_ , token.tag_ , token.dep_ , token.is_punct , token.is_stop , token.is_digit])
info = pd.DataFrame(info ,columns= ['token' , 'lemma' , 'ent' ,'pos' , 'tag' , 'dep' , 'is_punct' , 'is_stop' , 'is_digit'])
print(info)

        token      lemma     ent    pos  ...       dep is_punct  is_stop  is_digit
0        When       when            ADV  ...    advmod    False     True     False
1   Sebastian  Sebastian  PERSON  PROPN  ...  compound    False    False     False
2       Thrun      Thrun  PERSON  PROPN  ...     nsubj    False    False     False
3     started      start           VERB  ...     advcl    False    False     False
4     working       work           VERB  ...     xcomp    False    False     False
5          on         on            ADP  ...      prep    False     True     False
6        self       self           NOUN  ...  npadvmod    False    False     False
7           -          -          PUNCT  ...     punct     True    False     False
8     driving      drive           VERB  ...      amod    False    False     False
9        cars        car           NOUN  ...      pobj    False    False     False
10         at         at            ADP  ...      prep    False     True     False
11  

In [67]:
print(doc[1].pos_ , spacy.explain(doc[1].pos_))

PROPN proper noun


In [77]:
#Noun Phrase Detection
conference_text = ('There is a developer conference happening on 21 July 2019 in London.')
conference_doc = nlp(conference_text)
# Extract Noun Phrases
for chunk in conference_doc.noun_chunks:
  print(chunk)

a developer conference
21 July
London


In [86]:
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

doc1 = nlp("Aman is a notorious boy. His friend Dheeraj Dubey is a nice boy")
def extract_full_name(doc):
  pattern = [{'POS': 'PROPN'}, {'POS': 'PROPN'}]
  matcher.add('FULL_NAME', None, pattern)
  matches = matcher(doc)
  for match_id, start, end in matches:
    span = doc[start:end]
    print(span.text)
  return

extract_full_name(doc1)

Dheeraj Dubey


**References-**

https://course.spacy.io/en

https://spacy.io/usage

https://realpython.com/natural-language-processing-spacy-python/

