<a href="https://colab.research.google.com/github/SrikanthGuggila/spaCy-Tutorial/blob/main/spaCy_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Spacy is most popular library for advanced NLP

In [None]:
!pip install spacy

**The nlp Object**

contains the processing pipeline
includes language-specific rules for tokenization etc

In [None]:
from spacy.lang.en import English
nlp = English()
nlp

<spacy.lang.en.English at 0x7f30d73f9f10>

**The DOC object**

In [None]:
doc = nlp("Hello NLP, I'm learning you")
for token in doc:
  print(token.text)

Hello
NLP
,
I
'm
learning
you


In [None]:
for i in range(len(doc)):
  print(doc[i].text)

Hello
NLP
,
I
'm
learning
you


In [None]:
span = doc[3:]
print(span.text)

I'm learning you


In [None]:
doc = nlp("Hello NLP, I'm learning you, it costs $4000")
print("Index: ", [token.i for token in doc])
print("Text ",[token.text for token in doc])

print("is_alpha:", [token.is_alpha for token in doc])
print("is_puct", [token.is_punct for token in doc])
print("like_num", [token.like_num for token in doc])

Index:  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
Text  ['Hello', 'NLP', ',', 'I', "'m", 'learning', 'you', ',', 'it', 'costs', '$', '4000']
is_alpha: [True, True, False, True, False, True, True, False, True, True, False, False]
is_puct [False, False, True, False, False, False, False, True, False, False, False, False]
like_num [False, False, False, False, False, False, False, False, False, False, False, True]


**Lexical Attributes**

In [None]:
doc = nlp("In 1990, more than 60% of people in East Asia were in extreme poverty. "
    "Now less than 4% are.")

for token in doc:
  if token.like_num:
    next_token = doc[token.i+1]
    if next_token.text == "%":
      print("Percentage Found: ", token.text,"%")

Percentage Found:  60 %
Percentage Found:  4 %


**Statistical Models**

Some of most intersting things that you want to find out from text. For example wether a word from given string is verb and a span of text is person name

Statistical models enable spaCy to make predictions correctly

* There are lot of Statistical models available in spaCy

* Statistical models enables spaCy to predict
  1. Parts of speech tags
  2. Syntactical dependencies
  3. Named Entities

* Trained on labeled example sets
* can be updated with more examples to fine tune models

**Model Packages**

In [None]:
!python -m spacy download en_core_web_sm

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

The model package includes
* Binary weight
* Vocabulary
* Meta data information (pipeline, language)

**Predicting the Parts of Speech**

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("I'm Learning NLP")

for token in doc:
  print(token.text,"------------>",token.pos_)

I ------------> PRON
'm ------------> AUX
Learning ------------> VERB
NLP ------------> PROPN


**Predcting the Named Entities**

In [None]:
doc = nlp("Apple is looking to buying the U.K based startup for $1 billion dollers")
for ent in doc.ents:
  print(ent.text, ent.label_)

Apple ORG
U.K ORG
$1 billion MONEY


**Predicting the Linguistic annotations**

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

for token in doc:
    # Get the token text, part-of-speech tag and dependency label
    token_text = token.text
    token_pos = token.pos_
    token_dep = token.dep_
    # This is for formatting only
    print(f"{token_text:<12}{token_pos:<10}{token_dep:<10}")

It          PRON      nsubj     
’s          VERB      punct     
official    NOUN      ccomp     
:           PUNCT     punct     
Apple       PROPN     nsubj     
is          AUX       ROOT      
the         DET       det       
first       ADJ       amod      
U.S.        PROPN     nmod      
public      ADJ       amod      
company     NOUN      attr      
to          PART      aux       
reach       VERB      relcl     
a           DET       det       
$           SYM       quantmod  
1           NUM       compound  
trillion    NUM       nummod    
market      NOUN      compound  
value       NOUN      dobj      


**Predicting the named Entities in the context**

Models are statistical and not always right. Whether their predictions are correct depends on the training data and the text you’re processing. Let’s take a look at an example.

* Process the text with the nlp object.
* Iterate over the entities and print the entity text and label.
* Looks like the model didn’t predict “iPhone X”. Create a span for those tokens manually.

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = "Upcoming iPhone X release date leaked as Apple reveals pre-orders"

# Process the text
doc = nlp(text)

# Iterate over the entities
for ent in doc.ents:
    # Print the entity text and label
    print(ent.text, ent.label_)

# Get the span for "iPhone X"
iphone_x = doc[1:3]

# Print the span text
print("Missing entity:", iphone_x.text)

Apple ORG
Missing entity: iPhone X


**Rule Based Matching**

* Match on doc objects, mot just strings
* Match on token and token attributes
* Use model's predictions
* Example "duck" (verb) vs "duck" (noun) 


Match patterns are lists of dictionaries. Each dictionary describes one token. The keys are the names of token attributes, mapped to their expected values.

In this example, we're looking for two tokens with the text "iPhone" and "X".

We can also match on other token attributes. Here, we're looking for two tokens whose lowercase forms equal "iphone" and "x".

We can even write patterns using attributes predicted by the model. Here, we're matching a token with the lemma "buy", plus a noun. The lemma is the base form, so this pattern would match phrases like "buying milk" or "bought flowers".

Lists of dictionaries, one per token

* Match exact token texts
[{"TEXT": "iPhone"}, {"TEXT": "X"}]
* Match lexical attributes
[{"LOWER": "iphone"}, {"LOWER": "x"}]
* Match any token attributes
[{"LEMMA": "buy"}, {"POS": "NOUN"}]

**Using matcher**

To use a pattern, we first import the matcher from spacy.matcher.

We also load a model and create the nlp object.

The matcher is initialized with the shared vocabulary, nlp.vocab. You'll learn more about this later – for now, just remember to always pass it in.

The matcher.add method lets you add a pattern. The first argument is a unique ID to identify which pattern was matched. The second argument is an optional callback. We don't need one here, so we set it to None. The third argument is the pattern.

To match the pattern on a text, we can call the matcher on any doc.

This will return the matches.

In [None]:
# import spacy library and Matcher
import spacy
from spacy.matcher import Matcher

# load the model
nlp = spacy.load("en_core_web_sm")

# Initiate the Matcher object
matcher = Matcher(nlp.vocab)

# Add the pattern
pattern = [{"Text":"iPhone"}, {"Text":"X"}]
matcher.add("IPHONE_PATTERN", None, pattern)

# Process some text
doc = nlp("Upcoming iPhone X release date leaked")

matches = matcher(doc)

In [None]:
matches

[(9528407286733565721, 1, 3)]

In [None]:
doc = nlp("Upcoming iPhone X release date leaked")
matches = matcher(doc)

# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

iPhone X


In [None]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "After making the iOS update you won't notice a radical system-wide "
    "redesign: nothing like the aesthetic upheaval we got with iOS 7. Most of "
    "iOS 11's furniture remains the same as in iOS 10. But you will discover "
    "some tweaks once you delve a little deeper."
)

# Write a pattern for full iOS versions ("iOS 7", "iOS 11", "iOS 10")
pattern = [{"TEXT": "iOS"}, {"IS_DIGIT": True}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("IOS_VERSION_PATTERN", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 3
Match found: iOS 7
Match found: iOS 11
Match found: iOS 10


In [None]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "i downloaded Fortnite on my laptop and can't open the game at all. Help? "
    "so when I was downloading Minecraft, I got the Windows version where it "
    "is the '.zip' folder and I used the default program to unpack it... do "
    "I also need to download Winzip?"
)

# Write a pattern that matches a form of "download" plus proper noun
pattern = [{"LEMMA": "download"}, {"POS": "PROPN"}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("DOWNLOAD_THINGS_PATTERN", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 3
Match found: downloaded Fortnite
Match found: downloading Minecraft
Match found: download Winzip
