<a href="https://colab.research.google.com/github/SidharthBhakth/spaCy-NLP/blob/master/1_Finding_words%2C_phrases%2C_names_and_concepts.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 1: Finding words, phrases, names and concepts

## **[Introduction to spaCy](https://www.youtube.com/watch?v=THduWAnG97k&t=16s)**

#### *Getting Started*

> English




In [1]:
# Import the English language class
from spacy.lang.en import English

# Create nlp object
nlp = English()

# Process a text
doc = nlp("This is a sentence.")

# Print document text
print(doc.text)

This is a sentence.


> German

In [2]:
# Import the German language class
from spacy.lang.de import German

# Create nlp object
nlp = German()

# Process a text
doc = nlp("Liebe Grüße!")

# Print document text
print(doc.text)

Liebe Grüße!


> Spanish

In [3]:
# Import Spanish language class
from spacy.lang.es import Spanish

# Create nlp object
nlp = Spanish()

# Process a text
doc = nlp("¿Cómo estás?")

# Print document text
print(doc.text)

¿Cómo estás?


#### *Documents, spans and tokens*

In [4]:
# Import the English language class
from spacy.lang.en import English

# Create nlp object
nlp = English()

# Process the text
doc = nlp("I like tree kangaroos and narwhals.")

# Select the first token
first_token = doc[0]

# Print the first token's text
print(first_token.text)

# A slice of the Doc for "tree kangaroos"
tree_kangaroos = doc[2:4]
print(tree_kangaroos.text)

# A slice of the Doc for "tree kangaroos and narwhals"
tree_kangaroos_and_narwhals = doc[2:6]
print(tree_kangaroos_and_narwhals.text)

I
tree kangaroos
tree kangaroos and narwhals


#### *Lexical attributes*

In [5]:
# Process the text
doc = nlp("In 1990, more than 60% of people in East Asia were in extreme poverty. "
          "Now less than 4% are.")

# Iterate over the tokens in the doc
for token in doc:
  # Check if the token resembles a number
  if token.like_num:
    # Get the next token in the document
    next_token = doc[token.i + 1]
    # Check if the next token's text equals "%"
    if next_token.text == "%":
      print("Percentage found:", token.text)

Percentage found: 60
Percentage found: 4


## **[Statistical Models](https://www.youtube.com/watch?v=THduWAnG97k&t=192s)**



#### *Loading models*

In [6]:
import spacy

# Load the "en_core_web_sm" model
nlp = spacy.load("en_core_web_sm")

# Process the text
text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"
doc = nlp(text)

# Print the document text
print(doc.text)

It’s official: Apple is the first U.S. public company to reach a $1 trillion market value


#### *Predicting lingusitic annotations*

> Part-of-Speech tags and dependency labels

In [7]:
# Get the token text, part-of-speech tag and dependency label
for token in doc:
  token_text = token.text
  token_pos = token.pos_
  token_dep = token.dep_
  print(f"{token_text:<12}{token_pos:<10}{token_dep:<10}")

It          PRON      nsubj     
’s          VERB      punct     
official    NOUN      ccomp     
:           PUNCT     punct     
Apple       PROPN     nsubj     
is          AUX       ROOT      
the         DET       det       
first       ADJ       amod      
U.S.        PROPN     nmod      
public      ADJ       amod      
company     NOUN      attr      
to          PART      aux       
reach       VERB      relcl     
a           DET       det       
$           SYM       quantmod  
1           NUM       compound  
trillion    NUM       nummod    
market      NOUN      compound  
value       NOUN      dobj      


> Entity labels

In [8]:
# Iterate over predicted entities and get the entity text and label
for ent in doc.ents:
  print(f"{ent.text:<14}{ent.label_}")

Apple         ORG
first         ORDINAL
U.S.          GPE
$1 trillion   MONEY


#### *Predicting named entities in context*

In [9]:
# Process the text
text = "Upcoming iPhone X release date leaked as Apple reveals pre-orders"
doc = nlp(text)

# Iterate over predicted entities and get the entity text and label
for ent in doc.ents:
  print(f"{ent.text:<14}{ent.label_}")

Apple         ORG


In [10]:
# Get the span for "iPhone X"
iphone_x = doc[1:3]

# Print the span text
print("Missing entity:", iphone_x.text)

Missing entity: iPhone X


## **[Rule-based matching](https://www.youtube.com/watch?v=THduWAnG97k&t=431s)**

#### *Using the Matcher*

In [11]:
# Import the matcher
from spacy.matcher import Matcher

# Intialize matcher with shared vocabulary
matcher = Matcher(nlp.vocab)

# Create  a pattern matching two tokens: "iPhone" and "X"
pattern = [{"TEXT":"iPhone"}, {"TEXT":"X"}]

# Add pattern to matcher
matcher.add("IPHONE_X_PATTERN", None, pattern)

# Use the matcher on the doc
matches = matcher(doc)
print("Matches found:", [ doc[start:end].text for match_id, start, end in matches ])

Matches found: ['iPhone X']


#### *Writing matching patterns*

> Write **one** pattern that only matches mentions of the full iOS versions: “iOS 7”, “iOS 11” and “iOS 10”.



In [12]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp("After making the iOS update you won't notice a radical system-wide "
          "redesign: nothing like the aesthetic upheaval we got with iOS 7. Most of "
          "iOS 11's furniture remains the same as in iOS 10. But you will discover "
          "some tweaks once you delve a little deeper.")

# Write a pattern for full iOS versions
pattern = [{"TEXT":"iOS"}, {"IS_DIGIT":True}]

# Add the pattern to the matcher
matcher.add("IOS_VERSION_PATTERN", None, pattern)

# Apply the matcher to the doc
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
  print("Match found:", doc[start:end].text)

Total matches found: 3
Match found: iOS 7
Match found: iOS 11
Match found: iOS 10


> Write **one** pattern that only matches forms of “download” (tokens with the lemma “download”), followed by a token with the part-of-speech tag "PROPN" (proper noun).

In [13]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp("i downloaded Fortnite on my laptop and can't open the game at all. Help? "
          "so when I was downloading Minecraft, I got the Windows version where it "
          "is the '.zip' folder and I used the default program to unpack it... do "
          "I also need to download Winzip?")

# Write a pattern that matches a form of "download" plus proper noun
pattern = [{"LEMMA":"download"}, {"POS":"PROPN"}]

# Add the pattern to the matcher
matcher.add("DOWNLOAD_THINGS_PATTERN", None, pattern)

# Apply the matcher to the doc
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
  print("Match found:", doc[start:end].text)

Total matches found: 3
Match found: downloaded Fortnite
Match found: downloading Minecraft
Match found: download Winzip


> Write **one** pattern that matches adjectives ("ADJ") followed by one or two "NOUN"s (one noun and one optional noun).

In [14]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp("Features of the app include a beautiful design, smart search, automatic "
          "labels and optional voice responses.")

# Write a pattern for adjective plus one or two nouns
pattern = [{"POS":"ADJ"}, {"POS":"NOUN"}, {"POS":"NOUN", "OP":"?"}]

# Add the pattern to the matcher
matcher.add("DOWNLOAD_THINGS_PATTERN", None, pattern)

# Apply the matcher to the doc
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
  print("Match found:", doc[start:end].text)

Total matches found: 5
Match found: beautiful design
Match found: smart search
Match found: automatic labels
Match found: optional voice
Match found: optional voice responses
