<a href="https://colab.research.google.com/github/JpChii/ML-Projects/blob/main/Spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Finding words, phrases, names, concepts

Learn basics of text processing with spacy. Learn about the data structures, how to work with trained pipelines and how to use them to predict linguistic features in your text.

In [14]:
# Import spacy
import spacy

In [15]:
# Create a blank English nlp object
nlp = spacy.blank("en")

* NLP object contains the processing pipeline
* Includes language-spaecific rules for tokenization etc

In [16]:
# The Doc object
doc = nlp("Hello world!")

In [17]:
# Iterate over tokens in Doc
for token in doc:
  print(token.text)

Hello
world
!


In [18]:
# The token object
# Index into the Doc to get a single token
token = doc[1]

In [19]:
print(token.text)

world


In [20]:
# The span object
# A slice from the Doc is a span object
span = doc[1:3]
print(span.text)

world!


In [21]:
# Lexical attributes
doc = nlp("It costs $5.")
print(f"Index: {[token.i for token in doc]}")
print(f"Text: {[token.text for token in doc]}")

print(f"is_alpha: {[token.is_alpha for token in doc]}")
print(f"is_punct: {[token.is_punct for token in doc]}")
print(f"like_num: {[token.like_num for token in doc]}")

Index: [0, 1, 2, 3, 4]
Text: ['It', 'costs', '$', '5', '.']
is_alpha: [True, True, False, False, False]
is_punct: [False, False, False, False, True]
like_num: [False, False, False, True, False]


### 1.1 Documents, spans and tokens

When calling `nlp` on a string, SpaCy first tokenizes the text and creates a document object. In this excercise, we'll lean more about the `Doc`, as well as its views `Token` and `Span`.

In [22]:
import spacy
nlp = spacy.blank("en")

# preprocess text
doc = nlp("I like tree kangaroos and narwhals")

# Select the first token
first_token = doc[0]

# Print the firs token's text
print(first_token.text)

I


### 1.2 Lexical attributes

Look for two subsequent tokens: a number and a percent sign

In [26]:
# Process the text
doc = nlp(
    "In 1990, more than 60% of people in East Asia were in extreme poverty. "
    "Now less than 4% are."
)

for token in doc:
  if token.like_num:
    next_token = doc[token.i + 1]
    if next_token.text == "%":
      print(f"Percentage found: {token.text}")

Percentage found: 60
Percentage found: 4


## 2. Trained pipelines

**What are trained pipelines?**

* Models that enable spaCy to predit linguistic attributes in context
  * Part-of-Speech tags
  * Syntactic dependencies
  * Named entities

* Trained on labeled example tasks
* Can be updated with more examples to fine-tune predictions

### 2.1 Loading Pipelines

The pipelines used in this course are already pre-installed. For this more details on spaCy's trained pipelines and how to install them on your machine, see the [documentation](https://spacy.io/usage/models).

In [28]:
# Loading small english pipeline
nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

print(doc.text)

It’s official: Apple is the first U.S. public company to reach a $1 trillion market value


### 2.2 Predicting linguistic annotations

Now let's try one of spaCy's trained pipeline packages and see its predictions in action.

In [34]:
nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

for token in doc:
  token_text = token.text
  token_pos = token.pos_
  token_dep = token.dep_

  print(f"Token text: {token_text:<12}, POS: {token_pos:<10}, Dependency: {token_dep}")
  print(f"Dependency expalination: {spacy.explain(token_dep)}")
  print(f"POS explaination: {spacy.explain(token_pos)}\n")

Token text: It          , POS: PRON      , Dependency: nsubj
Dependency expalination: nominal subject
POS explaination: pronoun

Token text: ’s          , POS: VERB      , Dependency: punct
Dependency expalination: punctuation
POS explaination: verb

Token text: official    , POS: NOUN      , Dependency: ccomp
Dependency expalination: clausal complement
POS explaination: noun

Token text: :           , POS: PUNCT     , Dependency: punct
Dependency expalination: punctuation
POS explaination: punctuation

Token text: Apple       , POS: PROPN     , Dependency: nsubj
Dependency expalination: nominal subject
POS explaination: proper noun

Token text: is          , POS: AUX       , Dependency: ROOT
Dependency expalination: None
POS explaination: auxiliary

Token text: the         , POS: DET       , Dependency: det
Dependency expalination: determiner
POS explaination: determiner

Token text: first       , POS: ADJ       , Dependency: amod
Dependency expalination: adjectival modifier
POS expla

In [35]:
# Iterate over the docs to print label_ attribute
nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# process the text
doc = nlp(text)

# Iterate over the predicted entities
for ent in doc.ents:
  print(ent.text, ent.label_)

Apple ORG
first ORDINAL
U.S. GPE
$1 trillion MONEY


### 2.3 Predicting named entities in context

Models are staistical and not always right. Whether the predictions are correct depends on the training data and the text preprocessed. Let's take a look at an example.

In [36]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = "Upcoming iPhone X release date leaked as Apple reveals pre-orders"

# Process the text
doc = nlp(text)

# Iterate over the entities
for ent in doc.ents:
    # Print the entity text and label
    print(ent.text, ent.label_)

# Get the span for "iPhone X"
iphone_x = doc[1:3]

# Print the span text
print("Missing entity:", iphone_x.text)

Apple ORG
Missing entity: iPhone X


### 3. Rule-based matching

**Why not just regular expressions?**

* Match on `Doc` objects, not just strings
* Match on tokens and token attributes
* Use a model's predictions
* Example: "duck"(verb) vs "duck"(noun)

#### 3.1 Using the Matcher

In [43]:
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
doc = nlp("Upcoming iPhone X release date leaked as Apple reveals pre-orders")

# Intitialize the Matcher with the shared vocabulary
matcher = Matcher(nlp.vocab)

# Create a pattern matching two tokens
pattern = [{"TEXT": "iPhone"}, {"TEXT": "X"}]

# Add pattern to the matcher
matcher.add("IPHONE_PATTERN", [pattern])

# Use the matcher on the doc
matches = matcher(doc)
print("Matches:", [doc[start:end].text for match_id, start, end in matches])

Matches: ['iPhone X']


#### 3.2 Writing match patterns



In [59]:
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "After making the iOS update you won't notice a radical system-wide "
    "redesign: nothing like the aesthetic upheaval we got with iOS 7. Most of "
    "iOS 11's furniture remains the same as in iOS 10. But you will discover "
    "some tweaks once you delve a little deeper."
)

# Writing a pattern for full ios versions
pattern = [{"TEXT": "iOS"}, {"IS_DIGIT": True}]

# Add the patterns to the matcher
matcher.add("IOS_VERSION_PATTERN", [pattern])

# use the matcher on the doc
matches = matcher(doc)
print("Total matches found:", len(matches))

print("Matches:", [doc[start:end].text for match_id, start, end in matches])

Total matches found: 3
Matches: ['iOS 7', 'iOS 11', 'iOS 10']


In [62]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "i downloaded Fortnite on my laptop and can't open the game at all. Help? "
    "so when I was downloading Minecraft, I got the Windows version where it "
    "is the '.zip' folder and I used the default program to unpack it... do "
    "I also need to download Winzip?"
)

# Write a pattern that matches a form of "download" plus proper noun
pattern = [{"LEMMA": "download"}, {"POS": "PROPN"}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("DOWNLOAD_THINGS_PATTERN", [pattern])
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 3
Match found: downloaded Fortnite
Match found: downloading Minecraft
Match found: download Winzip


In [65]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "Features of the app include a beautiful design, smart search, automatic "
    "labels and optional voice responses."
)

# Patter for adjective plus one or two noun
pattern = [{"POS": "ADJ"}, {"POS": "NOUN"}, {"POS": "NOUN", "OP": "?"}]

# Add the pattern to the macher
matcher.add("ADJ_NOUN_PATTERN", [pattern])
matches = matcher(doc)

# Iterate over the matches and print span text
for match_id, start, end in matches:
  print(f"Match found: {doc[start:end]}")

Match found: beautiful design
Match found: smart search
Match found: automatic labels
Match found: optional voice
Match found: optional voice responses
