<a href="https://colab.research.google.com/github/JpChii/ML-Projects/blob/main/Spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Finding words, phrases, names, concepts

Learn basics of text processing with spacy. Learn about the data structures, how to work with trained pipelines and how to use them to predict linguistic features in your text.

In [1]:
# Import spacy
import spacy

In [2]:
# Create a blank English nlp object
nlp = spacy.blank("en")

* NLP object contains the processing pipeline
* Includes language-spaecific rules for tokenization etc

In [3]:
# The Doc object
doc = nlp("Hello world!")

In [4]:
# Iterate over tokens in Doc
for token in doc:
  print(token.text)

Hello
world
!


In [5]:
# The token object
# Index into the Doc to get a single token
token = doc[1]

In [6]:
print(token.text)

world


In [7]:
# The span object
# A slice from the Doc is a span object
span = doc[1:3]
print(span.text)

world!


In [8]:
# Lexical attributes
doc = nlp("It costs $5.")
print(f"Index: {[token.i for token in doc]}")
print(f"Text: {[token.text for token in doc]}")

print(f"is_alpha: {[token.is_alpha for token in doc]}")
print(f"is_punct: {[token.is_punct for token in doc]}")
print(f"like_num: {[token.like_num for token in doc]}")

Index: [0, 1, 2, 3, 4]
Text: ['It', 'costs', '$', '5', '.']
is_alpha: [True, True, False, False, False]
is_punct: [False, False, False, False, True]
like_num: [False, False, False, True, False]


### 1.1 Documents, spans and tokens

When calling `nlp` on a string, SpaCy first tokenizes the text and creates a document object. In this excercise, we'll lean more about the `Doc`, as well as its views `Token` and `Span`.

In [9]:
import spacy
nlp = spacy.blank("en")

# preprocess text
doc = nlp("I like tree kangaroos and narwhals")

# Select the first token
first_token = doc[0]

# Print the firs token's text
print(first_token.text)

I


### 1.2 Lexical attributes

Look for two subsequent tokens: a number and a percent sign

In [10]:
# Process the text
doc = nlp(
    "In 1990, more than 60% of people in East Asia were in extreme poverty. "
    "Now less than 4% are."
)

for token in doc:
  if token.like_num:
    next_token = doc[token.i + 1]
    if next_token.text == "%":
      print(f"Percentage found: {token.text}")

Percentage found: 60
Percentage found: 4


## 2. Trained pipelines

**What are trained pipelines?**

* Models that enable spaCy to predit linguistic attributes in context
  * Part-of-Speech tags
  * Syntactic dependencies
  * Named entities

* Trained on labeled example tasks
* Can be updated with more examples to fine-tune predictions

### 2.1 Loading Pipelines

The pipelines used in this course are already pre-installed. For this more details on spaCy's trained pipelines and how to install them on your machine, see the [documentation](https://spacy.io/usage/models).

In [11]:
# Loading small english pipeline
nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

print(doc.text)

It’s official: Apple is the first U.S. public company to reach a $1 trillion market value


### 2.2 Predicting linguistic annotations

Now let's try one of spaCy's trained pipeline packages and see its predictions in action.

In [12]:
nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

for token in doc:
  token_text = token.text
  token_pos = token.pos_
  token_dep = token.dep_

  print(f"Token text: {token_text:<12}, POS: {token_pos:<10}, Dependency: {token_dep}")
  print(f"Dependency expalination: {spacy.explain(token_dep)}")
  print(f"POS explaination: {spacy.explain(token_pos)}\n")

Token text: It          , POS: PRON      , Dependency: nsubj
Dependency expalination: nominal subject
POS explaination: pronoun

Token text: ’s          , POS: VERB      , Dependency: punct
Dependency expalination: punctuation
POS explaination: verb

Token text: official    , POS: NOUN      , Dependency: ccomp
Dependency expalination: clausal complement
POS explaination: noun

Token text: :           , POS: PUNCT     , Dependency: punct
Dependency expalination: punctuation
POS explaination: punctuation

Token text: Apple       , POS: PROPN     , Dependency: nsubj
Dependency expalination: nominal subject
POS explaination: proper noun

Token text: is          , POS: AUX       , Dependency: ROOT
Dependency expalination: None
POS explaination: auxiliary

Token text: the         , POS: DET       , Dependency: det
Dependency expalination: determiner
POS explaination: determiner

Token text: first       , POS: ADJ       , Dependency: amod
Dependency expalination: adjectival modifier
POS expla

In [13]:
# Iterate over the docs to print label_ attribute
nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# process the text
doc = nlp(text)

# Iterate over the predicted entities
for ent in doc.ents:
  print(ent.text, ent.label_)

Apple ORG
first ORDINAL
U.S. GPE
$1 trillion MONEY


### 2.3 Predicting named entities in context

Models are staistical and not always right. Whether the predictions are correct depends on the training data and the text preprocessed. Let's take a look at an example.

In [14]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = "Upcoming iPhone X release date leaked as Apple reveals pre-orders"

# Process the text
doc = nlp(text)

# Iterate over the entities
for ent in doc.ents:
    # Print the entity text and label
    print(ent.text, ent.label_)

# Get the span for "iPhone X"
iphone_x = doc[1:3]

# Print the span text
print("Missing entity:", iphone_x.text)

Apple ORG
Missing entity: iPhone X


### 3. Rule-based matching

**Why not just regular expressions?**

* Match on `Doc` objects, not just strings
* Match on tokens and token attributes
* Use a model's predictions
* Example: "duck"(verb) vs "duck"(noun)

#### 3.1 Using the Matcher

In [15]:
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
doc = nlp("Upcoming iPhone X release date leaked as Apple reveals pre-orders")

# Intitialize the Matcher with the shared vocabulary
matcher = Matcher(nlp.vocab)

# Create a pattern matching two tokens
pattern = [{"TEXT": "iPhone"}, {"TEXT": "X"}]

# Add pattern to the matcher
matcher.add("IPHONE_PATTERN", [pattern])

# Use the matcher on the doc
matches = matcher(doc)
print("Matches:", [doc[start:end].text for match_id, start, end in matches])

Matches: ['iPhone X']


#### 3.2 Writing match patterns



In [16]:
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "After making the iOS update you won't notice a radical system-wide "
    "redesign: nothing like the aesthetic upheaval we got with iOS 7. Most of "
    "iOS 11's furniture remains the same as in iOS 10. But you will discover "
    "some tweaks once you delve a little deeper."
)

# Writing a pattern for full ios versions
pattern = [{"TEXT": "iOS"}, {"IS_DIGIT": True}]

# Add the patterns to the matcher
matcher.add("IOS_VERSION_PATTERN", [pattern])

# use the matcher on the doc
matches = matcher(doc)
print("Total matches found:", len(matches))

print("Matches:", [doc[start:end].text for match_id, start, end in matches])

Total matches found: 3
Matches: ['iOS 7', 'iOS 11', 'iOS 10']


In [17]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "i downloaded Fortnite on my laptop and can't open the game at all. Help? "
    "so when I was downloading Minecraft, I got the Windows version where it "
    "is the '.zip' folder and I used the default program to unpack it... do "
    "I also need to download Winzip?"
)

# Write a pattern that matches a form of "download" plus proper noun
pattern = [{"LEMMA": "download"}, {"POS": "PROPN"}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("DOWNLOAD_THINGS_PATTERN", [pattern])
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 3
Match found: downloaded Fortnite
Match found: downloading Minecraft
Match found: download Winzip


In [18]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "Features of the app include a beautiful design, smart search, automatic "
    "labels and optional voice responses."
)

# Patter for adjective plus one or two noun
pattern = [{"POS": "ADJ"}, {"POS": "NOUN"}, {"POS": "NOUN", "OP": "?"}]

# Add the pattern to the macher
matcher.add("ADJ_NOUN_PATTERN", [pattern])
matches = matcher(doc)

# Iterate over the matches and print span text
for match_id, start, end in matches:
  print(f"Match found: {doc[start:end]}")

Match found: beautiful design
Match found: smart search
Match found: automatic labels
Match found: optional voice
Match found: optional voice responses


# Large-Scale data analysis with spaCy

## 1. Data structures(1)

### 1.1 Strings to hashes

To be memory optimized spaCy stores strings as hashes in stringStore.

In [19]:
import spacy

nlp = spacy.blank("en")
doc = nlp("I have a cat")

# Look up the hash for the word "cat"
cat_hash = nlp.vocab.strings["cat"]
print(cat_hash)

# And hash to string
cat_string = nlp.vocab.strings[cat_hash]
print(cat_string)

5439657043933447811
cat


In [21]:
# Doing the same for another
doc = nlp("David Bowie is a PERSON")

person_hash = doc.vocab.strings["PERSON"]
print(person_hash)

person_string = doc.vocab.strings[person_hash]
print(person_string)

380
PERSON


In [22]:
person_hash = doc.vocab.strings["CHECK"]
print(person_hash)

person_string = doc.vocab.strings[person_hash]
print(person_string)

4398250953217862582


KeyError: ignored

Hash will return an error for a word it hasn't seen or hashed. So add the word to the vocab or use the same vocab to resolve the hash to back to a string.

In [25]:
doc.vocab.strings.add("CHECK")

4398250953217862582

In [26]:
person_hash = doc.vocab.strings["CHECK"]
print(person_hash)

person_string = doc.vocab.strings[person_hash]
print(person_string)

4398250953217862582
CHECK


## 2. Data Structures(2): Doc, Span and Token

The Doc Object - Create document from tokens

Span Object - Create span document from Doc object

**Best Practices:**

* `Doc` and `Span` are very powerful and hold references and relationships of words and sentences
  * Convert result to strings as late as possible
  * Use token attributes if available - for example, `token.i` for the token index

* Don't forget to pass in the shared `vocab`

### 2.1 Creating a Doc

In [28]:
import spacy

nlp = spacy.blank("en")

# Import the Doc class
from spacy.tokens import Doc

# Desied text: "Spacy is interesting!"

words = ["spaCy", "is", "interesting", "!"]
spaces = [True, True, False, False]

# Create a Doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

spaCy is interesting!


In [30]:
import spacy

nlp = spacy.blank("en")

from spacy.tokens import Doc

words = ["Go", ",", "get", "started", "!"]
spaces = [False, True, True, False, False]

doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

Go, get started!


In [31]:
import spacy

nlp = spacy.blank("en")

from spacy.tokens import Doc

words = ["Oh", ",", "really", "?", "!"]
spaces = [False, True, False, False, False]

doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

Oh, really?!


### 2.1 Docs, spans and entities from scratch

Let's createa the `Doc` and `Span` objects manually and update the named entities just like spacy does behind the scenes.

In [47]:
import spacy
nlp = spacy.blank("en")

# Import Doc and Span classes
from spacy.tokens import Doc, Span

words = ["I", "like", "David", "Bowie"]
spaces = [True, True, True, False]

# Create doc from words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

# Create a span for "David Bowie" from the doc and assign it the label "PERSON"
span_with_label = Span(doc, 2, 4, label="PERSON")
print(span_with_label.text, span_with_label.label_)
doc.ents = [span_with_label]

print([(ent.text, ent.label_) for ent in doc.ents])

I like David Bowie
David Bowie PERSON
[('David Bowie', 'PERSON')]


In [42]:
for hash, word in doc.ents:
  print(match_id, start, end)

5488211386492616699 David Bowie


In [46]:
doc[2:4]

David Bowie

### 2.2 Data structures and best practices

* token.i gives the index
* token.pos_ gives pos
* token.dep_ gives dependency of the word

In [58]:
import spacy

nlp = spacy.load("en_core_web_sm")

doc = nlp("Berlin looks like a nice city")

for token in doc:

  if token.pos_ == "PROPN":

    if doc[token.i + 1].pos_ == "VERB":
      result = doc[token.i + 1]
      print(f"Found proper noun before a verb: {result}")

Found proper noun before a verb: looks


In [60]:
for token in doc:
  print(token.dep_)

nsubj
ROOT
prep
det
amod
pobj


## 3. Word vectors and semantic similarity

* spaCy can compare two objects and predict similarity
* `Doc.similarity()` `Span.similarity()` and `Token.similarity()`
* Take another object and return a similarity score (`0` to `1`)
* Needs a pipeline that has word vectors included

### 3.1 Inspecting word vectors

We'll use a larger English pipeline, which revolves around 20,000 word vectors.

In [64]:
!python -m spacy download en_core_web_md

Collecting en_core_web_md==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.2.5/en_core_web_md-2.2.5.tar.gz (96.4 MB)
[K     |████████████████████████████████| 96.4 MB 1.2 MB/s 
Building wheels for collected packages: en-core-web-md
  Building wheel for en-core-web-md (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-md: filename=en_core_web_md-2.2.5-py3-none-any.whl size=98051301 sha256=3c756321c345f2597ffb9e911504fdf678a8fe2fc1a84f18785af2fec7bec5a9
  Stored in directory: /tmp/pip-ephem-wheel-cache-dsqw1c39/wheels/69/c5/b8/4f1c029d89238734311b3269762ab2ee325a42da2ce8edb997
Successfully built en-core-web-md
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-2.2.5
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')


In [70]:
spacy.cli.download("en_core_web_md")

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')


In [71]:
import spacy
nlp = spacy.load("en_core_web_md")

doc = nlp("Two bananas in pyjamas")
bananas_vector = doc[1].vector
print(bananas_vector)

[-2.2009e-01 -3.0322e-02 -7.9859e-02 -4.6279e-01 -3.8600e-01  3.6962e-01
 -7.7178e-01 -1.1529e-01  3.3601e-02  5.6573e-01 -2.4001e-01  4.1833e-01
  1.5049e-01  3.5621e-01 -2.1508e-01 -4.2743e-01  8.1400e-02  3.3916e-01
  2.1637e-01  1.4792e-01  4.5811e-01  2.0966e-01 -3.5706e-01  2.3800e-01
  2.7971e-02 -8.4538e-01  4.1917e-01 -3.9181e-01  4.0434e-04 -1.0662e+00
  1.4591e-01  1.4643e-03  5.1277e-01  2.6072e-01  8.3785e-02  3.0340e-01
  1.8579e-01  5.9999e-02 -4.0270e-01  5.0888e-01 -1.1358e-01 -2.8854e-01
 -2.7068e-01  1.1017e-02 -2.2217e-01  6.9076e-01  3.6459e-02  3.0394e-01
  5.6989e-02  2.2733e-01 -9.9473e-02  1.5165e-01  1.3540e-01 -2.4965e-01
  9.8078e-01 -8.0492e-01  1.9326e-01  3.1128e-01  5.5390e-02 -4.2423e-01
 -1.4082e-02  1.2708e-01  1.8868e-01  5.9777e-02 -2.2215e-01 -8.3950e-01
  9.1987e-02  1.0180e-01 -3.1299e-01  5.5083e-01 -3.0717e-01  4.4201e-01
  1.2666e-01  3.7643e-01  3.2333e-01  9.5673e-02  2.5083e-01 -6.4049e-02
  4.2143e-01 -1.9375e-01  3.8026e-01  7.0883e-03 -2

### 3.2 Comparing similarities

In [73]:
# Doc similarities
import spacy
nlp = spacy.load("en_core_web_md")
doc1 = nlp("It's a warm summer day")
doc2 = nlp("It's sunny outside")

# Get similarity 
similarity = doc1.similarity(doc2)
print(similarity)

0.8789265574516525


In [74]:
# Token similarity
doc = nlp("TV and books")
token1, token2 = doc[0], doc[2]

similarity = token1.similarity(token2)
print(similarity)

0.22325331


In [80]:
# Span similarity
doc = nlp("This was a great restaurant. Afterwards, we went to a really nice bar.")
span1 = doc[3:5]
span2 = doc[12:15]

similarity = span1.similarity(span2)
print(similarity)

0.75173926


For custom nlp problems, we might have to train our own vectors.

## 4. Combining predictions and rules

| |Statistical models|Rule-based systems|
|-------|------------------|----------|
|Use cases|	application needs to generalize based on examples|dictionary with finite number of examples|
|Real-world examples|	product names, person names, subject/object |relationships	countries of the world, cities, drug names, dog breeds|
|spaCy features|	entity recognizer, dependency parser, part-of-speech tagger|	tokenizer, Matcher, PhraseMatcher|

In [83]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
doc = nlp(
    "Twitch Prime, the perks program for Amazon Prime members offering free "
    "loot, games and other benefits, is ditching one of its best features: "
    "ad-free viewing. According to an email sent out to Amazon Prime members "
    "today, ad-free viewing will no longer be included as a part of Twitch "
    "Prime for new members, beginning on September 14. However, members with "
    "existing annual subscriptions will be able to continue to enjoy ad-free "
    "viewing until their subscription comes up for renewal. Those with "
    "monthly subscriptions will have access to ad-free viewing until October 15."
)

# Create the match patterns
pattern1 = [{"LOWER": "amazon"}, {"IS_TITLE": True, "POS": "PROPN"}]
pattern2 = [{"LOWER": "ad"}, {"TEXT": "-"}, {"LOWER": "free"}, {"POS": "NOUN"}]

# Initialize the Matcher and add the patterns
matcher = Matcher(nlp.vocab)
matcher.add("PATTERN1", [pattern1])
matcher.add("PATTERN2", [pattern2])

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Print pattern string name and text of matched span
    print(doc.vocab.strings[match_id], doc[start:end].text)

PATTERN1 Amazon Prime
PATTERN2 ad-free viewing
PATTERN1 Amazon Prime
PATTERN2 ad-free viewing
PATTERN2 ad-free viewing
PATTERN2 ad-free viewing
