#  NLP Lab Tutorial

In [6]:
import spacy 
nlp = spacy.load("en_core_web_sm") # load the small English model

* When you process a text with the nlp object, spaCy creates a Doc object – short for "document". The Doc lets you access information about the text in a structured way, and no information is lost.

* The Doc behaves like a normal Python sequence by the way and lets you iterate over its tokens, or get a token by its index

In [4]:
doc = nlp("Hello World!")
for token in doc:
    print(f"{token.text} -> {token.pos_} ({token.dep_})")

token = doc[1]
token.tensor # this tensor vector represents the token in the model's vocabulary and the values are the word embeddings.
token.text


Hello -> INTJ (ROOT)
World -> PROPN (npadvmod)
! -> PUNCT (punct)


'World'

In [5]:
print("Index:   ", [token.i for token in doc])
print("Text:    ", [token.text for token in doc])

print("is_alpha:", [token.is_alpha for token in doc]) # .is_alpha checks if the token consists of alphabetic characters.
print("is_punct:", [token.is_punct for token in doc]) # .is_punct checks if the token is a punctuation mark.
print("like_num:", [token.like_num for token in doc]) # .like_num checks if the token looks like a number (e.g., "3", "3.14", "three").

Index:    [0, 1, 2]
Text:     ['Hello', 'World', '!']
is_alpha: [True, True, False]
is_punct: [False, False, True]
like_num: [False, False, False]


In [5]:
# Import spaCy
import spacy


# Download
!python -m spacy download en_core_web_sm


# TODO: Create the English nlp object [X]
nlp = spacy.load("en_core_web_sm")


# TODO: Process a text
doc = nlp("This is part of the NLPLAB Tutorial")

# Print the document text
print(doc.text)

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     - -------------------------------------- 0.5/12.8 MB 2.4 MB/s eta 0:00:06
     ---- ----------------------------------- 1.3/12.8 MB 3.5 MB/s eta 0:00:04
     ------ --------------------------------- 2.1/12.8 MB 3.5 MB/s eta 0:00:04
     --------- ------------------------------ 2.9/12.8 MB 3.4 MB/s eta 0:00:03
     ---------- ----------------------------- 3.4/12.8 MB 3.4 MB/s eta 0:00:03
     ------------- -------------------------- 4.2/12.8 MB 3.4 MB/s eta 0:00:03
     --------------- ------------------------ 5.0/12.8 MB 3.4 MB/s eta 0:00:03
     ----------------- ---------------------- 5.5/12.8 MB 3.4 MB/s eta 0:00:03
     ------------------ --------------------- 6.0/12.8 MB 3.2 MB/s eta 0:00:03
     --------------------- --------------


[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


This is part of the NLPLAB Tutorial


In [3]:
!   python -m spacy download de_core_news_sm

Collecting de-core-news-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-3.8.0/de_core_news_sm-3.8.0-py3-none-any.whl (14.6 MB)
     ---------------------------------------- 0.0/14.6 MB ? eta -:--:--
     - -------------------------------------- 0.5/14.6 MB 2.8 MB/s eta 0:00:06
     --- ------------------------------------ 1.3/14.6 MB 3.4 MB/s eta 0:00:04
     ----- ---------------------------------- 2.1/14.6 MB 3.5 MB/s eta 0:00:04
     ------- -------------------------------- 2.6/14.6 MB 3.2 MB/s eta 0:00:04
     -------- ------------------------------- 3.1/14.6 MB 2.9 MB/s eta 0:00:04
     ---------- ----------------------------- 3.7/14.6 MB 2.8 MB/s eta 0:00:04
     ----------- ---------------------------- 4.2/14.6 MB 2.9 MB/s eta 0:00:04
     ------------- -------------------------- 5.0/14.6 MB 2.9 MB/s eta 0:00:04
     --------------- ------------------------ 5.8/14.6 MB 3.0 MB/s eta 0:00:03
     ----------------- ---------------


[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [7]:
# Import spaCy
import spacy

# TODO: Create the German nlp object using "de_core_news_sm" [X]
nlp = spacy.load("de_core_news_sm")
# Process a text (this is German for: "Kind regards!")
doc = nlp("Liebe Grüße!")

# Print the document text
print(doc.text)

Liebe Grüße!


In [8]:
!   python -m spacy download es_core_news_sm

Collecting es-core-news-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_sm-3.8.0/es_core_news_sm-3.8.0-py3-none-any.whl (12.9 MB)
     ---------------------------------------- 0.0/12.9 MB ? eta -:--:--
     - -------------------------------------- 0.5/12.9 MB 2.4 MB/s eta 0:00:06
     ---- ----------------------------------- 1.3/12.9 MB 3.5 MB/s eta 0:00:04
     ------ --------------------------------- 2.1/12.9 MB 3.4 MB/s eta 0:00:04
     -------- ------------------------------- 2.6/12.9 MB 3.3 MB/s eta 0:00:04
     ---------- ----------------------------- 3.4/12.9 MB 3.3 MB/s eta 0:00:03
     ------------- -------------------------- 4.2/12.9 MB 3.3 MB/s eta 0:00:03
     -------------- ------------------------- 4.7/12.9 MB 3.3 MB/s eta 0:00:03
     ----------------- ---------------------- 5.5/12.9 MB 3.3 MB/s eta 0:00:03
     ------------------- -------------------- 6.3/12.9 MB 3.3 MB/s eta 0:00:02
     -------------------- ------------


[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [9]:
# Import spaCy
import spacy

# TODO: Create the Spanish nlp object [X]
nlp = spacy.load("es_core_news_sm")

# Process a text (this is Spanish for: "How are you?")
doc = nlp("¿Cómo estás?")

# Print the document text
print(doc.text)

¿Cómo estás?


When you call nlp on a string, spaCy first tokenizes the text and creates a document object. In this exercise, you’ll learn more about the Doc, as well as its views Token and Span.

Step 1

Use spacy.blank to create the English nlp object.
Process the text and instantiate a Doc object in the variable doc.
Select the first token of the Doc and print its text.

In [10]:
# Import spaCy and create the English nlp object
# Use spacy.blank to create the English nlp object.
import spacy
nlp = spacy.blank("en")

# Process the text
doc = nlp("I like tree kangaroos and narwhals.")

# TODO: Select the first token [X]
first_token = doc[0]

# Print the first token's text
print(first_token.text)

I


In [19]:
# Import spaCy and create the English nlp object
import spacy

nlp = spacy.blank("en")

# Process the text
doc = nlp("I like tree kangaroos and narwhals.")

# A slice of the Doc for "tree kangaroos"
tree_kangaroos = doc[2:4]
print(tree_kangaroos.text)

# A slice of the Doc for "tree kangaroos and narwhals" (without the ".")
tree_kangaroos_and_narwhals = doc[2:6]
print(tree_kangaroos_and_narwhals.text)

tree kangaroos
tree kangaroos and narwhals


In this example, you’ll use spaCy’s Doc and Token objects, and lexical attributes to find percentages in a text. You’ll be looking for two subsequent tokens: a number and a percent sign.

Use the like_num token attribute to check whether a token in the doc resembles a number.
Get the token following the current token in the document. The index of the next token in the doc is token.i + 1.
Check whether the next token’s text attribute is a percent sign ”%“.

In [15]:
import spacy

nlp = spacy.blank("en")

# Process the text
doc = nlp(
    "In 1990, more than 60% of people in East Asia were in extreme poverty. "
    "Now less than 4% are."
)

# Iterate over the tokens in the doc
for i, token in enumerate(doc):
    # Check if the token resembles a number
    if token.like_num:
        # TODO: Get the next token in the document [X]
        if i + 1 < len(doc):
            next_token = doc[i + 1]
        # Check if the next token's text equals "%"
            if next_token.text == "%":
                print("Percentage found:", token.text)

Percentage found: 60
Percentage found: 4


In [17]:
import spacy

# TODO: Load the small English pipeline [X]
nlp = spacy.load("en_core_web_sm")

# Process a text
doc = nlp("She ate the pizza")

# Iterate over the tokens
for token in doc:
    # TODO: Print the text and the predicted part-of-speech tag [X]
    print(token.text, token.pos_)

She PRON
ate VERB
the DET
pizza NOUN


In [22]:
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

She PRON nsubj ate
ate VERB ROOT ate
the DET det pizza
pizza NOUN dobj ate


In [18]:
# Process a text
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

# Iterate over the predicted entities
for ent in doc.ents:
    # TODO: Print the entity text and its label [X]
    print(ent.text, ent.label_)

Apple ORG
U.K. GPE
$1 billion MONEY


In [25]:
print(spacy.explain("GPE"))
print(spacy.explain("ORG"))
print(spacy.explain("dobj"))

Countries, cities, states
Companies, agencies, institutions, etc.
direct object


In [19]:
import spacy

# Load the "en_core_web_sm" pipeline
nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# TODO: Process the text [X]
doc = nlp(text)

# TODO: Print the document text [X]
print(doc.text)

It’s official: Apple is the first U.S. public company to reach a $1 trillion market value


* You’ll now get to try one of spaCy’s trained pipeline packages and see its predictions in action. Feel free to try it out on your own text! To find out what a tag or label means, you can call spacy.explain in the loop. For example: spacy.explain("PROPN") or spacy.explain("GPE").

Part 1

* Process the text with the nlp object and create a doc.
For each token, print the token text, the token’s .pos_ (part-of-speech tag) and the token’s .dep_ (dependency label).

In [33]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

for token in doc:
    # Get the token text, part-of-speech tag and dependency label
    token_text = token.text
    token_pos = token.pos_
    #  Are you getting the token's dependency label correctly? Remember to use the underscore attribute
    token_pos = token.pos_
    token_dep = token.dep
    # This is for formatting only
    print(f"{token_text:<12}{token_pos:<10}{token_dep:<10}")

It          PRON      429       
’s          VERB      408       
official    NOUN      404       
:           PUNCT     445       
Apple       PROPN     429       
is          AUX       8206900633647566924
the         DET       415       
first       ADJ       402       
U.S.        PROPN     426       
public      ADJ       402       
company     NOUN      404       
to          PART      405       
reach       VERB      447       
a           DET       415       
$           SYM       446       
1           NUM       7037928807040764755
trillion    NUM       12837356684637874264
market      NOUN      7037928807040764755
value       NOUN      416       


*Part 2

*Process the text and create a doc object.
*Iterate over the doc.ents and print the entity text and label_ attribute.

In [20]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

# Iterate over the predicted entities
for ent in doc.ents:
    # TODO: Print the entity text and its label [X]
    print(ent.text, ent.label_)

Apple ORG
first ORDINAL
U.S. GPE
$1 trillion MONEY


*Models are statistical and not always right. Whether their predictions are correct depends on the training data and the text you’re processing. Let’s take a look at an example.

*Process the text with the nlp object.
Iterate over the entities and print the entity text and label.
Looks like the model didn’t predict “iPhone X”. Create a span for those tokens manually.

In [21]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = "Upcoming iPhone X release date leaked as Apple reveals pre-orders"

# Process the text
doc = nlp(text)

# Iterate over the entities
for ent in doc.ents: # what are entities? The entities are the named entities that the model has recognized in the text.
    # Print the entity text and label
    print(ent.text, ent.label_)

# TODO: Get the span for "iPhone X" [X]
iphone_x = doc[1:3]

# Print the span text
print("Missing entity:", iphone_x.text)

Apple ORG
Missing entity: iPhone X


### Rule-based matching



* Let’s try spaCy’s rule-based Matcher. You’ll be using the example from the previous exercise and write a pattern that can match the phrase “iPhone X” in the text.

* Import the Matcher from spacy.matcher.
* Initialize it with the nlp object’s shared vocab.
* Create a pattern that matches the "TEXT" values of two tokens: "iPhone" and "X".
* Use the matcher.add method to add the pattern to the matcher.
* Call the matcher on the doc and store the result in the variable matches.
* Iterate over the matches and get the matched span from the start to the end index.

In [22]:
import spacy

# Import the Matcher
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
doc = nlp("Upcoming iPhone X release date leaked as Apple reveals pre-orders")

# Initialize the Matcher with the shared vocabulary
matcher = Matcher(doc.vocab)

# TODO: Create a pattern matching two tokens: "iPhone" and "X" [X]
pattern = [{"LOWER": "iphone"}, {"LOWER": "x"}]

# Add the pattern to the matcher
matcher.add("IPHONE_X_PATTERN", [pattern])

# Use the matcher on the doc
matches = matcher(doc)
print("Matches:", [doc[start:end].text for match_id, start, end in matches])

Matches: ['iPhone X']


### Writing match patterns

* In this exercise, you’ll practice writing more complex match patterns using different token attributes and operators.

* Part 1

* Write one pattern that only matches mentions of the full iOS versions: “iOS 7”, “iOS 11” and “iOS 10”

In [37]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "After making the iOS update you won't notice a radical system-wide "
    "redesign: nothing like the aesthetic upheaval we got with iOS 7. Most of "
    "iOS 11's furniture remains the same as in iOS 10. But you will discover "
    "some tweaks once you delve a little deeper."
)

# Write a pattern for full iOS versions ("iOS 7", "iOS 11", "iOS 10")
pattern = [{"TEXT": "iOS"}, {"IS_DIGIT": True}]


# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("IOS_VERSION_PATTERN", [pattern])
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 3
Match found: iOS 7
Match found: iOS 11
Match found: iOS 10


Part 2

* Write one pattern that only matches forms of “download” (tokens with the lemma “download”), followed by a token with the part-of-speech tag "PROPN" (proper noun).



In [23]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "i downloaded Fortnite on my laptop and can't open the game at all. Help? "
    "so when I was downloading Minecraft, I got the Windows version where it "
    "is the '.zip' folder and I used the default program to unpack it... do "
    "I also need to download Winzip?"
)

# TODO: Write a pattern that matches a form of "download" plus proper noun [x]
pattern = [{"LEMMA": "download"}, {"POS": "PROPN"}]

# TODO: Add the pattern to the matcher and apply the matcher to the doc [X]
matcher.add("DOWNLOAD_THINGS_PATTERN", [pattern])
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 3
Match found: downloaded Fortnite
Match found: downloading Minecraft
Match found: download Winzip


* Part 3

Write one pattern that matches adjectives ("ADJ") followed by one or two "NOUN"s (one noun and one optional noun).

In [24]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "Features of the app include a beautiful design, smart search, automatic "
    "labels and optional voice responses."
)

# TODO: Write a pattern for adjective plus one or two nouns [X]
pattern = [{"POS": "ADJ"}, {"POS": "NOUN"}, {"POS": "NOUN", "OP": "?"}] # this OP # operator means that the noun is optional, so it can match either one or two nouns.

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("ADJ_NOUN_PATTERN", [pattern])
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 5
Match found: beautiful design
Match found: smart search
Match found: automatic labels
Match found: optional voice
Match found: optional voice responses
