# Chapter 1: Finding words, phrases, names and concepts

This chapter will introduce you to the basics of text processing with spaCy. You'll learn about the data structures, how to work with statistical models, and how to use them to predict linguistic features in your text.

### 1.1 Getting Started

Let’s get started and try out spaCy! You’ll be able to try out some of the 55+ available languages.

- Import the English class from spacy.lang.en and create the `nlp` object.
- Let's create a `doc` and print its `text`.

In [1]:
# Import the English language class
from spacy.lang.en import English

# Create the nlp object
nlp = English()

# Process a text
doc = nlp("This is a sentence.")

# Print the document text

# .text() näyttää alkuperäisen docin, 
# eli tässä tapauksessa lauseen
print(doc.text)

This is a sentence.


#### Exercise 1.1: Repeat chapter 1.1 with Finnish.

In [3]:
# Exercise 1.1:
# -----------------
# Repeat chapter 1.1 with Finnish.
from spacy.lang.fi import Finnish

nlp = Finnish()

doc_f = nlp("Tämä on lause.")
print(doc_f.text)

Tämä on lause.


### 1.2 Documents, spans and tokens 

When you call `nlp` on a string, spaCy first tokenizes the text and creates a document object. In this exercise, you’ll learn more about the `Doc`, as well as its views `Token` and `Span`.

**Step 1**

- Let's import the English language class and create the `nlp` object.
- Process the text and instantiate a Doc object in the variable doc.
- Select the first token of the Doc and print its text.

In [4]:
# Import the English language class and create the nlp object
from spacy.lang.en import English

nlp = English()

# Process the text
doc = nlp("I like tree kangaroos and narwhals.")

# Select the first token
first_token = doc[0]

# Print the first token's text
print(first_token.text)

I


**Step 2**

- Import the English language class and create the nlp object.
- Process the text and instantiate a Doc object in the variable doc.
- Create a slice of the Doc for the tokens “tree kangaroos” and “tree kangaroos and narwhals”.

In [5]:
# Import the English language class and create the nlp object
from spacy.lang.en import English

nlp = English()

# Process the text
doc = nlp("I like tree kangaroos and narwhals.")

# A slice of the Doc for "tree kangaroos"
tree_kangaroos = doc[2:4]
print(tree_kangaroos.text)

# A slice of the Doc for "tree kangaroos and narwhals" (without the ".")
tree_kangaroos_and_narwhals = doc[2:6]
print(tree_kangaroos_and_narwhals.text)

tree kangaroos
tree kangaroos and narwhals


### 1.3 Lexical attributes

In this example, you’ll use spaCy’s Doc and Token objects, and lexical attributes to find percentages in a text. You’ll be looking for two subsequent tokens: a number and a percent sign.

- Use the like_num token attribute to check whether a token in the doc resembles a number.
- Get the token following the current token in the document. The index of the next token in the doc is token.i + 1.
- Check whether the next token’s text attribute is a percent sign ”%“.

In [6]:
from spacy.lang.en import English

nlp = English()

# Process the text
doc = nlp(
    "In 1990, more than 60% of people in East Asia were in extreme poverty. "
    "Now less than 4% are."
)

# Iterate over the tokens in the doc
for token in doc:
    # Check if the token resembles a number
    if token.like_num:
        # Get the next token in the document
        # token.i + 1 saadaan seuraava tokeni tarkastelun kohteeksi
        next_token = doc[token.i + 1]
        # Check if the next token's text equals "%"
        if next_token.text == "%":
            print("Percentage found:", token.text)
        

Percentage found: 60
Percentage found: 4


#### Exercise 1.2: Repeat chapter 1.3 with Finnish.

In [11]:
# Exercise 1.2:
# -----------------
# Repeat chapter 1.3 with Finnish.

from spacy.lang.fi import Finnish

nlp = Finnish()

doc_f = nlp(
"Vuonna 1990 yli 60% Itä-Aasian ihmisistä oli äärimmäisessä köyhyydessä."
"Nykyään alle 4%. "
)


doc_f2= nlp(
"Yli puolet 60% suomalaisista ajattelee, että lihan syömiseen suhtaudutaan tällä hetkellä liian tuomitsevasti."
"Lähes yhtä moni 58% suomalaisista kokee, että lihansyönnin vähentämiselle asetetaan yhteiskunnallisia ja sosiaalisia paineita."
)

# Etsitään prosenttimerkki suomen kielisen tekstin seasta yksikertaisen loopin avulla

# Käydään läpi docin (tekstin) tokenit
for token in doc_f2:
    # Jos tokeni muistuttaa numeroa, otetaan se tarkasteluun
    if token.like_num:
        # Seuraava tokeni saadaan token.i + 1:llä
        next_token = doc_f2[token.i+1]
        # Jos seuraava token on prosenttimerkki ...
        # ... olemme löytäneet prosenttiluvun teksistä
        if next_token.text == "%":
            print("Prosenttiluku löytyi: ", token.text)

Prosenttiluku löytyi:  60
Prosenttiluku löytyi:  58


### 1.4 Loading Models

Use spacy.load to load the small English model "en_core_web_sm".
Process the text and print the document text.

In [13]:
import spacy

# Load the small English model
nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

# Print the document text
print(doc.text)

It’s official: Apple is the first U.S. public company to reach a $1 trillion market value


### 1.5 Predicting linquistic annotations


You’ll now get to try one of spaCy’s pre-trained model packages and see its predictions in action. Feel free to try it out on your own text! To find out what a tag or label means, you can call spacy.explain in the loop. For example: spacy.explain("PROPN") or spacy.explain("GPE").

**Part 1**

- Process the text with the nlp object and create a doc.
- For each token, print the token text, the token’s .pos_ (part-of-speech tag) and the token’s .dep_ (dependency label).


In [14]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

for token in doc:
    # Get the token text, part-of-speech tag and dependency label
    token_text = token.text
    token_pos = token.pos_
    token_dep = token.dep_
    # This is for formatting only
    print(f"{token_text:<12}{token_pos:<10}{token_dep:<10}")

It          PRON      nsubj     
’s          VERB      ccomp     
official    ADJ       acomp     
:           PUNCT     punct     
Apple       PROPN     nsubj     
is          AUX       ROOT      
the         DET       det       
first       ADJ       amod      
U.S.        PROPN     nmod      
public      ADJ       amod      
company     NOUN      attr      
to          PART      aux       
reach       VERB      relcl     
a           DET       det       
$           SYM       quantmod  
1           NUM       compound  
trillion    NUM       nummod    
market      NOUN      compound  
value       NOUN      dobj      


#### Exercise 1.3: Print explanations for PROPN, nsubj and quantmod. 

In [16]:
# Exercise 1.3:
# -----------------
# Print explanations for PROPN, nsubj and quantmod. 
print(spacy.explain("PROPN"))
print(spacy.explain("nsubj"))
print(spacy.explain("quantmod"))

proper noun
nominal subject
modifier of quantifier


**Part 2**

Process the text and create a doc object.
Iterate over the doc.ents and print the entity text and label_ attribute.

In [17]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

Apple ORG
first ORDINAL
U.S. GPE
$1 trillion MONEY


### 1.6 Predicting named entities in context

Models are statistical and not always right. Whether their predictions are correct depends on the training data and the text you’re processing. Let’s take a look at an example.

- Process the text with the nlp object.
- Iterate over the entities and print the entity text and label.
- Looks like the model didn’t predict “iPhone X”. Create a span for those tokens manually.

In [18]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = "Upcoming iPhone X release date leaked as Apple reveals pre-orders"

# Process the text
doc = nlp(text)

# Iterate over the entities
for ent in doc.ents:
    # Print the entity text and label
    print(ent.text, ent.label_)

# Get the span for "iPhone X"
iphone_x = doc[1:3]

# Print the span text
print("Missing entity:", iphone_x.text)

Apple ORG
Missing entity: iPhone X


### 1.7 Using the matcher

Let’s try spaCy’s rule-based Matcher. You’ll be using the example from the previous exercise and write a pattern that can match the phrase “iPhone X” in the text.

- Import the Matcher from spacy.matcher.
- Initialize it with the nlp object’s shared vocab.
- Create a pattern that matches the "TEXT" values of two tokens: "iPhone" and "X".
- Use the matcher.add method to add the pattern to the matcher.
- Call the matcher on the doc and store the result in the variable matches.
- Iterate over the matches and get the matched span from the start to the end index.

#### Exercise 1.4: Write matcher for "iPhone" followed by "X". 

In [5]:
import spacy

# Import the Matcher
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
doc = nlp("Upcoming iPhone X release date leaked as Apple reveals pre-orders")

# Initialize the Matcher with the shared vocabulary
matcher = Matcher(nlp.vocab)

# Exercise 1.4: Write matcher for "iPhone" followed by "X"
# -----------
# Create a pattern matching two tokens: "iPhone" and "X"
pattern = [[{"LOWER": "iphone"}, {"LOWER": "x"}]]
# Vaihtoehto 2:
# pattern = [[{"TEXT": "iPhone"}, {"LOWER": "X"}]]

# Kummatkin toimii

# Add the pattern to the matcher
matcher.add("IPHONE_X_PATTERN", pattern)

# Use the matcher on the doc
matches = matcher(doc)
print("Matches:", [doc[start:end].text for match_id, start, end in matches])

Matches: ['iPhone X']


### 1.8 Writing match patterns

In this exercise, you’ll practice writing more complex match patterns using different token attributes and operators.

**Part 1**

Write one pattern that only matches mentions of the full iOS versions: “iOS 7”, “iOS 11” and “iOS 10”.

In [34]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "After making the iOS update you won't notice a radical system-wide "
    "redesign: nothing like the aesthetic upheaval we got with iOS 7. Most of "
    "iOS 11's furniture remains the same as in iOS 10. But you will discover "
    "some tweaks once you delve a little deeper."
)

# Write a pattern for full iOS versions ("iOS 7", "iOS 11", "iOS 10")
pattern = [[{"TEXT": "iOS"}, {"IS_DIGIT": True}]]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("IOS_VERSION_PATTERN", pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 3
Match found: iOS 7
Match found: iOS 11
Match found: iOS 10


**Part 2**

Write one pattern that only matches forms of “download” (tokens with the lemma “download”), followed by a token with the part-of-speech tag "PROPN" (proper noun).

In [35]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "i downloaded Fortnite on my laptop and can't open the game at all. Help? "
    "so when I was downloading Minecraft, I got the Windows version where it "
    "is the '.zip' folder and I used the default program to unpack it... do "
    "I also need to download Winzip?"
)

# Write a pattern that matches a form of "download" plus proper noun
pattern = [[{"LEMMA": "download"}, {"POS": "PROPN"}]]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("DOWNLOAD_THINGS_PATTERN", pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 3
Match found: downloaded Fortnite
Match found: downloading Minecraft
Match found: download Winzip


#### Exercise 1.5: Write matcher

Write one pattern that matches adjectives ("ADJ") followed by one or two "NOUN"s (one noun and one optional noun).

In [39]:
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "Features of the app include a beautiful design, smart search, automatic "
    "labels and optional voice responses."
)


# Exercise 1.5: Write pattern and use matcher with pattern to find matches
# ----------------

# Write a pattern for adjective plus one or two nouns
pattern = [[{"POS": "ADJ"}, {"POS": "NOUN"}, {"POS": "NOUN", "OP": "?"}]]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("ADJECTIVE_AND_NOUN(S)",pattern)
matches = matcher(doc)


print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)



Total matches found: 5
Match found: beautiful design
Match found: smart search
Match found: automatic labels
Match found: optional voice
Match found: optional voice responses


# Reflection
1. What is spaCy?
    - Niinkuin jokaisen muunkin aihepiiriin työstämisen helpottamiseksi löytyy kirjastot, löytyy myös luonnollisen kielen käsittelyynkin. Kirjaston nimi on spaCy. 
    - SpaCy on siis luonnollisen kielen käsittelyyn suunniteltu kirjasto. Se sisältää paljon funktoita esimerkiksi datan käsittelyyn. SpaCy helpottaa ja nopeuttaa huomattavasti luonnollisen kielen käsittelyä python-ympäristössä. 
2. Why you are not able to repeat parts 1.4 - 1.8 with Finnish?
    - Suomen kielelle ei ole ainakaan vielä olemassa pipeline-pakettia luotuna. Kukaan ei ole ottanut sitä työkseen.
3. What’s not included in a model package that you can load into spaCy?
    - A meta file including the language, pipeline and license.
    - Binary weights to make statistical predictions.
    - **The labelled data that the model was trained on.** <= Ei kuulu
    - Strings of the model's vocabulary and their hashes. 
4. What is `nlp`?
    - `nlp` funktio luo tekstistä (doc) objektin ja tokenoi tekstin.

*Your answers here...*