# **Sesion 1 NLP Practice**

This notebook provides a practical introduction to Natural Language Processing (NLP) concepts using SpaCy. It demonstrates how to load and process text, perform tokenization, sentence segmentation, and extract linguistic attributes such as POS tags, dependency tags, and lemmas. Additionally, it covers pattern matching with spaCy's Matcher to identify specific phrases within the text. The exercises in this notebook are applied to the classic tale "The Three Little Pigs," obtained from [Proyecto Gutenberg](https://www.gutenberg.org/cache/epub/18155/pg18155.txt), Comments, spaCy version, and the story used will be in English.

***Note:** Comments, spaCy version, and the story used will be in English.*

In [1]:
import pkg_resources
import warnings

warnings.filterwarnings('ignore')

installed_packages = [package.key for package in pkg_resources.working_set]
IN_COLAB = 'google-colab' in installed_packages

  import pkg_resources


In [2]:
!test '{IN_COLAB}' = 'True' && wget  https://github.com/Carlos-SD/NLP/raw/refs/heads/main/requirements.txt && pip install -r requirements.txt

--2026-02-14 13:55:49--  https://github.com/Carlos-SD/NLP/raw/refs/heads/main/requirements.txt
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/Carlos-SD/NLP/refs/heads/main/requirements.txt [following]
--2026-02-14 13:55:50--  https://raw.githubusercontent.com/Carlos-SD/NLP/refs/heads/main/requirements.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 349 [text/plain]
Saving to: ‘requirements.txt.1’


2026-02-14 13:55:50 (9.92 MB/s) - ‘requirements.txt.1’ saved [349/349]

Traceback (most recent call last):
  File "/usr/local/bin/pip3", line 10, in <module>
    sys.exit(main())
             ^^^

In [3]:
# spacy import for english language:
import spacy
nlp = spacy.load('en_core_web_sm')

**1.   The text to use will be get from the remote repository**

> "Three Little Pigs".           
> L. Leslie Brooke



In [7]:
!test '{IN_COLAB}' = 'True' && wget  https://github.com/Carlos-SD/NLP/raw/refs/heads/main/1-Sesion-activity/three_little_pigs.txt

--2026-02-14 13:56:01--  https://github.com/Carlos-SD/NLP/raw/refs/heads/main/1-Sesion-activity/three_little_pigs.txt
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/Carlos-SD/NLP/refs/heads/main/1-Sesion-activity/three_little_pigs.txt [following]
--2026-02-14 13:56:01--  https://raw.githubusercontent.com/Carlos-SD/NLP/refs/heads/main/1-Sesion-activity/three_little_pigs.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25457 (25K) [text/plain]
Saving to: ‘three_little_pigs.txt’


2026-02-14 13:56:01 (16.7 MB/s) - ‘three_little_pigs.txt’ saved [25457/25457]



In [8]:
with open('./three_little_pigs.txt') as file:
    doc = nlp(file.read()) # This saves the .txt as a variable file for it to be used to read it and be printed after

doc[:119] #As the example for the practice uses it, we take 199 tokens from the book, to catch at least the title at the .txt

The Project Gutenberg eBook of The Story of the Three Little Pigs
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.

Title: The Story of the Three Little Pigs


**2.   List lenght of tokens at the file**

In [9]:
len(doc)

5340

**3.   How many sentences are there in the file?**

In [10]:
sentences = list(doc.sents)
len(sentences)

170

**4.   Print the second sentence from the file**          
**Note:** This .txt has its title at the third sentence as the first two are terms from Gutenberg

In [11]:
#sentences[0]
sentences[3]

Title: The Story of the Three Little Pigs


**5. For each token, print its `text`, `POS` tag, `dep` tag y `lemma`**
<br>

In [12]:
print("{:20}{:20}{:20}{:20}".format("Text", "POS", "dep", "lemma"))
for token in sentences[1]:
    print(f"{token.text:{20}}{token.pos_:{20}}{token.dep_:{20}}{token.lemma_:{20}}")

Text                POS                 dep                 lemma               
You                 PRON                nsubj               you                 
may                 AUX                 aux                 may                 
copy                VERB                ROOT                copy                
it                  PRON                dobj                it                  
,                   PUNCT               punct               ,                   
give                VERB                dep                 give                
it                  PRON                dobj                it                  
away                ADV                 advmod              away                
or                  CCONJ               cc                  or                  
re                  VERB                conj                re                  
-                   VERB                conj                -                   
use                 VERB    

**5.1 Named Entity Recognition (NER)**

Extract named entities from a range of sentences to see how they are explained.

In [23]:
print("{:20}{:20}{:20}".format("Text", "Label", "Explanation"))
for i in range(10, 16): # Iterate from sentence 10 to 15 (inclusive)
    current_sentence = sentences[i]
    print(f'Original sentence ({i}):', current_sentence.text, '\n')
    if current_sentence.ents:
        for ent in current_sentence.ents:
            print(f"{ent.text:{20}}{ent.label_:{20}}{spacy.explain(ent.label_):{20}}")
    else:
        print("No named entities found in this sentence.\n")
    print("-" * 60) # Separator for readability between sentences

Text                Label               Explanation         
Original sentence (10): Once upon a time there was an old Sow with three little Pigs,
and as she had not enough to keep them, she sent them out to seek their
fortune.

 

three               CARDINAL            Numerals that do not fall under another type
Pigs                ORG                 Companies, agencies, institutions, etc.
------------------------------------------------------------
Original sentence (11): [Illustration]

[Illustration]

[Illustration]

The first that went off met a Man with a bundle of straw, and said to
him, "Please, Man, give me that straw to build me a house"; which the
Man did, and the little Pig built a house with it. 

first               ORDINAL             "first", "second", etc.
------------------------------------------------------------
Original sentence (12): Presently came along
a Wolf, and knocked at the door, and said, "Little Pig, little Pig, let
me come in. 

Wolf                P

- At this point, using NERD tries but does not accurately tells the real meaning and intention from some entities as "chin chin", or with "Pigs" being named as Companies, institutions or agencies and not as a People how the book narrates them.

**6. Implementa un matcher llamado *Swimming* que encuentre las ocurrencias de la frase *swimming vigorously* Write a matcher called 'Swimming' that finds**

In [25]:
from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)
pattern = [{'LOWER': 'little'}, {'IS_SPACE': True}, {'LOWER': 'pig'}]
matcher.add("Swimming", [pattern])


In [26]:
found_matches = matcher(doc)
found_matches

[(12881893835109366681, 1305, 1308)]

**7. Print the text around each found match**

In [27]:
start, end = found_matches[0][1:]
doc[start-9:end+13]

, and said to the little Pig, "Little
Pig, there is a Fair in the Town this afternoon: will you

**8. Print the sentence that contains each found match**

In [28]:
for sentence in sentences:
    for _, start, end in found_matches:
        if sentence.start <= start and sentence.end >= end:
            print(sentence.text, '\n')

[Illustration]

[Illustration]

The next day the Wolf came again, and said to the little Pig, "Little
Pig, there is a Fair in the Town this afternoon: will you go?"

"Oh, yes," said the Pig, "I will go; what time shall you be ready?"

"At three," said the Wolf.

 

