<a href="https://colab.research.google.com/github/ShaunakSen/Natural-Language-Processing/blob/master/Advanced_NLP_with_spaCy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Advanced NLP with spaCy

> Based on the official course: https://course.spacy.io/en

---

### Installation

In [0]:
!pip install -U spacy

In [0]:
!pip install -U spacy-lookups-data

In [0]:
!python -m spacy download en_core_web_sm

### Chapter 1: Finding words, phrases, names and concepts

#### Introduction to spaCy

**The nlp object**

At the center of spaCy is the object containing the processing pipeline. We usually call this variable "nlp".

For example, to create an English nlp object, you can import the English language class from `spacy.lang.en` and instantiate it. You can use the nlp object like a function to analyze text.

It contains all the different components in the pipeline.

It also includes language-specific rules used for tokenizing the text into words and punctuation. spaCy supports a variety of languages that are available in `spacy.lang`.



In [0]:
# Import the English language class
from spacy.lang.en import English

# Create the nlp object
nlp = English()

**The Doc Object**

When you process a text with the nlp object, spaCy creates a Doc object – short for "document". The Doc lets you access information about the text in a structured way, and no information is lost.

The Doc behaves like a normal Python sequence by the way and lets you iterate over its tokens, or get a token by its index. But more on that later!



In [0]:
# Created by processing a string of text with the nlp object

doc = nlp(text="Hello world!")

# Iterate over tokens in a Doc
for token in doc:
    print (token.text) 

Hello
world
!


**The token object**
![](https://course.spacy.io/doc.png)

Token objects represent the tokens in a document – for example, a word or a punctuation character.

Token objects also provide various attributes that let you access more information about the tokens. For example, the .text attribute returns the verbatim token text.

In [0]:
# Index into the Doc to get a single Token
token = doc[1]

# Get the token text via the .text attribute
print(token.text)

world


**The span object**

*A Span object is a slice of the document consisting of one or more tokens*. 

> It's only a view of the Doc and doesn't contain any data itself.

To create a span, you can use Python's slice notation. For example, 1:3 will create a slice starting from the token at position 1, up to – but not including! – the token at position 3.





In [0]:
span = doc[1:3]

print(span.text)

world!


**Token attributes**

Here you can see some of the available token attributes:

i is the index of the token within the parent document.

text returns the token text.

is_alpha, is_punct and like_num return boolean values indicating whether the token consists of alphabetic characters, whether it's punctuation or whether it resembles a number. For example, a token "10" – one, zero – or the word "ten" – T, E, N.

These attributes are also called lexical attributes: **they refer to the entry in the vocabulary and don't depend on the token's context**.



In [0]:
doc = nlp("It costs $5. Ten £ only")

print("Index:   ", [token.i for token in doc])
print("Text:    ", [token.text for token in doc])

print("is_alpha:", [token.is_alpha for token in doc])
print("is_punct:", [token.is_punct for token in doc])
print("like_num:", [token.like_num for token in doc])
print("is_currency:", [token.is_currency for token in doc])

Index:    [0, 1, 2, 3, 4, 5, 6, 7]
Text:     ['It', 'costs', '$', '5', '.', 'Ten', '£', 'only']
is_alpha: [True, True, False, False, False, True, False, True]
is_punct: [False, False, False, False, True, False, False, False]
like_num: [False, False, False, True, False, True, False, False]
is_currency: [False, False, True, False, False, False, True, False]


#### Exercises (slightly complicated ones)

In this example, you’ll use spaCy’s Doc and Token objects, and lexical attributes to find percentages in a text. You’ll be looking for two subsequent tokens: a number and a percent sign.

- Use the like_num token attribute to check whether a token in the doc resembles a number.
- Get the token following the current token in the document. The index of the next token in the doc is token.i + 1.
- Check whether the next token’s text attribute is a percent sign ”%“.

In [0]:
# Process the text
doc = nlp(
    "In 1990, more than 60% of people in East Asia were in extreme poverty. "
    "Now less than 4% are."
)

for token in doc:
    if token.like_num:
        next_token = doc[token.i + 1]
        if next_token.text == '%':
            print (token.text)

60
4


#### Statistical models

Some of the most interesting things you can analyze are context-specific: for example, whether a word is a verb or whether a span of text is a person name.


Statistical models enable spaCy to make predictions in context. This usually includes part-of speech tags, syntactic dependencies and named entities.

- Part-of-speech tags
- Syntactic dependencies
- Named entities

- Trained on labeled example texts
- Can be updated with more examples to fine-tune predictions

spaCy provides a number of pre-trained model packages you can download using the spacy download command. For example, the `"en_core_web_sm`" package is a small English model that supports all core capabilities and is trained on web text.

The `spacy.load` method loads a model package by name and returns an nlp object.

The package provides the binary weights that enable spaCy to make predictions.

It also includes the vocabulary, and meta information to tell spaCy which language class to use and how to configure the processing pipeline.

In [0]:
import spacy
nlp = spacy.load("en_core_web_sm")

**Predicting Part-of-speech Tags**

1. First, we load the small English model and receive an nlp object.

2. Next, we're processing the text "She ate the pizza".

3. For each token in the doc, we can print the text and the .pos_ attribute, the predicted part-of-speech tag.

> In spaCy, attributes that return strings usually end with an underscore – attributes without the underscore return an integer ID value.

Here, the model correctly predicted "ate" as a verb and "pizza" as a noun.



In [0]:
doc = nlp("She ate the pizza!")

for token in doc:
    print (token.text, token.pos_)

She PRON
ate VERB
the DET
pizza NOUN
! PUNCT


**Predicting Syntactic Dependencies**

In addition to the part-of-speech tags, we can also predict how the words are related. For example, whether a word is the subject of the sentence or an object.

- The `.dep_` attribute returns the predicted dependency label.

- The `.head` attribute returns the syntactic head token. You can also think of it as the parent token this word is attached to.

In [0]:
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

She PRON nsubj ate
ate VERB ROOT ate
the DET det pizza
pizza NOUN dobj ate
! PUNCT punct ate


**Dependency label scheme**

To describe syntactic dependencies, spaCy uses a standardized label scheme. Here's an example of some common labels:


![](https://course.spacy.io/dep_example.png)



![](https://i.ibb.co/2nxYpjX/diag1.png)

The pronoun "She" is a nominal subject attached to the verb – in this case, to "ate".

The noun "pizza" is a direct object attached to the verb "ate". It is eaten by the subject, "she".

The determiner "the", also known as an article, is attached to the noun "pizza".

**Predicting Named Entities**

![](https://course.spacy.io/ner_example.png)

Named entities are "real world objects" that are assigned a name – for example, a person, an organization or a country.

The `doc.ents` property lets you access the named entities predicted by the model.

It returns an iterator of Span objects, so we can print the entity text and the entity label using the `.label_` attribute.

In this case, the model is correctly predicting "Apple" as an organization, "U.K." as a geopolitical entity and "$1 billion" as money.



In [0]:
# Process a text
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

# Iterate over the predicted entities
for ent in doc.ents:
    print (ent.text, ent.label_)

Apple ORG
U.K. GPE
$1 billion MONEY


**Tip: the spacy.explain method**

A quick tip: To get definitions for the most common tags and labels, you can use the spacy.explain helper function.

For example, "GPE" for geopolitical entity isn't exactly intuitive – but spacy.explain can tell you that it refers to countries, cities and states.

The same works for part-of-speech tags and dependency labels.

In [0]:
print (spacy.explain("GPE"))
print (spacy.explain("NNP"))
print (spacy.explain("dobj"))

Countries, cities, states
noun, proper singular
direct object


#### Exercises

**Predicting linguistic annotations**

You’ll now get to try one of spaCy’s pre-trained model packages and see its predictions in action. Feel free to try it out on your own text! To find out what a tag or label means, you can call spacy.explain in the loop. For example: spacy.explain("PROPN") or spacy.explain("GPE").

- Process the text with the nlp object and create a doc.
- For each token, print the token text, the token’s .pos_ (part-of-speech tag) and the token’s .dep_ (dependency label).

In [0]:
nlp = spacy.load("en_core_web_sm")

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

for token in doc:
    # Get the token text, part-of-speech tag and dependency label
    token_text = token.text
    token_pos = token.pos_
    token_dep = token.dep_
    explain_text = str(spacy.explain(term=token_dep))
    head_text = token.head.text
    # This is for formatting only
    print(f"{token_text:<12}{token_pos:<10}{token_dep:<10}{explain_text:<30}{head_text:<12}")

It          PRON      nsubj     nominal subject               official    
’s          VERB      punct     punctuation                   It          
official    NOUN      ccomp     clausal complement            is          
:           PUNCT     punct     punctuation                   is          
Apple       PROPN     nsubj     nominal subject               is          
is          AUX       ROOT      None                          is          
the         DET       det       determiner                    company     
first       ADJ       amod      adjectival modifier           company     
U.S.        PROPN     nmod      modifier of nominal           company     
public      ADJ       amod      adjectival modifier           company     
company     NOUN      attr      attribute                     is          
to          PART      aux       auxiliary                     reach       
reach       VERB      relcl     relative clause modifier      company     
a           DET       det

Note that it also recognized the "'s" as an abbr for is, which is a VERB

Also note the *$* as symbol and *1* and *trillion* as NUM



In [0]:
for ent in doc.ents:
    print (ent.text, ent.label_)

Apple ORG
first ORDINAL
U.S. GPE
$1 trillion MONEY


So far, the model has been correct every single time. In
the next exercise, you'll see what happens if the model is wrong, and how to
adjust it.

**Predicting named entities in context**

Models are statistical and not always right. Whether their predictions are correct depends on the training data and the text you’re processing. Let’s take a look at an example.

In [0]:
text = "Upcoming iPhone X release date leaked as Apple reveals pre-orders for $ 1 billion worth Indian customers"
doc = nlp(text)

for ent in doc.ents:
    print (ent.text, ent.label_)

Apple ORG
$ 1 billion MONEY
Indian NORP


"iPhone X" also should be an entity that has been missed by the model

In [0]:
iphone_x = doc[1:3]
print (iphone_x.text)

iPhone X


Of course, you don't always have to do this manually. In the
next exercise, you'll learn about spaCy's rule-based matcher, which can help you
find certain words and phrases in text.

#### Rule-based matching

Compared to regular expressions, the matcher works with `Doc` and `Token` objects instead of only strings.

It's also more flexible: you can search for texts but also other lexical attributes.

You can even write rules that use the model's predictions.

> For example, find the word "duck" only if it's a verb, not a noun.

**Match pattern examples**

Match patterns are lists of dictionaries. Each dictionary describes one token. The keys are the names of token attributes, mapped to their expected values.

Match exact token texts:

`[{"TEXT": "iPhone"}, {"TEXT": "X"}]`

In this example, we're looking for two tokens with the text "iPhone" and "X".

Match lexical attributes:

`[{"LOWER": "iphone"}, {"LOWER": "x"}]`

We can also match on other token attributes. Here, we're looking for two tokens whose lowercase forms equal "iphone" and "x".

Match any token attributes:

`[{"LEMMA": "buy"}, {"POS": "NOUN"}]`

We can even write patterns using *attributes predicted by the mode*l. Here, we're matching a token with the lemma "buy", plus a noun. **The lemma is the base form, so this pattern would match phrases like "buying milk" or "bought flowers"**.




In [0]:
import spacy
from spacy.matcher import Matcher

# Load a model and create the nlp object
nlp = spacy.load("en_core_web_sm")

In [0]:
# Initialize the matcher with the shared vocab
matcher = Matcher(vocab=nlp.vocab)

# Add the pattern to the matcher
pattern = [{"TEXT": "iPhone"}, {"TEXT": "X"}]
matcher.add("IPHONE_PATTERN", None, pattern)

# Process some text
doc = nlp("Upcoming iPhone X release date leaked")

# Call the matcher on the doc
matches = matcher(doc)

To use a pattern, we first import the matcher from spacy.matcher.

We also load a model and create the nlp object.

The matcher is initialized with the shared vocabulary, nlp.vocab. You'll learn more about this later – for now, just remember to always pass it in.

The matcher.add method lets you add a pattern. The first argument is a unique ID to identify which pattern was matched. The second argument is an optional callback. We don't need one here, so we set it to None. The third argument is the pattern.

To match the pattern on a text, we can call the matcher on any doc.

This will return the matches.



In [0]:
print (matches)

[(9528407286733565721, 1, 3)]


When you call the matcher on a doc, it returns a list of tuples.

Each tuple consists of three values: the match ID, the start index and the end index of the matched span.

match_id: hash value of the pattern name


In [13]:
# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

iPhone X


**Matching lexical attributes**

Here's an example of a more complex pattern using lexical attributes.

We're looking for five tokens:

A token consisting of only digits.

Three case-insensitive tokens for "fifa", "world" and "cup".

And a token that consists of punctuation.

The pattern matches the tokens "2018 FIFA World Cup:".

In [15]:
pattern_2 = [
    {"IS_DIGIT": True},
    {"LOWER": "fifa"},
    {"LOWER": "world"},
    {"LOWER": "cup"},
    {"IS_PUNCT": True}
]

doc = nlp("2018 FIFA World Cup: France won! Upcoming iPhone X release date leaked")
matcher.add("FIFA_PATTERN", None, pattern_2)

# Call the matcher on the doc
matches = matcher(doc)

print (matches)

[(17311505950452258848, 0, 5), (9528407286733565721, 9, 11)]


In [16]:
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

2018 FIFA World Cup:
iPhone X


**Matching other token attributes**

In this example, we're looking for two tokens:

A verb with the lemma "love", followed by a noun.

Note: lemma means base word, so the tense will not matter

This pattern will match "loved dogs" and "love cats".

In [18]:
pattern = [
           {"LEMMA": "love", "POS": "VERB"},
           {"POS": "NOUN"}
]

doc = nlp("I loved dogs but now I love cats more.")


matcher.add("VERB_PATTERN", None, pattern)

# Call the matcher on the doc
matches = matcher(doc)

print (matches)

[(14990696005118948706, 1, 3), (14990696005118948706, 6, 8)]


In [20]:
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

loved dogs
love cats


**Using operators and quantifiers**

Operators and quantifiers let you define how often a token should be matched. They can be added using the "OP" key.

Here, the "?" operator makes the determiner token optional, so it will match a token with the lemma "buy", an optional article and a noun.


In [22]:
pattern = [
    {"LEMMA": "buy"},
    {"POS": "DET", "OP": "?"},  # optional: match 0 or 1 times
    {"POS": "NOUN"}
]

doc = nlp("I bought a smartphone. Now I'm buying apps.")

matcher.add("OP_PATTERN", None, pattern)

# Call the matcher on the doc
matches = matcher(doc)

print (matches)

for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

[(15381964648932541590, 1, 4), (15381964648932541590, 8, 10)]
bought a smartphone
buying apps


"OP" can have one of four values:

![](https://i.ibb.co/M5LGNMf/diag2.png)

#### Exercises

Write one pattern that only matches mentions of the full iOS versions: “iOS 7”, “iOS 11” and “iOS 10”.


In [23]:
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "After making the iOS update you won't notice a radical system-wide "
    "redesign: nothing like the aesthetic upheaval we got with iOS 7. Most of "
    "iOS 11's furniture remains the same as in iOS 10. But you will discover "
    "some tweaks once you delve a little deeper."
)

# Write a pattern for full iOS versions ("iOS 7", "iOS 11", "iOS 10")
pattern = [{"TEXT": "iOS"}, {"IS_DIGIT": True}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("IOS_VERSION_PATTERN", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 3
Match found: iOS 7
Match found: iOS 11
Match found: iOS 10


Write one pattern that only matches forms of “download” (tokens with the lemma “download”), followed by a token with the part-of-speech tag "PROPN" (proper noun).


In [24]:
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "i downloaded Fortnite on my laptop and can't open the game at all. Help? "
    "so when I was downloading Minecraft, I got the Windows version where it "
    "is the '.zip' folder and I used the default program to unpack it... do "
    "I also need to download Winzip? The download speed is very slow"
)

# Write a pattern that matches a form of "download" plus proper noun
pattern = [{"LEMMA": "download"}, {"POS": "PROPN"}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("DOWNLOAD_THINGS_PATTERN", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 2
Match found: downloaded Fortnite
Match found: downloading Minecraft


Write one pattern that matches adjectives ("ADJ") followed by one or two "NOUN"s (one noun and one optional noun).


In [26]:
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

doc = nlp(
    "Features of the app include a beautiful design, smart search, automatic "
    "labels and optional voice responses."
)

# Write a pattern for adjective plus one or two nouns
pattern = [{"POS": "ADJ"}, {"POS": "NOUN"}, {"POS": "NOUN", "OP": "?"}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("ADJ_NOUN_PATTERN", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Total matches found: 5
Match found: beautiful design
Match found: smart search
Match found: automatic labels
Match found: optional voice
Match found: optional voice responses


#### Data Structures (1): Vocab, Lexemes and StringStore

Now that you've had some real experience using spaCy's objects, it's time for you to learn more about what's actually going on under spaCy's hood.

In this lesson, we'll take a look at the shared vocabulary and how spaCy deals with strings.

**Shared vocab and string store**

spaCy stores all shared data in a vocabulary, the Vocab.

This includes words, but also the labels schemes for tags and entities.

To save memory, all strings are encoded to hash IDs. If a word occurs more than once, we don't need to save it every time.

Instead, spaCy uses a hash function to generate an ID and stores the string only once in the string store. The string store is available as nlp.vocab.strings.

It's a lookup table that works in both directions. You can look up a string and get its hash, and look up a hash to get its string value. Internally, spaCy only communicates in hash IDs.

Hash IDs can't be reversed, though. If a word is not in the vocabulary, there's no way to get its string. That's why we always need to pass around the shared vocab.

In [29]:
coffee_hash = nlp.vocab.strings["coffee"]
print (coffee_hash)

3197928453018144401


In [34]:
doc = nlp("I love coffee")
print("hash value:", nlp.vocab.strings["coffee"])
print("string value:", nlp.vocab.strings[3197928453018144401])

hash value: 3197928453018144401
string value: coffee


In [35]:
# The doc also exposes the vocab and strings
doc = nlp("I love coffee")
print("hash value:", doc.vocab.strings["coffee"])

hash value: 3197928453018144401


**Lexemes: entries in the vocabulary**

Lexemes are **context-independent entries** in the vocabulary.

You can get a lexeme by looking up a string or a hash ID in the vocab.

Lexemes expose attributes, just like tokens.

They hold context-independent information about a word, like the text, or whether the word consists of alphabetic characters.

*Lexemes don't have part-of-speech tags, dependencies or entity labels. Those depend on the context.*


In [36]:
doc = nlp("I love mini")
lexemme = nlp.vocab["mini"]

print (lexemme)

<spacy.lexeme.Lexeme object at 0x7fe1ca592828>


In [42]:
lexemme.is_digit, lexemme.is_alpha, lexemme.text, lexemme.orth

(False, True, 'mini', 11698860559887369376)

Here's an example.

The Doc contains words in context – in this case, the tokens "I", "love" and "coffee" with their part-of-speech tags and dependencies.

Each token refers to a lexeme, which knows the word's hash ID. To get the string representation of the word, spaCy looks up the hash in the string store.

![](https://i.ibb.co/1m9fBYh/diag3.png)

Now that you know all about the vocabulary and string store, we can take a look at the most important data structure: the Doc, and its views Token and Span.

**The Doc object**

The Doc is one of the central data structures in spaCy. It's created automatically when you process a text with the nlp object. But you can also instantiate the class manually.

After creating the nlp object, we can import the Doc class from spacy.tokens.

Here we're creating a doc from three words. *The spaces are a list of boolean values indicating whether the word is followed by a space*. Every token includes that information – even the last one!

The Doc class takes three arguments: the shared vocab, the words and the spaces.

In [0]:
nlp = English()

# Import the Doc class
from spacy.tokens import Doc

# The words and spaces to create the doc from

words = ["Hello", "world", "!"]
spaces = [True, False, False]

# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)

**The Span obejct**

![](https://course.spacy.io/span_indices.png)

A Span is a slice of a doc consisting of one or more tokens. The Span takes at least three arguments: the doc it refers to, and the start and end index of the span. Remember that the end index is exclusive!

To create a Span manually, we can also import the class from spacy.tokens. We can then instantiate it with the doc and the span's start and end index, and an optional label argument.

The doc.ents are writable, so we can add entities manually by overwriting it with a list of spans.



In [0]:
# Import the Doc and Span classes
from spacy.tokens import Doc, Span

# The words and spaces to create the doc from
words = ["Hello", "world", "!"]
spaces = [True, False, False]

# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)

# Create a span manually
span = Span(doc, 0, 2)

# Create a span with a label
span_with_label = Span(doc, 0, 2, label="GREETING")

In [53]:
for token in span_with_label:
    print (token.text)

Hello
world


The text "Hello world!" is like a GREETING (entity). So we add it to the doc entities

In [56]:
# Add span to the doc.ents
doc.ents = [span_with_label]

for ent in doc.ents:
    print (ent.text, ent.label_)

Hello world GREETING


**Best practices**

The Doc and Span are very powerful and optimized for performance. They give you access to all references and relationships of the words and sentences.

If your application needs to output strings, make sure to convert the doc as late as possible. If you do it too early, you'll lose all relationships between the tokens.

To keep things consistent, try to use built-in token attributes wherever possible. For example, token.i for the token index.

Also, don't forget to always pass in the shared vocab!

#### Exercises

In this exercise, you’ll create the Doc and Span objects manually, and update the named entities – just like spaCy does behind the scenes. A shared nlp object has already been created.

- Import the Doc and Span classes from spacy.tokens.
- Use the Doc class directly to create a doc from the words and spaces.
- Create a Span for “David Bowie” from the doc and assign it the label "PERSON".
- Overwrite the doc.ents with a list of one entity, the “David Bowie” span.

In [57]:
nlp = English()

# Import the Doc and Span classes
from spacy.tokens import Doc, Span

words = ["I", "like", "David", "Bowie"]
spaces = [True, True, True, False]

# Create a doc from the words and spaces
doc = Doc(nlp.vocab, words, spaces)
print(doc.text)

# Create a span for "David Bowie" from the doc and assign it the label "PERSON"
span = Span(doc, 2, 4, label="PERSON")
print(span.text, span.label_)

# Add the span to the doc's entities
doc.ents = [span]

# Print entities' text and labels
print([(ent.text, ent.label_) for ent in doc.ents])

I like David Bowie
David Bowie PERSON
[('David Bowie', 'PERSON')]


**Data Structures Best Practices**

The code in this example is trying to analyze a text and collect all proper nouns that are followed by a verb.



In [58]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("Berlin is a nice city")

# Get all tokens and part-of-speech tags
token_texts = [token.text for token in doc]
pos_tags = [token.pos_ for token in doc]

print (pos_tags)

['PROPN', 'AUX', 'DET', 'ADJ', 'NOUN']


In [0]:
for index, pos in enumerate(pos_tags):
    # Check if the current token is a proper noun
    if pos == 'PROPN':
        # Check if the next token is a verb
        if pos_tags[index+1] == 'VERB':
            result = token_texts[index]
            print("Found proper noun before a verb:", result)

In [60]:
[token.tag_ for token in doc]

['NNP', 'VBZ', 'DT', 'JJ', 'NN']

Why is the code bad?

- It only uses lists of strings instead of native token attributes. This is often less efficient, and can't express complex relationships.

In [61]:
type(token_texts[0])

str

- Rewrite the code to use the native token attributes instead of lists of token_texts and pos_tags.
- Loop over each token in the doc and check the token.pos_ attribute.
- Use doc[token.i + 1] to check for the next token and its .pos_ attribute.
- If a proper noun before a verb is found, print its token.text.

In [73]:
doc = nlp("Berlin depicts nice city Tokyo")

for token in doc:
    if token.pos_ == "PROPN" and token.i+1 < len(doc):
        if doc[token.i + 1].pos_ == "VERB":
            print ("Found proper noun before a verb:", token.text)

Found proper noun before a verb: Berlin
