Aim: To explore various NLP operations and spaCy functionalities

In [2]:
pip install spacy

Note: you may need to restart the kernel to use updated packages.


In [1]:
import spacy
nlp = spacy.load("en_core_web_sm")
nlp

<spacy.lang.en.English at 0x1ad475fd750>

In [2]:
introduction_doc = nlp("This tutorial is about Natural Language Processing")
type(introduction_doc)

spacy.tokens.doc.Doc

In [3]:
[token. text for token in introduction_doc]

['This', 'tutorial', 'is', 'about', 'Natural', 'Language', 'Processing']

In [4]:
import pathlib
file_name = "Introduction.txt"
introduction_doc = nlp(pathlib.Path(file_name).read_text(encoding="utf-8"))
print ([token.text for token in introduction_doc])

['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a', 'subfield', 'of', 'computer', 'science', 'and', 'especially', 'artificial', 'intelligence', '.', 'It', 'is', 'primarily', 'concerned', 'with', 'providing', 'computers', 'with', 'the', 'ability', 'to', 'process', 'data', 'encoded', 'in', 'natural', 'language', 'and', 'is', 'thus', 'closely', 'related', 'to', 'information', 'retrieval', ',', 'knowledge', 'representation', 'and', 'computational', 'linguistics', ',', 'a', 'subfield', 'of', 'linguistics', '.', 'Typically', 'data', 'is', 'collected', 'in', 'text', 'corpora', ',', 'using', 'either', 'rule', '-', 'based', ',', 'statistical', 'or', 'neural', '-', 'based', 'approaches', 'in', 'machine', 'learning', 'and', 'deep', 'learning', '.']


Sentence Detection

In [5]:
about_text = ("This tutorial is about Natural Language Processing")

In [6]:
about_doc = nlp(about_text)
sentences = list(about_doc.sents)
len(sentences)

1

In [7]:
for sentence in sentences:
    print(f"{sentence[:5]}...")

This tutorial is about Natural...


In [8]:
ellipsis_text = (
    "Gus, can you, ... never mind, I forgot"
    " what I was saying. So, do you think"
    " we should ..."
)

In [14]:
from spacy.language import Language

@Language.component("set_custom_boundaries")
def set_custom_boundaries(doc):
    for token in list(doc)[::-1]:  # Convert doc to list for reverse iteration
        if token.text == "...":
            if token.i + 1 < len(doc):  # Ensure index is within bounds
                doc[token.i + 1].is_sent_start = True
    return doc

In [15]:
custom_nlp = spacy.load("en_core_web_sm")

custom_nlp.add_pipe("set_custom_boundaries", before="parser")

<function __main__.set_custom_boundaries(doc)>

In [18]:
custom_ellipsis_doc = custom_nlp(ellipsis_text)

In [20]:
custom_ellipsis_sentences = list(custom_ellipsis_doc.sents)

custom_ellipsis_sentences = list(custom_ellipsis_doc.sents)

for sentence in custom_ellipsis_sentences:
    print(sentence)

Gus, can you, ...
never mind, I forgot what I was saying.
So, do you think we should ...


Tokens in spaCy

In [21]:
import spacy
nlp = spacy.load("en_core_web_sm")
about_text = (
"Gus Proto is a Python developer currently"
" working for a London-based Fintech"
" company. He is interested in learning"
" Natural Language Processing."
)
about_doc = nlp(about_text)

for token in about_doc:
    print (token, token.idx)

Gus 0
Proto 4
is 10
a 13
Python 15
developer 22
currently 32
working 42
for 50
a 54
London 56
- 62
based 63
Fintech 69
company 77
. 84
He 86
is 89
interested 92
in 103
learning 106
Natural 115
Language 123
Processing 132
. 142


In [22]:
print(
f'{"Text with Whitespace":22}'
f'{"Is Alphanumeric?":15}'
f'{"Is Punctuation?":18}'
f'{"Is Stop Word?"}'
)

Text with Whitespace  Is Alphanumeric?Is Punctuation?   Is Stop Word?


In [23]:
for token in about_doc:
    print(
    f'{str(token.text_with_ws):22}'
    f'{str(token.is_alpha):15}'
    f'{str(token.is_punct):18}'
    f'{str(token.is_stop)}'
    )

Gus                   True           False             False
Proto                 True           False             False
is                    True           False             True
a                     True           False             True
Python                True           False             False
developer             True           False             False
currently             True           False             False
working               True           False             False
for                   True           False             True
a                     True           False             True
London                True           False             False
-                     False          True              False
based                 True           False             False
Fintech               True           False             False
company               True           False             False
.                     False          True              False
He                    True  

In [24]:
custom_about_text = (
"Gus Proto is a Python developer currently"
" working for a London@based Fintech"
" company. He is interested in learning"
" Natural Language Processing."

)

tokens_to_print = [token.text for token in about_doc[8:15]]
print(tokens_to_print)

['for', 'a', 'London', '-', 'based', 'Fintech', 'company']


In [25]:
import re
import spacy
from spacy.tokenizer import Tokenizer

custom_nlp = spacy.load("en_core_web_sm")

prefix_re = spacy.util.compile_prefix_regex(custom_nlp.Defaults.prefixes)
suffix_re = spacy.util.compile_suffix_regex(custom_nlp.Defaults.suffixes)

custom_infixes = [r"@"]
infix_re = spacy.util.compile_infix_regex(
    list(custom_nlp.Defaults.infixes) + custom_infixes
)

custom_nlp.tokenizer = Tokenizer(
    custom_nlp.vocab,
    prefix_search=prefix_re.search,
    suffix_search=suffix_re.search,
    infix_finditer=infix_re.finditer,
    token_match=None,
)

custom_about_text = (
    "Gus Proto is a Python developer currently"
    " working for a London@based Fintech"
    " company. He is interested in learning"
    " Natural Language Processing."

)

custom_tokenizer_about_doc = custom_nlp(custom_about_text)

print([token. text for token in custom_tokenizer_about_doc[8:15]])

['for', 'a', 'London', '@', 'based', 'Fintech', 'company']


Stop Words

In [26]:
import spacy

spacy_stopwords = spacy.lang.en.stop_words. STOP_WORDS
len(spacy_stopwords)

for stop_word in list(spacy_stopwords) [:10]:
    print(stop_word)

nobody
rather
various
under
being
made
not
their
should
while


In [27]:
import spacy

custom_about_text = (
"Gus Proto is a Python developer currently"
" working for a London-based Fintech"
" company. He is interested in learning"
" Natural Language Processing."

)

nlp = spacy.load("en_core_web_sm")
about_doc = nlp(custom_about_text)

filtered_tokens = [token. text for token in about_doc if not token.is_stop]

print(filtered_tokens)

['Gus', 'Proto', 'Python', 'developer', 'currently', 'working', 'London', '-', 'based', 'Fintech', 'company', '.', 'interested', 'learning', 'Natural', 'Language', 'Processing', '.']


Lemmatization

In [28]:
import spacy

nlp = spacy.load("en_core_web_sm")
conference_help_text = (
"Gus is helping organize a developer"
" conference on Applications of Natural Language"
" Processing. He keeps organizing local Python meetups"
" and several internal talks at his workplace."
)

conference_help_doc = nlp(conference_help_text)

for token in conference_help_doc:
    if str(token) != str(token.lemma_):
        print(f"{str(token):>20}:{str(token.lemma_)}")

                  is:be
          Processing:processing
                  He:he
               keeps:keep
          organizing:organize
             meetups:meetup
               talks:talk


Word Frequency

In [29]:
import spacy
from collections import Counter

nlp = spacy.load("en_core_web_sm")

complete_text = (
"Gus Proto is a Python developer currently"
" working for a London-based Fintech company. He is"
" interested in learning Natural Language Processing."
" There is a developer conference happening on 21 July"
' 2019 in London. It is titled "Applications of Natural'
' Language Processing". There is a helpline number'
"available at +44-1234567891. Gus is helping organize it."
"He keeps organizing local Python meetups and several"
"internal talks at his workplace. Gus is also presenting"
' a talk. The talk will introduce the reader about "Use'
' cases of Natural Language Processing in Fintech".'
"Apart from his work, he is very passionate about music."
"Gus is learning to play the Piano. He has enrolled"
"himself in the weekend batch of Great Piano Academy."
"Great Piano Academy is situated in Mayfair or the City"
"of London and has world-class piano instructors."
)

complete_doc = nlp(complete_text)

words = [
    token. text
    for token in complete_doc
    if not token. is_stop and not token.is_punct
]
print(Counter(words).most_common(5))

[('Gus', 4), ('London', 3), ('Natural', 3), ('Language', 3), ('Processing', 3)]


In [30]:
Counter(
[token. text for token in complete_doc if not token.is_punct]
).most_common(5)

[('is', 10), ('a', 5), ('in', 5), ('Gus', 4), ('the', 4)]

Part of Speech Tagging

In [31]:
import spacy

nlp = spacy.load("en_core_web_sm")

about_text = (
"Gus Proto is a Python developer currently"
" working for a London-based Fintech"
" company. He is interested in learning"
" Natural Language Processing."

)

about_doc = nlp(about_text)

for token in about_doc:
    print(
f"""
TOKEN: {str(token)}
TAG: {str(token.tag_):10} POS: {token.pos_}
EXPLANATION: {spacy.explain(token.tag_)}"""
)


TOKEN: Gus
TAG: NNP        POS: PROPN
EXPLANATION: noun, proper singular

TOKEN: Proto
TAG: NNP        POS: PROPN
EXPLANATION: noun, proper singular

TOKEN: is
TAG: VBZ        POS: AUX
EXPLANATION: verb, 3rd person singular present

TOKEN: a
TAG: DT         POS: DET
EXPLANATION: determiner

TOKEN: Python
TAG: NNP        POS: PROPN
EXPLANATION: noun, proper singular

TOKEN: developer
TAG: NN         POS: NOUN
EXPLANATION: noun, singular or mass

TOKEN: currently
TAG: RB         POS: ADV
EXPLANATION: adverb

TOKEN: working
TAG: VBG        POS: VERB
EXPLANATION: verb, gerund or present participle

TOKEN: for
TAG: IN         POS: ADP
EXPLANATION: conjunction, subordinating or preposition

TOKEN: a
TAG: DT         POS: DET
EXPLANATION: determiner

TOKEN: London
TAG: NNP        POS: PROPN
EXPLANATION: noun, proper singular

TOKEN: -
TAG: HYPH       POS: PUNCT
EXPLANATION: punctuation mark, hyphen

TOKEN: based
TAG: VBN        POS: VERB
EXPLANATION: verb, past participle

TOKEN: Fintech
TAG:

In [32]:
nouns = []
adjectives = []

for token in about_doc:
    if token.pos_ == "NOUN":
        nouns. append (token)
    if token.pos_ == "ADJ":
        adjectives.append(token)

In [33]:
nouns

[developer, company]

In [34]:
adjectives

[interested]

Visualization: Using displaCy

In [35]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")

about_interest_text = "He is interested in learning Natural Language Processing."
about_interest_doc = nlp(about_interest_text)

html_code = displacy.render(about_interest_doc, style="dep", options={"distance": 90})

print(html_code)

None


Processing Functions

In [37]:
import spacy

nlp = spacy.load("en_core_web_sm")

complete_text = (
"Gus Proto is a Python developer currently"
" working for a London-based Fintech company. He is"
" interested in learning Natural Language Processing."
" There is a developer conference happening on 21 July"
' 2019 in London. It is titled "Applications of Natural'
' Language Processing". There is a helpline number'
" available at +44-1234567891. Gus is helping organize it."
" He keeps organizing local Python meetups and several"
" internal talks at his workplace. Gus is also presenting"
' a talk. The talk will introduce the reader about "Use'
' cases of Natural Language Processing in Fintech".'
" Apart from his work, he is very passionate about music."
" Gus is learning to play the Piano. He has enrolled"
" himself in the weekend batch of Great Piano Academy."
" Great Piano Academy is situated in Mayfair or the City"
" of London and has world-class piano instructors."

)

complete_doc = nlp(complete_text)

def is_token_allowed(token):
    return bool(
    token
    and str(token).strip()
    and not token.is_stop
    and not token. is_punct
    )
def preprocess_token(token):
    return token. lemma_.strip().lower()

complete_filtered_tokens = [
    preprocess_token(token)
    for token in complete_doc
    if is_token_allowed(token)
]

print(complete_filtered_tokens)

['gus', 'proto', 'python', 'developer', 'currently', 'work', 'london', 'base', 'fintech', 'company', 'interested', 'learn', 'natural', 'language', 'processing', 'developer', 'conference', 'happen', '21', 'july', '2019', 'london', 'title', 'application', 'natural', 'language', 'processing', 'helpline', 'number', 'available', '+44', '1234567891', 'gus', 'helping', 'organize', 'keep', 'organize', 'local', 'python', 'meetup', 'internal', 'talk', 'workplace', 'gus', 'present', 'talk', 'talk', 'introduce', 'reader', 'use', 'case', 'natural', 'language', 'processing', 'fintech', 'apart', 'work', 'passionate', 'music', 'gus', 'learn', 'play', 'piano', 'enrol', 'weekend', 'batch', 'great', 'piano', 'academy', 'great', 'piano', 'academy', 'situate', 'mayfair', 'city', 'london', 'world', 'class', 'piano', 'instructor']


Rule Based Matching using spaCY

In [38]:
import spacy

nlp = spacy.load("en_core_web_sm")

about_text = (
"Gus Proto is a Python developer currently"
"working for a London-based Fintech"
"22038_NLP_02.ipynbcompany. He is interested in learning"
" Natural Language Processing."
)

about_doc = nlp(about_text)

from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)

def extract_full_name(nlp_doc):
    pattern = [{"POS": "PROPN"}, {"POS": "PROPN"} ]
    matcher.add("FULL_NAME", [pattern])
    matches = matcher(nlp_doc)
    for _, start, end in matches:
        span = nlp_doc[start:end]
        yield span.text

# Print the extracted full name
print(next(extract_full_name(about_doc)))

Gus Proto


Dependancy Parsing using spaCY

In [39]:
import spacy

nlp = spacy.load("en_core_web_sm")
piano_text = "Gus is learning piano"
piano_doc = nlp(piano_text)

for token in piano_doc:
    print(
    f"""
TOKEN: {token.text}

{token.tag_ =}
{token.head. text =}
{token.dep_ =}"""
)


TOKEN: Gus

token.tag_ ='NNP'
token.head. text ='learning'
token.dep_ ='nsubj'

TOKEN: is

token.tag_ ='VBZ'
token.head. text ='learning'
token.dep_ ='aux'

TOKEN: learning

token.tag_ ='VBG'
token.head. text ='learning'
token.dep_ ='ROOT'

TOKEN: piano

token.tag_ ='NN'
token.head. text ='learning'
token.dep_ ='dobj'


In [40]:
displacy.serve(piano_doc, style="dep")




Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


In [41]:
import spacy

nlp = spacy.load("en_core_web_sm")

one_line_about_text = (
"Gus Proto is a Python developer"
" currently working for a London-based Fintech company"

)

one_line_about_doc = nlp(one_line_about_text)

# Extract children of `developer'
print([token.text for token in one_line_about_doc[5].children])

# Extract previous neighboring node of `developer'
print(one_line_about_doc[5].nbor(-1))

# Extract next neighboring node of `developer'
print(one_line_about_doc[5].nbor())

# Extract all tokens on the left of `developer'
print([token.text for token in one_line_about_doc[5].lefts])

# Extract tokens on the right of developer'
print([token.text for token in one_line_about_doc[5].rights])

# Print subtree of `developer'
print(list(one_line_about_doc[5].subtree))

['a', 'Python', 'working']
Python
currently
['a', 'Python']
['working']
[a, Python, developer, currently, working, for, a, London, -, based, Fintech, company]


Named Entity Recognition

In [42]:
import spacy

nlp = spacy.load("en_core_web_sm")

piano_class_text = (
"Great Piano Academy is situated"
" in Mayfair or the City of London and has"
" world-class piano instructors."

)

piano_class_doc = nlp(piano_class_text)

for ent in piano_class_doc.ents:
    print(
        f"""
{ent.text =}
{ent.start_char =}
{ent.end_char =}
{ent.label_ =}
spacy.explain('{ent.label_}' ) = {spacy.explain(ent.label_)}"""
)


ent.text ='Great Piano Academy'
ent.start_char =0
ent.end_char =19
ent.label_ ='ORG'
spacy.explain('ORG' ) = Companies, agencies, institutions, etc.

ent.text ='Mayfair'
ent.start_char =35
ent.end_char =42
ent.label_ ='FAC'
spacy.explain('FAC' ) = Buildings, airports, highways, bridges, etc.

ent.text ='the City of London'
ent.start_char =46
ent.end_char =64
ent.label_ ='GPE'
spacy.explain('GPE' ) = Countries, cities, states


In [43]:
displacy.serve(piano_class_doc, style="ent")




Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


In [44]:
survey_text = (
"Out of 5 people surveyed, James Robert,"
" Julie Fuller and Benjamin Brooks like"
" apples. Kelly Cox and Matthew Evans"
" like oranges."
)

def replace_person_names(token):
    if token.ent_iob != 0 and token.ent_type_ == "PERSON":
        return "[REDACTED] "
    return token. text_with_ws

def redact_names(nlp_doc):
    with nlp_doc.retokenize() as retokenizer:
        for ent in nlp_doc.ents:
            retokenizer.merge(ent)
    tokens = map(replace_person_names, nlp_doc)
    return "".join(tokens)

survey_doc = nlp(survey_text)
print(redact_names(survey_doc))

Out of 5 people surveyed, [REDACTED] , [REDACTED] and [REDACTED] like apples. [REDACTED] and [REDACTED] like oranges.


Inference:
In this NLP lab, we installed and utilized spaCy, exploring various text processing tasks. We began by creating a Doc object for processed text and implemented custom rules for sentence detection. Tokenization revealed insights into attributes such as alphanumeric nature, punctuation, and stop word status. The lab delved into dependency parsing, showcasing spaCy's ability to unveil syntactic relationships. Lemmatization was demonstrated for effective handling of word forms. Practical applications, including stop word removal and lemmatization, were showcased to enhance data quality. The exploration covered word frequency analysis, part-of-speech tagging, and visualizations using displaCy. Preprocessing functions highlighted the importance of customized text cleaning. In summary, the lab provided a comprehensive understanding of spaCy's capabilities, encompassing advanced features like dependency parsing, for diverse aspects of natural language processing.