# Basic NLP Course

## NLTK vs spaCy

This is the second class of the Basic NLP Course, where we will explore two popular Python libraries for Natural Language Processing: NLTK and spaCy. These packages provide powerful tools for text processing, tokenization, parsing, and more.

spaCy adopts an object-oriented approach, making it ideal for those who prioritize the end result and need efficient pipelines for NLP tasks. On the other hand, NLTK relies on string processing, offering access to a wide range of algorithms and greater flexibility for customizations, making it suitable for those who want to experiment and fine-tune their workflows.

In [2]:
import nltk
import spacy
from nltk.tokenize import sent_tokenize, word_tokenize

# Load the small English NLP model
nlp = spacy.load("en_core_web_sm")

# load packages for nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [3]:
work_order = "Work Order: Maintenance required for HVAC system in Building A. Priority: High. Assigned to: John Doe."
print(work_order)

# Process the text with spaCy
doc = nlp(work_order)

Work Order: Maintenance required for HVAC system in Building A. Priority: High. Assigned to: John Doe.


In [4]:
# sentence tokenization
for sentence in doc.sents:
    print(sentence)

Work Order: Maintenance required for HVAC system in Building A. Priority: High.
Assigned to: John Doe.


In [5]:
# word tokenization
for sentence in doc.sents:
    for word in sentence:
        print(word)

Work
Order
:
Maintenance
required
for
HVAC
system
in
Building
A.
Priority
:
High
.
Assigned
to
:
John
Doe
.


In [None]:
# analysing types of objects
print(type(doc))
print(type(doc.sents))
print(type(doc[0]))

<class 'spacy.tokens.doc.Doc'>
<class '_cython_3_1_1.generator'>
<class 'spacy.tokens.token.Token'>


In [None]:
# we can create a span
span = doc[3:7]
print(span)
print(type(span))

Maintenance required for HVAC
<class 'spacy.tokens.span.Span'>


In [12]:
# checking attributes and methods of an object Token
token = doc[3]
dir(token)

['_',
 '__bytes__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 'ancestors',
 'check_flag',
 'children',
 'cluster',
 'conjuncts',
 'dep',
 'dep_',
 'doc',
 'ent_id',
 'ent_id_',
 'ent_iob',
 'ent_iob_',
 'ent_kb_id',
 'ent_kb_id_',
 'ent_type',
 'ent_type_',
 'get_extension',
 'has_dep',
 'has_extension',
 'has_head',
 'has_morph',
 'has_vector',
 'head',
 'i',
 'idx',
 'iob_strings',
 'is_alpha',
 'is_ancestor',
 'is_ascii',
 'is_bracket',
 'is_currency',
 'is_digit',
 'is_left_punct',
 'is_lower',
 'is_oov',
 'is_punct',
 'is_quote',
 'is_right_punct',
 'is_sent_end',
 'is_sent_start',
 'is_space',
 'is_stop',
 'is_title',
 'is_upper',
 'lang

In [13]:
# checking if the token is alphabetic
print('Is the token alphabetic?')
print(token.is_alpha)

# checking if the token is numeric
print('Is the token numeric?')
print(token.is_digit)

# checking if the token is a space
print('Is the token a space?')
print(token.is_space)

# checking the part of speech tag
print('Part of speech tag:')
print(token.pos_)

# checking the dependency label
print('Dependency label:')
print(token.dep_)

# checking the lemma of the token
print('Lemma:')
print(token.lemma_)

# checking the shape of the token
print('Shape:')
print(token.shape_)

# checking the prefix of the token
print('Prefix:')
print(token.prefix_)

# checking the suffix of the token
print('Suffix:')
print(token.suffix_)

Is the token alphabetic?
True
Is the token numeric?
False
Is the token a space?
False
Part of speech tag:
PROPN
Dependency label:
nsubj
Lemma:
Maintenance
Shape:
Xxxxx
Prefix:
M
Suffix:
nce


In [14]:
for token in doc:
    print(f'Token: {token.text}, POS: {token.pos_}, Dependency: {token.dep_}, Lemma: {token.lemma_}')
    print(f'Is alpha: {token.is_alpha}, Index: {token.i}, Is punctuation: {token.is_punct}, Like num: {token.like_num}, Is currency: {token.is_currency}')

Token: Work, POS: NOUN, Dependency: compound, Lemma: work
Is alpha: True, Index: 0, Is punctuation: False, Like num: False, Is currency: False
Token: Order, POS: NOUN, Dependency: ROOT, Lemma: order
Is alpha: True, Index: 1, Is punctuation: False, Like num: False, Is currency: False
Token: :, POS: PUNCT, Dependency: punct, Lemma: :
Is alpha: False, Index: 2, Is punctuation: True, Like num: False, Is currency: False
Token: Maintenance, POS: PROPN, Dependency: nsubj, Lemma: Maintenance
Is alpha: True, Index: 3, Is punctuation: False, Like num: False, Is currency: False
Token: required, POS: VERB, Dependency: acl, Lemma: require
Is alpha: True, Index: 4, Is punctuation: False, Like num: False, Is currency: False
Token: for, POS: ADP, Dependency: prep, Lemma: for
Is alpha: True, Index: 5, Is punctuation: False, Like num: False, Is currency: False
Token: HVAC, POS: PROPN, Dependency: compound, Lemma: HVAC
Is alpha: True, Index: 6, Is punctuation: False, Like num: False, Is currency: False
T

In [None]:
import random

# Generate a long list of names, emails, and departments
names = ["Alice", "Bob", "Charlie", "David", "Eve", "Frank", "Grace", "Hannah", "Ivy", "Jack"]
domains = ["example.com", "company.org", "industry.net", "business.io"]
departments = ["HR", "Finance", "Engineering", "Marketing", "Sales", "IT", "Operations"]

# Create a string simulating a text file
text_file_content = ""
for i in range(501):  # Generate 501 entries
    name = random.choice(names)
    email = f"{name.lower()}{i}@{random.choice(domains)}"
    dept = random.choice(departments)
    text_file_content += f"{name}\t{email}\t{dept}\n"

# Print the simulated text file content
print(text_file_content)

Bob	bob0@example.com	Engineering
Frank	frank1@company.org	Operations
Bob	bob2@business.io	Engineering
Jack	jack3@industry.net	Engineering
David	david4@industry.net	Engineering
Eve	eve5@business.io	HR
Alice	alice6@company.org	Engineering
Bob	bob7@example.com	Finance
Hannah	hannah8@example.com	Engineering
Grace	grace9@business.io	Sales
Jack	jack10@company.org	Finance
David	david11@example.com	Engineering
Jack	jack12@example.com	Engineering
Alice	alice13@business.io	HR
Grace	grace14@company.org	Sales
Alice	alice15@company.org	Operations
Eve	eve16@example.com	Marketing
Ivy	ivy17@example.com	Operations
David	david18@example.com	Sales
Jack	jack19@industry.net	HR
Charlie	charlie20@industry.net	Marketing
Bob	bob21@business.io	HR
Jack	jack22@example.com	Marketing
David	david23@business.io	IT
Grace	grace24@example.com	Marketing
Bob	bob25@example.com	HR
Hannah	hannah26@business.io	Engineering
Eve	eve27@industry.net	Marketing
Eve	eve28@business.io	HR
Charlie	charlie29@example.com	Operations
Eve	ev

In [16]:
# create a spaCy document from the text file content
doc = nlp(text_file_content)

# printing emails only
for token in doc:
    if token.like_email:
        print(token.text)

bob0@example.com
frank1@company.org
bob2@business.io
jack3@industry.net
david4@industry.net
eve5@business.io
alice6@company.org
bob7@example.com
hannah8@example.com
grace9@business.io
jack10@company.org
david11@example.com
jack12@example.com
alice13@business.io
grace14@company.org
alice15@company.org
eve16@example.com
ivy17@example.com
david18@example.com
jack19@industry.net
charlie20@industry.net
bob21@business.io
jack22@example.com
david23@business.io
grace24@example.com
bob25@example.com
hannah26@business.io
eve27@industry.net
eve28@business.io
charlie29@example.com
eve30@company.org
charlie31@business.io
eve32@company.org
bob33@business.io
ivy34@example.com
grace35@example.com
alice36@company.org
david37@business.io
eve38@industry.net
ivy39@business.io
frank40@business.io
bob41@example.com
david42@business.io
grace43@example.com
hannah44@example.com
charlie45@business.io
jack46@company.org
charlie47@industry.net
eve48@industry.net
hannah49@example.com
bob50@example.com
charlie51@ex

In [17]:
# now lets get the emails only if the department is Engineering
emails_for_wo = []
for line in text_file_content.split('\n'):
    parts = line.split('\t')
    if len(parts) == 3:
        name, email, dept = parts
        if dept == "Engineering":
            print(f"Name: {name}, Email: {email}, Department: {dept}")
            emails_for_wo.append(email)

Name: Bob, Email: bob0@example.com, Department: Engineering
Name: Bob, Email: bob2@business.io, Department: Engineering
Name: Jack, Email: jack3@industry.net, Department: Engineering
Name: David, Email: david4@industry.net, Department: Engineering
Name: Alice, Email: alice6@company.org, Department: Engineering
Name: Hannah, Email: hannah8@example.com, Department: Engineering
Name: David, Email: david11@example.com, Department: Engineering
Name: Jack, Email: jack12@example.com, Department: Engineering
Name: Hannah, Email: hannah26@business.io, Department: Engineering
Name: Eve, Email: eve38@industry.net, Department: Engineering
Name: Ivy, Email: ivy39@business.io, Department: Engineering
Name: Charlie, Email: charlie51@example.com, Department: Engineering
Name: Eve, Email: eve59@example.com, Department: Engineering
Name: David, Email: david66@example.com, Department: Engineering
Name: Grace, Email: grace77@company.org, Department: Engineering
Name: Bob, Email: bob85@company.org, Departm

In [7]:
# tokenizing with nltk
sent_tokenize(work_order)

['Work Order: Maintenance required for HVAC system in Building A.',
 'Priority: High.',
 'Assigned to: John Doe.']

In [8]:
# word tokenizing with nltk
word_tokenize(work_order)

['Work',
 'Order',
 ':',
 'Maintenance',
 'required',
 'for',
 'HVAC',
 'system',
 'in',
 'Building',
 'A',
 '.',
 'Priority',
 ':',
 'High',
 '.',
 'Assigned',
 'to',
 ':',
 'John',
 'Doe',
 '.']