Welcome to the homework assignment for text processing using NLTK. In this
assignment, you will apply the text-processing techniques we have learned in
the previous lessons using the Natural Language Toolkit (NLTK) library in
Python.
You will be given a set of text data and asked to perform various text
processing tasks, including tokenization, stop word removal, stemming
lemmatization, and regular expression matching. These tasks are everyday in
natural language processing and are essential for preparing text data for
analysis and modelling.
By completing this assignment, you will gain hands-on experience in
applying these techniques to real-world text data and become more
proficient in using NLTK for text processing. Good luck!

1. Tokenize the following text into individual words:

In [6]:
import nltk

nltk.download('punkt_tab')


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [7]:
from nltk.tokenize import word_tokenize

text = "The quick brown fox jumps over the lazy dog."
tokens = word_tokenize(text)
print(tokens)



['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']


2. Remove stop words from the following text:

In [8]:
from nltk.corpus import stopwords

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print(filtered_tokens)


['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog', '.']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


3. Stem the following words using Porter Stemmer:

In [9]:
from nltk.stem import PorterStemmer

words = ["studies", "studying", "studied", "study"]
stemmer = PorterStemmer()
stems = [stemmer.stem(word) for word in words]
print(stems)


['studi', 'studi', 'studi', 'studi']


4. Lemmatize the following words using WordNet Lemmatizer:

In [10]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(word) for word in words]
print(lemmatized)


[nltk_data] Downloading package wordnet to /root/nltk_data...


['study', 'studying', 'studied', 'study']


5. Use regular expressions to extract all email addresses:

In [11]:
import re

text = "Please contact us at info@example.com or support@example.com for further assistance."
email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b'
emails = re.findall(email_pattern, text)
print(emails)


['info@example.com', 'support@example.com']


6. Use regular expressions to extract all phone numbers:

In [12]:
phone_text = "Please call us at (555) 123-4567 or (555) 987-6543 for further assistance."
phone_pattern = r'\(\d{3}\) \d{3}-\d{4}'
phone_numbers = re.findall(phone_pattern, phone_text)
print(phone_numbers)


['(555) 123-4567', '(555) 987-6543']


7. Clean the following text by removing all punctuation and numbers, and converting all letters to lowercase:

In [13]:
import string

text_to_clean = "Hello, world! This is some text with some special characters like $ and %, and some extra whitespace. 123"
cleaned_text = ''.join([char.lower() for char in text_to_clean if char not in string.punctuation and not char.isdigit()])
print(cleaned_text)


hello world this is some text with some special characters like  and  and some extra whitespace 


8. Convert the following text to a bag of words representation:

In [14]:
from sklearn.feature_extraction.text import CountVectorizer

bag_of_words_text = ["The quick brown fox jumps over the lazy dog."]
vectorizer = CountVectorizer()
bag_of_words = vectorizer.fit_transform(bag_of_words_text)
print(vectorizer.get_feature_names_out())
print(bag_of_words.toarray())


['brown' 'dog' 'fox' 'jumps' 'lazy' 'over' 'quick' 'the']
[[1 1 1 1 1 1 1 2]]
