## **HAMMOUTI Douae**


The first part of this notebook is based on Sudalai Rajkumar's tutorial on Kaggle.
More information on the [***dataset***](https://www.kaggle.com/datasets/thoughtvector/customer-support-on-twitter).


## **Introduction**

In any machine learning task, cleaning or preprocessing the data is as important as model building if not more. And when it comes to unstructured data like text, this process is even more important.

The goal of this tutorial is to understand the various text preprocessing steps with code examples.

Some of the common text preprocessing / cleaning steps are:
* Lower casing
* Removal of Punctuations
* Removal of Stopwords
* Removal of Frequent words
* Removal of Rare words
* Stemming
* Lemmatization
* Removal of emojis
* Removal of emoticons
* Conversion of emoticons to words
* Conversion of emojis to words
* Removal of URLs
* Removal of HTML tags
* Chat words conversion
* Spelling correction


So these are the different types of text preprocessing steps which we can do on text data. But we need not do all of these all the times. We need to carefully choose the preprocessing steps based on our use case since that also play an important role.

For example, in sentiment analysis use case, we need not remove the emojis or emoticons as they will convey some important information about the sentiment.

In [3]:
import numpy as np
import pandas as pd
import re
import nltk
import spacy
import string
pd.options.mode.chained_assignment = None

Let's load the dataset and see how is structured with few samples

In [4]:
df = pd.read_csv("./data/tweets_preprocessing.csv")
df["text"] = df["text"].astype(str)
print(f"Shape: {df.shape}")
df.head()

Shape: (93, 7)


Unnamed: 0,tweet_id,author_id,inbound,created_at,text,response_tweet_id,in_response_to_tweet_id
0,119237,105834,True,Wed Oct 11 06:55:44 +0000 2017,@AppleSupport causing the reply to be disregar...,119236.0,
1,119238,ChaseSupport,False,Wed Oct 11 13:25:49 +0000 2017,@105835 Your business means a lot to us. Pleas...,,119239.0
2,119239,105835,True,Wed Oct 11 13:00:09 +0000 2017,@76328 I really hope you all change but I'm su...,119238.0,
3,119240,VirginTrains,False,Tue Oct 10 15:16:08 +0000 2017,@105836 LiveChat is online at the moment - htt...,119241.0,119242.0
4,119241,105836,True,Tue Oct 10 15:17:21 +0000 2017,@VirginTrains see attached error message. I've...,119243.0,119240.0


## **Lower Casing**

Lower casing is a common text preprocessing technique. The idea is to convert the input text into same casing format so that 'text', 'Text' and 'TEXT' are treated the same way.

This is more helpful when computing words frequency for example, yet it may not be the case for tasks like Part of Speech tagging (where proper casing gives some information about proper nouns and so on) and Sentiment Analysis (where upper casing refers to anger)

By default, lower casing is done my most of the modern day vectorizers and tokenizers.

In [5]:
text_df = df[["text"]]

In [6]:
print("Before lowering:")
print(text_df.head().text.values)

Before lowering:
['@AppleSupport causing the reply to be disregarded and the tapped notification under the keyboard is opened😡😡😡'
 '@105835 Your business means a lot to us. Please DM your name, zip code and additional details about your concern. ^RR https://t.co/znUu1VJn9r'
 "@76328 I really hope you all change but I'm sure you won't! Because you don't have to!"
 '@105836 LiveChat is online at the moment - https://t.co/SY94VtU8Kq or contact 03331 031 031 option 1, 4, 3 (Leave a message) to request a call back'
 "@VirginTrains see attached error message. I've tried leaving a voicemail several times in the past week https://t.co/NxVZjlYx1k"]


In [7]:
# Your code here:
# convert the text_df text column to lowercase
# keep in mind that text_df["text"] is a Pandas' Series
text_df["text"] = text_df["text"].str.lower()


assert text_df.text.values[2] == "@76328 i really hope you all change but i'm sure you won't! because you don't have to!", "Your text is not lowered properly"

print("After lowering:")
print(text_df.head().text.values)
text_df.head()

After lowering:
['@applesupport causing the reply to be disregarded and the tapped notification under the keyboard is opened😡😡😡'
 '@105835 your business means a lot to us. please dm your name, zip code and additional details about your concern. ^rr https://t.co/znuu1vjn9r'
 "@76328 i really hope you all change but i'm sure you won't! because you don't have to!"
 '@105836 livechat is online at the moment - https://t.co/sy94vtu8kq or contact 03331 031 031 option 1, 4, 3 (leave a message) to request a call back'
 "@virgintrains see attached error message. i've tried leaving a voicemail several times in the past week https://t.co/nxvzjlyx1k"]


Unnamed: 0,text
0,@applesupport causing the reply to be disregar...
1,@105835 your business means a lot to us. pleas...
2,@76328 i really hope you all change but i'm su...
3,@105836 livechat is online at the moment - htt...
4,@virgintrains see attached error message. i've...


## **Removal of Punctuations**

Another common text preprocessing technique is to remove the punctuations from the text data. This is again a text standardization process that will help to treat 'hurray' and 'hurray!' in the same way.

We also need to carefully choose the list of punctuations to exclude depending on the use case. For example, the `string.punctuation` in python contains the following punctuation symbols

`!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~`

We can add or remove more punctuations as per our need.

In [8]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [9]:
PUNCT_TO_REMOVE = string.punctuation
def remove_punctuation(text: str) -> str:
    """
    Removes punctuation characters from the input text

    Args:
        text (str): The input text from which punctuation characters will be removed

    Returns:
        str: A new string with all punctuation characters removed
    """
    # Your code here:
    # use the maketrans function to remove the punctuation specified in PUNCT_TO_REMOVE [https://www.w3schools.com/python/ref_string_maketrans.asp]
    text_df= str.maketrans("","",PUNCT_TO_REMOVE)
    return text.translate(text_df)

# now apply your function to the text column using the apply function [https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html]
text_df["text_wo_punct"] = text_df["text"].apply(remove_punctuation)



assert text_df["text_wo_punct"].values[3] == "105836 livechat is online at the moment  httpstcosy94vtu8kq or contact 03331 031 031 option 1 4 3 leave a message to request a call back"
text_df.head()

Unnamed: 0,text,text_wo_punct
0,@applesupport causing the reply to be disregar...,applesupport causing the reply to be disregard...
1,@105835 your business means a lot to us. pleas...,105835 your business means a lot to us please ...
2,@76328 i really hope you all change but i'm su...,76328 i really hope you all change but im sure...
3,@105836 livechat is online at the moment - htt...,105836 livechat is online at the moment https...
4,@virgintrains see attached error message. i've...,virgintrains see attached error message ive tr...


## **Removal of stopwords**

Stopwords are commonly occuring words in a language like 'the', 'a' and so on. They can be removed from the text most of the times, as they don't provide valuable information for downstream analysis. In cases like Part of Speech tagging, we should not remove them as provide very valuable information about the POS.

These stopword lists are already compiled for different languages and we can safely use them. For example, the stopword list for english language from the nltk package can be seen below.


In [10]:
from nltk.corpus import stopwords
STOPWORDS = set(stopwords.words('english'))
", ".join(stopwords.words('english'))

"a, about, above, after, again, against, ain, all, am, an, and, any, are, aren, aren't, as, at, be, because, been, before, being, below, between, both, but, by, can, couldn, couldn't, d, did, didn, didn't, do, does, doesn, doesn't, doing, don, don't, down, during, each, few, for, from, further, had, hadn, hadn't, has, hasn, hasn't, have, haven, haven't, having, he, he'd, he'll, her, here, hers, herself, he's, him, himself, his, how, i, i'd, if, i'll, i'm, in, into, is, isn, isn't, it, it'd, it'll, it's, its, itself, i've, just, ll, m, ma, me, mightn, mightn't, more, most, mustn, mustn't, my, myself, needn, needn't, no, nor, not, now, o, of, off, on, once, only, or, other, our, ours, ourselves, out, over, own, re, s, same, shan, shan't, she, she'd, she'll, she's, should, shouldn, shouldn't, should've, so, some, such, t, than, that, that'll, the, their, theirs, them, themselves, then, there, these, they, they'd, they'll, they're, they've, this, those, through, to, too, under, until, up, 

Similarly we can also get the list for other languages as well and use them.

In [11]:
sample = text_df.text.values[0]
sample

'@applesupport causing the reply to be disregarded and the tapped notification under the keyboard is opened😡😡😡'

In [12]:
split = sample.split(' ') # splits the input string into a list, the delimiter is a whitespace " "
split

['@applesupport',
 'causing',
 'the',
 'reply',
 'to',
 'be',
 'disregarded',
 'and',
 'the',
 'tapped',
 'notification',
 'under',
 'the',
 'keyboard',
 'is',
 'opened😡😡😡']

In [13]:
# Use list comprehension to remove the stopwords: [https://www.programiz.com/python-programming/list-comprehension]
# We want to achieve the same result as:

# filtered_words = []
# for word in split:
#   if word not in STOPWORDS:
#       filtered_words.append(word)
#
# Your code here
filtered_words = [word for word in split if word not in STOPWORDS]


assert filtered_words == ['@applesupport','causing','reply','disregarded','tapped','notification','keyboard','opened😡😡😡']
filtered_words

['@applesupport',
 'causing',
 'reply',
 'disregarded',
 'tapped',
 'notification',
 'keyboard',
 'opened😡😡😡']

In [14]:
filtered_string = " ".join(filtered_words)
print(f"Before filtering: {sample}\nAfter filtering: {filtered_string}")

Before filtering: @applesupport causing the reply to be disregarded and the tapped notification under the keyboard is opened😡😡😡
After filtering: @applesupport causing reply disregarded tapped notification keyboard opened😡😡😡


In [15]:
def remove_stopwords(text: str) -> str:
    """
    Removes stopwords from the input text

    Args:
        text (str): The input text from which stopwords will be removed

    Returns:
        str: A new string without stopwords
    """
    # Your code here:
    # remove stopwords with list comprehension and convert the list back to a string by concatenating the words
    words = text.split()
    filtered_words = [word for word in words if word.lower() not in STOPWORDS]
    return " ".join(filtered_words)

text_df["text_wo_stop"] = text_df["text_wo_punct"].apply(remove_stopwords)
text_df.head()

Unnamed: 0,text,text_wo_punct,text_wo_stop
0,@applesupport causing the reply to be disregar...,applesupport causing the reply to be disregard...,applesupport causing reply disregarded tapped ...
1,@105835 your business means a lot to us. pleas...,105835 your business means a lot to us please ...,105835 business means lot us please dm name zi...
2,@76328 i really hope you all change but i'm su...,76328 i really hope you all change but im sure...,76328 really hope change im sure wont dont
3,@105836 livechat is online at the moment - htt...,105836 livechat is online at the moment https...,105836 livechat online moment httpstcosy94vtu8...
4,@virgintrains see attached error message. i've...,virgintrains see attached error message ive tr...,virgintrains see attached error message ive tr...


## **Removal of Frequent words**

In the previous preprocessing step, we removed the stopwords based on language information. But say, if we have a domain specific corpus, we might also have some frequent words which are of lesser importance to us.

So this step is to remove the frequent words in the given corpus. So, let us get the most common words and then remove them in the next step

In [16]:
from collections import Counter
cnt = Counter()

# Your code here:
# Use the Counter class to return the most frequent words
# 1: join all the text in the text_wo_stop column using the join() function
big_string = " ".join(text_df["text_wo_stop"])
# 2: tokenize the text on white space by using the split() function
tokens = big_string.split()
# 3: instantiate the Counter class with your tokenized array
cnt = Counter(tokens)
# 4: use the most_common class method to return the most frequent words
most_common = cnt.most_common()

assert set(most_common[:10]) == set([('us', 25), ('dm', 19),('help', 18),('thanks', 13),('httpstcogdrqu22ypt',12),('applesupport', 11),('please', 11),('phone', 9),('hi', 9),('ive', 8)])
most_common[:10]

[('us', 25),
 ('dm', 19),
 ('help', 18),
 ('thanks', 13),
 ('httpstcogdrqu22ypt', 12),
 ('applesupport', 11),
 ('please', 11),
 ('phone', 9),
 ('hi', 9),
 ('ive', 8)]

In [17]:
# create a set with the 10 most frequent words:
# hint: create a set comprehension equivalent to
# words = set()
# for word, count in most_common[:10]:
#   words.add(word)
#
# keep in mind that sets are more efficient in this scenario, being implemented as hash tables

FREQWORDS = {w for (w, word_count) in most_common[:10]}

def remove_freqwords(text: str, freq_words: set=FREQWORDS) -> str:
    """
    Removes a selection of frequent words from the input string

    Inputs:
        text (str): The input text from which frequent words will be removed
        freq_words (set): A set of frequent words to remove from the text

    Returns:
        str: A new string with all frequent words removed
    """
    # Your code here:
    # use your function to filter out the 10 most frequent words
    return " ".join([word for word in text.split() if word not in freq_words])


assert remove_freqwords(text_df.text_wo_stop.values[1]) == "105835 business means lot name zip code additional details concern rr httpstcoznuu1vjn9r"

# Your code here:
# Apply your function to the text_wo_stop column
text_df["text_wo_freq"] = text_df["text_wo_stop"].apply(remove_freqwords)


text_df.head()

Unnamed: 0,text,text_wo_punct,text_wo_stop,text_wo_freq
0,@applesupport causing the reply to be disregar...,applesupport causing the reply to be disregard...,applesupport causing reply disregarded tapped ...,causing reply disregarded tapped notification ...
1,@105835 your business means a lot to us. pleas...,105835 your business means a lot to us please ...,105835 business means lot us please dm name zi...,105835 business means lot name zip code additi...
2,@76328 i really hope you all change but i'm su...,76328 i really hope you all change but im sure...,76328 really hope change im sure wont dont,76328 really hope change im sure wont dont
3,@105836 livechat is online at the moment - htt...,105836 livechat is online at the moment https...,105836 livechat online moment httpstcosy94vtu8...,105836 livechat online moment httpstcosy94vtu8...
4,@virgintrains see attached error message. i've...,virgintrains see attached error message ive tr...,virgintrains see attached error message ive tr...,virgintrains see attached error message tried ...


## **Removal of Rare words**

This is very similar to previous preprocessing step but we will remove the rare words from the corpus.

In [19]:
# let's keep only the latest version of the processed text
text_df = text_df[["text", "text_wo_freq"]]
text_df.head()

Unnamed: 0,text,text_wo_freq
0,@applesupport causing the reply to be disregar...,causing reply disregarded tapped notification ...
1,@105835 your business means a lot to us. pleas...,105835 business means lot name zip code additi...
2,@76328 i really hope you all change but i'm su...,76328 really hope change im sure wont dont
3,@105836 livechat is online at the moment - htt...,105836 livechat online moment httpstcosy94vtu8...
4,@virgintrains see attached error message. i've...,virgintrains see attached error message tried ...


In [20]:
# Your code here:
# similarly to FREQWORD, extract the 10 rarest words
RAREWORDS = {w for (w, word_count) in most_common[-10:]}


assert RAREWORDS == set(['browser', 'green', 'httpstco9281okeebk', 'including', 'keen', 'lee', 'line', 'log','slowdown','thin'])
RAREWORDS

{'browser',
 'green',
 'httpstco9281okeebk',
 'including',
 'keen',
 'lee',
 'line',
 'log',
 'slowdown',
 'thin'}

In [21]:
def remove_rarewords(text: str, rare_words: set=RAREWORDS) -> str:
    """
    Removes a selection of rare words from the input string

    Inputs:
        text (str): The input text from which rare words will be removed
        rare_words (set): A set of rare words to remove from the text

    Returns:
        str: A new string with all rare words removed
    """
    # Your code here:
    # Filter out the most frequent words from a text string
    return " ".join([word for word in text.split() if word not in rare_words])


text_df["text_wo_stopfreqrare"] = text_df["text_wo_freq"].apply(remove_rarewords)
text_df.head()

Unnamed: 0,text,text_wo_freq,text_wo_stopfreqrare
0,@applesupport causing the reply to be disregar...,causing reply disregarded tapped notification ...,causing reply disregarded tapped notification ...
1,@105835 your business means a lot to us. pleas...,105835 business means lot name zip code additi...,105835 business means lot name zip code additi...
2,@76328 i really hope you all change but i'm su...,76328 really hope change im sure wont dont,76328 really hope change im sure wont dont
3,@105836 livechat is online at the moment - htt...,105836 livechat online moment httpstcosy94vtu8...,105836 livechat online moment httpstcosy94vtu8...
4,@virgintrains see attached error message. i've...,virgintrains see attached error message tried ...,virgintrains see attached error message tried ...


We can combine all the list of words (stopwords, frequent words and rare words) and create a single list to remove them at once.

In [22]:
# Your code here:
# group all the words to remove using in a single set (TO_REMOVE)
TO_REMOVE = STOPWORDS.union(FREQWORDS).union(RAREWORDS)

TO_REMOVE

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'applesupport',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'browser',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'dm',
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'green',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 "he's",
 'help',
 'her',
 'here',
 'hers',
 'herself',
 'hi',
 'him',
 'himself',
 'his',
 'how',
 'httpstco9281okeebk',
 'httpstcogdrqu22ypt',
 'i',
 "i'd",
 "i'll",
 "i'm",
 "i've",
 'if',
 'in',
 'including',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 'ive',
 'just',
 'keen',
 'lee',
 'line',
 'll',
 'log',
 'm',
 'ma',
 'me',
 'mightn',
 

In [23]:
def filter_text(text: str, words_to_remove :set=TO_REMOVE) -> str:
    """
    Removes a selection of words from the input string

    Inputs:
        text (str): The input text from which words will be removed
        words_to_remove (set): A set of words to remove from the text

    Returns:
        str: A new string with all listed words removed
    """ 
    # Your code here
    return " ".join([word for word in text.split() if word not in words_to_remove])


text_df["filtered_text"] =  text_df.text.apply(filter_text)
text_df = text_df[['text', 'filtered_text']]
text_df.head()

Unnamed: 0,text,filtered_text
0,@applesupport causing the reply to be disregar...,@applesupport causing reply disregarded tapped...
1,@105835 your business means a lot to us. pleas...,"@105835 business means lot us. name, zip code ..."
2,@76328 i really hope you all change but i'm su...,@76328 really hope change sure won't! to!
3,@105836 livechat is online at the moment - htt...,@105836 livechat online moment - https://t.co/...
4,@virgintrains see attached error message. i've...,@virgintrains see attached error message. trie...


## **Stemming**

Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form (From [Wikipedia](https://en.wikipedia.org/wiki/Stemming)).

This process is useful to **`reduce the vocabulary size`** by converting similar words to their root form.

For example, if there are two words in the corpus `walks` and `walking`, then stemming will stem the suffix to make them `walk`. But say in another example, we have two words `console` and `consoling`, the stemmer will remove the suffix and make them `consol` which is not a proper english word.

There are several type of stemming algorithms available and one of the famous one is porter stemmer (NLTK package) which is widely used.

In [24]:
from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer()

def stem_demonstration(word: str) -> None:
    """
    Compares a word before and after stemming
    """
    print(f"Before stemming: {word}, \tafter stemming: {stemmer.stem(word)}")

for word in ['programs', 'programming', 'programmed', 'walks', 'walked', 'walking', 'UPPERCASE', 'mistake', 'mistakke']:
    stem_demonstration(word)

Before stemming: programs, 	after stemming: program
Before stemming: programming, 	after stemming: program
Before stemming: programmed, 	after stemming: program
Before stemming: walks, 	after stemming: walk
Before stemming: walked, 	after stemming: walk
Before stemming: walking, 	after stemming: walk
Before stemming: UPPERCASE, 	after stemming: uppercas
Before stemming: mistake, 	after stemming: mistak
Before stemming: mistakke, 	after stemming: mistakk


In [31]:
def stem_words(text: str) -> str:
    """
    Applies the stemmer to an input string

    Args:
        text (str): The text to be stemmed

    Returns:
        str: A new string where every word has been stemmed
    """
    # Your code here:
    return " ".join([stemmer.stem(word.lower()) for word in text.split()])

print(repr(text_df["filtered_text"].values[7]))
assert stem_words(text_df["filtered_text"].values[7]) == "@105836 work ok here, miriam. link https://t.co/0m2mph15eh ? ^mm" # note that the stemmer also applied lowercase

text_df["text_stemmed"] = text_df["filtered_text"].apply(stem_words)
text_df.head(10)

'@105836 working ok here, miriam. link https://t.co/0m2mph15eh ? ^mm'


Unnamed: 0,text,filtered_text,text_stemmed
0,@applesupport causing the reply to be disregar...,@applesupport causing reply disregarded tapped...,@applesupport caus repli disregard tap notif k...
1,@105835 your business means a lot to us. pleas...,"@105835 business means lot us. name, zip code ...","@105835 busi mean lot us. name, zip code addit..."
2,@76328 i really hope you all change but i'm su...,@76328 really hope change sure won't! to!,@76328 realli hope chang sure won't! to!
3,@105836 livechat is online at the moment - htt...,@105836 livechat online moment - https://t.co/...,@105836 livechat onlin moment - https://t.co/s...
4,@virgintrains see attached error message. i've...,@virgintrains see attached error message. trie...,@virgintrain see attach error message. tri lea...
5,"@105836 have you tried from another device, mi...","@105836 tried another device, miriam ^mm","@105836 tri anoth device, miriam ^mm"
6,"@virgintrains yep, i've tried laptop too sever...","@virgintrains yep, tried laptop several times ...","@virgintrain yep, tri laptop sever time past w..."
7,"@105836 it's working ok from here, miriam. doe...","@105836 working ok here, miriam. link https://...","@105836 work ok here, miriam. link https://t.c..."
8,@virgintrains i still haven't heard &amp; the ...,@virgintrains still heard &amp; number directe...,@virgintrain still heard &amp; number direct d...
9,@105836 that's what we're here for miriam 😊 t...,@105836 that's miriam 😊 team send email shortl...,@105836 that' miriam 😊 team send email shortli...


In [27]:
all_text_no_stemming = ' '.join(text_df["text"]).split()
all_text_w_stemming = ' '.join(text_df["text_stemmed"]).split()

n_words_no_stemming = len(set(all_text_no_stemming))
n_words_w_stemming = len(set(all_text_w_stemming))
vocabulary_size_diff = n_words_no_stemming - n_words_w_stemming

assert vocabulary_size_diff == 156

print(f"Number of unique words without stemming: {n_words_no_stemming}")
print(f"Number of unique words with stemming: {n_words_w_stemming}")
print(f"Difference: {vocabulary_size_diff} words")

Number of unique words without stemming: 813
Number of unique words with stemming: 657
Difference: 156 words


We can see that words like `private` and `propose` have their `e` at the end chopped off due to stemming. This is not intented. What can we do for that? We can use Lemmatization in such cases.

This porter stemmer is for English language only, if we are working with other languages, we can use the snowball stemmer. The supported languages for snowball stemmer are:

In [28]:
from nltk.stem.snowball import SnowballStemmer
SnowballStemmer.languages

('arabic',
 'danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'hungarian',
 'italian',
 'norwegian',
 'porter',
 'portuguese',
 'romanian',
 'russian',
 'spanish',
 'swedish')

## **Lemmatization**

Lemmatization is similar to stemming in reducing inflected words to their word stem but differs in the way that it makes sure the root word (also called as lemma) belongs to the language.
Here's a list of [examples](https://github.com/michmech/lemmatization-lists/blob/master/lemmatization-en.txt).

As a result, this one is generally slower than stemming process. So depending on the speed requirement, we can choose to use either stemming or lemmatization.

Let us use the `WordNetLemmatizer` in nltk to lemmatize our sentences

In [29]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\douae\AppData\Roaming\nltk_data...


True

In [30]:
lemmatizer = WordNetLemmatizer()

def lem_demonstration(word: str, pos: str='n') -> None:
    print(f"Before lemmatization: {word}\tafter lemmatization: {lemmatizer.lemmatize(word, pos)}")

for word, pos in [('feet', 'n'), ('caring', 'v')]:
    lem_demonstration(word, pos)

Before lemmatization: feet	after lemmatization: foot
Before lemmatization: caring	after lemmatization: care


In [32]:
def lemmatize_words(text: str) -> str:
    """
    Applies lemmatization to the input string

    Args:
        text (str): The input text to lemmatize

    Returns:
        str: The lemmatized version of the text
    """
    # Your code here
    return " ".join([lemmatizer.lemmatize(word.lower()) for word in text.split()])

text_df["text_lemmatized"] = text_df["filtered_text"].apply(lemmatize_words)
text_df.head()

Unnamed: 0,text,filtered_text,text_stemmed,text_lemmatized
0,@applesupport causing the reply to be disregar...,@applesupport causing reply disregarded tapped...,@applesupport caus repli disregard tap notif k...,@applesupport causing reply disregarded tapped...
1,@105835 your business means a lot to us. pleas...,"@105835 business means lot us. name, zip code ...","@105835 busi mean lot us. name, zip code addit...","@105835 business mean lot us. name, zip code a..."
2,@76328 i really hope you all change but i'm su...,@76328 really hope change sure won't! to!,@76328 realli hope chang sure won't! to!,@76328 really hope change sure won't! to!
3,@105836 livechat is online at the moment - htt...,@105836 livechat online moment - https://t.co/...,@105836 livechat onlin moment - https://t.co/s...,@105836 livechat online moment - https://t.co/...
4,@virgintrains see attached error message. i've...,@virgintrains see attached error message. trie...,@virgintrain see attach error message. tri lea...,@virgintrains see attached error message. trie...


We can see that the trailing `e` in the `propose` and `private` is retained when we use lemmatization unlike stemming.

Wait. There is one more thing in lemmatization. Let us try to lemmatize `running` now.

In [33]:
lemmatizer.lemmatize("running")

'running'

Wow. It returned `running` as such without converting it to the root form `run`. This is because the lemmatization process depends on the POS tag to come up with the correct lemma. Now let us lemmatize again by providing the POS tag for the word.

See this [table](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) for examples of POS tags.

In [34]:
lemmatizer.lemmatize("running", "v") # v for verb

'run'

Now we are getting the root form `run`. So we also need to provide the POS tag of the word along with the word for lemmatizer in nltk. Depending on the POS, the lemmatizer may return different results.

Let us take the example, `stripes` and check the lemma when it is both verb and noun.

In [35]:
print("Word is: stripes")
print("Lemma result for verb: ",lemmatizer.lemmatize("stripes", 'v'))
print("Lemma result for noun: ",lemmatizer.lemmatize("stripes", 'n'))

Word is: stripes
Lemma result for verb:  strip
Lemma result for noun:  stripe


Now let us redo the lemmatization process for our dataset.

In [38]:
from nltk.corpus import wordnet
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\douae\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\douae\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\douae\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

In [40]:
lemmatizer = WordNetLemmatizer()
def lemmatize_words(text: str) -> str:
    """
    Apply lemmatization to the input string, considering words' POS tags.

    This function lemmatizes words in the input string based on their POS (Part-of-Speech) tags.

    Args:
        text (str): The input text to be lemmatized.

    Returns:
        str: A new string with lemmatized words.
    """

    # Initialize a mapping of POS tags to WordNet tags
    wordnet_map = {
        'N': wordnet.NOUN,
        'V': wordnet.VERB,
        'R': wordnet.ADV,
        'J': wordnet.ADJ
    }

    # Your code here:
    # Use the nltk.pos_tag function to get the POS tags of every word in the input (https://www.nltk.org/api/nltk.tag.pos_tag.html)
    # You may also use nltk.word_tokenize to tokenize the text (https://www.nltk.org/api/nltk.tokenize.html)
    tokens = nltk.word_tokenize(text)
    pos_tagged_text = nltk.pos_tag(tokens)


    # Your code here:
    # Return the lemmatized version of the text
    # Inside the lemmatize function, use the (word, POS tag) tuple collected in the pos_tagged_text list
    # hint: query the wordnet_map with wordnet_map.get(key, default) using wordnet.NOUN as default
    lemmatized_words = [
    lemmatizer.lemmatize(word, wordnet_map.get(tag[0], wordnet.NOUN))
    for word, tag in pos_tagged_text
]
    return " ".join(lemmatized_words)



assert lemmatize_words("Any man who must say 'I am the king' is no true king.") == "Any man who must say ' I be the king ' be no true king ."
text_df["text_lemmatized"] = text_df["filtered_text"].apply(lambda text: lemmatize_words(text))

text_df.head()

Unnamed: 0,text,filtered_text,text_stemmed,text_lemmatized
0,@applesupport causing the reply to be disregar...,@applesupport causing reply disregarded tapped...,@applesupport caus repli disregard tap notif k...,@ applesupport cause reply disregard tapped no...
1,@105835 your business means a lot to us. pleas...,"@105835 business means lot us. name, zip code ...","@105835 busi mean lot us. name, zip code addit...","@ 105835 business mean lot u . name , zip code..."
2,@76328 i really hope you all change but i'm su...,@76328 really hope change sure won't! to!,@76328 realli hope chang sure won't! to!,@ 76328 really hope change sure wo n't ! to !
3,@105836 livechat is online at the moment - htt...,@105836 livechat online moment - https://t.co/...,@105836 livechat onlin moment - https://t.co/s...,@ 105836 livechat online moment - http : //t.c...
4,@virgintrains see attached error message. i've...,@virgintrains see attached error message. trie...,@virgintrain see attach error message. tri lea...,@ virgintrains see attach error message . trie...


In [41]:
all_text_no_lemm = ' '.join(text_df["text"]).split()
all_text_w_lemm = ' '.join(text_df["text_lemmatized"]).split()

n_words_no_lemm = len(set(all_text_no_lemm))
n_words_w_lemm = len(set(all_text_w_lemm))
vocabulary_size_diff = n_words_no_lemm - n_words_w_lemm

assert vocabulary_size_diff == 221

print(f"Number of unique words without stemming: {n_words_no_lemm}")
print(f"Number of unique words with stemming: {n_words_w_lemm}")
print(f"Difference: {vocabulary_size_diff} words out of {df.shape[0]} sample")

Number of unique words without stemming: 813
Number of unique words with stemming: 592
Difference: 221 words out of 93 sample


## **Removal of URLs**

Next preprocessing step is to remove any URLs present in the data. For example, if we are doing a twitter analysis, then there is a good chance that the tweet will have some URL in it. Probably we might need to remove them for our further analysis.

We can use the below code snippet to do that.

`Regex breakdown:`
```Python
r'https?://\S+|www\.\S+'
# could also be understood as
(r'https?://\S+') or (r'www\.\S+')
```
* `r` in front of a string indicates that Python shall treat the string as a raw litteral (avoids `\` being treated as escape characters)
* `https?://'`: This part of the regular expression matches URLs that start with either "http://" or "https:////". The `s?` portion allows for an optional "s" character, so it matches both "http://" and "https://".
*  `\S+`: This part of the regular expression matches one or more non-whitespace characters. It's used to match the domain part of the URL (e.g., www.example.com).
|: This is the alternation operator, which acts like an OR operator in regular expressions. It allows you to match either the pattern on the left or the pattern on the right. In this case, it's used to match either URLs starting with "http://" or "https://", or URLs starting with "www.".
* `www\.\S+`: This part of the regular expression matches URLs that start with "www." and then followed by one or more non-whitespace characters. It's commonly used to match URLs like "www.example.com".

In summary, this regular expression is designed to identify and capture URLs in a text string, whether they start with `"http://"`, `"https://"`, or `"www."`. It's a common pattern for extracting or hyperlinking URLs in text processing tasks.
So, when you see `u'('+emot+')'`, it's creating a Unicode string that contains a left parenthesis `'('`, the value of the `emot` variable (which is a placeholder for the text or pattern you want to find), and a right parenthesis `')'`.

In [42]:
def remove_urls(text :str) -> str:
    """
    Remove URLs (web links) from the input text.

    This function searches for URLs in the input text and removes them, leaving the
    text without any web links.

    Args:
        text (str): The input text from which URLs will be removed.

    Returns:
        str: A new string with URLs removed.

    Example:
        >>> remove_urls("Visit our website at https://www.example.com to learn more.")
        "Visit our website at to learn more."
    """
    # Your code here
    # Regex pattern for URLs
    url_pattern = r'https?://\S+|www\.\S+'
    
    # Replace all URLs with an empty string
    return re.sub(url_pattern, '', text)

text = "Check this out: https://example.com and also www.test.com"
remove_urls(text)


'Check this out:  and also '

Let us take a `https` link and check the code

In [43]:
text = "Driverless AI NLP blog post on https://www.h2o.ai/blog/detecting-sarcasm-is-difficult-but-ai-may-have-an-answer/"
remove_urls(text)

'Driverless AI NLP blog post on '

Now let us take a `http` url and check the code

In [44]:
text = "Please refer to link http://lnkd.in/ecnt5yC for the paper"
remove_urls(text)

'Please refer to link  for the paper'

Thanks to Pranjal for the edge cases in the comments below. Suppose say there is no `http` or `https` in the url link. The function can now captures that as well.

In [45]:
text = "Want to know more. Checkout www.h2o.ai for additional information"
remove_urls(text)

'Want to know more. Checkout  for additional information'

## **Removal of HTML Tags**

One another common preprocessing technique that will come handy in multiple places is removal of html tags. This is especially useful, if we scrap the data from different websites. We might end up having html strings as part of our text.

First, let us try to remove the HTML tags using regular expressions.

`Regex breakdown:`
```Python
'<.*?>'
```
* `<` and `>`: simply match the opening and closing brackets of HTML tags, e.g. \<div>
* `.*?`: This is the `non-greedy` or `lazy quantifier` *?, which matches any character (represented by `.` ) zero or more times, but it does so as few times as possible to make a valid match. In the context of HTML tags, this means it will match the shortest possible sequence of characters between the opening < and closing > tags.

So, the entire regular expression `'<.*?>'` is used to match and capture the shortest possible HTML tag found in a text string. This is useful in cases where you want to extract or remove HTML tags from a text while preserving the shortest possible tag structure.

In [None]:
def remove_html(text :str) -> str:
    """
    Remove HTML tags and content from the input text.

    This function searches for HTML tags within the input text and removes them,
    leaving only the plain text content.

    Args:
        text (str): The input text containing HTML tags to be removed.

    Returns:
        str: A new string with HTML tags and content removed.

    Example:
        >>> remove_html("<p>This is <b>bold</b> text.</p>")
        "This is bold text."
    """
    html_pattern = r'<.*?>'
    return re.sub(html_pattern, '', text)

text = """<div>
<h1> H2O</h1>
<SomeComponent/>
<p> AutoML</p>
<a href="https://www.h2o.ai/products/h2o-driverless-ai/"> Driverless AI</a>
</div>"""

print(remove_html('text'))

text


We can also use `BeautifulSoup` package to get the text from HTML document in a more elegant way.

In [47]:
from bs4 import BeautifulSoup

def remove_html(text :str) -> str:
    """
    Remove HTML tags and content from the input text using BeautifulSoup.

    This function utilizes the BeautifulSoup library to parse the input text as HTML
    and then extracts and returns the plain text content, removing all HTML tags.

    Args:
        text (str): The input text containing HTML tags to be removed.

    Returns:
        str: A new string with HTML tags and content removed.

    Example:
        >>> remove_html("<p>This is <b>bold</b> text.</p>")
        "This is bold text."
    """
    # Your code here
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text(separator=" ").strip()

text = """<div>
<h1> H2O</h1>
<p> AutoML</p>
<a href="https://www.h2o.ai/products/h2o-driverless-ai/"> Driverless AI</a>
</div>
"""

print(remove_html(text))

H2O 
  AutoML 
  Driverless AI


**In your opinion which tool between regex or a parser is preferable and why? Write you thoughts**

*Here you answer*

The use of Parser is safer and more reliable espacially in the real-worf web data. 
Regex is more for simple and predictable HTML.

## **SpaCy**
Similarly to NLTK, SpaCy is another popular NLP library with many features and pretrained models. It's extremely optimized and user-friendly.

Sometimes less flexible if compared to NLTK (which provides a wide range of algorithms) it has comparable performance, or slightly better in tokenization and POS-tagging, underperforming in sentence tokenization.

While NLTK can be seen as a large toolbox for NLP, SpaCy offers more ready-to-use solutions for production. Amoung the other funcionalities, there are:
- Tokenization
- Part-of-speech (POS) tagging
- Dependency Parsing
- Lemmatization
- Named Entity Recognition
- Sentence Boundary Detection
- Text Classification
- etc

It is worth it to that a look at it!

Let's import the library and download a pretrained model for the English language

In [48]:
import spacy
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ------------------------- -------------- 8.1/12.8 MB 47.9 MB/s eta 0:00:01
     --------------------------------------- 12.8/12.8 MB 44.4 MB/s eta 0:00:00
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


The SpaCy's Language model (named "nlp" in this code) contains all the components and data needed for the analysis. Calling it on a text string will return a Doc object

In [49]:
nlp = spacy.load("en_core_web_sm") # pretrained model

doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print('{0}\t{1}\t{2}'.format(token.text, token.pos_, token.dep_))

Apple	PROPN	nsubj
is	AUX	aux
looking	VERB	ROOT
at	ADP	prep
buying	VERB	pcomp
U.K.	PROPN	nsubj
startup	VERB	ccomp
for	ADP	prep
$	SYM	quantmod
1	NUM	compound
billion	NUM	pobj


As we have seen the first information we can get is the tokenization of the sequence with the corresponding tokens

In [50]:
# Your code here:
# tokenize the sentence collecting the tokens with a list comprehension
# print the list of tokens

sentence = "Apple is looking at buying U.K. startup for $1 billion"

doc = nlp(sentence)

tokens = [token.text for token in doc]

print(tokens)


['Apple', 'is', 'looking', 'at', 'buying', 'U.K.', 'startup', 'for', '$', '1', 'billion']


Tokens contain many attributes like the corresponding lemma, pos tag, syntax dependency, word shape (e.g. capitalization, punctuation, digits), is alpha, is a stop word

In [51]:
for token in doc:
    string = '\t\t'.join([token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, str(token.is_alpha), str(token.is_stop)])
    print(string)

Apple		Apple		PROPN		NNP		nsubj		Xxxxx		True		False
is		be		AUX		VBZ		aux		xx		True		True
looking		look		VERB		VBG		ROOT		xxxx		True		False
at		at		ADP		IN		prep		xx		True		True
buying		buy		VERB		VBG		pcomp		xxxx		True		False
U.K.		U.K.		PROPN		NNP		nsubj		X.X.		False		False
startup		startup		VERB		VBD		ccomp		xxxx		True		False
for		for		ADP		IN		prep		xxx		True		True
$		$		SYM		$		quantmod		$		False		False
1		1		NUM		CD		compound		d		False		False
billion		billion		NUM		CD		pobj		xxxx		True		False


In [52]:
# Your code here:
# remove stopwords using the information given by SpaCy
# return the result as a list of tokens
filtered = [token.text for token in doc if not token.is_stop]


print(filtered)
assert(filtered == ['Apple', 'looking', 'buying', 'U.K.', 'startup', '$', '1', 'billion'])

['Apple', 'looking', 'buying', 'U.K.', 'startup', '$', '1', 'billion']


SpaCy also provides a Named Entity Recognition (NER) engine recognizing categories such as person, country, organization etc.

Instead of iterating over the tokens, we rely on entities (ents) in this case

In [53]:
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY


To conclude let's visualize the dependecy tree instead of printing the result on the terminal

In [54]:
from spacy import displacy
displacy.serve(doc, style="dep", auto_select_port=True)

# keep in mind to stop the execution after running this cell: it's a service running on your PC




Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


## **Subword Tokenization**

What we have seen so far was a traditional pre-processing pipeline still relevant in some tasks, however nowadays the standard approach for NLP rely on huge neural models that can perform feature extraction by themself, without human intervention. In this context all the information are needed for the model: punctuation, emojis, stopwords and even case sensitive words.

The only pre-processing steps required deal with data cleaning and sanitization, as well as tokenization. Pre-trained models usually have their own pre-trained tokenizer and sometimes it could be a good idea to pre-tokenize the input text, only once at the beginning, and work with sequences of tokens instead of strings.

Let's use therefore a modern subword tokenizer

In [56]:
!pip install transformers


Collecting transformers
  Downloading transformers-4.56.1-py3-none-any.whl.metadata (42 kB)
Collecting huggingface-hub<1.0,>=0.34.0 (from transformers)
  Downloading huggingface_hub-0.34.4-py3-none-any.whl.metadata (14 kB)
Collecting tokenizers<=0.23.0,>=0.22.0 (from transformers)
  Downloading tokenizers-0.22.0-cp39-abi3-win_amd64.whl.metadata (6.9 kB)
Collecting safetensors>=0.4.3 (from transformers)
  Downloading safetensors-0.6.2-cp38-abi3-win_amd64.whl.metadata (4.1 kB)
Downloading transformers-4.56.1-py3-none-any.whl (11.6 MB)
   ---------------------------------------- 0.0/11.6 MB ? eta -:--:--
   ---------------------- ----------------- 6.6/11.6 MB 40.7 MB/s eta 0:00:01
   ---------------------------------------- 11.6/11.6 MB 40.4 MB/s eta 0:00:00
Downloading huggingface_hub-0.34.4-py3-none-any.whl (561 kB)
   ---------------------------------------- 0.0/561.5 kB ? eta -:--:--
   --------------------------------------- 561.5/561.5 kB 17.1 MB/s eta 0:00:00
Downloading tokenizers

In [1]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") # import the pre-trained tokenizer from BERT, a model we will see in the last lectures

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


We are now using the HugginFace 🤗 library and in particular its tokenizers, with the tokenize() method we can directly see the effect of such tokenization

In [2]:
tokenizer.tokenize("BERT is an encoder-only transformer architecture!")

['bert',
 'is',
 'an',
 'en',
 '##code',
 '##r',
 '-',
 'only',
 'transform',
 '##er',
 'architecture',
 '!']

Being it a subword tokenizer, words that do not appear in the (learned) vocabulary are splitted in subtokens, here identified with '##' at the beginning

In [3]:
tokenizer.tokenize("transformer")

['transform', '##er']

In [4]:
tokenizer.tokenize("testing the tokenizer")

['testing', 'the', 'token', '##izer']

Keep in mind, though, that in this context the goal is to prepare the data for ML analysis using neural models, subtokens per se are not really usefull for humans but
Neural networks can only understand numbers, not string. As we have seen during the lecture the tokenizer will provide a numerical representation by converting tokens to word indeces (the corresponding index of the token in the vocabulary)

In [5]:
tokenizer.encode('testing the tokenizer')

[101, 5604, 1996, 19204, 17629, 102]

You might have notices that compared to the tokenizer() method we have now 6 tokens instead of 4, converting the ids back to string can help us understanding what is goind on

In [6]:
input_ids = tokenizer.encode('testing the tokenizer')
tokenizer.decode(input_ids)

'[CLS] testing the tokenizer [SEP]'

BERT's tokenizer automatically added two special tokens at the begging and ending of the sequence, respectively [CLS] and [SEP] or 101 and 102 in numberical values.

These tokens are used to delimit sentences and represent the entire sequence for tasks such as classification as we will see more in detail later on...

convert_ids_to_tokens, and the dual method convert_tokens_to_ids, can perform this conversion between tokens and ids and vice versa

In [7]:
print(tokenizer.convert_ids_to_tokens(101))
print(tokenizer.convert_ids_to_tokens(1996))

[CLS]
the


Each word can therefore be splitted into one or more subtokens, depending on the tokenizer. In NLP "fertility" measures the average number of subwords produced per word by a tokenizer.

Let's compute the fertility on the brown corpus: a pretokenized collection of texts already available on NLTK

In [8]:
import nltk.corpus
nltk.download('brown')
brown = nltk.corpus.brown
words = brown.words()

print(words[:14])
print('Number of words: ' + str(len(words)))

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\douae\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\brown.zip.


['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election']
Number of words: 1161192


In [9]:
# Here your code
# compute the fertility on the brown corpus using BERT's pre-trained tokenizer
tokenized_words = [tokenizer.tokenize(word) for word in words]
num_original_words = len(words)
num_tokenized_subwords = sum(len(subwords) for subwords in tokenized_words)
fertility = num_tokenized_subwords / num_original_words 
print(fertility)



1.128585970278817


**We computed the fertility on the Brown corpus (English texts) using a tokenizer trained for English, what if we used instead a multilingual model or a tokenizer trained in another language? Based on you understanding of the algorithm, what do you expect and why?**

*Here you answer*

Using a Non-English or multilingual  tokenize on English texts incease fertility, because more words are broken into subword tokens due to vocabulary mumatch.


## +++ End of the mandatory section +++

## **Advance only**

To get to the `advanced` grade try to put together what you have learned into a script and process the examples provided in `spam.tsv`. The final goal is to discriminate between spam and not spam using a random forest classifier. Adapt therefore the pre-processing pipeline to this need (e.g. removing or non removing stopwords? Punctuation? Stemming or Lemmatization? Other things to clean?). Please justify and comment on these decisions. Feel free to add as many steps as you feel necessary, as long as you justify the choice.

Complete the `spam.py` script and try to run the classification process. How your decisions impacted the performance (pay attention to the F1-score)? Try different strategies and report what you have found, what are your explanations?