This notebook is based on [***Getting started with text preprocessing***](https://www.kaggle.com/code/sudalairajkumar/getting-started-with-text-preprocessing) by SRK on Kaggle.
More information on the [***dataset***](https://www.kaggle.com/datasets/thoughtvector/customer-support-on-twitter).


## **Introduction**

In any machine learning task, cleaning or preprocessing the data is as important as model building if not more. And when it comes to unstructured data like text, this process is even more important. 

Objective of this kernel is to understand the various text preprocessing steps with code examples. 

Some of the common text preprocessing / cleaning steps are:
* Lower casing
* Removal of Punctuations
* Removal of Stopwords
* Removal of Frequent words
* Removal of Rare words
* Stemming
* Lemmatization
* Removal of emojis
* Removal of emoticons
* Conversion of emoticons to words
* Conversion of emojis to words
* Removal of URLs 
* Removal of HTML tags
* Chat words conversion
* Spelling correction


So these are the different types of text preprocessing steps which we can do on text data. But we need not do all of these all the times. We need to carefully choose the preprocessing steps based on our use case since that also play an important role. 

For example, in sentiment analysis use case, we need not remove the emojis or emoticons as it will convey some important information about the sentiment. Similarly we need to decide based on our use cases. 

In [1]:
import numpy as np
import pandas as pd 
import re # regular expressions => string parsing and filtering
import nltk # natural language toolkit => classification, tokenization, stemming, ...
import spacy # tokenizer, tagger, parser, NER, pretrained models
import string
pd.options.mode.chained_assignment = None


In [2]:
df = pd.read_csv("../data/tweets_preprocessing.csv")
df["text"] = df["text"].astype(str)
print(f"Shape: {df.shape}")
df.head()

Shape: (93, 7)


Unnamed: 0,tweet_id,author_id,inbound,created_at,text,response_tweet_id,in_response_to_tweet_id
0,119237,105834,True,Wed Oct 11 06:55:44 +0000 2017,@AppleSupport causing the reply to be disregar...,119236.0,
1,119238,ChaseSupport,False,Wed Oct 11 13:25:49 +0000 2017,@105835 Your business means a lot to us. Pleas...,,119239.0
2,119239,105835,True,Wed Oct 11 13:00:09 +0000 2017,@76328 I really hope you all change but I'm su...,119238.0,
3,119240,VirginTrains,False,Tue Oct 10 15:16:08 +0000 2017,@105836 LiveChat is online at the moment - htt...,119241.0,119242.0
4,119241,105836,True,Tue Oct 10 15:17:21 +0000 2017,@VirginTrains see attached error message. I've...,119243.0,119240.0


## **Lower Casing**

Lower casing is a common text preprocessing technique. The idea is to convert the input text into same casing format so that 'text', 'Text' and 'TEXT' are treated the same way. 

This is more helpful for text featurization techniques like frequency, tfidf as it helps to combine the same words together thereby reducing the duplication and get correct counts / tfidf values.

This may not be helpful when we do tasks like Part of Speech tagging (where proper casing gives some information about Nouns and so on) and Sentiment Analysis (where upper casing refers to anger and so on)

By default, lower casing is done my most of the modern day vecotirzers and tokenizers like [sklearn TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) and [Keras Tokenizer](https://keras.io/preprocessing/text/). So we need to set them to false as needed depending on our use case. 

In [3]:
text_df = df[["text"]]

In [4]:
print("Before lowering:")
print(text_df.head().text.values)

Before lowering:
['@AppleSupport causing the reply to be disregarded and the tapped notification under the keyboard is opened😡😡😡'
 '@105835 Your business means a lot to us. Please DM your name, zip code and additional details about your concern. ^RR https://t.co/znUu1VJn9r'
 "@76328 I really hope you all change but I'm sure you won't! Because you don't have to!"
 '@105836 LiveChat is online at the moment - https://t.co/SY94VtU8Kq or contact 03331 031 031 option 1, 4, 3 (Leave a message) to request a call back'
 "@VirginTrains see attached error message. I've tried leaving a voicemail several times in the past week https://t.co/NxVZjlYx1k"]


In [5]:
text_df["text"] = text_df["text"].str.lower()
print("After lowering:")
print(text_df.head().text.values)
text_df.head()

After lowering:
['@applesupport causing the reply to be disregarded and the tapped notification under the keyboard is opened😡😡😡'
 '@105835 your business means a lot to us. please dm your name, zip code and additional details about your concern. ^rr https://t.co/znuu1vjn9r'
 "@76328 i really hope you all change but i'm sure you won't! because you don't have to!"
 '@105836 livechat is online at the moment - https://t.co/sy94vtu8kq or contact 03331 031 031 option 1, 4, 3 (leave a message) to request a call back'
 "@virgintrains see attached error message. i've tried leaving a voicemail several times in the past week https://t.co/nxvzjlyx1k"]


Unnamed: 0,text
0,@applesupport causing the reply to be disregar...
1,@105835 your business means a lot to us. pleas...
2,@76328 i really hope you all change but i'm su...
3,@105836 livechat is online at the moment - htt...
4,@virgintrains see attached error message. i've...


## **Removal of Punctuations**

One another common text preprocessing technique is to remove the punctuations from the text data. This is again a text standardization process that will help to treat 'hurray' and 'hurray!' in the same way.

We also need to carefully choose the list of punctuations to exclude depending on the use case. For example, the `string.punctuation` in python contains the following punctuation symbols 

`!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~`

We can add or remove more punctuations as per our need.

In [6]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [7]:
# drop the new column created in last cell

PUNCT_TO_REMOVE = string.punctuation
def remove_punctuation(text):
    """custom function to remove the punctuation"""
    return text.translate(str.maketrans('', '', PUNCT_TO_REMOVE))

text_df["text_wo_punct"] = text_df["text"].apply(remove_punctuation)
text_df.head()

Unnamed: 0,text,text_wo_punct
0,@applesupport causing the reply to be disregar...,applesupport causing the reply to be disregard...
1,@105835 your business means a lot to us. pleas...,105835 your business means a lot to us please ...
2,@76328 i really hope you all change but i'm su...,76328 i really hope you all change but im sure...
3,@105836 livechat is online at the moment - htt...,105836 livechat is online at the moment https...
4,@virgintrains see attached error message. i've...,virgintrains see attached error message ive tr...


## **Removal of stopwords**

Stopwords are commonly occuring words in a language like 'the', 'a' and so on. They can be removed from the text most of the times, as they don't provide valuable information for downstream analysis. In cases like Part of Speech tagging, we should not remove them as provide very valuable information about the POS.

These stopword lists are already compiled for different languages and we can safely use them. For example, the stopword list for english language from the nltk package can be seen below.


In [8]:
from nltk.corpus import stopwords
STOPWORDS = set(stopwords.words('english'))
", ".join(stopwords.words('english'))

"i, me, my, myself, we, our, ours, ourselves, you, you're, you've, you'll, you'd, your, yours, yourself, yourselves, he, him, his, himself, she, she's, her, hers, herself, it, it's, its, itself, they, them, their, theirs, themselves, what, which, who, whom, this, that, that'll, these, those, am, is, are, was, were, be, been, being, have, has, had, having, do, does, did, doing, a, an, the, and, but, if, or, because, as, until, while, of, at, by, for, with, about, against, between, into, through, during, before, after, above, below, to, from, up, down, in, out, on, off, over, under, again, further, then, once, here, there, when, where, why, how, all, any, both, each, few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very, s, t, can, will, just, don, don't, should, should've, now, d, ll, m, o, re, ve, y, ain, aren, aren't, couldn, couldn't, didn, didn't, doesn, doesn't, hadn, hadn't, hasn, hasn't, haven, haven't, isn, isn't, ma, mightn, mightn't, mustn, mus

Similarly we can also get the list for other languages as well and use them. 

In [9]:
sample = text_df.text.values[0]
sample

'@applesupport causing the reply to be disregarded and the tapped notification under the keyboard is opened😡😡😡'

In [10]:
split = sample.split(' ')
split

['@applesupport',
 'causing',
 'the',
 'reply',
 'to',
 'be',
 'disregarded',
 'and',
 'the',
 'tapped',
 'notification',
 'under',
 'the',
 'keyboard',
 'is',
 'opened😡😡😡']

In [11]:
filtered_words = [word for word in split if word not in STOPWORDS] # list comprehension
# equivalent to:
#
# filtered_words = []  
# for word in split:
#   if word not in STOPWORDS:
#       filtered_words.append(word)

filtered_words

['@applesupport',
 'causing',
 'reply',
 'disregarded',
 'tapped',
 'notification',
 'keyboard',
 'opened😡😡😡']

In [12]:
filtered_string = " ".join(filtered_words)
print(f"Before filtering: {sample}\nAfter filtering: {filtered_string}")

Before filtering: @applesupport causing the reply to be disregarded and the tapped notification under the keyboard is opened😡😡😡
After filtering: @applesupport causing reply disregarded tapped notification keyboard opened😡😡😡


In [13]:
def remove_stopwords(text):
    """custom function to remove the stopwords"""
    return " ".join([word for word in str(text).split() if word not in STOPWORDS])

text_df["text_wo_stop"] = text_df["text_wo_punct"].apply(remove_stopwords)
text_df.head()

Unnamed: 0,text,text_wo_punct,text_wo_stop
0,@applesupport causing the reply to be disregar...,applesupport causing the reply to be disregard...,applesupport causing reply disregarded tapped ...
1,@105835 your business means a lot to us. pleas...,105835 your business means a lot to us please ...,105835 business means lot us please dm name zi...
2,@76328 i really hope you all change but i'm su...,76328 i really hope you all change but im sure...,76328 really hope change im sure wont dont
3,@105836 livechat is online at the moment - htt...,105836 livechat is online at the moment https...,105836 livechat online moment httpstcosy94vtu8...
4,@virgintrains see attached error message. i've...,virgintrains see attached error message ive tr...,virgintrains see attached error message ive tr...


## **Removal of Frequent words**

In the previous preprocessing step, we removed the stopwords based on language information. But say, if we have a domain specific corpus, we might also have some frequent words which are of lesser importance to us. 

So this step is to remove the frequent words in the given corpus. If we use something like tfidf (vectorizers are the topic of the second course), this is automatically taken care of.  

Let us get the most common words and then remove them in the next step

In [14]:
from collections import Counter
cnt = Counter()
for text in text_df["text_wo_stop"].values:
    for word in text.split():
        cnt[word] += 1
        
cnt.most_common(10)

[('us', 25),
 ('dm', 19),
 ('help', 18),
 ('thanks', 13),
 ('httpstcogdrqu22ypt', 12),
 ('applesupport', 11),
 ('please', 11),
 ('phone', 9),
 ('hi', 9),
 ('ive', 8)]

In [15]:
# selecting the 10 most frequent words to filter them out
FREQWORDS = set([w for (w, word_count) in cnt.most_common(10)])
def remove_freqwords(text):
    """custom function to remove the frequent words
       It works the same way as stop word removal !
    """
    return " ".join([word for word in str(text).split() if word not in FREQWORDS])

text_df["text_wo_stopfreq"] = text_df["text_wo_stop"].apply(remove_freqwords)
text_df.head()

Unnamed: 0,text,text_wo_punct,text_wo_stop,text_wo_stopfreq
0,@applesupport causing the reply to be disregar...,applesupport causing the reply to be disregard...,applesupport causing reply disregarded tapped ...,causing reply disregarded tapped notification ...
1,@105835 your business means a lot to us. pleas...,105835 your business means a lot to us please ...,105835 business means lot us please dm name zi...,105835 business means lot name zip code additi...
2,@76328 i really hope you all change but i'm su...,76328 i really hope you all change but im sure...,76328 really hope change im sure wont dont,76328 really hope change im sure wont dont
3,@105836 livechat is online at the moment - htt...,105836 livechat is online at the moment https...,105836 livechat online moment httpstcosy94vtu8...,105836 livechat online moment httpstcosy94vtu8...
4,@virgintrains see attached error message. i've...,virgintrains see attached error message ive tr...,virgintrains see attached error message ive tr...,virgintrains see attached error message tried ...


## **Removal of Rare words**

This is very similar to previous preprocessing step but we will remove the rare words from the corpus. 

In [16]:
# let's keep only the latest version of the processed text
text_df = text_df[["text", "text_wo_stopfreq"]]
text_df.head()

Unnamed: 0,text,text_wo_stopfreq
0,@applesupport causing the reply to be disregar...,causing reply disregarded tapped notification ...
1,@105835 your business means a lot to us. pleas...,105835 business means lot name zip code additi...
2,@76328 i really hope you all change but i'm su...,76328 really hope change im sure wont dont
3,@105836 livechat is online at the moment - htt...,105836 livechat online moment httpstcosy94vtu8...
4,@virgintrains see attached error message. i've...,virgintrains see attached error message tried ...


In [17]:
# Extracting the 10 less common words
RAREWORDS = set([w for (w, wc) in cnt.most_common()[:-11:-1]])
RAREWORDS

{'browser',
 'green',
 'httpstco9281okeebk',
 'including',
 'keen',
 'lee',
 'line',
 'log',
 'slowdown',
 'thin'}

In [18]:
def remove_rarewords(text):
    """custom function to remove the rare words
       Once again, works just like stop words and frequent words removal !
    """
    return " ".join([word for word in str(text).split() if word not in RAREWORDS])

text_df["text_wo_stopfreqrare"] = text_df["text_wo_stopfreq"].apply(remove_rarewords)
text_df.head()

Unnamed: 0,text,text_wo_stopfreq,text_wo_stopfreqrare
0,@applesupport causing the reply to be disregar...,causing reply disregarded tapped notification ...,causing reply disregarded tapped notification ...
1,@105835 your business means a lot to us. pleas...,105835 business means lot name zip code additi...,105835 business means lot name zip code additi...
2,@76328 i really hope you all change but i'm su...,76328 really hope change im sure wont dont,76328 really hope change im sure wont dont
3,@105836 livechat is online at the moment - htt...,105836 livechat online moment httpstcosy94vtu8...,105836 livechat online moment httpstcosy94vtu8...
4,@virgintrains see attached error message. i've...,virgintrains see attached error message tried ...,virgintrains see attached error message tried ...


We can combine all the list of words (stopwords, frequent words and rare words) and create a single list to remove them at once.

In [19]:
import itertools
words_to_remove = [list(STOPWORDS), list(FREQWORDS), list(RAREWORDS)]
words_to_remove = list(itertools.chain(*words_to_remove))

assert len(words_to_remove) == (len(STOPWORDS) + len(FREQWORDS) + len(RAREWORDS))

len(words_to_remove)

199

In [20]:
def filter_text(text: str) -> str:
    return " ".join([word for word in str(text).split() if word not in words_to_remove])

text_df["filtered_text"] =  text_df.text.apply(filter_text)
text_df = text_df[['text', 'filtered_text']]
text_df.head()

Unnamed: 0,text,filtered_text
0,@applesupport causing the reply to be disregar...,@applesupport causing reply disregarded tapped...
1,@105835 your business means a lot to us. pleas...,"@105835 business means lot us. name, zip code ..."
2,@76328 i really hope you all change but i'm su...,@76328 really hope change i'm sure won't! to!
3,@105836 livechat is online at the moment - htt...,@105836 livechat online moment - https://t.co/...
4,@virgintrains see attached error message. i've...,@virgintrains see attached error message. i've...


## **Stemming**

Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form (From [Wikipedia](https://en.wikipedia.org/wiki/Stemming)).

This process is useful to **`reduce the vocabulary size`** by converting similar words to their root form.

For example, if there are two words in the corpus `walks` and `walking`, then stemming will stem the suffix to make them `walk`. But say in another example, we have two words `console` and `consoling`, the stemmer will remove the suffix and make them `consol` which is not a proper english word.

There are several type of stemming algorithms available and one of the famous one is porter stemmer (NLTK package) which is widely used.

In [21]:
from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer()
def stem_words(text):
    return " ".join([stemmer.stem(word) for word in text.split()])

text_df["text_stemmed"] = text_df["text"].apply(stem_words)
text_df.head()

Unnamed: 0,text,filtered_text,text_stemmed
0,@applesupport causing the reply to be disregar...,@applesupport causing reply disregarded tapped...,@applesupport caus the repli to be disregard a...
1,@105835 your business means a lot to us. pleas...,"@105835 business means lot us. name, zip code ...",@105835 your busi mean a lot to us. pleas dm y...
2,@76328 i really hope you all change but i'm su...,@76328 really hope change i'm sure won't! to!,@76328 i realli hope you all chang but i'm sur...
3,@105836 livechat is online at the moment - htt...,@105836 livechat online moment - https://t.co/...,@105836 livechat is onlin at the moment - http...
4,@virgintrains see attached error message. i've...,@virgintrains see attached error message. i've...,@virgintrain see attach error message. i'v tri...


In [22]:
all_text_no_stemming = ' '.join(text_df["text"]).split()
all_text_w_stemming = ' '.join(text_df["text_stemmed"]).split()

n_words_no_stemming = len(set(all_text_no_stemming))
n_words_w_stemming = len(set(all_text_w_stemming))
vocabulary_size_diff = n_words_no_stemming - n_words_w_stemming

assert vocabulary_size_diff == 47

print(f"Number of unique words without stemming: {n_words_no_stemming}")
print(f"Number of unique words with stemming: {n_words_w_stemming}")
print(f"Difference: {vocabulary_size_diff} words")

Number of unique words without stemming: 813
Number of unique words with stemming: 766
Difference: 47 words


We can see that words like `private` and `propose` have their `e` at the end chopped off due to stemming. This is not intented. What can we do fort hat? We can use Lemmatization in such cases.

Also this porter stemmer is for English language. If we are working with other languages, we can use snowball stemmer. The supported languages for snowball stemmer are

In [23]:
from nltk.stem.snowball import SnowballStemmer
SnowballStemmer.languages

('arabic',
 'danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'hungarian',
 'italian',
 'norwegian',
 'porter',
 'portuguese',
 'romanian',
 'russian',
 'spanish',
 'swedish')

## **Lemmatization**

Lemmatization is similar to stemming in reducing inflected words to their word stem but differs in the way that it makes sure the root word (also called as lemma) belongs to the language. 

As a result, this one is generally slower than stemming process. So depending on the speed requirement, we can choose to use either stemming or lemmatization. 

Let us use the `WordNetLemmatizer` in nltk to lemmatize our sentences

In [24]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ryanp\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [25]:
lemmatizer = WordNetLemmatizer()
def lemmatize_words(text):
    return " ".join([lemmatizer.lemmatize(word) for word in text.split()])

text_df["text_lemmatized"] = text_df["text"].apply(lemmatize_words)
text_df.head()

Unnamed: 0,text,filtered_text,text_stemmed,text_lemmatized
0,@applesupport causing the reply to be disregar...,@applesupport causing reply disregarded tapped...,@applesupport caus the repli to be disregard a...,@applesupport causing the reply to be disregar...
1,@105835 your business means a lot to us. pleas...,"@105835 business means lot us. name, zip code ...",@105835 your busi mean a lot to us. pleas dm y...,@105835 your business mean a lot to us. please...
2,@76328 i really hope you all change but i'm su...,@76328 really hope change i'm sure won't! to!,@76328 i realli hope you all chang but i'm sur...,@76328 i really hope you all change but i'm su...
3,@105836 livechat is online at the moment - htt...,@105836 livechat online moment - https://t.co/...,@105836 livechat is onlin at the moment - http...,@105836 livechat is online at the moment - htt...
4,@virgintrains see attached error message. i've...,@virgintrains see attached error message. i've...,@virgintrain see attach error message. i'v tri...,@virgintrains see attached error message. i've...


We can see that the trailing `e` in the `propose` and `private` is retained when we use lemmatization unlike stemming. 

Wait. There is one more thing in lemmatization. Let us try to lemmatize `running` now.

In [26]:
lemmatizer.lemmatize("running")

'running'

Wow. It returned `running` as such without converting it to the root form `run`. This is because the lemmatization process depends on the POS tag to come up with the correct lemma. Now let us lemmatize again by providing the POS tag for the word.

In [27]:
lemmatizer.lemmatize("running", "v") # v for verb

'run'

Now we are getting the root form `run`. So we also need to provide the POS tag of the word along with the word for lemmatizer in nltk. Depending on the POS, the lemmatizer may return different results.

Let us take the example, `stripes` and check the lemma when it is both verb and noun.

In [28]:
print("Word is : stripes")
print("Lemma result for verb : ",lemmatizer.lemmatize("stripes", 'v'))
print("Lemma result for noun : ",lemmatizer.lemmatize("stripes", 'n'))

Word is : stripes
Lemma result for verb :  strip
Lemma result for noun :  stripe


Now let us redo the lemmatization process for our dataset.

In [29]:
from nltk.corpus import wordnet
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\ryanp\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [30]:
lemmatizer = WordNetLemmatizer()
wordnet_map = {"N":wordnet.NOUN, "V":wordnet.VERB, "J":wordnet.ADJ, "R":wordnet.ADV}
def lemmatize_words(text):
    pos_tagged_text = nltk.pos_tag(text.split())
    return " ".join([lemmatizer.lemmatize(word, wordnet_map.get(pos[0], wordnet.NOUN)) for word, pos in pos_tagged_text])

text_df["text_lemmatized"] = text_df["text"].apply(lambda text: lemmatize_words(text))

text_df.head()

Unnamed: 0,text,filtered_text,text_stemmed,text_lemmatized
0,@applesupport causing the reply to be disregar...,@applesupport causing reply disregarded tapped...,@applesupport caus the repli to be disregard a...,@applesupport cause the reply to be disregard ...
1,@105835 your business means a lot to us. pleas...,"@105835 business means lot us. name, zip code ...",@105835 your busi mean a lot to us. pleas dm y...,@105835 your business mean a lot to us. please...
2,@76328 i really hope you all change but i'm su...,@76328 really hope change i'm sure won't! to!,@76328 i realli hope you all chang but i'm sur...,@76328 i really hope you all change but i'm su...
3,@105836 livechat is online at the moment - htt...,@105836 livechat online moment - https://t.co/...,@105836 livechat is onlin at the moment - http...,@105836 livechat be online at the moment - htt...
4,@virgintrains see attached error message. i've...,@virgintrains see attached error message. i've...,@virgintrain see attach error message. i'v tri...,@virgintrains see attached error message. i've...


In [31]:
all_text_no_lemm = ' '.join(text_df["text"]).split()
all_text_w_lemm = ' '.join(text_df["text_lemmatized"]).split()

n_words_no_lemm = len(set(all_text_no_lemm))
n_words_w_lemm = len(set(all_text_w_lemm))
vocabulary_size_diff = n_words_no_lemm - n_words_w_lemm

assert vocabulary_size_diff == 50

print(f"Number of unique words without stemming: {n_words_no_lemm}")
print(f"Number of unique words with stemming: {n_words_w_lemm}")
print(f"Difference: {vocabulary_size_diff} words out of {df.shape[0]} sample")

Number of unique words without stemming: 813
Number of unique words with stemming: 763
Difference: 50 words out of 93 sample


We can now see that in the third row, `sent` got converted to `send` since we provided the POS tag for lemmatization.

## **Removal of Emojis**

With more and more usage of social media platforms, there is an explosion in the usage of emojis in our day to day life as well. Probably we might need to remove these emojis for some of our textual analysis.

Thanks to [this code,](https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b) please find below a helper function to remove emojis from our text. 

In [32]:
# Reference : https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b
def remove_emoji(string):
    # define a regular expression pattern
    emoji_pattern = re.compile("[" 
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"  # Miscellaneous symbols
                           u"\U000024C2-\U0001F251"  # Enclosed characters
                           "]+", flags=re.UNICODE)   # '+' signifies that those characters can occur once or more consecutively
    # replace the substrings matching our regular expression with an empty string
    return emoji_pattern.sub(r'', string)

remove_emoji("game is on 🔥🔥")

'game is on '

In [33]:
remove_emoji("Regular expressions are so much fun😂")

'Regular expressions are so much fun'

`/!\ Be aware of the patterns you use with regular expresions /!\`

Example: The United States flag emoji "🇺🇸" is part of the Unicode range \U0001F1E0-\U0001F1FF, which is included in the pattern. 

Therefore, when you use this pattern to remove emoji characters, it will also remove the flag emoji "🇺🇸."

`This might or might not be a problem depending on your use case, but you have to be aware of the design decisions you are making.`

In [34]:
remove_emoji("This is a 😀 sample text with 🚀 emojis 🇺🇸")

'This is a  sample text with  emojis '

## **Removal of Emoticons**

This is what we did in the last step right? >Not exactly. We did remove emojis in the last step but not emoticons. There is a minor difference between emojis and emoticons. 

From Grammarist.com, emoticon is built from keyboard characters that when put together in a certain way represent a facial expression, an emoji is an actual image.

:-) is an emoticon

😀 is an emoji

Thanks to [NeelShah](https://github.com/NeelShah18/emot/blob/master/emot/emo_unicode.py) for the collection of emoticons, we are going to use them to remove emoticons. 

Please note again that the removal of emojis / emoticons are not always preferred and decision should be made based on the use case at hand.

In [35]:
from utils import emoticons
EMOTICONS = emoticons()
print(list(EMOTICONS.items())[:5])

[(':‑\\)', 'Happy face or smiley'), (':\\)', 'Happy face or smiley'), (':-\\]', 'Happy face or smiley'), (':\\]', 'Happy face or smiley'), (':-3', 'Happy face smiley')]


In [36]:
def remove_emoticons(text):
    emoticon_pattern = re.compile(u'(' + u'|'.join(k for k in EMOTICONS) + u')')
    return emoticon_pattern.sub(r'', text)

remove_emoticons("Hello :-)")

'Hello '

In [37]:
remove_emoticons("I am sad :(")

'I am sad '

## **Conversion of Emoticon to Words**

In the previous step, we have removed the emoticons. In case of use cases like sentiment analysis, the emoticons give some valuable information and so removing them might not be a good solution. What can we do in such cases?

One way is to convert the emoticons to word format so that they can be used in downstream modeling processes. Thanks for Neel again for the wonderful dictionary that we have used in the previous step. We are going to use that again for conversion of emoticons to words. 


`Regex breakdown:`
```Python
u'('+emot+')'
```
* `u` in front of a string indicates that the string contains Unicode characters
* `'('` and `')'`: These are regular characters, not special in any way. They are just literal parentheses
*  `emot`: The string variable representing the emoji, for instance `:‑\)`
* `+`: This is the string concatenation operator. It combines the characters and the value of the `emot` variable together to create a new string.

So, when you see `u'('+emot+')'`, it's creating a Unicode string that contains a left parenthesis `'('`, the value of the `emot` variable (which is a placeholder for the text or pattern you want to find), and a right parenthesis `')'`. 

We'll use this pattern in the next cell to replace the emojis within strings: for example "Hi :-)" => "Hi Happy_face_smiley"

In [38]:
for emoticon in EMOTICONS:
    print(EMOTICONS[emoticon])
    cleaned_description = EMOTICONS[emoticon].replace(",", "").split()
    cleaned_description_joined = "_".join(cleaned_description)
    print(cleaned_description_joined)
    break

Happy face or smiley
Happy_face_or_smiley


In [39]:
def convert_emoticons(text):
    for emoticon, description in EMOTICONS.items():
        cleaned_description = description.replace(",", "").split()
        cleaned_description_joined = "_".join(cleaned_description)
        # replace the emojis by the cleaned description within the given text
        text = re.sub(u'('+emoticon+')', cleaned_description_joined, text)
    return text

text = "Hello :-) :-)"
convert_emoticons(text)

'Hello Happy_face_smiley Happy_face_smiley'

In [40]:
text = "I am sad :("
assert convert_emoticons(text) == 'I am sad Frown_sad_andry_or_pouting'

This method might be better for some use cases when we do not want to miss out on the emoticon information.

## **Conversion of Emoji to Words**

Now let us do the same for Emojis as well. Neel Shah has put together a list of emojis with the corresponding words as well as part of his [Github repo](https://github.com/NeelShah18/emot/blob/master/emot/emo_unicode.py). We are going to make use of this dictionary to convert the emojis to corresponding words.

Again this conversion might be better than emoji removal for certain use cases. Please use the one that is suitable for the use case. 

In [41]:
from utils import emojis_unicode

EMO_UNICODE = emojis_unicode()
print(f"Emoticon unicode: {list(EMO_UNICODE.items())[:5]}")

# reversing the dictionary for facilitated quer
UNICODE_EMO = {v: k for k, v in EMO_UNICODE.items()}

Emoticon unicode: [(':1st_place_medal:', '🥇'), (':2nd_place_medal:', '🥈'), (':3rd_place_medal:', '🥉'), (':AB_button_(blood_type):', '🆎'), (':ATM_sign:', '🏧')]


In [42]:
print(UNICODE_EMO['😂'])
print(UNICODE_EMO['🥇'])

:face_with_tears_of_joy:
:1st_place_medal:


Let's clean the emoji descriptions before using them:

In [43]:
def convert_emojis(text):
    for emoticon, description in UNICODE_EMO.items():
        cleaned_description = description.replace(",", "").replace(":", "").split()
        replacement = "_".join(cleaned_description)
        text = text.replace(emoticon, replacement)
    return text

text = "game is on 🔥"

print(convert_emojis(text))
assert convert_emojis(text) == 'game is on fire'

game is on fire


In [44]:
text = "Hilarious 😂"
assert convert_emojis(text) == 'Hilarious face_with_tears_of_joy'

## **Removal of URLs**

Next preprocessing step is to remove any URLs present in the data. For example, if we are doing a twitter analysis, then there is a good chance that the tweet will have some URL in it. Probably we might need to remove them for our further analysis. 

We can use the below code snippet to do that.

`Regex breakdown:`
```Python
r'https?://\S+|www\.\S+'
# could also be understood as 
(r'https?://\S+') or (r'www\.\S+')
```
* `r` in front of a string indicates that Python shall treat the string as a raw litteral (avoids `\` being treated as escape characters)
* `https?://'`: This part of the regular expression matches URLs that start with either "http://" or "https:////". The `s?` portion allows for an optional "s" character, so it matches both "http://" and "https://".
*  `\S+`: This part of the regular expression matches one or more non-whitespace characters. It's used to match the domain part of the URL (e.g., www.example.com).
|: This is the alternation operator, which acts like an OR operator in regular expressions. It allows you to match either the pattern on the left or the pattern on the right. In this case, it's used to match either URLs starting with "http://" or "https://", or URLs starting with "www.".
* `www\.\S+`: This part of the regular expression matches URLs that start with "www." and then followed by one or more non-whitespace characters. It's commonly used to match URLs like "www.example.com".

In summary, this regular expression is designed to identify and capture URLs in a text string, whether they start with `"http://"`, `"https://"`, or `"www."`. It's a common pattern for extracting or hyperlinking URLs in text processing tasks.
So, when you see `u'('+emot+')'`, it's creating a Unicode string that contains a left parenthesis `'('`, the value of the `emot` variable (which is a placeholder for the text or pattern you want to find), and a right parenthesis `')'`. 

In [45]:
def remove_urls(text):
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    return url_pattern.sub(r'', text)

Let us take a `https` link and check the code

In [46]:
text = "Driverless AI NLP blog post on https://www.h2o.ai/blog/detecting-sarcasm-is-difficult-but-ai-may-have-an-answer/"
remove_urls(text)

'Driverless AI NLP blog post on '

Now let us take a `http` url and check the code

In [47]:
text = "Please refer to link http://lnkd.in/ecnt5yC for the paper"
remove_urls(text)

'Please refer to link  for the paper'

Thanks to Pranjal for the edge cases in the comments below. Suppose say there is no `http` or `https` in the url link. The function can now captures that as well.

In [48]:
text = "Want to know more. Checkout www.h2o.ai for additional information"
remove_urls(text)

'Want to know more. Checkout  for additional information'

## **Removal of HTML Tags**

One another common preprocessing technique that will come handy in multiple places is removal of html tags. This is especially useful, if we scrap the data from different websites. We might end up having html strings as part of our text. 

First, let us try to remove the HTML tags using regular expressions. 

`Regex breakdown:`
```Python
'<.*?>'
```
* `<` and `>`: simply match the opening and closing brackets of HTML tags, e.g. \<div>
* `.*?`: This is the `non-greedy` or `lazy quantifier` *?, which matches any character (represented by `.` ) zero or more times, but it does so as few times as possible to make a valid match. In the context of HTML tags, this means it will match the shortest possible sequence of characters between the opening < and closing > tags.

So, the entire regular expression `'<.*?>'` is used to match and capture the shortest possible HTML tag found in a text string. This is useful in cases where you want to extract or remove HTML tags from a text while preserving the shortest possible tag structure.

In [49]:
def remove_html(text):
    html_pattern = re.compile('<.*?>')
    return html_pattern.sub(r'', text)

text = """<div>
<h1> H2O</h1>
<SomeComponent/>
<p> AutoML</p>
<a href="https://www.h2o.ai/products/h2o-driverless-ai/"> Driverless AI</a>
</div>"""

print(remove_html('text'))

text


We can also use `BeautifulSoup` package to get the text from HTML document in a more elegant way.

In [50]:
from bs4 import BeautifulSoup

def remove_html(text):
    return BeautifulSoup(text, "lxml").text

text = """<div>
<h1> H2O</h1>
<p> AutoML</p>
<a href="https://www.h2o.ai/products/h2o-driverless-ai/"> Driverless AI</a>
</div>
"""

print(remove_html(text))


 H2O
 AutoML
 Driverless AI




## **Chat Words Conversion**

This is an important text preprocessing step if we are dealing with chat data. People do use a lot of abbreviated words in chat and so it might be helpful to expand those words for our analysis purposes. 

Got a good list of chat slang words from this [repo](https://github.com/rishabhverma17/sms_slang_translator/blob/master/slang.txt). We can use this for our conversion here. We can add more words to this list.

In [51]:
from utils import slang_words

slang_words_list = slang_words()
print(list(slang_words_list.items())[:10])

[('AFAIK', 'As Far As I Know'), ('AFK', 'Away From Keyboard'), ('ASAP', 'As Soon As Possible'), ('ATK', 'At The Keyboard'), ('ATM', 'At The Moment'), ('A3', 'Anytime, Anywhere, Anyplace'), ('BAK', 'Back At Keyboard'), ('BBL', 'Be Back Later'), ('BBS', 'Be Back Soon'), ('BFN', 'Bye For Now')]


In [52]:
chat_words_list = list(slang_words_list.keys())

def chat_words_conversion(text):
    slang_words_list = slang_words()
    chat_words_list = list(slang_words_list.keys())
    new_text = []
    for w in text.split():
        if w.upper() in chat_words_list:
            new_text.append(slang_words_list[w.upper()])
        else:
            new_text.append(w)
    return " ".join(new_text)

chat_words_conversion("one minute BRB")


'one minute Be Right Back'

In [53]:
chat_words_conversion("imo this is awesome")

'In My Opinion this is awesome'

We can add more words to our abbreviation list and use them based on our use case. 

## **Spelling Correction**

One another important text preprocessing step is spelling correction. Typos are common in text data and we might want to correct those spelling mistakes before we do our analysis. 

If you are interested in writing a spell corrector of our own, take a look at [How to Write a Spelling Corrector](https://norvig.com/spell-correct.html) from Peter Norvig.

For the sake of brevity, let's use the python package `pyspellchecker` for our spelling correction.

In [54]:
# %pip install pyspellchecker
from spellchecker import SpellChecker

In [55]:
spell = SpellChecker()
def correct_spellings(text):
    corrected_text = []
    misspelled_words = spell.unknown(text.split())
    for word in text.split():
        if word in misspelled_words:
            corrected_text.append(spell.correction(word))
        else:
            corrected_text.append(word)
    return " ".join(corrected_text)
        
text = "speling correctin"
correct_spellings(text)

'spelling correcting'

In [56]:
text = "Hopefully you larned smething durng th classn, seeee you in twwo wekks !"
correct_spellings(text)

'Hopefully you learned something during the class see you in two weeks !'

## **End of the mandatory section**

***Valued in the notation:***
* Your code should be shareable with your colleagues (clean, commented, reusable, functional)
* This pipeline is not perfect, text preprocessing is a difficult task requiring design decisions. You are aware and comment on the different limits of your code and pipeline. For example:
  * Does the order of pre processing steps matter ?
  * What design choices were made in this notebook, what risks do we accept by using it ?
  * Are there use cases (for example tasks or types of datasets) that are more or less adapted to the way we approach preprocessing ?
  * Anything you want to comment on ...  


***To get the `advanced` grade: try to put the different functions we've seen in this notebook together in a nicely written pipeline and clean the text examples provided in `to_clean.csv`. Feel free to add any new step of your choice and to chain the different processing steps in any order that makes sense to you (of course, comment on those decisions).***

*You can use an Sklearn pipeline ([docs](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)) along with FunctionTransformer ([docs](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html)) to nicely chain the functions we wrote, we will reuse this tool in the next TP to add feature extraction functions and provide the data to machine learning models.*

<br>

### ***If you have any additional questions or feedback on the course and practical works, don't hesitate emailing me at ryan.pegoud@epfedu.fr***