## Introduction

Text preprocessing is an essential step in natural language processing (NLP) that involves transforming unstructured text data  into a clean and consistent format that can then be fed into a model. It includes punctuation removal, stop-word removal,  lowercasing,  tokenization, stemming, lemmatization

In [1]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [5]:
# %cd drive/MyDrive/AMMI\ 2023


In [10]:
import pandas as pd
data = pd.read_csv('spam.csv', encoding='latin-1')
data.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [11]:
data = data[["v1", "v2"]]
data.head()


Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
list(data["v2"])[:10]

['Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...',
 'Ok lar... Joking wif u oni...',
 "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's",
 'U dun say so early hor... U c already then say...',
 "Nah I don't think he goes to usf, he lives around here though",
 "FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, å£1.50 to rcv",
 'Even my brother is not like to speak with me. They treat me like aids patent.',
 "As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune",
 'WINNER!! As a valued network customer you have been selected to receivea å£900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.',
 'Had y

## Puntuation removal

It consist of removing all puntuations from the data. The punctuations to remove need to be carefully choosen depending on the use case. 

This can be done in multiple ways:



*   regex
*   library string
*   for loop



In [15]:
import re
def remove_punctuations(text):
  text = re.sub(r'[^\w\s]', '', text)
  return text

data["v3"] = data["v2"].apply(lambda text: remove_punctuations(text))
data.head()

Unnamed: 0,v1,v2,v3
0,ham,"Go until jurong point, crazy.. Available only ...",Go until jurong point crazy Available only in ...
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say
4,ham,"Nah I don't think he goes to usf, he lives aro...",Nah I dont think he goes to usf he lives aroun...


In [16]:
import string

def remove_punctuations(text):
    translator = str.maketrans("", "", string.punctuation)
    text = text.translate(translator)
    return text

data["v3"] = data["v2"].apply(lambda text: remove_punctuations(text))
data.head()


Unnamed: 0,v1,v2,v3
0,ham,"Go until jurong point, crazy.. Available only ...",Go until jurong point crazy Available only in ...
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say
4,ham,"Nah I don't think he goes to usf, he lives aro...",Nah I dont think he goes to usf he lives aroun...


In [17]:
list(data["v3"])[:10]

['Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amore wat',
 'Ok lar Joking wif u oni',
 'Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive entry questionstd txt rateTCs apply 08452810075over18s',
 'U dun say so early hor U c already then say',
 'Nah I dont think he goes to usf he lives around here though',
 'FreeMsg Hey there darling its been 3 weeks now and no word back Id like some fun you up for it still Tb ok XxX std chgs to send å£150 to rcv',
 'Even my brother is not like to speak with me They treat me like aids patent',
 'As per your request Melle Melle Oru Minnaminunginte Nurungu Vettam has been set as your callertune for all Callers Press 9 to copy your friends Callertune',
 'WINNER As a valued network customer you have been selected to receivea å£900 prize reward To claim call 09061701461 Claim code KL341 Valid 12 hours only',
 'Had your mobile 11 months or more U R entitled to Update

In [18]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [19]:
PUNCT_TO_REMOVE = string.punctuation

def remove_punctuations_1(text):
    return text.translate(str.maketrans('', '', PUNCT_TO_REMOVE))

In [20]:
data.drop("v3", axis=1, inplace=True)
data.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [21]:

data["v3"] = data["v2"].apply(lambda text: remove_punctuations(text))
data.head()

Unnamed: 0,v1,v2,v3
0,ham,"Go until jurong point, crazy.. Available only ...",Go until jurong point crazy Available only in ...
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say
4,ham,"Nah I don't think he goes to usf, he lives aro...",Nah I dont think he goes to usf he lives aroun...


In [22]:
list(data["v3"])[:10]

['Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amore wat',
 'Ok lar Joking wif u oni',
 'Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive entry questionstd txt rateTCs apply 08452810075over18s',
 'U dun say so early hor U c already then say',
 'Nah I dont think he goes to usf he lives around here though',
 'FreeMsg Hey there darling its been 3 weeks now and no word back Id like some fun you up for it still Tb ok XxX std chgs to send å£150 to rcv',
 'Even my brother is not like to speak with me They treat me like aids patent',
 'As per your request Melle Melle Oru Minnaminunginte Nurungu Vettam has been set as your callertune for all Callers Press 9 to copy your friends Callertune',
 'WINNER As a valued network customer you have been selected to receivea å£900 prize reward To claim call 09061701461 Claim code KL341 Valid 12 hours only',
 'Had your mobile 11 months or more U R entitled to Update

##Lowercasing

 It consist of converting the input text into same casing format so that 'text', 'Text' and 'TEXT' are treated the same way.  This can help to considerably reduce the size of our dictionary/vocabulary.

 But it's not always necessary as in some case it can lead to loss information for example in part of speech tagging (POS) (where proper casing gives some information about Nouns and so on ) and emotion analysis (where words written in upper cases can be a sign of frustration or excitement)

In [26]:
data["v4"] = data["v3"].str.lower()
data.head()

Unnamed: 0,v1,v2,v3,v4
0,ham,"Go until jurong point, crazy.. Available only ...",go until jurong point crazy available only in ...,go until jurong point crazy available only in ...
1,ham,Ok lar... Joking wif u oni...,ok lar joking wif u oni,ok lar joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry in 2 a wkly comp to win fa cup fina...,free entry in 2 a wkly comp to win fa cup fina...
3,ham,U dun say so early hor... U c already then say...,u dun say so early hor u c already then say,u dun say so early hor u c already then say
4,ham,"Nah I don't think he goes to usf, he lives aro...",nah i dont think he goes to usf he lives aroun...,nah i dont think he goes to usf he lives aroun...


In [27]:
data["v4"]= data["v3"].apply(lambda x: x.lower())
data.head()

Unnamed: 0,v1,v2,v3,v4
0,ham,"Go until jurong point, crazy.. Available only ...",go until jurong point crazy available only in ...,go until jurong point crazy available only in ...
1,ham,Ok lar... Joking wif u oni...,ok lar joking wif u oni,ok lar joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry in 2 a wkly comp to win fa cup fina...,free entry in 2 a wkly comp to win fa cup fina...
3,ham,U dun say so early hor... U c already then say...,u dun say so early hor u c already then say,u dun say so early hor u c already then say
4,ham,"Nah I don't think he goes to usf, he lives aro...",nah i dont think he goes to usf he lives aroun...,nah i dont think he goes to usf he lives aroun...


In [28]:
data.drop("v4", axis=1, inplace=True)
data["v3"] = data["v3"].str.lower()
data.head()

Unnamed: 0,v1,v2,v3
0,ham,"Go until jurong point, crazy.. Available only ...",go until jurong point crazy available only in ...
1,ham,Ok lar... Joking wif u oni...,ok lar joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry in 2 a wkly comp to win fa cup fina...
3,ham,U dun say so early hor... U c already then say...,u dun say so early hor u c already then say
4,ham,"Nah I don't think he goes to usf, he lives aro...",nah i dont think he goes to usf he lives aroun...


## stop words removal

Stopwords are the commonly used words and are removed from the text as they do not add any value to the analysis. These words don't provide valuable information for downstream analysis.

We need to be carefull as for some tasks we should not remove all stop words as in POS 

Stopwords can be used from the nltk libray

In [29]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [36]:
from nltk.corpus import stopwords

stop_words = stopwords.words("english")
stop_words

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [None]:
def remove_stopwords(text):
    return " ".join([word for word in text.split() if word not in stop_words])

data["v4"] = data["v3"].apply(lambda text: remove_stopwords(text))
data.head()

Unnamed: 0,v1,v2,v3,v4
0,ham,"Go until jurong point, crazy.. Available only ...",go until jurong point crazy available only in ...,go jurong point crazy available bugis n great ...
1,ham,Ok lar... Joking wif u oni...,ok lar joking wif u oni,ok lar joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry in 2 a wkly comp to win fa cup fina...,free entry 2 wkly comp win fa cup final tkts 2...
3,ham,U dun say so early hor... U c already then say...,u dun say so early hor u c already then say,u dun say early hor u c already say
4,ham,"Nah I don't think he goes to usf, he lives aro...",nah i dont think he goes to usf he lives aroun...,nah dont think goes usf lives around though


In [None]:
data.drop("v4", axis=1, inplace=True)
data["v3"] = data["v3"].apply(lambda text: remove_stopwords(text))
data.head()

Unnamed: 0,v1,v2,v3
0,ham,"Go until jurong point, crazy.. Available only ...",go jurong point crazy available bugis n great ...
1,ham,Ok lar... Joking wif u oni...,ok lar joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry 2 wkly comp win fa cup final tkts 2...
3,ham,U dun say so early hor... U c already then say...,u dun say early hor u c already say
4,ham,"Nah I don't think he goes to usf, he lives aro...",nah dont think goes usf lives around though


In some cases, you would also revove very frequent words or rare words.


In [38]:
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')  # Download the stopwords corpus if not already downloaded

stop_words = set(stopwords.words('english'))  # Set of English stopwords

def remove_stopwords(text):
    words = text.split()
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return " ".join(filtered_words)

data["v4"] = data["v3"].apply(lambda text: remove_stopwords(text))
data.head()


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,v1,v2,v3,v4
0,ham,"Go until jurong point, crazy.. Available only ...",go until jurong point crazy available only in ...,go jurong point crazy available bugis n great ...
1,ham,Ok lar... Joking wif u oni...,ok lar joking wif u oni,ok lar joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry in 2 a wkly comp to win fa cup fina...,free entry 2 wkly comp win fa cup final tkts 2...
3,ham,U dun say so early hor... U c already then say...,u dun say so early hor u c already then say,u dun say early hor u c already say
4,ham,"Nah I don't think he goes to usf, he lives aro...",nah i dont think he goes to usf he lives aroun...,nah dont think goes usf lives around though


## Stemming

Stemming is the process of converting all words to their root or base form called stem.

For example, words like `programmer`, `programming`, `program` will be stemmed to `program`.

 But say in another example, we have two words `console` and `consoling`, the stemmer will remove the suffix and make them `consol` which is not a proper english word

 Indeed the disadvantage of stemming is that it stems the words such that its root form loses the meaning or it is not diminished to a proper English word.

 The most used stemming algorithm is porter stemmer  from nltk

 

In [None]:
from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer()

def stem_words(text):
    return " ".join([stemmer.stem(word) for word in text.split()])

data["v4"] = data["v3"].apply(lambda text: stem_words(text))
data.head()

Unnamed: 0,v1,v2,v3,v4
0,ham,"Go until jurong point, crazy.. Available only ...",go jurong point crazy available bugis n great ...,go jurong point crazi avail bugi n great world...
1,ham,Ok lar... Joking wif u oni...,ok lar joking wif u oni,ok lar joke wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry 2 wkly comp win fa cup final tkts 2...,free entri 2 wkli comp win fa cup final tkt 21...
3,ham,U dun say so early hor... U c already then say...,u dun say early hor u c already say,u dun say earli hor u c alreadi say
4,ham,"Nah I don't think he goes to usf, he lives aro...",nah dont think goes usf lives around though,nah dont think goe usf live around though


## Lemmatization

Lemmatization is a more advanced form of stemming but differs in the way that it makes sure the root word (also called as lemma) belongs to the language.

As a result, this one is generally slower than stemming process. So depending on the speed requirement, we can choose to use either stemming or lemmatization.

We can use the WordNetLemmatizer in nltk to lemmatize our sentences

In [None]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

def lemmatize_words(text):
    return " ".join([lemmatizer.lemmatize(word) for word in text.split()])

data["v5"] = data["v3"].apply(lambda text: lemmatize_words(text))
data.head()

Unnamed: 0,v1,v2,v3,v4,v5
0,ham,"Go until jurong point, crazy.. Available only ...",go jurong point crazy available bugis n great ...,go jurong point crazi avail bugi n great world...,go jurong point crazy available bugis n great ...
1,ham,Ok lar... Joking wif u oni...,ok lar joking wif u oni,ok lar joke wif u oni,ok lar joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry 2 wkly comp win fa cup final tkts 2...,free entri 2 wkli comp win fa cup final tkt 21...,free entry 2 wkly comp win fa cup final tkts 2...
3,ham,U dun say so early hor... U c already then say...,u dun say early hor u c already say,u dun say earli hor u c alreadi say,u dun say early hor u c already say
4,ham,"Nah I don't think he goes to usf, he lives aro...",nah dont think goes usf lives around though,nah dont think goe usf live around though,nah dont think go usf life around though


In [None]:
list(data["v5"])[:10]

['go jurong point crazy available bugis n great world la e buffet cine got amore wat',
 'ok lar joking wif u oni',
 'free entry 2 wkly comp win fa cup final tkts 21st may 2005 text fa 87121 receive entry questionstd txt ratetcs apply 08452810075over18s',
 'u dun say early hor u c already say',
 'nah dont think go usf life around though',
 'freemsg hey darling 3 week word back id like fun still tb ok xxx std chgs send å150 rcv',
 'even brother like speak treat like aid patent',
 'per request melle melle oru minnaminunginte nurungu vettam set callertune caller press 9 copy friend callertune',
 'winner valued network customer selected receivea å900 prize reward claim call 09061701461 claim code kl341 valid 12 hour',
 'mobile 11 month u r entitled update latest colour mobile camera free call mobile update co free 08002986030']

In [None]:
list(data["v3"])[:10]

['go jurong point crazy available bugis n great world la e buffet cine got amore wat',
 'ok lar joking wif u oni',
 'free entry 2 wkly comp win fa cup final tkts 21st may 2005 text fa 87121 receive entry questionstd txt ratetcs apply 08452810075over18s',
 'u dun say early hor u c already say',
 'nah dont think goes usf lives around though',
 'freemsg hey darling 3 weeks word back id like fun still tb ok xxx std chgs send å150 rcv',
 'even brother like speak treat like aids patent',
 'per request melle melle oru minnaminunginte nurungu vettam set callertune callers press 9 copy friends callertune',
 'winner valued network customer selected receivea å900 prize reward claim call 09061701461 claim code kl341 valid 12 hours',
 'mobile 11 months u r entitled update latest colour mobiles camera free call mobile update co free 08002986030']

![Stemming vs Lemmatizing](https://pluralsight2.imgix.net/guides/c71d705d-445d-4d4c-99d1-38ce48985cba_12.JPG)

image from [pluralsight](https://www.pluralsight.com/guides/importance-of-text-pre-processing)

In [None]:
data.drop(["v4", "v5"], axis=1, inplace=True)
data["v3"] = data["v3"].apply(lambda text: lemmatize_words(text))
data.head()

Unnamed: 0,v1,v2,v3
0,ham,"Go until jurong point, crazy.. Available only ...",go jurong point crazy available bugis n great ...
1,ham,Ok lar... Joking wif u oni...,ok lar joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry 2 wkly comp win fa cup final tkts 2...
3,ham,U dun say so early hor... U c already then say...,u dun say early hor u c already say
4,ham,"Nah I don't think he goes to usf, he lives aro...",nah dont think go usf life around though


## Tokenization

Tokenization is the process of splitiing a text  into a stream of words, also called `tokens`. Tokens can be either words, characters, or subwords

For example, let us consider `smarter`:

*   word token: `smarter`
*   Character tokens: `s-m-a-r-t-e-r`
*   Subword tokens: `smart-er`



### Vocabulary
Vocabulary refers to the set of unique tokens in the corpus. It can be constructed by considering each unique token in the corpus or by considering the top K Frequently Occurring Words.

In [None]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
from nltk.tokenize import word_tokenize

def tokenize_text(text):
  return word_tokenize(text)

data["v4"] = data["v3"].apply(lambda text: tokenize_text(text))
data.head()

Unnamed: 0,v1,v2,v3,v4
0,ham,"Go until jurong point, crazy.. Available only ...",go jurong point crazy available bugis n great ...,"[go, jurong, point, crazy, available, bugis, n..."
1,ham,Ok lar... Joking wif u oni...,ok lar joking wif u oni,"[ok, lar, joking, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry 2 wkly comp win fa cup final tkts 2...,"[free, entry, 2, wkly, comp, win, fa, cup, fin..."
3,ham,U dun say so early hor... U c already then say...,u dun say early hor u c already say,"[u, dun, say, early, hor, u, c, already, say]"
4,ham,"Nah I don't think he goes to usf, he lives aro...",nah dont think go usf life around though,"[nah, dont, think, go, usf, life, around, though]"


## Drawbacks of Word Tokenization

One of the major issues with word tokens is dealing with Out Of Vocabulary (OOV) words. OOV words refer to the new words which are encountered at testing. These new words do not exist in the vocabulary. Hence, these methods fail in handling OOV words.



A small trick can rescue word tokenizers from OOV words. The trick is to form the vocabulary with the Top K Frequent Words and replace the rare words in training data with unknown tokens (UNK). This helps the model to learn the representation of OOV words in terms of UNK tokens
So, during test time, any word that is not present in the vocabulary will be mapped to a UNK token. This is how we can tackle the problem of OOV in word tokenizers.

Character-based tokenizers split the raw text into individual characters. The logic behind this tokenization is that a language has many different words but has a fixed number of characters. This results in a very small vocabulary.

For example, in the English language, we use 256 different characters

One of the major advantages of character-based tokenization is that there will be no or very few unknown or OOV words

## Drawbacks of Character Tokenization

Character tokens solve the OOV problem but the length of the input and output sentences increases rapidly as we are representing a sentence as a sequence of characters. As a result, it becomes challenging to learn the relationship between the characters to form meaningful words.

In [None]:
text = "split this text into characters"
characters = [char for char in text]
characters

['s',
 'p',
 'l',
 'i',
 't',
 ' ',
 't',
 'h',
 'i',
 's',
 ' ',
 't',
 'e',
 'x',
 't',
 ' ',
 'i',
 'n',
 't',
 'o',
 ' ',
 'c',
 'h',
 'a',
 'r',
 'a',
 'c',
 't',
 'e',
 'r',
 's']

## Subword Tokenization
Subword Tokenization splits the piece of text into subwords (or n-gram characters).  It addresses the issues of Word and Character Tokenizers

Byte Pair Encoding (BPE) is a widely used subword tokenization method. 

https://github.com/rsennrich/subword-nmt

https://huggingface.co/docs/transformers/tokenizer_summary

https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt



In [None]:
sent = "tokenization model"

bpe_tokens = "to", "@@ken", "@@ization", "model" 

bpe_tokens = "to", "##ken", "##ization", "model" 
