<a href="https://colab.research.google.com/github/Shahad-Mohammed/NLP_Course_Spam_Filtering/blob/main/NLP_Course_Spam_Filtering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The SMS Spam Collection is a public set of SMS labeled messages that have been collected for mobile phone spam research.

The Dataset:

[SMSSpamCollection](https://archive.ics.uci.edu/dataset/228/sms+spam+collection)



**NLTK** is a popular library in Python for working with human language data and is often used in natural language processing (NLP) tasks.

In [9]:
import numpy as np
import pandas as pd
import nltk as nltk
from nltk.corpus import stopwords
import string
import re

In [10]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

# Load the data


Read the CSV file into a DataFrame called sms_SH_df

the parameters:
* sep='\t': It specifies that the values in the file are separated by tabs ('\t').

* header=None: It indicates that the file doesn't contain a header row, and Pandas should use default column indices.

In [11]:
sms_SH_df = pd.read_csv("/content/SMSSpamCollection", sep='\t',header=None)
sms_SH_df.columns = ['label','body_text']
sms_SH_df.head(20)

Unnamed: 0,label,body_text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


# Explore Data

 The **tail()** method is commonly used to inspect the end of a DataFrame. In this case, it provides a view of the most recent 20 rows, which can be useful for checking the data towards the end of the dataset.

 The default behavior of the **tail()** method, displays the last 5 rows of the DataFrame

In [12]:
sms_SH_df.tail(20)
sms_SH_df.tail()

Unnamed: 0,label,body_text
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...
5571,ham,Rofl. Its true to its name


Display the shape of the DataFrame (number of rows and columns)

In [13]:
sms_SH_df.shape

(5572, 2)

Provide information about the DataFrame, including data types and non-null values


In [14]:
sms_SH_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   label      5572 non-null   object
 1   body_text  5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


Generate descriptive statistics of the DataFrame (count, mean, std, min, 25%, 50%, 75%, max)


In [15]:
sms_SH_df.describe() # 2 means ham or spam in unique freq of the top message

Unnamed: 0,label,body_text
count,5572,5572
unique,2,5169
top,ham,"Sorry, I'll call later"
freq,4825,30


Count the occurrences of each unique value in the 'label' column

In [16]:
sms_SH_df.label.value_counts()

ham     4825
spam     747
Name: label, dtype: int64

**normalize=True:** This parameter, when set to True, returns the relative frequencies or proportions instead of the raw counts. Each unique label's count is divided by the total number of entries, giving the proportion of each label in the dataset.

In [17]:
sms_SH_df.label.value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: label, dtype: float64

In [18]:
#sms_SH_df.label.value_counts(normalize=True).plot.pie()
#plt.show() must import

In [19]:
# Check for missing values in the DataFrame, count the occurrences for each column
sms_SH_df.isnull().value_counts()

label  body_text
False  False        5572
dtype: int64

In [20]:
#df= sms_SH_df.copy()

# punctuation

text_nopunct = "".join([char for char in text if char not in string.punctuation and char != '£']): This line creates a new string (text_nopunct) by joining characters from the input text only if they are not in the set of punctuation characters (string.punctuation) and not equal to the symbol ('£').

In [21]:
def remove_punct(text):
    text_nopunct = "".join([char for char in text if char not in string.punctuation and char != '£' ])
    return text_nopunct

The provided code adds a new column named 'body' to the DataFrame sms_SH_df by applying the remove_punct function to the 'body_text' column.

In [22]:
sms_SH_df ['body'] = sms_SH_df['body_text'].apply(lambda x: remove_punct(x))

In [23]:
sms_SH_df ['body_lower'] = sms_SH_df['body_text'].apply(lambda x: remove_punct(x.lower()))

**pd.set_option('display.max_colwidth',800) :** it adjusts the maximum width of a column when the DataFrame is displayed. The number 800 represents the maximum number of characters to be displayed in a single cell of a DataFrame column.

In [24]:
pd.set_option('display.max_colwidth',800)

In [25]:
sms_SH_df.head()

Unnamed: 0,label,body_text,body,body_lower
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...",Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amore wat,go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni,ok lar joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive entry questionstd txt rateTCs apply 08452810075over18s,free entry in 2 a wkly comp to win fa cup final tkts 21st may 2005 text fa to 87121 to receive entry questionstd txt ratetcs apply 08452810075over18s
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say,u dun say so early hor u c already then say
4,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,nah i dont think he goes to usf he lives around here though


In [26]:
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


# Tokenization

**Tokenization** is the process of breaking down a text into individual units, such as words or phrases. The word_tokenize function specifically tokenizes a text into words.

In [27]:
from nltk.tokenize import word_tokenize

In [28]:
sms_SH_df ['body_token'] = sms_SH_df['body_lower'].apply(lambda x: word_tokenize(x))

In [29]:
sms_SH_df.head()

Unnamed: 0,label,body_text,body,body_lower,body_token
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...",Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amore wat,go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat,"[go, until, jurong, point, crazy, available, only, in, bugis, n, great, world, la, e, buffet, cine, there, got, amore, wat]"
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni,ok lar joking wif u oni,"[ok, lar, joking, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive entry questionstd txt rateTCs apply 08452810075over18s,free entry in 2 a wkly comp to win fa cup final tkts 21st may 2005 text fa to 87121 to receive entry questionstd txt ratetcs apply 08452810075over18s,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to, 87121, to, receive, entry, questionstd, txt, ratetcs, apply, 08452810075over18s]"
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say,u dun say so early hor u c already then say,"[u, dun, say, so, early, hor, u, c, already, then, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,nah i dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]"


In [30]:
text ='I love NLP, will use python in our code.'
tokens = re.split('\W+',text)

In [31]:
tokens

['I', 'love', 'NLP', 'will', 'use', 'python', 'in', 'our', 'code', '']

In [32]:
sms_SH_df ['body_token_re'] = sms_SH_df['body_lower'].apply(lambda x: re.split('\W+',x)) # same as word_tokens -whiteSpace- but with RE

In [33]:
sms_SH_df.head()

Unnamed: 0,label,body_text,body,body_lower,body_token,body_token_re
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...",Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amore wat,go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat,"[go, until, jurong, point, crazy, available, only, in, bugis, n, great, world, la, e, buffet, cine, there, got, amore, wat]","[go, until, jurong, point, crazy, available, only, in, bugis, n, great, world, la, e, buffet, cine, there, got, amore, wat]"
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni,ok lar joking wif u oni,"[ok, lar, joking, wif, u, oni]","[ok, lar, joking, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive entry questionstd txt rateTCs apply 08452810075over18s,free entry in 2 a wkly comp to win fa cup final tkts 21st may 2005 text fa to 87121 to receive entry questionstd txt ratetcs apply 08452810075over18s,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to, 87121, to, receive, entry, questionstd, txt, ratetcs, apply, 08452810075over18s]","[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to, 87121, to, receive, entry, questionstd, txt, ratetcs, apply, 08452810075over18s]"
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say,u dun say so early hor u c already then say,"[u, dun, say, so, early, hor, u, c, already, then, say]","[u, dun, say, so, early, hor, u, c, already, then, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,nah i dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]","[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]"


# Stopwords


In [34]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [35]:
stopwords_en = nltk.corpus.stopwords.words('english')

In [36]:
stopwords_en

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [37]:
def stopwords(text):
    text_stopwords = ([char for char in text if char not in stopwords_en])
    return text_stopwords

In [38]:
sms_SH_df ['body_stopword'] = sms_SH_df['body_token_re'].apply(lambda x: stopwords(x))

In [39]:
sms_SH_df.head()

Unnamed: 0,label,body_text,body,body_lower,body_token,body_token_re,body_stopword
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...",Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amore wat,go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat,"[go, until, jurong, point, crazy, available, only, in, bugis, n, great, world, la, e, buffet, cine, there, got, amore, wat]","[go, until, jurong, point, crazy, available, only, in, bugis, n, great, world, la, e, buffet, cine, there, got, amore, wat]","[go, jurong, point, crazy, available, bugis, n, great, world, la, e, buffet, cine, got, amore, wat]"
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni,ok lar joking wif u oni,"[ok, lar, joking, wif, u, oni]","[ok, lar, joking, wif, u, oni]","[ok, lar, joking, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive entry questionstd txt rateTCs apply 08452810075over18s,free entry in 2 a wkly comp to win fa cup final tkts 21st may 2005 text fa to 87121 to receive entry questionstd txt ratetcs apply 08452810075over18s,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to, 87121, to, receive, entry, questionstd, txt, ratetcs, apply, 08452810075over18s]","[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to, 87121, to, receive, entry, questionstd, txt, ratetcs, apply, 08452810075over18s]","[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receive, entry, questionstd, txt, ratetcs, apply, 08452810075over18s]"
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say,u dun say so early hor u c already then say,"[u, dun, say, so, early, hor, u, c, already, then, say]","[u, dun, say, so, early, hor, u, c, already, then, say]","[u, dun, say, early, hor, u, c, already, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,nah i dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]","[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]","[nah, dont, think, goes, usf, lives, around, though]"


In [40]:
ps = nltk.PorterStemmer()

In [41]:
ps2= ps.stem('played')
ps2

'play'

In [42]:
def stem_text(text):
   stem = [ps.stem(token) for token in text]
   return ' '.join(stem)

sms_SH_df ['body_stem'] = sms_SH_df['body_stopword'].apply(lambda x: stem_text(x))

In [43]:
sms_SH_df.head(10)

Unnamed: 0,label,body_text,body,body_lower,body_token,body_token_re,body_stopword,body_stem
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...",Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amore wat,go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat,"[go, until, jurong, point, crazy, available, only, in, bugis, n, great, world, la, e, buffet, cine, there, got, amore, wat]","[go, until, jurong, point, crazy, available, only, in, bugis, n, great, world, la, e, buffet, cine, there, got, amore, wat]","[go, jurong, point, crazy, available, bugis, n, great, world, la, e, buffet, cine, got, amore, wat]",go jurong point crazi avail bugi n great world la e buffet cine got amor wat
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni,ok lar joking wif u oni,"[ok, lar, joking, wif, u, oni]","[ok, lar, joking, wif, u, oni]","[ok, lar, joking, wif, u, oni]",ok lar joke wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive entry questionstd txt rateTCs apply 08452810075over18s,free entry in 2 a wkly comp to win fa cup final tkts 21st may 2005 text fa to 87121 to receive entry questionstd txt ratetcs apply 08452810075over18s,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to, 87121, to, receive, entry, questionstd, txt, ratetcs, apply, 08452810075over18s]","[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to, 87121, to, receive, entry, questionstd, txt, ratetcs, apply, 08452810075over18s]","[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receive, entry, questionstd, txt, ratetcs, apply, 08452810075over18s]",free entri 2 wkli comp win fa cup final tkt 21st may 2005 text fa 87121 receiv entri questionstd txt ratetc appli 08452810075over18
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say,u dun say so early hor u c already then say,"[u, dun, say, so, early, hor, u, c, already, then, say]","[u, dun, say, so, early, hor, u, c, already, then, say]","[u, dun, say, early, hor, u, c, already, say]",u dun say earli hor u c alreadi say
4,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,nah i dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]","[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]","[nah, dont, think, goes, usf, lives, around, though]",nah dont think goe usf live around though
5,spam,"FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv",FreeMsg Hey there darling its been 3 weeks now and no word back Id like some fun you up for it still Tb ok XxX std chgs to send 150 to rcv,freemsg hey there darling its been 3 weeks now and no word back id like some fun you up for it still tb ok xxx std chgs to send 150 to rcv,"[freemsg, hey, there, darling, its, been, 3, weeks, now, and, no, word, back, id, like, some, fun, you, up, for, it, still, tb, ok, xxx, std, chgs, to, send, 150, to, rcv]","[freemsg, hey, there, darling, its, been, 3, weeks, now, and, no, word, back, id, like, some, fun, you, up, for, it, still, tb, ok, xxx, std, chgs, to, send, 150, to, rcv]","[freemsg, hey, darling, 3, weeks, word, back, id, like, fun, still, tb, ok, xxx, std, chgs, send, 150, rcv]",freemsg hey darl 3 week word back id like fun still tb ok xxx std chg send 150 rcv
6,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent,even my brother is not like to speak with me they treat me like aids patent,"[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]","[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]","[even, brother, like, speak, treat, like, aids, patent]",even brother like speak treat like aid patent
7,ham,As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune,As per your request Melle Melle Oru Minnaminunginte Nurungu Vettam has been set as your callertune for all Callers Press 9 to copy your friends Callertune,as per your request melle melle oru minnaminunginte nurungu vettam has been set as your callertune for all callers press 9 to copy your friends callertune,"[as, per, your, request, melle, melle, oru, minnaminunginte, nurungu, vettam, has, been, set, as, your, callertune, for, all, callers, press, 9, to, copy, your, friends, callertune]","[as, per, your, request, melle, melle, oru, minnaminunginte, nurungu, vettam, has, been, set, as, your, callertune, for, all, callers, press, 9, to, copy, your, friends, callertune]","[per, request, melle, melle, oru, minnaminunginte, nurungu, vettam, set, callertune, callers, press, 9, copy, friends, callertune]",per request mell mell oru minnaminungint nurungu vettam set callertun caller press 9 copi friend callertun
8,spam,WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.,WINNER As a valued network customer you have been selected to receivea 900 prize reward To claim call 09061701461 Claim code KL341 Valid 12 hours only,winner as a valued network customer you have been selected to receivea 900 prize reward to claim call 09061701461 claim code kl341 valid 12 hours only,"[winner, as, a, valued, network, customer, you, have, been, selected, to, receivea, 900, prize, reward, to, claim, call, 09061701461, claim, code, kl341, valid, 12, hours, only]","[winner, as, a, valued, network, customer, you, have, been, selected, to, receivea, 900, prize, reward, to, claim, call, 09061701461, claim, code, kl341, valid, 12, hours, only]","[winner, valued, network, customer, selected, receivea, 900, prize, reward, claim, call, 09061701461, claim, code, kl341, valid, 12, hours]",winner valu network custom select receivea 900 prize reward claim call 09061701461 claim code kl341 valid 12 hour
9,spam,Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030,Had your mobile 11 months or more U R entitled to Update to the latest colour mobiles with camera for Free Call The Mobile Update Co FREE on 08002986030,had your mobile 11 months or more u r entitled to update to the latest colour mobiles with camera for free call the mobile update co free on 08002986030,"[had, your, mobile, 11, months, or, more, u, r, entitled, to, update, to, the, latest, colour, mobiles, with, camera, for, free, call, the, mobile, update, co, free, on, 08002986030]","[had, your, mobile, 11, months, or, more, u, r, entitled, to, update, to, the, latest, colour, mobiles, with, camera, for, free, call, the, mobile, update, co, free, on, 08002986030]","[mobile, 11, months, u, r, entitled, update, latest, colour, mobiles, camera, free, call, mobile, update, co, free, 08002986030]",mobil 11 month u r entitl updat latest colour mobil camera free call mobil updat co free 08002986030


# Lemmatization


In [44]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [45]:
wn = nltk.WordNetLemmatizer()

In [46]:
dir(wn)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'lemmatize']

In [47]:
wn.lemmatize('universal')

'universal'

In [48]:
def lemma_text(text):
   lemma = [wn.lemmatize(token) for token in text]
   return ' '.join(lemma)

sms_SH_df ['body_lemma'] = sms_SH_df['body_stopword'].apply(lambda x: lemma_text(x))

In [49]:
sms_SH_df.head(10)

Unnamed: 0,label,body_text,body,body_lower,body_token,body_token_re,body_stopword,body_stem,body_lemma
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...",Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amore wat,go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat,"[go, until, jurong, point, crazy, available, only, in, bugis, n, great, world, la, e, buffet, cine, there, got, amore, wat]","[go, until, jurong, point, crazy, available, only, in, bugis, n, great, world, la, e, buffet, cine, there, got, amore, wat]","[go, jurong, point, crazy, available, bugis, n, great, world, la, e, buffet, cine, got, amore, wat]",go jurong point crazi avail bugi n great world la e buffet cine got amor wat,go jurong point crazy available bugis n great world la e buffet cine got amore wat
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni,ok lar joking wif u oni,"[ok, lar, joking, wif, u, oni]","[ok, lar, joking, wif, u, oni]","[ok, lar, joking, wif, u, oni]",ok lar joke wif u oni,ok lar joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive entry questionstd txt rateTCs apply 08452810075over18s,free entry in 2 a wkly comp to win fa cup final tkts 21st may 2005 text fa to 87121 to receive entry questionstd txt ratetcs apply 08452810075over18s,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to, 87121, to, receive, entry, questionstd, txt, ratetcs, apply, 08452810075over18s]","[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to, 87121, to, receive, entry, questionstd, txt, ratetcs, apply, 08452810075over18s]","[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receive, entry, questionstd, txt, ratetcs, apply, 08452810075over18s]",free entri 2 wkli comp win fa cup final tkt 21st may 2005 text fa 87121 receiv entri questionstd txt ratetc appli 08452810075over18,free entry 2 wkly comp win fa cup final tkts 21st may 2005 text fa 87121 receive entry questionstd txt ratetcs apply 08452810075over18s
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say,u dun say so early hor u c already then say,"[u, dun, say, so, early, hor, u, c, already, then, say]","[u, dun, say, so, early, hor, u, c, already, then, say]","[u, dun, say, early, hor, u, c, already, say]",u dun say earli hor u c alreadi say,u dun say early hor u c already say
4,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,nah i dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]","[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]","[nah, dont, think, goes, usf, lives, around, though]",nah dont think goe usf live around though,nah dont think go usf life around though
5,spam,"FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv",FreeMsg Hey there darling its been 3 weeks now and no word back Id like some fun you up for it still Tb ok XxX std chgs to send 150 to rcv,freemsg hey there darling its been 3 weeks now and no word back id like some fun you up for it still tb ok xxx std chgs to send 150 to rcv,"[freemsg, hey, there, darling, its, been, 3, weeks, now, and, no, word, back, id, like, some, fun, you, up, for, it, still, tb, ok, xxx, std, chgs, to, send, 150, to, rcv]","[freemsg, hey, there, darling, its, been, 3, weeks, now, and, no, word, back, id, like, some, fun, you, up, for, it, still, tb, ok, xxx, std, chgs, to, send, 150, to, rcv]","[freemsg, hey, darling, 3, weeks, word, back, id, like, fun, still, tb, ok, xxx, std, chgs, send, 150, rcv]",freemsg hey darl 3 week word back id like fun still tb ok xxx std chg send 150 rcv,freemsg hey darling 3 week word back id like fun still tb ok xxx std chgs send 150 rcv
6,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent,even my brother is not like to speak with me they treat me like aids patent,"[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]","[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]","[even, brother, like, speak, treat, like, aids, patent]",even brother like speak treat like aid patent,even brother like speak treat like aid patent
7,ham,As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune,As per your request Melle Melle Oru Minnaminunginte Nurungu Vettam has been set as your callertune for all Callers Press 9 to copy your friends Callertune,as per your request melle melle oru minnaminunginte nurungu vettam has been set as your callertune for all callers press 9 to copy your friends callertune,"[as, per, your, request, melle, melle, oru, minnaminunginte, nurungu, vettam, has, been, set, as, your, callertune, for, all, callers, press, 9, to, copy, your, friends, callertune]","[as, per, your, request, melle, melle, oru, minnaminunginte, nurungu, vettam, has, been, set, as, your, callertune, for, all, callers, press, 9, to, copy, your, friends, callertune]","[per, request, melle, melle, oru, minnaminunginte, nurungu, vettam, set, callertune, callers, press, 9, copy, friends, callertune]",per request mell mell oru minnaminungint nurungu vettam set callertun caller press 9 copi friend callertun,per request melle melle oru minnaminunginte nurungu vettam set callertune caller press 9 copy friend callertune
8,spam,WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.,WINNER As a valued network customer you have been selected to receivea 900 prize reward To claim call 09061701461 Claim code KL341 Valid 12 hours only,winner as a valued network customer you have been selected to receivea 900 prize reward to claim call 09061701461 claim code kl341 valid 12 hours only,"[winner, as, a, valued, network, customer, you, have, been, selected, to, receivea, 900, prize, reward, to, claim, call, 09061701461, claim, code, kl341, valid, 12, hours, only]","[winner, as, a, valued, network, customer, you, have, been, selected, to, receivea, 900, prize, reward, to, claim, call, 09061701461, claim, code, kl341, valid, 12, hours, only]","[winner, valued, network, customer, selected, receivea, 900, prize, reward, claim, call, 09061701461, claim, code, kl341, valid, 12, hours]",winner valu network custom select receivea 900 prize reward claim call 09061701461 claim code kl341 valid 12 hour,winner valued network customer selected receivea 900 prize reward claim call 09061701461 claim code kl341 valid 12 hour
9,spam,Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030,Had your mobile 11 months or more U R entitled to Update to the latest colour mobiles with camera for Free Call The Mobile Update Co FREE on 08002986030,had your mobile 11 months or more u r entitled to update to the latest colour mobiles with camera for free call the mobile update co free on 08002986030,"[had, your, mobile, 11, months, or, more, u, r, entitled, to, update, to, the, latest, colour, mobiles, with, camera, for, free, call, the, mobile, update, co, free, on, 08002986030]","[had, your, mobile, 11, months, or, more, u, r, entitled, to, update, to, the, latest, colour, mobiles, with, camera, for, free, call, the, mobile, update, co, free, on, 08002986030]","[mobile, 11, months, u, r, entitled, update, latest, colour, mobiles, camera, free, call, mobile, update, co, free, 08002986030]",mobil 11 month u r entitl updat latest colour mobil camera free call mobil updat co free 08002986030,mobile 11 month u r entitled update latest colour mobile camera free call mobile update co free 08002986030


In [50]:
sms_SH_df ['is_equal'] = (sms_SH_df['body_stem'] ==  sms_SH_df['body_lemma']) # to check if the lemma is equal to the stemm or not
sms_SH_df.head(10)

Unnamed: 0,label,body_text,body,body_lower,body_token,body_token_re,body_stopword,body_stem,body_lemma,is_equal
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...",Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amore wat,go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat,"[go, until, jurong, point, crazy, available, only, in, bugis, n, great, world, la, e, buffet, cine, there, got, amore, wat]","[go, until, jurong, point, crazy, available, only, in, bugis, n, great, world, la, e, buffet, cine, there, got, amore, wat]","[go, jurong, point, crazy, available, bugis, n, great, world, la, e, buffet, cine, got, amore, wat]",go jurong point crazi avail bugi n great world la e buffet cine got amor wat,go jurong point crazy available bugis n great world la e buffet cine got amore wat,False
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni,ok lar joking wif u oni,"[ok, lar, joking, wif, u, oni]","[ok, lar, joking, wif, u, oni]","[ok, lar, joking, wif, u, oni]",ok lar joke wif u oni,ok lar joking wif u oni,False
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive entry questionstd txt rateTCs apply 08452810075over18s,free entry in 2 a wkly comp to win fa cup final tkts 21st may 2005 text fa to 87121 to receive entry questionstd txt ratetcs apply 08452810075over18s,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to, 87121, to, receive, entry, questionstd, txt, ratetcs, apply, 08452810075over18s]","[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to, 87121, to, receive, entry, questionstd, txt, ratetcs, apply, 08452810075over18s]","[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receive, entry, questionstd, txt, ratetcs, apply, 08452810075over18s]",free entri 2 wkli comp win fa cup final tkt 21st may 2005 text fa 87121 receiv entri questionstd txt ratetc appli 08452810075over18,free entry 2 wkly comp win fa cup final tkts 21st may 2005 text fa 87121 receive entry questionstd txt ratetcs apply 08452810075over18s,False
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say,u dun say so early hor u c already then say,"[u, dun, say, so, early, hor, u, c, already, then, say]","[u, dun, say, so, early, hor, u, c, already, then, say]","[u, dun, say, early, hor, u, c, already, say]",u dun say earli hor u c alreadi say,u dun say early hor u c already say,False
4,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,nah i dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]","[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]","[nah, dont, think, goes, usf, lives, around, though]",nah dont think goe usf live around though,nah dont think go usf life around though,False
5,spam,"FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv",FreeMsg Hey there darling its been 3 weeks now and no word back Id like some fun you up for it still Tb ok XxX std chgs to send 150 to rcv,freemsg hey there darling its been 3 weeks now and no word back id like some fun you up for it still tb ok xxx std chgs to send 150 to rcv,"[freemsg, hey, there, darling, its, been, 3, weeks, now, and, no, word, back, id, like, some, fun, you, up, for, it, still, tb, ok, xxx, std, chgs, to, send, 150, to, rcv]","[freemsg, hey, there, darling, its, been, 3, weeks, now, and, no, word, back, id, like, some, fun, you, up, for, it, still, tb, ok, xxx, std, chgs, to, send, 150, to, rcv]","[freemsg, hey, darling, 3, weeks, word, back, id, like, fun, still, tb, ok, xxx, std, chgs, send, 150, rcv]",freemsg hey darl 3 week word back id like fun still tb ok xxx std chg send 150 rcv,freemsg hey darling 3 week word back id like fun still tb ok xxx std chgs send 150 rcv,False
6,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent,even my brother is not like to speak with me they treat me like aids patent,"[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]","[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]","[even, brother, like, speak, treat, like, aids, patent]",even brother like speak treat like aid patent,even brother like speak treat like aid patent,True
7,ham,As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune,As per your request Melle Melle Oru Minnaminunginte Nurungu Vettam has been set as your callertune for all Callers Press 9 to copy your friends Callertune,as per your request melle melle oru minnaminunginte nurungu vettam has been set as your callertune for all callers press 9 to copy your friends callertune,"[as, per, your, request, melle, melle, oru, minnaminunginte, nurungu, vettam, has, been, set, as, your, callertune, for, all, callers, press, 9, to, copy, your, friends, callertune]","[as, per, your, request, melle, melle, oru, minnaminunginte, nurungu, vettam, has, been, set, as, your, callertune, for, all, callers, press, 9, to, copy, your, friends, callertune]","[per, request, melle, melle, oru, minnaminunginte, nurungu, vettam, set, callertune, callers, press, 9, copy, friends, callertune]",per request mell mell oru minnaminungint nurungu vettam set callertun caller press 9 copi friend callertun,per request melle melle oru minnaminunginte nurungu vettam set callertune caller press 9 copy friend callertune,False
8,spam,WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.,WINNER As a valued network customer you have been selected to receivea 900 prize reward To claim call 09061701461 Claim code KL341 Valid 12 hours only,winner as a valued network customer you have been selected to receivea 900 prize reward to claim call 09061701461 claim code kl341 valid 12 hours only,"[winner, as, a, valued, network, customer, you, have, been, selected, to, receivea, 900, prize, reward, to, claim, call, 09061701461, claim, code, kl341, valid, 12, hours, only]","[winner, as, a, valued, network, customer, you, have, been, selected, to, receivea, 900, prize, reward, to, claim, call, 09061701461, claim, code, kl341, valid, 12, hours, only]","[winner, valued, network, customer, selected, receivea, 900, prize, reward, claim, call, 09061701461, claim, code, kl341, valid, 12, hours]",winner valu network custom select receivea 900 prize reward claim call 09061701461 claim code kl341 valid 12 hour,winner valued network customer selected receivea 900 prize reward claim call 09061701461 claim code kl341 valid 12 hour,False
9,spam,Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030,Had your mobile 11 months or more U R entitled to Update to the latest colour mobiles with camera for Free Call The Mobile Update Co FREE on 08002986030,had your mobile 11 months or more u r entitled to update to the latest colour mobiles with camera for free call the mobile update co free on 08002986030,"[had, your, mobile, 11, months, or, more, u, r, entitled, to, update, to, the, latest, colour, mobiles, with, camera, for, free, call, the, mobile, update, co, free, on, 08002986030]","[had, your, mobile, 11, months, or, more, u, r, entitled, to, update, to, the, latest, colour, mobiles, with, camera, for, free, call, the, mobile, update, co, free, on, 08002986030]","[mobile, 11, months, u, r, entitled, update, latest, colour, mobiles, camera, free, call, mobile, update, co, free, 08002986030]",mobil 11 month u r entitl updat latest colour mobil camera free call mobil updat co free 08002986030,mobile 11 month u r entitled update latest colour mobile camera free call mobile update co free 08002986030,False


In [51]:
sms_SH_df.is_equal.value_counts() #count the unique value

False    4425
True     1147
Name: is_equal, dtype: int64

In [52]:
import nltk
from nltk.util import ngrams

In [53]:
def ngramss(text):
  fourgrams = nltk.bigrams(text)
  return [i for i in fourgrams]

In [54]:
sms_SH_df ['ngrams'] = sms_SH_df['body_token'].apply(lambda x: ngramss(x))

In [55]:
sms_SH_df.head(10)

Unnamed: 0,label,body_text,body,body_lower,body_token,body_token_re,body_stopword,body_stem,body_lemma,is_equal,ngrams
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...",Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amore wat,go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat,"[go, until, jurong, point, crazy, available, only, in, bugis, n, great, world, la, e, buffet, cine, there, got, amore, wat]","[go, until, jurong, point, crazy, available, only, in, bugis, n, great, world, la, e, buffet, cine, there, got, amore, wat]","[go, jurong, point, crazy, available, bugis, n, great, world, la, e, buffet, cine, got, amore, wat]",go jurong point crazi avail bugi n great world la e buffet cine got amor wat,go jurong point crazy available bugis n great world la e buffet cine got amore wat,False,"[(go, until), (until, jurong), (jurong, point), (point, crazy), (crazy, available), (available, only), (only, in), (in, bugis), (bugis, n), (n, great), (great, world), (world, la), (la, e), (e, buffet), (buffet, cine), (cine, there), (there, got), (got, amore), (amore, wat)]"
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni,ok lar joking wif u oni,"[ok, lar, joking, wif, u, oni]","[ok, lar, joking, wif, u, oni]","[ok, lar, joking, wif, u, oni]",ok lar joke wif u oni,ok lar joking wif u oni,False,"[(ok, lar), (lar, joking), (joking, wif), (wif, u), (u, oni)]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive entry questionstd txt rateTCs apply 08452810075over18s,free entry in 2 a wkly comp to win fa cup final tkts 21st may 2005 text fa to 87121 to receive entry questionstd txt ratetcs apply 08452810075over18s,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to, 87121, to, receive, entry, questionstd, txt, ratetcs, apply, 08452810075over18s]","[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to, 87121, to, receive, entry, questionstd, txt, ratetcs, apply, 08452810075over18s]","[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receive, entry, questionstd, txt, ratetcs, apply, 08452810075over18s]",free entri 2 wkli comp win fa cup final tkt 21st may 2005 text fa 87121 receiv entri questionstd txt ratetc appli 08452810075over18,free entry 2 wkly comp win fa cup final tkts 21st may 2005 text fa 87121 receive entry questionstd txt ratetcs apply 08452810075over18s,False,"[(free, entry), (entry, in), (in, 2), (2, a), (a, wkly), (wkly, comp), (comp, to), (to, win), (win, fa), (fa, cup), (cup, final), (final, tkts), (tkts, 21st), (21st, may), (may, 2005), (2005, text), (text, fa), (fa, to), (to, 87121), (87121, to), (to, receive), (receive, entry), (entry, questionstd), (questionstd, txt), (txt, ratetcs), (ratetcs, apply), (apply, 08452810075over18s)]"
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say,u dun say so early hor u c already then say,"[u, dun, say, so, early, hor, u, c, already, then, say]","[u, dun, say, so, early, hor, u, c, already, then, say]","[u, dun, say, early, hor, u, c, already, say]",u dun say earli hor u c alreadi say,u dun say early hor u c already say,False,"[(u, dun), (dun, say), (say, so), (so, early), (early, hor), (hor, u), (u, c), (c, already), (already, then), (then, say)]"
4,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,nah i dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]","[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]","[nah, dont, think, goes, usf, lives, around, though]",nah dont think goe usf live around though,nah dont think go usf life around though,False,"[(nah, i), (i, dont), (dont, think), (think, he), (he, goes), (goes, to), (to, usf), (usf, he), (he, lives), (lives, around), (around, here), (here, though)]"
5,spam,"FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv",FreeMsg Hey there darling its been 3 weeks now and no word back Id like some fun you up for it still Tb ok XxX std chgs to send 150 to rcv,freemsg hey there darling its been 3 weeks now and no word back id like some fun you up for it still tb ok xxx std chgs to send 150 to rcv,"[freemsg, hey, there, darling, its, been, 3, weeks, now, and, no, word, back, id, like, some, fun, you, up, for, it, still, tb, ok, xxx, std, chgs, to, send, 150, to, rcv]","[freemsg, hey, there, darling, its, been, 3, weeks, now, and, no, word, back, id, like, some, fun, you, up, for, it, still, tb, ok, xxx, std, chgs, to, send, 150, to, rcv]","[freemsg, hey, darling, 3, weeks, word, back, id, like, fun, still, tb, ok, xxx, std, chgs, send, 150, rcv]",freemsg hey darl 3 week word back id like fun still tb ok xxx std chg send 150 rcv,freemsg hey darling 3 week word back id like fun still tb ok xxx std chgs send 150 rcv,False,"[(freemsg, hey), (hey, there), (there, darling), (darling, its), (its, been), (been, 3), (3, weeks), (weeks, now), (now, and), (and, no), (no, word), (word, back), (back, id), (id, like), (like, some), (some, fun), (fun, you), (you, up), (up, for), (for, it), (it, still), (still, tb), (tb, ok), (ok, xxx), (xxx, std), (std, chgs), (chgs, to), (to, send), (send, 150), (150, to), (to, rcv)]"
6,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent,even my brother is not like to speak with me they treat me like aids patent,"[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]","[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]","[even, brother, like, speak, treat, like, aids, patent]",even brother like speak treat like aid patent,even brother like speak treat like aid patent,True,"[(even, my), (my, brother), (brother, is), (is, not), (not, like), (like, to), (to, speak), (speak, with), (with, me), (me, they), (they, treat), (treat, me), (me, like), (like, aids), (aids, patent)]"
7,ham,As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune,As per your request Melle Melle Oru Minnaminunginte Nurungu Vettam has been set as your callertune for all Callers Press 9 to copy your friends Callertune,as per your request melle melle oru minnaminunginte nurungu vettam has been set as your callertune for all callers press 9 to copy your friends callertune,"[as, per, your, request, melle, melle, oru, minnaminunginte, nurungu, vettam, has, been, set, as, your, callertune, for, all, callers, press, 9, to, copy, your, friends, callertune]","[as, per, your, request, melle, melle, oru, minnaminunginte, nurungu, vettam, has, been, set, as, your, callertune, for, all, callers, press, 9, to, copy, your, friends, callertune]","[per, request, melle, melle, oru, minnaminunginte, nurungu, vettam, set, callertune, callers, press, 9, copy, friends, callertune]",per request mell mell oru minnaminungint nurungu vettam set callertun caller press 9 copi friend callertun,per request melle melle oru minnaminunginte nurungu vettam set callertune caller press 9 copy friend callertune,False,"[(as, per), (per, your), (your, request), (request, melle), (melle, melle), (melle, oru), (oru, minnaminunginte), (minnaminunginte, nurungu), (nurungu, vettam), (vettam, has), (has, been), (been, set), (set, as), (as, your), (your, callertune), (callertune, for), (for, all), (all, callers), (callers, press), (press, 9), (9, to), (to, copy), (copy, your), (your, friends), (friends, callertune)]"
8,spam,WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.,WINNER As a valued network customer you have been selected to receivea 900 prize reward To claim call 09061701461 Claim code KL341 Valid 12 hours only,winner as a valued network customer you have been selected to receivea 900 prize reward to claim call 09061701461 claim code kl341 valid 12 hours only,"[winner, as, a, valued, network, customer, you, have, been, selected, to, receivea, 900, prize, reward, to, claim, call, 09061701461, claim, code, kl341, valid, 12, hours, only]","[winner, as, a, valued, network, customer, you, have, been, selected, to, receivea, 900, prize, reward, to, claim, call, 09061701461, claim, code, kl341, valid, 12, hours, only]","[winner, valued, network, customer, selected, receivea, 900, prize, reward, claim, call, 09061701461, claim, code, kl341, valid, 12, hours]",winner valu network custom select receivea 900 prize reward claim call 09061701461 claim code kl341 valid 12 hour,winner valued network customer selected receivea 900 prize reward claim call 09061701461 claim code kl341 valid 12 hour,False,"[(winner, as), (as, a), (a, valued), (valued, network), (network, customer), (customer, you), (you, have), (have, been), (been, selected), (selected, to), (to, receivea), (receivea, 900), (900, prize), (prize, reward), (reward, to), (to, claim), (claim, call), (call, 09061701461), (09061701461, claim), (claim, code), (code, kl341), (kl341, valid), (valid, 12), (12, hours), (hours, only)]"
9,spam,Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030,Had your mobile 11 months or more U R entitled to Update to the latest colour mobiles with camera for Free Call The Mobile Update Co FREE on 08002986030,had your mobile 11 months or more u r entitled to update to the latest colour mobiles with camera for free call the mobile update co free on 08002986030,"[had, your, mobile, 11, months, or, more, u, r, entitled, to, update, to, the, latest, colour, mobiles, with, camera, for, free, call, the, mobile, update, co, free, on, 08002986030]","[had, your, mobile, 11, months, or, more, u, r, entitled, to, update, to, the, latest, colour, mobiles, with, camera, for, free, call, the, mobile, update, co, free, on, 08002986030]","[mobile, 11, months, u, r, entitled, update, latest, colour, mobiles, camera, free, call, mobile, update, co, free, 08002986030]",mobil 11 month u r entitl updat latest colour mobil camera free call mobil updat co free 08002986030,mobile 11 month u r entitled update latest colour mobile camera free call mobile update co free 08002986030,False,"[(had, your), (your, mobile), (mobile, 11), (11, months), (months, or), (or, more), (more, u), (u, r), (r, entitled), (entitled, to), (to, update), (update, to), (to, the), (the, latest), (latest, colour), (colour, mobiles), (mobiles, with), (with, camera), (camera, for), (for, free), (free, call), (call, the), (the, mobile), (mobile, update), (update, co), (co, free), (free, on), (on, 08002986030)]"


In [56]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [57]:
from nltk import pos_tag

In [58]:
text = "This is a foo bar sentence"
tokenizer = word_tokenize(text)
pos_tag(word_tokenize(text))

[('This', 'DT'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('foo', 'JJ'),
 ('bar', 'NN'),
 ('sentence', 'NN')]

In [59]:
from collections import Counter

In [60]:
Counter([j for i,j in pos_tag(word_tokenize(text))])

Counter({'DT': 2, 'VBZ': 1, 'JJ': 1, 'NN': 2})

In [61]:
sms_SH_df ['tagging'] = sms_SH_df['body_token'].apply(lambda x: pos_tag(x))

In [62]:
sms_SH_df.head(10)

Unnamed: 0,label,body_text,body,body_lower,body_token,body_token_re,body_stopword,body_stem,body_lemma,is_equal,ngrams,tagging
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...",Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amore wat,go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat,"[go, until, jurong, point, crazy, available, only, in, bugis, n, great, world, la, e, buffet, cine, there, got, amore, wat]","[go, until, jurong, point, crazy, available, only, in, bugis, n, great, world, la, e, buffet, cine, there, got, amore, wat]","[go, jurong, point, crazy, available, bugis, n, great, world, la, e, buffet, cine, got, amore, wat]",go jurong point crazi avail bugi n great world la e buffet cine got amor wat,go jurong point crazy available bugis n great world la e buffet cine got amore wat,False,"[(go, until), (until, jurong), (jurong, point), (point, crazy), (crazy, available), (available, only), (only, in), (in, bugis), (bugis, n), (n, great), (great, world), (world, la), (la, e), (e, buffet), (buffet, cine), (cine, there), (there, got), (got, amore), (amore, wat)]","[(go, VB), (until, IN), (jurong, JJ), (point, NN), (crazy, NN), (available, JJ), (only, RB), (in, IN), (bugis, NN), (n, RB), (great, JJ), (world, NN), (la, NN), (e, VBP), (buffet, JJ), (cine, NN), (there, EX), (got, VBD), (amore, RB), (wat, JJ)]"
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni,ok lar joking wif u oni,"[ok, lar, joking, wif, u, oni]","[ok, lar, joking, wif, u, oni]","[ok, lar, joking, wif, u, oni]",ok lar joke wif u oni,ok lar joking wif u oni,False,"[(ok, lar), (lar, joking), (joking, wif), (wif, u), (u, oni)]","[(ok, JJ), (lar, JJ), (joking, NN), (wif, NN), (u, JJ), (oni, NN)]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive entry questionstd txt rateTCs apply 08452810075over18s,free entry in 2 a wkly comp to win fa cup final tkts 21st may 2005 text fa to 87121 to receive entry questionstd txt ratetcs apply 08452810075over18s,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to, 87121, to, receive, entry, questionstd, txt, ratetcs, apply, 08452810075over18s]","[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to, 87121, to, receive, entry, questionstd, txt, ratetcs, apply, 08452810075over18s]","[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receive, entry, questionstd, txt, ratetcs, apply, 08452810075over18s]",free entri 2 wkli comp win fa cup final tkt 21st may 2005 text fa 87121 receiv entri questionstd txt ratetc appli 08452810075over18,free entry 2 wkly comp win fa cup final tkts 21st may 2005 text fa 87121 receive entry questionstd txt ratetcs apply 08452810075over18s,False,"[(free, entry), (entry, in), (in, 2), (2, a), (a, wkly), (wkly, comp), (comp, to), (to, win), (win, fa), (fa, cup), (cup, final), (final, tkts), (tkts, 21st), (21st, may), (may, 2005), (2005, text), (text, fa), (fa, to), (to, 87121), (87121, to), (to, receive), (receive, entry), (entry, questionstd), (questionstd, txt), (txt, ratetcs), (ratetcs, apply), (apply, 08452810075over18s)]","[(free, JJ), (entry, NN), (in, IN), (2, CD), (a, DT), (wkly, JJ), (comp, NN), (to, TO), (win, VB), (fa, JJ), (cup, JJ), (final, JJ), (tkts, NN), (21st, CD), (may, MD), (2005, CD), (text, NN), (fa, NN), (to, TO), (87121, CD), (to, TO), (receive, VB), (entry, NN), (questionstd, NN), (txt, NN), (ratetcs, NN), (apply, VBP), (08452810075over18s, CD)]"
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say,u dun say so early hor u c already then say,"[u, dun, say, so, early, hor, u, c, already, then, say]","[u, dun, say, so, early, hor, u, c, already, then, say]","[u, dun, say, early, hor, u, c, already, say]",u dun say earli hor u c alreadi say,u dun say early hor u c already say,False,"[(u, dun), (dun, say), (say, so), (so, early), (early, hor), (hor, u), (u, c), (c, already), (already, then), (then, say)]","[(u, JJ), (dun, NNS), (say, VBP), (so, RB), (early, JJ), (hor, NN), (u, JJ), (c, NNS), (already, RB), (then, RB), (say, VB)]"
4,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,nah i dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]","[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]","[nah, dont, think, goes, usf, lives, around, though]",nah dont think goe usf live around though,nah dont think go usf life around though,False,"[(nah, i), (i, dont), (dont, think), (think, he), (he, goes), (goes, to), (to, usf), (usf, he), (he, lives), (lives, around), (around, here), (here, though)]","[(nah, NN), (i, NN), (dont, NN), (think, VBP), (he, PRP), (goes, VBZ), (to, TO), (usf, VB), (he, PRP), (lives, VBZ), (around, RB), (here, RB), (though, IN)]"
5,spam,"FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv",FreeMsg Hey there darling its been 3 weeks now and no word back Id like some fun you up for it still Tb ok XxX std chgs to send 150 to rcv,freemsg hey there darling its been 3 weeks now and no word back id like some fun you up for it still tb ok xxx std chgs to send 150 to rcv,"[freemsg, hey, there, darling, its, been, 3, weeks, now, and, no, word, back, id, like, some, fun, you, up, for, it, still, tb, ok, xxx, std, chgs, to, send, 150, to, rcv]","[freemsg, hey, there, darling, its, been, 3, weeks, now, and, no, word, back, id, like, some, fun, you, up, for, it, still, tb, ok, xxx, std, chgs, to, send, 150, to, rcv]","[freemsg, hey, darling, 3, weeks, word, back, id, like, fun, still, tb, ok, xxx, std, chgs, send, 150, rcv]",freemsg hey darl 3 week word back id like fun still tb ok xxx std chg send 150 rcv,freemsg hey darling 3 week word back id like fun still tb ok xxx std chgs send 150 rcv,False,"[(freemsg, hey), (hey, there), (there, darling), (darling, its), (its, been), (been, 3), (3, weeks), (weeks, now), (now, and), (and, no), (no, word), (word, back), (back, id), (id, like), (like, some), (some, fun), (fun, you), (you, up), (up, for), (for, it), (it, still), (still, tb), (tb, ok), (ok, xxx), (xxx, std), (std, chgs), (chgs, to), (to, send), (send, 150), (150, to), (to, rcv)]","[(freemsg, NN), (hey, NN), (there, RB), (darling, VBG), (its, PRP$), (been, VBN), (3, CD), (weeks, NNS), (now, RB), (and, CC), (no, DT), (word, NN), (back, RB), (id, NN), (like, IN), (some, DT), (fun, NN), (you, PRP), (up, IN), (for, IN), (it, PRP), (still, RB), (tb, VBZ), (ok, JJ), (xxx, NNP), (std, NN), (chgs, NN), (to, TO), (send, VB), (150, CD), (to, TO), (rcv, VB)]"
6,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent,even my brother is not like to speak with me they treat me like aids patent,"[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]","[even, my, brother, is, not, like, to, speak, with, me, they, treat, me, like, aids, patent]","[even, brother, like, speak, treat, like, aids, patent]",even brother like speak treat like aid patent,even brother like speak treat like aid patent,True,"[(even, my), (my, brother), (brother, is), (is, not), (not, like), (like, to), (to, speak), (speak, with), (with, me), (me, they), (they, treat), (treat, me), (me, like), (like, aids), (aids, patent)]","[(even, RB), (my, PRP$), (brother, NN), (is, VBZ), (not, RB), (like, IN), (to, TO), (speak, VB), (with, IN), (me, PRP), (they, PRP), (treat, VBP), (me, PRP), (like, IN), (aids, NNS), (patent, NN)]"
7,ham,As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune,As per your request Melle Melle Oru Minnaminunginte Nurungu Vettam has been set as your callertune for all Callers Press 9 to copy your friends Callertune,as per your request melle melle oru minnaminunginte nurungu vettam has been set as your callertune for all callers press 9 to copy your friends callertune,"[as, per, your, request, melle, melle, oru, minnaminunginte, nurungu, vettam, has, been, set, as, your, callertune, for, all, callers, press, 9, to, copy, your, friends, callertune]","[as, per, your, request, melle, melle, oru, minnaminunginte, nurungu, vettam, has, been, set, as, your, callertune, for, all, callers, press, 9, to, copy, your, friends, callertune]","[per, request, melle, melle, oru, minnaminunginte, nurungu, vettam, set, callertune, callers, press, 9, copy, friends, callertune]",per request mell mell oru minnaminungint nurungu vettam set callertun caller press 9 copi friend callertun,per request melle melle oru minnaminunginte nurungu vettam set callertune caller press 9 copy friend callertune,False,"[(as, per), (per, your), (your, request), (request, melle), (melle, melle), (melle, oru), (oru, minnaminunginte), (minnaminunginte, nurungu), (nurungu, vettam), (vettam, has), (has, been), (been, set), (set, as), (as, your), (your, callertune), (callertune, for), (for, all), (all, callers), (callers, press), (press, 9), (9, to), (to, copy), (copy, your), (your, friends), (friends, callertune)]","[(as, IN), (per, IN), (your, PRP$), (request, NN), (melle, NN), (melle, NN), (oru, NN), (minnaminunginte, NN), (nurungu, NN), (vettam, NN), (has, VBZ), (been, VBN), (set, VBN), (as, IN), (your, PRP$), (callertune, NN), (for, IN), (all, DT), (callers, NNS), (press, VBP), (9, CD), (to, TO), (copy, VB), (your, PRP$), (friends, NNS), (callertune, VBP)]"
8,spam,WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.,WINNER As a valued network customer you have been selected to receivea 900 prize reward To claim call 09061701461 Claim code KL341 Valid 12 hours only,winner as a valued network customer you have been selected to receivea 900 prize reward to claim call 09061701461 claim code kl341 valid 12 hours only,"[winner, as, a, valued, network, customer, you, have, been, selected, to, receivea, 900, prize, reward, to, claim, call, 09061701461, claim, code, kl341, valid, 12, hours, only]","[winner, as, a, valued, network, customer, you, have, been, selected, to, receivea, 900, prize, reward, to, claim, call, 09061701461, claim, code, kl341, valid, 12, hours, only]","[winner, valued, network, customer, selected, receivea, 900, prize, reward, claim, call, 09061701461, claim, code, kl341, valid, 12, hours]",winner valu network custom select receivea 900 prize reward claim call 09061701461 claim code kl341 valid 12 hour,winner valued network customer selected receivea 900 prize reward claim call 09061701461 claim code kl341 valid 12 hour,False,"[(winner, as), (as, a), (a, valued), (valued, network), (network, customer), (customer, you), (you, have), (have, been), (been, selected), (selected, to), (to, receivea), (receivea, 900), (900, prize), (prize, reward), (reward, to), (to, claim), (claim, call), (call, 09061701461), (09061701461, claim), (claim, code), (code, kl341), (kl341, valid), (valid, 12), (12, hours), (hours, only)]","[(winner, NN), (as, IN), (a, DT), (valued, VBN), (network, NN), (customer, NN), (you, PRP), (have, VBP), (been, VBN), (selected, VBN), (to, TO), (receivea, VB), (900, CD), (prize, JJ), (reward, NN), (to, TO), (claim, VB), (call, JJ), (09061701461, CD), (claim, NN), (code, NN), (kl341, VBD), (valid, JJ), (12, CD), (hours, NNS), (only, RB)]"
9,spam,Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030,Had your mobile 11 months or more U R entitled to Update to the latest colour mobiles with camera for Free Call The Mobile Update Co FREE on 08002986030,had your mobile 11 months or more u r entitled to update to the latest colour mobiles with camera for free call the mobile update co free on 08002986030,"[had, your, mobile, 11, months, or, more, u, r, entitled, to, update, to, the, latest, colour, mobiles, with, camera, for, free, call, the, mobile, update, co, free, on, 08002986030]","[had, your, mobile, 11, months, or, more, u, r, entitled, to, update, to, the, latest, colour, mobiles, with, camera, for, free, call, the, mobile, update, co, free, on, 08002986030]","[mobile, 11, months, u, r, entitled, update, latest, colour, mobiles, camera, free, call, mobile, update, co, free, 08002986030]",mobil 11 month u r entitl updat latest colour mobil camera free call mobil updat co free 08002986030,mobile 11 month u r entitled update latest colour mobile camera free call mobile update co free 08002986030,False,"[(had, your), (your, mobile), (mobile, 11), (11, months), (months, or), (or, more), (more, u), (u, r), (r, entitled), (entitled, to), (to, update), (update, to), (to, the), (the, latest), (latest, colour), (colour, mobiles), (mobiles, with), (with, camera), (camera, for), (for, free), (free, call), (call, the), (the, mobile), (mobile, update), (update, co), (co, free), (free, on), (on, 08002986030)]","[(had, VBD), (your, PRP$), (mobile, JJ), (11, CD), (months, NNS), (or, CC), (more, JJR), (u, JJ), (r, NN), (entitled, VBD), (to, TO), (update, VB), (to, TO), (the, DT), (latest, JJS), (colour, NN), (mobiles, NNS), (with, IN), (camera, NN), (for, IN), (free, JJ), (call, NN), (the, DT), (mobile, JJ), (update, JJ), (co, NNS), (free, VBP), (on, IN), (08002986030, CD)]"


In [63]:
nltk.download('tagsets')

[nltk_data] Downloading package tagsets to /root/nltk_data...
[nltk_data]   Unzipping help/tagsets.zip.


True

In [64]:
nltk.help.upenn_tagset('JJ')

JJ: adjective or numeral, ordinal
    third ill-mannered pre-war regrettable oiled calamitous first separable
    ectoplasmic battery-powered participatory fourth still-to-be-named
    multilingual multi-disciplinary ...


In [65]:
sms_SH_df ['Counter_tagging'] = sms_SH_df['body_token'].apply(lambda x: Counter([j for i,j in pos_tag(x)]))

In [66]:
from nltk.tag import DefaultTagger

In [67]:
text = "This is a foo bar sentence"
tokenizer = word_tokenize(text)
wordsTag = pos_tag(word_tokenize(text))
print(wordsTag)

[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('foo', 'JJ'), ('bar', 'NN'), ('sentence', 'NN')]


In [68]:
for i in range(0,len(wordsTag)):
  if wordsTag[i][1] =='NN':
    print (wordsTag[i][0])

bar
sentence


In [69]:
print ([word for (word,tag) in wordsTag if tag =='NN'])

['bar', 'sentence']


In [70]:
text ="The pizza was good but the pasta was bad"
tokens = nltk.word_tokenize(text)
tag = nltk.pos_tag(tokens)
grammer ="VP: {<DT>?<JJ>* <NN>}"
p = nltk.RegexpParser(grammer)
r = p.parse(tag)
print(r)


(S
  (VP The/DT pizza/NN)
  was/VBD
  good/JJ
  but/CC
  (VP the/DT pasta/NN)
  was/VBD
  bad/JJ)


In [71]:
sms_SH_df.head()

Unnamed: 0,label,body_text,body,body_lower,body_token,body_token_re,body_stopword,body_stem,body_lemma,is_equal,ngrams,tagging,Counter_tagging
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...",Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amore wat,go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat,"[go, until, jurong, point, crazy, available, only, in, bugis, n, great, world, la, e, buffet, cine, there, got, amore, wat]","[go, until, jurong, point, crazy, available, only, in, bugis, n, great, world, la, e, buffet, cine, there, got, amore, wat]","[go, jurong, point, crazy, available, bugis, n, great, world, la, e, buffet, cine, got, amore, wat]",go jurong point crazi avail bugi n great world la e buffet cine got amor wat,go jurong point crazy available bugis n great world la e buffet cine got amore wat,False,"[(go, until), (until, jurong), (jurong, point), (point, crazy), (crazy, available), (available, only), (only, in), (in, bugis), (bugis, n), (n, great), (great, world), (world, la), (la, e), (e, buffet), (buffet, cine), (cine, there), (there, got), (got, amore), (amore, wat)]","[(go, VB), (until, IN), (jurong, JJ), (point, NN), (crazy, NN), (available, JJ), (only, RB), (in, IN), (bugis, NN), (n, RB), (great, JJ), (world, NN), (la, NN), (e, VBP), (buffet, JJ), (cine, NN), (there, EX), (got, VBD), (amore, RB), (wat, JJ)]","{'VB': 1, 'IN': 2, 'JJ': 5, 'NN': 6, 'RB': 3, 'VBP': 1, 'EX': 1, 'VBD': 1}"
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni,ok lar joking wif u oni,"[ok, lar, joking, wif, u, oni]","[ok, lar, joking, wif, u, oni]","[ok, lar, joking, wif, u, oni]",ok lar joke wif u oni,ok lar joking wif u oni,False,"[(ok, lar), (lar, joking), (joking, wif), (wif, u), (u, oni)]","[(ok, JJ), (lar, JJ), (joking, NN), (wif, NN), (u, JJ), (oni, NN)]","{'JJ': 3, 'NN': 3}"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive entry questionstd txt rateTCs apply 08452810075over18s,free entry in 2 a wkly comp to win fa cup final tkts 21st may 2005 text fa to 87121 to receive entry questionstd txt ratetcs apply 08452810075over18s,"[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to, 87121, to, receive, entry, questionstd, txt, ratetcs, apply, 08452810075over18s]","[free, entry, in, 2, a, wkly, comp, to, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, to, 87121, to, receive, entry, questionstd, txt, ratetcs, apply, 08452810075over18s]","[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receive, entry, questionstd, txt, ratetcs, apply, 08452810075over18s]",free entri 2 wkli comp win fa cup final tkt 21st may 2005 text fa 87121 receiv entri questionstd txt ratetc appli 08452810075over18,free entry 2 wkly comp win fa cup final tkts 21st may 2005 text fa 87121 receive entry questionstd txt ratetcs apply 08452810075over18s,False,"[(free, entry), (entry, in), (in, 2), (2, a), (a, wkly), (wkly, comp), (comp, to), (to, win), (win, fa), (fa, cup), (cup, final), (final, tkts), (tkts, 21st), (21st, may), (may, 2005), (2005, text), (text, fa), (fa, to), (to, 87121), (87121, to), (to, receive), (receive, entry), (entry, questionstd), (questionstd, txt), (txt, ratetcs), (ratetcs, apply), (apply, 08452810075over18s)]","[(free, JJ), (entry, NN), (in, IN), (2, CD), (a, DT), (wkly, JJ), (comp, NN), (to, TO), (win, VB), (fa, JJ), (cup, JJ), (final, JJ), (tkts, NN), (21st, CD), (may, MD), (2005, CD), (text, NN), (fa, NN), (to, TO), (87121, CD), (to, TO), (receive, VB), (entry, NN), (questionstd, NN), (txt, NN), (ratetcs, NN), (apply, VBP), (08452810075over18s, CD)]","{'JJ': 5, 'NN': 9, 'IN': 1, 'CD': 5, 'DT': 1, 'TO': 3, 'VB': 2, 'MD': 1, 'VBP': 1}"
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say,u dun say so early hor u c already then say,"[u, dun, say, so, early, hor, u, c, already, then, say]","[u, dun, say, so, early, hor, u, c, already, then, say]","[u, dun, say, early, hor, u, c, already, say]",u dun say earli hor u c alreadi say,u dun say early hor u c already say,False,"[(u, dun), (dun, say), (say, so), (so, early), (early, hor), (hor, u), (u, c), (c, already), (already, then), (then, say)]","[(u, JJ), (dun, NNS), (say, VBP), (so, RB), (early, JJ), (hor, NN), (u, JJ), (c, NNS), (already, RB), (then, RB), (say, VB)]","{'JJ': 3, 'NNS': 2, 'VBP': 1, 'RB': 3, 'NN': 1, 'VB': 1}"
4,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,nah i dont think he goes to usf he lives around here though,"[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]","[nah, i, dont, think, he, goes, to, usf, he, lives, around, here, though]","[nah, dont, think, goes, usf, lives, around, though]",nah dont think goe usf live around though,nah dont think go usf life around though,False,"[(nah, i), (i, dont), (dont, think), (think, he), (he, goes), (goes, to), (to, usf), (usf, he), (he, lives), (lives, around), (around, here), (here, though)]","[(nah, NN), (i, NN), (dont, NN), (think, VBP), (he, PRP), (goes, VBZ), (to, TO), (usf, VB), (he, PRP), (lives, VBZ), (around, RB), (here, RB), (though, IN)]","{'NN': 3, 'VBP': 1, 'PRP': 2, 'VBZ': 2, 'TO': 1, 'VB': 1, 'RB': 2, 'IN': 1}"


# Vectoraiztion


**CountVectorizer** and **TfidfVectorizer** are two commonly used text vectorization techniques in natural language processing (NLP) for converting textual data into numerical representations that machine learning models can understand.


**1- CountVectorizer** : a widely used technique that represents a document as a vector of word frequencies. It works by tokenizing the text and counting the occurrences of each word in the document.

**2- TfidfVectorizer** : stands for Term Frequency-Inverse Document Frequency. It is a more advanced vectorization technique that takes into account not only the frequency of a word in a document but also its importance in the entire corpus.

### CountVectorizer

In [72]:
sentences = ["good movie","not a good movie","did not like","i like it"]

In [73]:
from sklearn.feature_extraction.text import CountVectorizer

 The **fit_transform** method fits the vectorizer to the input sentences and transforms them into the document-term matrix.

In [74]:
vectorizer = CountVectorizer()
features_cv= vectorizer.fit_transform(sentences)
print(features_cv.shape)
print('spare Matrix :\n',features_cv)
features_cv =pd.DataFrame(features_cv.toarray())

features_cv.columns = vectorizer.get_feature_names_out()
features_cv

(4, 6)
spare Matrix :
   (0, 1)	1
  (0, 4)	1
  (1, 1)	1
  (1, 4)	1
  (1, 5)	1
  (2, 5)	1
  (2, 0)	1
  (2, 3)	1
  (3, 3)	1
  (3, 2)	1


Unnamed: 0,did,good,it,like,movie,not
0,0,1,0,0,1,0
1,0,1,0,0,1,1
2,1,0,0,1,0,1
3,0,0,1,1,0,0


#### Increase the n-gram range


The other thing you’ll want to do is adjust the ngram_range argument. In the simple example above, we set the CountVectorizer to 1, 1 to return unigrams or single words. Increasing the ngram_range will mean the vocabulary is expanded from single words to short phrases of your desired lengths. For example, setting the ngram_range to 2, 2 will return bigrams (2-grams) or two word phrases

In [75]:
ngram_vect= CountVectorizer(ngram_range=(2,2))
x_counts = ngram_vect.fit_transform(sentences)
print(x_counts.shape)
x_counts_df= pd.DataFrame(x_counts.toarray())
x_counts_df.columns = ngram_vect.get_feature_names_out()
x_counts_df

(4, 5)


Unnamed: 0,did not,good movie,like it,not good,not like
0,0,1,0,0,0
1,0,1,0,1,0
2,1,0,0,0,1
3,0,0,1,0,0


In [76]:
ngram_vect= CountVectorizer(ngram_range=(1,2))
x_counts = ngram_vect.fit_transform(sentences)
print(x_counts.shape)
x_counts_df= pd.DataFrame(x_counts.toarray())
x_counts_df.columns = ngram_vect.get_feature_names_out()
x_counts_df

(4, 11)


Unnamed: 0,did,did not,good,good movie,it,like,like it,movie,not,not good,not like
0,0,0,1,1,0,0,0,1,0,0,0
1,0,0,1,1,0,0,0,1,1,1,0
2,1,1,0,0,0,1,0,0,1,0,1
3,0,0,0,0,1,1,1,0,0,0,0


In [77]:
ngram_vect= CountVectorizer(ngram_range=(1,1))
x_counts = ngram_vect.fit_transform(sentences)
print(x_counts.shape)
x_counts_df= pd.DataFrame(x_counts.toarray())
x_counts_df.columns = ngram_vect.get_feature_names_out()
x_counts_df

(4, 6)


Unnamed: 0,did,good,it,like,movie,not
0,0,1,0,0,1,0
1,0,1,0,0,1,1
2,1,0,0,1,0,1
3,0,0,1,1,0,0


In [78]:
ngram_vect= CountVectorizer(ngram_range=(3,3))
x_counts = ngram_vect.fit_transform(sentences)
print(x_counts.shape)
x_counts_df= pd.DataFrame(x_counts.toarray())
x_counts_df.columns = ngram_vect.get_feature_names_out()
x_counts_df

(4, 2)


Unnamed: 0,did not like,not good movie
0,0,0
1,0,1
2,1,0
3,0,0


In [79]:
def CountVectorizering(text):
  vectorizer = CountVectorizer()
  features_cv= vectorizer.fit_transform(text)
  #print(features_cv.shape)
  #print('spare Matrix :\n',features_cv)
  #features_cv =pd.DataFrame(features_cv.toarray())
  #features_cv.columns = vectorizer.get_feature_names_out()
  #return features_cv


## TfidfVectorizer

In [80]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [81]:
tfidf = TfidfVectorizer()
vectors = tfidf.fit_transform(sentences)
# from here (Panda) just for show the data
pd.DataFrame (
    vectors.todense(),
    columns =tfidf.get_feature_names_out()
) # to array = todense same result

Unnamed: 0,did,good,it,like,movie,not
0,0.0,0.707107,0.0,0.0,0.707107,0.0
1,0.0,0.57735,0.0,0.0,0.57735,0.57735
2,0.667679,0.0,0.0,0.526405,0.0,0.526405
3,0.0,0.0,0.785288,0.61913,0.0,0.0


In [82]:
#sms_SH_df = sms_SH_df.drop('len', axis=1)

In [83]:
tfidf = TfidfVectorizer()
vectors = tfidf.fit_transform(sms_SH_df['body_lemma'][:5])
# from here (Panda) just for show the data
pd.DataFrame (
    vectors.todense(),
    columns =tfidf.get_feature_names_out()
) # to array = todense same result

Unnamed: 0,08452810075over18s,2005,21st,87121,already,amore,apply,around,available,buffet,...,think,though,tkts,txt,usf,wat,wif,win,wkly,world
0,0.0,0.0,0.0,0.0,0.0,0.270657,0.0,0.0,0.270657,0.270657,...,0.0,0.0,0.0,0.0,0.0,0.270657,0.0,0.0,0.0,0.270657
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.447214,0.0,0.0,0.0
2,0.196116,0.196116,0.196116,0.196116,0.0,0.0,0.196116,0.0,0.0,0.0,...,0.0,0.0,0.196116,0.196116,0.0,0.0,0.0,0.196116,0.196116,0.0
3,0.0,0.0,0.0,0.0,0.353553,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.361529,0.0,0.0,...,0.361529,0.361529,0.0,0.0,0.361529,0.0,0.0,0.0,0.0,0.0


In [84]:
vectorizer = CountVectorizer()
features_cv= vectorizer.fit_transform(sms_SH_df['body_lemma'][:5])
features_cv =pd.DataFrame(features_cv.toarray())

features_cv.columns = vectorizer.get_feature_names_out()
features_cv

Unnamed: 0,08452810075over18s,2005,21st,87121,already,amore,apply,around,available,buffet,...,think,though,tkts,txt,usf,wat,wif,win,wkly,world
0,0,0,0,0,0,1,0,0,1,1,...,0,0,0,0,0,1,0,0,0,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,1,1,1,1,0,0,1,0,0,0,...,0,0,1,1,0,0,0,1,1,0
3,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,1,0,0,...,1,1,0,0,1,0,0,0,0,0


In [85]:
ngram_vect= CountVectorizer(ngram_range=(2,2))
x_counts = ngram_vect.fit_transform(sms_SH_df['body_lemma'][:5])
print(x_counts.shape)
x_counts_df= pd.DataFrame(x_counts.toarray())
x_counts_df.columns = ngram_vect.get_feature_names_out()
x_counts_df

(5, 50)


Unnamed: 0,2005 text,21st may,87121 receive,already say,amore wat,apply 08452810075over18s,around though,available bugis,buffet cine,bugis great,...,say early,text fa,think go,tkts 21st,txt ratetcs,usf life,wif oni,win fa,wkly comp,world la
0,0,0,0,0,1,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,1,1,1,0,0,1,0,0,0,0,...,0,1,0,1,1,0,0,1,1,0
3,0,0,0,1,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,1,0,0,0,...,0,0,1,0,0,1,0,0,0,0


In [86]:
sms_SH_df['length'] = sms_SH_df['body_text'].apply(lambda x: len(x))

In [87]:
def countPunctuation(x):
  count = sum([1 for i in x if i in string.punctuation])
  return round((count / len(x)) * 100)
sms_SH_df['punc%'] = sms_SH_df['body_text'].apply(lambda x: countPunctuation(x))

In [88]:
def countCap(x):
  count = sum([1 for i in x if i.isupper()])
  return round((count / len(x)) * 100)
sms_SH_df['Cap%'] = sms_SH_df['body_text'].apply(lambda x: countCap(x))

In [89]:
sms_SH_df.head(1)

Unnamed: 0,label,body_text,body,body_lower,body_token,body_token_re,body_stopword,body_stem,body_lemma,is_equal,ngrams,tagging,Counter_tagging,length,punc%,Cap%
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...",Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amore wat,go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat,"[go, until, jurong, point, crazy, available, only, in, bugis, n, great, world, la, e, buffet, cine, there, got, amore, wat]","[go, until, jurong, point, crazy, available, only, in, bugis, n, great, world, la, e, buffet, cine, there, got, amore, wat]","[go, jurong, point, crazy, available, bugis, n, great, world, la, e, buffet, cine, got, amore, wat]",go jurong point crazi avail bugi n great world la e buffet cine got amor wat,go jurong point crazy available bugis n great world la e buffet cine got amore wat,False,"[(go, until), (until, jurong), (jurong, point), (point, crazy), (crazy, available), (available, only), (only, in), (in, bugis), (bugis, n), (n, great), (great, world), (world, la), (la, e), (e, buffet), (buffet, cine), (cine, there), (there, got), (got, amore), (amore, wat)]","[(go, VB), (until, IN), (jurong, JJ), (point, NN), (crazy, NN), (available, JJ), (only, RB), (in, IN), (bugis, NN), (n, RB), (great, JJ), (world, NN), (la, NN), (e, VBP), (buffet, JJ), (cine, NN), (there, EX), (got, VBD), (amore, RB), (wat, JJ)]","{'VB': 1, 'IN': 2, 'JJ': 5, 'NN': 6, 'RB': 3, 'VBP': 1, 'EX': 1, 'VBD': 1}",111,8,3


In [90]:
sms_SH_df.head(1)

Unnamed: 0,label,body_text,body,body_lower,body_token,body_token_re,body_stopword,body_stem,body_lemma,is_equal,ngrams,tagging,Counter_tagging,length,punc%,Cap%
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...",Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amore wat,go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat,"[go, until, jurong, point, crazy, available, only, in, bugis, n, great, world, la, e, buffet, cine, there, got, amore, wat]","[go, until, jurong, point, crazy, available, only, in, bugis, n, great, world, la, e, buffet, cine, there, got, amore, wat]","[go, jurong, point, crazy, available, bugis, n, great, world, la, e, buffet, cine, got, amore, wat]",go jurong point crazi avail bugi n great world la e buffet cine got amor wat,go jurong point crazy available bugis n great world la e buffet cine got amore wat,False,"[(go, until), (until, jurong), (jurong, point), (point, crazy), (crazy, available), (available, only), (only, in), (in, bugis), (bugis, n), (n, great), (great, world), (world, la), (la, e), (e, buffet), (buffet, cine), (cine, there), (there, got), (got, amore), (amore, wat)]","[(go, VB), (until, IN), (jurong, JJ), (point, NN), (crazy, NN), (available, JJ), (only, RB), (in, IN), (bugis, NN), (n, RB), (great, JJ), (world, NN), (la, NN), (e, VBP), (buffet, JJ), (cine, NN), (there, EX), (got, VBD), (amore, RB), (wat, JJ)]","{'VB': 1, 'IN': 2, 'JJ': 5, 'NN': 6, 'RB': 3, 'VBP': 1, 'EX': 1, 'VBD': 1}",111,8,3


# Machine Learning