# Assignment 07
### **Natural Language Processing Bag of Words**

The spam.csv file contains SMS messages from different users.
The file contains one message per line. Each line contains two columns: v1 contains the
label
(ham or spam) and v2 contains the raw text.
Use Bag-of-words model and random forest algorithm to design a classifier that can
distinguish spam and ham messages.

In [1]:
import numpy as np
import pandas as pd

In [2]:
data = pd.read_csv('/content/spam.csv',encoding='Windows-1252')

In [3]:
data.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [4]:
pd.set_option('display.max_colwidth', None)
messages = data[['v1','v2']]
messages.columns = ["label", "text"]
messages.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


In [21]:
messages.shape

(5572, 3)

# Pre-processing

In [5]:
# remove_punctuation
import string
def remove_punctuation(text):
    punctuationfree="".join([i for i in text if i not in string.punctuation])
    return punctuationfree

# tokenization
import nltk
nltk.download('punkt')
def tokenization(text):
    words = nltk.word_tokenize(text)
    return words

# remove_stopwords
nltk.download('stopwords')
stopwords = nltk.corpus.stopwords.words('english')
def remove_stopwords(text):
    output= [i for i in text if i not in stopwords]
    return output

# lemmatizer
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')
wordnet_lemmatizer = WordNetLemmatizer()
def lemmatizer(text):
  lemm_text = [wordnet_lemmatizer.lemmatize(word) for word in text]
  return lemm_text



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


In [9]:
def preprocess(df_col):
  corpus = []
  for item in df_col:
    new_item = remove_punctuation(item)
    new_item = new_item.lower()
    new_item = tokenization(new_item)
    new_item = remove_stopwords(new_item)
    new_item = lemmatizer(new_item)
    corpus.append(' '.join(str(x) for x in new_item))
  return corpus

In [11]:
corpus = preprocess(messages.text)

In [12]:
corpus[0:10]

['go jurong point crazy available bugis n great world la e buffet cine got amore wat',
 'ok lar joking wif u oni',
 'free entry 2 wkly comp win fa cup final tkts 21st may 2005 text fa 87121 receive entry questionstd txt ratetcs apply 08452810075over18s',
 'u dun say early hor u c already say',
 'nah dont think go usf life around though',
 'freemsg hey darling 3 week word back id like fun still tb ok xxx std chgs send å£150 rcv',
 'even brother like speak treat like aid patent',
 'per request melle melle oru minnaminunginte nurungu vettam set callertune caller press 9 copy friend callertune',
 'winner valued network customer selected receivea å£900 prize reward claim call 09061701461 claim code kl341 valid 12 hour',
 'mobile 11 month u r entitled update latest colour mobile camera free call mobile update co free 08002986030']

In [16]:
messages.label.head(10)

0     ham
1     ham
2    spam
3     ham
4     ham
5    spam
6     ham
7     ham
8    spam
9    spam
Name: label, dtype: object

# Bag-Of-Words

In [17]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(1,2))
traindata = cv.fit_transform(corpus)
X = traindata
y = messages.label

In [22]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators = 100)
clf.fit(X, y)

RandomForestClassifier()

In [23]:
from sklearn import metrics
y_pred = clf.predict(X) 
metrics.accuracy_score(y_pred, y)

1.0

# Testing

In [27]:
def find_hamspam(input):
  input = cv.transform(preprocess(input))
  prediction = clf.predict(input)
  if prediction == 'spam' : 
    print('Input statement is spam ')
  if prediction == 'ham':
    print('Input statement is ham')

In [28]:
input = ["Aft i finish my lunch then i go str down lor. Ard 3 smth lor. U finish ur lunch already?"]
find_hamspam(input)

Input statement is ham


In [29]:
input = ["SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info"]
find_hamspam(input)

Input statement is spam 


In [30]:
input = ["SMS. ac Blind Date 4U!: Rodds1 is 21/m from Aberdeen, United Kingdom. Check Him out http://img. sms. ac/W/icmb3cktz8r7!-4 no Blind Dates send HIDE"]
find_hamspam(input)

Input statement is spam 


In [31]:
input = ["K.k:)advance happy pongal."]
find_hamspam(input)

Input statement is ham


Submitted By Ajuma Mohammed, KKEM_Aug22 Batch