# Assignment 07

# Natural Language Processing Bag of Words

The spam.csv file contains SMS messages from different users.

The file contains one message per line. Each line contains two columns: v1 contains the label (ham or spam) and v2 contains the raw text. Use Bag-of-words model and random forest algorithm to design a classifier that can distinguish spam and ham messages.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('spam.csv')
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [3]:
df = df[['v1','v2']]
df.columns = ["label", "text"]

In [21]:
pd.set_option('display.max_colwidth',None)
df.head()

Unnamed: 0,label,text
0,1,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
1,1,Ok lar... Joking wif u oni...
2,0,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
3,1,U dun say so early hor... U c already then say...
4,1,"Nah I don't think he goes to usf, he lives around here though"


In [5]:
df.label.value_counts()

ham     4825
spam     747
Name: label, dtype: int64

In [6]:
df['label']=df['label'].map({'ham':1,'spam':0})

## Data Preprocessing

In [7]:
# special character removal

import string
def remove_punctuation(text):
    punctuationfree="".join([i for i in text if i not in string.punctuation])
    return punctuationfree

In [8]:
# word tokenize

import nltk
nltk.download('punkt')
def tokenization(text):
    words = nltk.word_tokenize(text)
    return words

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ab\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [9]:
# Remove Stopwords

nltk.download('stopwords')
stopwords = nltk.corpus.stopwords.words('english')
def remove_stopwords(text):
    output= [i for i in text if i not in stopwords]
    return output

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ab\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [10]:
# Lemmatization

from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')

wordnet_lemmatizer = WordNetLemmatizer()
def lemmatizer(text):
  lemm_text = [wordnet_lemmatizer.lemmatize(word) for word in text]
  return lemm_text

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ab\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\ab\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [11]:
# Doing all the preprocessing step into a function
# after this we getting strings, so we adding each to list 

def preprocess(df_col):
  corpus = []
  for item in df_col:
    new_item = remove_punctuation(item)
    new_item = new_item.lower()
    new_item = tokenization(new_item)
    new_item = remove_stopwords(new_item)
    new_item = lemmatizer(new_item)
    corpus.append(' '.join(str(x) for x in new_item))
  return corpus

In [12]:
corpus = preprocess(df.text)

In [22]:
corpus[0:5]

['go jurong point crazy available bugis n great world la e buffet cine got amore wat',
 'ok lar joking wif u oni',
 'free entry 2 wkly comp win fa cup final tkts 21st may 2005 text fa 87121 receive entry questionstd txt ratetcs apply 08452810075over18s',
 'u dun say early hor u c already say',
 'nah dont think go usf life around though']

In [14]:
df.label.head(10)

0    1
1    1
2    0
3    1
4    1
5    0
6    1
7    1
8    0
9    0
Name: label, dtype: int64

## Bag-Of-Words

In [15]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(1,2))
traindata = cv.fit_transform(corpus)
X = traindata
y = df.label

In [16]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators = 100)
clf.fit(X, y)

In [17]:
from sklearn import metrics
y_pred = clf.predict(X) 
metrics.accuracy_score(y_pred, y)

1.0

In [18]:
def find_sentiment(input):
  input = cv.transform(preprocess(input))
  prediction = clf.predict(input)
  if prediction == 0: 
    print('Input statement has spam sentiment')
  if prediction == 1:
    print('Input statement has ham sentiment')

In [19]:
input = ["Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"]
find_sentiment(input)

Input statement has spam sentiment


In [20]:
input = ["U dun say so early hor... U c already then say..."]
find_sentiment(input)

Input statement has ham sentiment


##### Jibin K Joy, ML & AI August 2022 KKEM Batch