<a href="https://colab.research.google.com/github/Pratik0247/capstone_product360/blob/master/SMS_Spam_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving spam1.csv to spam1 (1).csv
User uploaded file "spam1.csv" with length 503665 bytes


In [2]:
import pandas as pd
import numpy as np
import spacy
import nltk
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
df = pd.read_csv("spam1.csv", encoding = "latin-1")

Checking if data has been imported properly and removing unnecessary columns

In [4]:
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [5]:
df = df[['v1', 'v2']]

In [6]:
df.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   v1      5572 non-null   object
 1   v2      5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


Here, we will first manually carry out the pre-processing using functions from NLTK

The first thing we do is identify and convert all characters to lower-case as the case doesn't matter in this case.

In [8]:
df['v2'] = df['v2'].apply(lambda x : " ".join(x.lower() for x in x.split()))
df.head()

Unnamed: 0,v1,v2
0,ham,"go until jurong point, crazy.. available only ..."
1,ham,ok lar... joking wif u oni...
2,spam,free entry in 2 a wkly comp to win fa cup fina...
3,ham,u dun say so early hor... u c already then say...
4,ham,"nah i don't think he goes to usf, he lives aro..."


We now focus on removing any special characters from our text. 

```[^\w\s]``` will remove anything other than a-z, 0-9 and whitespaces.


In [9]:
df['v2'] = df['v2'].str.replace('[^\w\s]','')
df.head()

Unnamed: 0,v1,v2
0,ham,go until jurong point crazy available only in ...
1,ham,ok lar joking wif u oni
2,spam,free entry in 2 a wkly comp to win fa cup fina...
3,ham,u dun say so early hor u c already then say
4,ham,nah i dont think he goes to usf he lives aroun...


Now we will focus on removing stop words as they can otherwise throw our model off by introducing statistical noise. 

In [10]:
stop=stopwords.words('english')

In [11]:
df['v2'] = df['v2'].apply(lambda x : " ".join(x for x in x.split() if x not in stop))
df.head() 

Unnamed: 0,v1,v2
0,ham,go jurong point crazy available bugis n great ...
1,ham,ok lar joking wif u oni
2,spam,free entry 2 wkly comp win fa cup final tkts 2...
3,ham,u dun say early hor u c already say
4,ham,nah dont think goes usf lives around though


We will now stem the text and convert all words to their base form.

In [12]:
st = PorterStemmer()
df['v2'] = df['v2'].apply(lambda x : " ". join(st.stem(word) for word in x.split()))
df.head()

Unnamed: 0,v1,v2
0,ham,go jurong point crazi avail bugi n great world...
1,ham,ok lar joke wif u oni
2,spam,free entri 2 wkli comp win fa cup final tkt 21...
3,ham,u dun say earli hor u c alreadi say
4,ham,nah dont think goe usf live around though


To ensure that our stemmed text contains legitimate words from the English language, we will further lemmatize our text as it is guaranteed to return actual words.

In [13]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [14]:
le = WordNetLemmatizer()
df['v2'] = df['v2'].apply(lambda x : " ".join(le.lemmatize(w) for w in x.split()))
df.head()

Unnamed: 0,v1,v2
0,ham,go jurong point crazi avail bugi n great world...
1,ham,ok lar joke wif u oni
2,spam,free entri 2 wkli comp win fa cup final tkt 21...
3,ham,u dun say earli hor u c alreadi say
4,ham,nah dont think goe usf live around though


As a last step, we will tokenize every sentence in our text so that it can be fed into our model.

In [15]:
nltk.download("punkt")
df['v2'] = df.apply(lambda x : word_tokenize(x['v2']), axis = 1)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [16]:
df.head()

Unnamed: 0,v1,v2
0,ham,"[go, jurong, point, crazi, avail, bugi, n, gre..."
1,ham,"[ok, lar, joke, wif, u, oni]"
2,spam,"[free, entri, 2, wkli, comp, win, fa, cup, fin..."
3,ham,"[u, dun, say, earli, hor, u, c, alreadi, say]"
4,ham,"[nah, dont, think, goe, usf, live, around, tho..."


In [17]:
df['v2'] = [" ".join(x) for x in df['v2'].values]

In [18]:
df.head()

Unnamed: 0,v1,v2
0,ham,go jurong point crazi avail bugi n great world...
1,ham,ok lar joke wif u oni
2,spam,free entri 2 wkli comp win fa cup final tkt 21...
3,ham,u dun say earli hor u c alreadi say
4,ham,nah dont think goe usf live around though


Now, we will split the data into training and testing data sets

In [19]:
from sklearn.model_selection import train_test_split

xtrain, xtest, ytrain, ytest= train_test_split(df['v2'], df['v1'], test_size=0.3, random_state=100)

print(xtrain.shape)
print(xtest.shape)
print(ytrain.shape)
print(ytest.shape)

(3900,)
(1672,)
(3900,)
(1672,)


In [20]:
from sklearn.model_selection import train_test_split

xtrain, xtest, ytrain, ytest= train_test_split(df['v2'], df['v1'], test_size=0.3, random_state=100)

print(xtrain.shape)
print(xtest.shape)
print(ytrain.shape)
print(ytest.shape)

(3900,)
(1672,)
(3900,)
(1672,)


In [21]:
from sklearn.preprocessing import LabelEncoder
lb = LabelEncoder()
ytrain=lb.fit_transform(ytrain)
ytest=lb.transform(ytest)

In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

tfvect=TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
tfvect=tfvect.fit(df['v2'])

xtrain_new=tfvect.transform(xtrain)
xtest_new=tfvect.transform(xtest)

In [23]:
def train_model(classifier, xtrain, ytrain, xtest, ytest):
    mod=classifier.fit(xtrain, ytrain)
    predictions=mod.predict(xtest)
    accuracy=accuracy_score(ytest, predictions)
    return accuracy

In [24]:
from sklearn import naive_bayes
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

accuracy1 = train_model(naive_bayes.MultinomialNB(), xtrain_new, ytrain, xtest_new, ytest)

print(accuracy1)

0.965311004784689


In [26]:
accuracy = train_model(LogisticRegression(), xtrain_new, ytrain, xtest_new, ytest)

print(accuracy)

0.9575358851674641
