<a href="https://colab.research.google.com/github/Kavya-sree/Titanic-Survival/blob/main/SMS_spam_filtering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Text messaging is one of the easiest and reliable way of connecting with people. But there is this annoying spam messages that we receive everyday. They are unsolicited and may be malicious. 
Using Machine Learning we can filter out spam messages. This is a binary classification problem. So lets build a Spam Classifier

# About dataset

The SMS Spam Collection Data Set is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam. You can find the dataset [here](https://https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection).

# Importing Libraries

In [27]:
import numpy as np
import pandas as pd
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt

import nltk
nltk.download('punkt')
nltk.download('wordnet')

import re
import string

from sklearn.model_selection import train_test_split

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# Loading Data

In [28]:
# Load the Drive helper and mount
from google.colab import drive

# This will prompt for authorization.
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [29]:
import chardet
with open("/content/drive/MyDrive/Colab Notebooks/spam.csv", 'rb') as rawdata:
    result = chardet.detect(rawdata.read(100000))
result

{'encoding': 'Windows-1252', 'confidence': 0.7270322499829184, 'language': ''}

In [30]:
#Loading data
data = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/spam.csv", encoding='Windows-1252')
data.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [31]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          5572 non-null   object
 1   v2          5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB



*   The dataset contains 5572 entries. 
*   v1 represents the class/label of the message, ham or spam.
*   v2 represents the text message

The 3 features Unnamed: 2, Unnamed: 3, Unnamed: 4 have lots of null values. They are unnamed and we don't know what they mean. So we can drop it. 

In [32]:
# Dropping unwanted columns
drop = ["Unnamed: 2","Unnamed: 3","Unnamed: 4"]
data = data.drop(data[drop], axis=1)
# Renaming the columns for readability
data.rename(columns = {"v1":"Label", "v2":"Text"}, inplace = True)
data.head()

Unnamed: 0,Label,Text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


# Data visualization

In [33]:
fig = px.histogram(data, x="Label", color="Label", 
                   color_discrete_sequence=["#808080","#FF0000"], 
                   title='Count Plot of labels', 
                   width=600, height=400)
fig.show()

we have imbalanced dataset (4825 ham messages and 747 spam messages)

# Feature Engineering

We can create a new feature for further data exploration

*   length: Length of the text message

In [34]:
data['Length'] = data['Text'].apply(len)
data.head()


Unnamed: 0,Label,Text,Length
0,ham,"Go until jurong point, crazy.. Available only ...",111
1,ham,Ok lar... Joking wif u oni...,29
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155
3,ham,U dun say so early hor... U c already then say...,49
4,ham,"Nah I don't think he goes to usf, he lives aro...",61


# Visualize new data

In [35]:
fig = px.histogram(data, x="Length", nbins=20,
                   color_discrete_sequence=["#FF0000"],
                   width=800, height=400)
fig.show()

In [36]:
data.Length.describe()

count    5572.000000
mean       80.118808
std        59.690841
min         2.000000
25%        36.000000
50%        61.000000
75%       121.000000
max       910.000000
Name: Length, dtype: float64

min is 2 characters and max is 910 characters. Lets find out the message

In [37]:
print(data[data['Length'] == 2]['Label'].iloc[0])
data[data['Length'] == 2]['Text'].iloc[0]


ham


'Ok'

In [38]:
print(data[data['Length'] == 910]['Label'].iloc[0])
data[data['Length'] == 910]['Text'].iloc[0]


ham


"For me the love should start with attraction.i should feel that I need her every time around me.she should be the first thing which comes in my thoughts.I would start the day and end it with her.she should be there every time I dream.love will be then when my every breath has her name.my life should happen around her.my life will be named to her.I would cry for her.will give all my happiness and take all her sorrows.I will be ready to fight with anyone for her.I will be in love when I will be doing the craziest things for her.love will be when I don't have to proove anyone that my girl is the most beautiful lady on the whole planet.I will always be singing praises for her.love will be when I start up making chicken curry and end up makiing sambar.life will be the most beautiful then.will get every morning and thank god for the day because she is with me.I would like to say a lot..will tell later.."

The maximum and minimum character texts are both 'ham'. The person that wrote the longest text seems like a passionate lover.

Now we have to determine if the length of the Text is related to our target variable 'Label'.

In [39]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go

data_ham= data[data.Label == 'ham']
data_spam= data[data.Label == 'spam']

fig = make_subplots(rows=1, cols=2)

fig.add_trace(
    go.Histogram(x= data_ham['Length'], name= 'ham',marker_color="#808080"),
    row=1, col=1
)

fig.add_trace(
    go.Histogram(x= data_spam['Length'], name='spam', marker_color="#FF0000"),
    row=1, col=2
)

# Update xaxis properties
fig.update_xaxes(title_text="Length", row=1, col=1)
fig.update_xaxes(title_text="Length", row=1, col=2)

# Update yaxis properties
fig.update_yaxes(title_text="Frequency", row=1, col=1)
fig.update_yaxes(title_text="Frequency", row=1, col=2)


fig.update_layout(height=500, width=800, 
                  title_text="Length distribution",
                  )
fig.show()

From the plot we can see that spam messages tend to have more characters typically in the range 150-160. Ham messages mostly are in the range 20-40 characters.


# Data Preprocessing

Since raw data can be noisy they need to be processed and cleaned before building any model. You cannot feed raw data directly to a model as it can cause errors and decrease model performance. 

## Convert to lowercase and remove stopwords

In [40]:
def clean_text(Text):
    Text = str(Text).lower() # make lowercase
    Text = Text.translate(str.maketrans('', '', string.punctuation))# remove punctuation
    return Text

data['Clean_Text'] = data['Text'].apply(lambda x:clean_text(x))

In [41]:
data.head()

Unnamed: 0,Label,Text,Length,Clean_Text
0,ham,"Go until jurong point, crazy.. Available only ...",111,go until jurong point crazy available only in ...
1,ham,Ok lar... Joking wif u oni...,29,ok lar joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155,free entry in 2 a wkly comp to win fa cup fina...
3,ham,U dun say so early hor... U c already then say...,49,u dun say so early hor u c already then say
4,ham,"Nah I don't think he goes to usf, he lives aro...",61,nah i dont think he goes to usf he lives aroun...


## Tokenization

In [42]:
data["Tokenized_Text"] = data.apply(lambda row: nltk.word_tokenize(row["Clean_Text"]), axis=1)

In [43]:
data.head()

Unnamed: 0,Label,Text,Length,Clean_Text,Tokenized_Text
0,ham,"Go until jurong point, crazy.. Available only ...",111,go until jurong point crazy available only in ...,"[go, until, jurong, point, crazy, available, o..."
1,ham,Ok lar... Joking wif u oni...,29,ok lar joking wif u oni,"[ok, lar, joking, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155,free entry in 2 a wkly comp to win fa cup fina...,"[free, entry, in, 2, a, wkly, comp, to, win, f..."
3,ham,U dun say so early hor... U c already then say...,49,u dun say so early hor u c already then say,"[u, dun, say, so, early, hor, u, c, already, t..."
4,ham,"Nah I don't think he goes to usf, he lives aro...",61,nah i dont think he goes to usf he lives aroun...,"[nah, i, dont, think, he, goes, to, usf, he, l..."


## Remove stopwords

In [44]:
from nltk.corpus import stopwords
nltk.download('stopwords')

# function to remove stopwords
def remove_stopwords(Text):
  stop_words = set(stopwords.words('english'))
  filtered_words = [w for w in Text if w not in stop_words]
  return filtered_words

data["Stopword_Removed"] = data["Tokenized_Text"].apply(remove_stopwords)
data.head()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,Label,Text,Length,Clean_Text,Tokenized_Text,Stopword_Removed
0,ham,"Go until jurong point, crazy.. Available only ...",111,go until jurong point crazy available only in ...,"[go, until, jurong, point, crazy, available, o...","[go, jurong, point, crazy, available, bugis, n..."
1,ham,Ok lar... Joking wif u oni...,29,ok lar joking wif u oni,"[ok, lar, joking, wif, u, oni]","[ok, lar, joking, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155,free entry in 2 a wkly comp to win fa cup fina...,"[free, entry, in, 2, a, wkly, comp, to, win, f...","[free, entry, 2, wkly, comp, win, fa, cup, fin..."
3,ham,U dun say so early hor... U c already then say...,49,u dun say so early hor u c already then say,"[u, dun, say, so, early, hor, u, c, already, t...","[u, dun, say, early, hor, u, c, already, say]"
4,ham,"Nah I don't think he goes to usf, he lives aro...",61,nah i dont think he goes to usf he lives aroun...,"[nah, i, dont, think, he, goes, to, usf, he, l...","[nah, dont, think, goes, usf, lives, around, t..."


## LEMMATIZATION

In [45]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
nltk.download('omw-1.4')

# lemmatize function
def lemmatize_word(text):
    lemmas = [lemmatizer.lemmatize(word, pos ='v') for word in text]
    return lemmas

data["Lemmatized_Text"] = data["Stopword_Removed"].apply(lemmatize_word)
data.head()

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Unnamed: 0,Label,Text,Length,Clean_Text,Tokenized_Text,Stopword_Removed,Lemmatized_Text
0,ham,"Go until jurong point, crazy.. Available only ...",111,go until jurong point crazy available only in ...,"[go, until, jurong, point, crazy, available, o...","[go, jurong, point, crazy, available, bugis, n...","[go, jurong, point, crazy, available, bugis, n..."
1,ham,Ok lar... Joking wif u oni...,29,ok lar joking wif u oni,"[ok, lar, joking, wif, u, oni]","[ok, lar, joking, wif, u, oni]","[ok, lar, joke, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155,free entry in 2 a wkly comp to win fa cup fina...,"[free, entry, in, 2, a, wkly, comp, to, win, f...","[free, entry, 2, wkly, comp, win, fa, cup, fin...","[free, entry, 2, wkly, comp, win, fa, cup, fin..."
3,ham,U dun say so early hor... U c already then say...,49,u dun say so early hor u c already then say,"[u, dun, say, so, early, hor, u, c, already, t...","[u, dun, say, early, hor, u, c, already, say]","[u, dun, say, early, hor, u, c, already, say]"
4,ham,"Nah I don't think he goes to usf, he lives aro...",61,nah i dont think he goes to usf he lives aroun...,"[nah, i, dont, think, he, goes, to, usf, he, l...","[nah, dont, think, goes, usf, lives, around, t...","[nah, dont, think, go, usf, live, around, though]"


## Vectorization

Vectorization is the technique of converting input data from its raw format into vectors of real numbers.

There are many vectorization techniques available but here lets use TF-IDF vectorization.


In [46]:
# The TFIDF Vectorizer expect an array of strings
corpus= []
for i in data["Lemmatized_Text"]:
    sms = ' '.join([row for row in i])
    corpus.append(sms)

In [47]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus).toarray()


In [48]:
#Label encode the Label feature 

from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
Y = label_encoder.fit_transform(data["Label"])

# Building model

In [49]:
# Splitting the testing and training sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# Multinomial Naive Bayes classifier
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB().fit(X_train, Y_train)
 
# predicting test set results
y_pred = model.predict(X_test)
 
from sklearn.metrics import accuracy_score
print(accuracy_score(Y_test,y_pred))

0.9623318385650225
