# SMS Spam Filtering using NLP

* Agenda of this notebook is to create a model that classify messages as spam or ham. The model should have acceptable accuracy and presision to do the classification.

* I have used RandomForest classifier here

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

**Lets load some basic Libraries and start the engine xD**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import string
import re
import nltk
nltk.download('stopwords')

%matplotlib inline

**Now lets see how the data looks**

In [None]:
sms=pd.read_csv("/kaggle/input/sms-spam-collection-dataset/spam.csv",encoding='latin-1')
print(sms.shape)
sms.head(5)



So as we can see above how the data looks and to exactly specify. There are 5572 messages.

Here I have dropped the columns with all NaN, makes no sense to keep that because it gives us 0 insights.

Also naming the columns v1,v2 as Label and messages respectively.

In [None]:
sms.drop(columns=['Unnamed: 2','Unnamed: 3','Unnamed: 4'],axis=1,inplace=True)
sms.columns=['Label','Messages']
sms.head(5)


In [None]:
print(sms.isnull().sum())
print(sms.shape)

Not the way I expected to be honest. Quite clean xD

In [None]:
sms['Messages'].head(15)


**Lets see how mant ham and spam messages are there in the dataset**

In [None]:
print(f'ham= {len(sms[sms["Label"] == "ham"])}')
print(f'spam= {len(sms[sms["Label"] == "spam"])}')

So we have 4825 ham messages and 747 spam messages.

**Lets see the same with some visualization**

In [None]:
fig,ax1=plt.subplots(figsize=(7,5))
sns.countplot(x="Label",data=sms)


Now Lets check some attibutes or insights from the data. 

Creating some additionals features that may help the model to understand better. Like: **Text length** and **punctuations used**.

In [None]:
sms['Text_len']= sms['Messages'].apply(lambda x:len(x))
sms.head(6)

**Lets see the distribution of spam and ham message's length**


In [None]:
f, ax = plt.subplots(1, 2, figsize = (28, 7))

sns.distplot(sms[sms["Label"] == "spam"]["Text_len"], bins = 50, ax = ax[0],rug=True,kde_kws={"color": "r"},rug_kws={"color": "g"})
ax[0].set_xlabel("Spam Message Word Length")

sns.distplot(sms[sms["Label"] == "ham"]["Text_len"], bins = 100, ax = ax[1],rug=True,kde_kws={"color": "r"},rug_kws={"color": "g"})
ax[1].set_xlabel("Ham Message Word Length")

plt.show()

**Spam:** Here we can see the spread little uniform when compared to ham messages. Generally spam message's length is around 110-160/170

**Ham:** Its totally right skewed distribution. like 70% ham message's length is around >0 and <180.

Lets count the number of punctuations used, instead lets get punctuation % easier to understand**

In [None]:
import string
def punc_count(text):
    count_punc=sum([1 for c in text if c in string.punctuation])
    return 100*count_punc/len(text)

sms['punc_%']= sms['Messages'].apply(lambda x:punc_count(x))
sms.head(6)

Now lets get to text cleaning part. Here I will remove the puntuations

In [None]:
#I am creating a function that will iterate over all the characters in the texts and search for punctuation as mentioned in "string.punctuation"

def remove_punctuation(text):
    text_nopunc="".join([c for c in text if c not in string.punctuation])
    return text_nopunc

#Reason why I used join is, while I ran it I gave me"," between every letters

In [None]:
sms['clean_text']=sms['Messages'].apply(lambda x:remove_punctuation(x))
sms.head(7)

As, we can see the above dataset all the puctuations are removed properly in the coloumn name "clean_text"

Now lets divide the messages in tokens. i.e single words

In [None]:
def tokenize(text):
    tokens=re.split('\W+',text)#W here stands for non-word and "w" stands for word, it will spilt on non-word
    return tokens

sms['text_tokens']=sms['clean_text'].apply(lambda x:tokenize(x.lower())) #x.lower to tell python that uppercase and lowercase with spellings are same words

sms.head()

**I am now moving to remove the stopwords from the tokenised sentence**. 

So stopwords are the words of a language which doesn't contribute much to the meaning of a sentence. 

In [None]:
from nltk.corpus import stopwords
stopword= nltk.corpus.stopwords.words('english')

In [None]:
def remove_stopwords(text):
    text_no_sw= [word for word in text if word not in stopword]
    return text_no_sw

sms['text_clean']=sms['text_tokens'].apply(lambda x:remove_stopwords(x))
sms.head()


**Lets move on to stemming and in particular I am using Porter Stemmer**. 

Stemming is a process of reducing a word to its original form. like converting a word's plural or tense form to the original form. Like reducing the branches of a tree to just its stem, thus named stemming.

In [None]:
from nltk.stem import PorterStemmer
ps=nltk.PorterStemmer()
print(ps.stem('cats'))

In [None]:
def stemming(text_clean):
    stemmed=[ps.stem(word)for word in text_clean]
    return stemmed



In [None]:
sms['text_stemmed']=sms['text_clean'].apply(lambda x:stemming(x))
sms.head(5)

# Vectorisation


The textual data after processing needs to be fed into the model. Since the model doesn't accept textual data and only understands numbers, this data needs to be vectorized i.e. transforming text into a meaningful vector (or array) of numbers.

To convert string data into numerical data one can use following methods

· Bag of words

· TFIDF

· Word2Vec


**WE ARE GOING TO USE Tfidf TODAY**

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf= TfidfVectorizer()
X_tfidf = tfidf.fit_transform(sms['Messages'])


In [None]:
X=pd.concat([sms['Text_len'], sms['punc_%'], pd.DataFrame(X_tfidf.toarray())], axis=1)
X.head(5)


The above is the vectorized form of every messages and it is now readt to be fed to the ML model


In [None]:
Y=sms['Label']

**Now divide the dataset into test and train dataset, I am keeping the test dataset size to 30% of the original size**

In [None]:
from sklearn.model_selection import train_test_split 
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size = 0.3)

**Below I have imported the Random Forest classifier model and did fit and predict**

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf=RandomForestClassifier(n_jobs=-1) #n_jobs=-1 means the all processors CPU jobs will be running concurrently. 
rf.fit(X_train,Y_train)
Y_pred = rf.predict(X_test)

In [None]:
from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(Y_test,Y_pred))

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(Y_test,Y_pred)

Seems like we got a good accuracy using a simple Classifier, we got an accuracy of 97 %.

**Steps to make the accuracy better:**

* By using n-grams try identifying uni-gram,bi-gram and tri-gram and make the into one word like Thank You.

* You can also try hyperparameter tuning. try RandomizedSearch CV or GridSearch CV.