# SMS Spam Detection Analysis

<img src="01.jpg" width=600 height=60 />

## Introduction :

The growth of mobile phone users has lead to a dramatic increasing of SMS spam messages. In practice, fighting mobile phone spam is difficult by several factors, including the lower rate of SMS that has allowed many users and service providers to ignore the issue, and the limited availability of mobile phone spam-filtering software. On the other hand, in academic settings, a major handicap is the scarcity of public SMS spam datasets, that are sorely needed for validation and comparison of different classifiers. Moreover, as SMS messages are fairly short, content-based spam filters may have their performance degraded. 

We all know that the internet and social media have become the quickest and most straightforward ways to get information. As a result, messages have become a significant source of information. In this era, Short message service or SMS is one of the most potent means of communication. As the dependence on mobile devices has drastically increased over the period, it has led to increased muggings via SMS. We can now extract meaningful information from such data using various artificial intelligence techniques thanks to technological advancements.

<img src="02.jpg" width=600 height=60 />


## Content :

The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text.

This corpus has been collected from free or free for research sources at the Internet:

* A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. The identification of the text of spam messages in the claims is a very hard and time-consuming task, and it involved carefully scanning hundreds of web pages. 


* A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available.


* A list of 450 SMS ham messages collected from Caroline Tag's PhD Thesis. 


* Finally, we have incorporated the SMS Spam Corpus v.0.1 Big. It has 1,002 SMS ham messages and 322 spam messages and it is public available at. 


## Dataset Source :

https://www.kaggle.com/uciml/sms-spam-collection-dataset

https://archive.ics.uci.edu/ml/datasets/sms+spam+collection

## Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import re
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from nltk.tokenize.toktok import ToktokTokenizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

import warnings
warnings.filterwarnings('ignore')

## Importing the Dataset

In [2]:
spam = pd.read_csv('spam.csv', encoding = 'latin-1')

In [3]:
spam.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [4]:
spam.shape

(5572, 5)

**Get necessary columns for processing**

In [5]:
spam = spam[['v2', 'v1']]

spam = spam.rename(columns = {'v2': 'messages', 'v1': 'label'})
spam.head()

Unnamed: 0,messages,label
0,"Go until jurong point, crazy.. Available only ...",ham
1,Ok lar... Joking wif u oni...,ham
2,Free entry in 2 a wkly comp to win FA Cup fina...,spam
3,U dun say so early hor... U c already then say...,ham
4,"Nah I don't think he goes to usf, he lives aro...",ham


## Preprocessing the dataset

In [6]:
spam.isnull().sum()

messages    0
label       0
dtype: int64

### Text Normalization

### Removing Html Strips & Noise Text

### Removing Special Characters

In [7]:
STOPWORDS = set(stopwords.words('english'))

def clean_text(text):
    
    # convert to lowercase
    text = text.lower()
    
    # remove special characters
    text = re.sub(r'[^0-9a-zA-Z]', ' ', text)
    
    # remove extra spaces
    text = re.sub(r'\s+', ' ', text)
    
    # remove stopwords
    text = " ".join(word for word in text.split() if word not in STOPWORDS)
    return text

### Tokenization

In [8]:
tokenizer = ToktokTokenizer()

stopword_list = nltk.corpus.stopwords.words('english')

In [9]:
spam.head()

Unnamed: 0,messages,label
0,"Go until jurong point, crazy.. Available only ...",ham
1,Ok lar... Joking wif u oni...,ham
2,Free entry in 2 a wkly comp to win FA Cup fina...,spam
3,U dun say so early hor... U c already then say...,ham
4,"Nah I don't think he goes to usf, he lives aro...",ham


In [10]:
spam['clean_text'] = spam['messages'].apply(clean_text)
spam.head()

Unnamed: 0,messages,label,clean_text
0,"Go until jurong point, crazy.. Available only ...",ham,go jurong point crazy available bugis n great ...
1,Ok lar... Joking wif u oni...,ham,ok lar joking wif u oni
2,Free entry in 2 a wkly comp to win FA Cup fina...,spam,free entry 2 wkly comp win fa cup final tkts 2...
3,U dun say so early hor... U c already then say...,ham,u dun say early hor u c already say
4,"Nah I don't think he goes to usf, he lives aro...",ham,nah think goes usf lives around though


### Text Stemming

In [11]:
def simple_stemmer(text):
    ps = nltk.porter.PorterStemmer()
    text = ' '.join([ps.stem(word) for word in text.split()])
    return text


spam['clean_text'] = spam['clean_text'].apply(simple_stemmer)

In [12]:
spam.head()

Unnamed: 0,messages,label,clean_text
0,"Go until jurong point, crazy.. Available only ...",ham,go jurong point crazi avail bugi n great world...
1,Ok lar... Joking wif u oni...,ham,ok lar joke wif u oni
2,Free entry in 2 a wkly comp to win FA Cup fina...,spam,free entri 2 wkli comp win fa cup final tkt 21...
3,U dun say so early hor... U c already then say...,ham,u dun say earli hor u c alreadi say
4,"Nah I don't think he goes to usf, he lives aro...",ham,nah think goe usf live around though


### Input Split

In [13]:
X = spam['clean_text']
y = spam['label']

## Model Training

In [14]:
def classify(model, X, y):
    
    x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42, shuffle = True, stratify = y)
    
    pipeline_model = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', model)])
    pipeline_model.fit(x_train, y_train)
    
    print('Train Accuracy:', pipeline_model.score(x_train, y_train)*100)
    print('Test  Accuracy:', pipeline_model.score(x_test, y_test)*100)
    
    y_pred_train = pipeline_model.predict(x_train)
    y_pred_test  = pipeline_model.predict(x_test)
    print('\n')
    print('Train classification report')
    print(classification_report(y_train, y_pred_train))
    print('\n')
    print('Test classification report')
    print(classification_report(y_test, y_pred_test))

### Logistic Regression

In [15]:
model = LogisticRegression()
classify(model, X, y)

Train Accuracy: 96.98492462311557
Test  Accuracy: 97.05671213208902


Train classification report
              precision    recall  f1-score   support

         ham       0.97      1.00      0.98      3619
        spam       0.99      0.78      0.87       560

    accuracy                           0.97      4179
   macro avg       0.98      0.89      0.93      4179
weighted avg       0.97      0.97      0.97      4179



Test classification report
              precision    recall  f1-score   support

         ham       0.97      1.00      0.98      1206
        spam       0.99      0.79      0.88       187

    accuracy                           0.97      1393
   macro avg       0.98      0.89      0.93      1393
weighted avg       0.97      0.97      0.97      1393



### Multinomial NB

In [16]:
model = MultinomialNB()
classify(model, X, y)

Train Accuracy: 98.08566642737497
Test  Accuracy: 96.33883704235463


Train classification report
              precision    recall  f1-score   support

         ham       0.98      1.00      0.99      3619
        spam       1.00      0.86      0.92       560

    accuracy                           0.98      4179
   macro avg       0.99      0.93      0.96      4179
weighted avg       0.98      0.98      0.98      4179



Test classification report
              precision    recall  f1-score   support

         ham       0.96      1.00      0.98      1206
        spam       0.99      0.73      0.84       187

    accuracy                           0.96      1393
   macro avg       0.98      0.87      0.91      1393
weighted avg       0.96      0.96      0.96      1393



### SVC

In [17]:
model = SVC(C = 3)
classify(model, X, y)

Train Accuracy: 100.0
Test  Accuracy: 98.34888729361091


Train classification report
              precision    recall  f1-score   support

         ham       1.00      1.00      1.00      3619
        spam       1.00      1.00      1.00       560

    accuracy                           1.00      4179
   macro avg       1.00      1.00      1.00      4179
weighted avg       1.00      1.00      1.00      4179



Test classification report
              precision    recall  f1-score   support

         ham       0.98      1.00      0.99      1206
        spam       1.00      0.88      0.93       187

    accuracy                           0.98      1393
   macro avg       0.99      0.94      0.96      1393
weighted avg       0.98      0.98      0.98      1393



### Decision Tree Classifier

In [18]:
model = DecisionTreeClassifier()
classify(model, X, y)

Train Accuracy: 100.0
Test  Accuracy: 96.62598707824839


Train classification report
              precision    recall  f1-score   support

         ham       1.00      1.00      1.00      3619
        spam       1.00      1.00      1.00       560

    accuracy                           1.00      4179
   macro avg       1.00      1.00      1.00      4179
weighted avg       1.00      1.00      1.00      4179



Test classification report
              precision    recall  f1-score   support

         ham       0.97      0.99      0.98      1206
        spam       0.91      0.83      0.87       187

    accuracy                           0.97      1393
   macro avg       0.94      0.91      0.92      1393
weighted avg       0.97      0.97      0.97      1393



### Random Forest Classifier

In [19]:
model = RandomForestClassifier()
classify(model, X, y)

Train Accuracy: 100.0
Test  Accuracy: 97.48743718592965


Train classification report
              precision    recall  f1-score   support

         ham       1.00      1.00      1.00      3619
        spam       1.00      1.00      1.00       560

    accuracy                           1.00      4179
   macro avg       1.00      1.00      1.00      4179
weighted avg       1.00      1.00      1.00      4179



Test classification report
              precision    recall  f1-score   support

         ham       0.97      1.00      0.99      1206
        spam       1.00      0.81      0.90       187

    accuracy                           0.97      1393
   macro avg       0.99      0.91      0.94      1393
weighted avg       0.98      0.97      0.97      1393

