# **SMS spam (cell phone spam or short messaging service spam)**

SMS spam (sometimes called cell phone spam) is any junk message delivered to a mobile phone as text messaging through the Short Message Service (SMS).Text messaging has greatly increased in popularity in the past five years and the government is trying to keep up with rapidly changing technology.This notebook demonstrates a basic model of the process and gives an intution of how the process works.

**About Dataset:**

A collection of SMS composed of one text file, where each line has the correct class followed by the raw message. 

**Target:**

To train a model to classify SMS as either ham or spam for future predictions.

**Approach:**

Firstly, some EDA and feature engineering followed by Data cleaning by removing unnecessary data and Vectorization using Bag of Words. Lastly, model training using Naive Bayes Multinomial classifier.

# **Importing Libraries**

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
        
import numpy as np # linear algebra
import pandas as pd # data processing

#importing visualising libraries
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
%matplotlib inline

#NLP
import nltk
from nltk.corpus import stopwords

#Data Cleaning
import string
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

#Training Model
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

# **Loading Data**

In [None]:
msgs = pd.read_csv("../input/sms-spam-collection-dataset/spam.csv",encoding='latin-1')

In [None]:
msgs.head()

In [None]:
#Dropping unnecessary columns
msgs.drop(msgs.columns[[2, 3, 4]], axis = 1, inplace = True)

In [None]:
#Renaming columns
msgs.rename(columns = {'v1': 'label', 'v2': 'message'},inplace=True)

In [None]:
msgs.head()

# **Exploratory Data Analysis**

In [None]:
msgs.describe()

In [None]:
msgs.info()

It can be seen there are no null values in data.

In [None]:
msgs.groupby('label').describe()

In [None]:
Category_count=np.array(msgs['label'].value_counts())
labels=sorted(msgs['label'].unique())

In [None]:
fig = go.Figure(data=[go.Pie(labels=labels, values=Category_count, hole=.3)])
fig.show()

We can see 86.6% of the messages are ham, which indicates that the possibility of spam messages are quite low.

# Feature Engineering

In [None]:
msgs['length'] = msgs['message'].apply(len)
msgs.head()

In [None]:
fig = px.histogram(msgs, x="length",color="label")
fig.show()

Let's bring more clarity to ham and spam.

In [None]:
msgs.hist(column='length', by='label',bins=50, figsize=(10,4))

This shows that short messages are generally ham whereas  messages with length around 150 are generally spam.

In [None]:
msgs.length.describe()

The maximum length of a message is 910. Let's explore this message.

In [None]:
msgs[msgs['length']==910]['message'].iloc[0]

Well!... this has to be the longest message. After all, it's from a Romeo to his Juliet.

# Text Preprocessing

# **Tokenizing Process(Normalization)**

**Removing Punctuation using string lib and stopwords using nltk**

In [None]:
#Forming function for msgs
#Removing punctuations and stopwords
def text_process(mess):
  nopunc=[char for char in mess if char not in string.punctuation]
  nopunc=''.join(nopunc)
  return [word for word in nopunc.split() if word.lower() not in stopwords.words('english') ]

In [None]:
#making lists of tokens(lemmas)
msgs['message'].apply(text_process)

# Vectorization: Implementing Bag of Words

Converting Sequence of characters into sequence of numbers(raw messages into vectors) that can be used by scikit-learn algorithm.

In [None]:
#converting text doc to a matrix of token counts using scikit countvectorizer 
bow_transformer = CountVectorizer(analyzer=text_process).fit(msgs['message'])

# Print total number of vocab words
print (len(bow_transformer.vocabulary_))

In [None]:
#calculating sparsity
msgs_bow=bow_transformer.transform(msgs['message'])
print ('Shape of Sparse Matrix:{}',format(msgs_bow.shape))
print ('Amount of Non-Zero occurences:{}',format(msgs_bow.nnz))
print ('sparsity: %.2f%%' % (100.0 * msgs_bow.nnz / (msgs_bow.shape[0] * msgs_bow.shape[1])))

In [None]:
print(msgs_bow)

# Finding Term frequency–inverse document frequency

TF-idf numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus

In [None]:
tfidf_transformer = TfidfTransformer().fit(msgs_bow)
msgs_tfidf = tfidf_transformer.transform(msgs_bow)
print (msgs_tfidf.shape)

In [None]:
print(msgs_tfidf)

# **Training model**

**Using Naive Bayes Multinomial Classifier**

In [None]:
spam_detect_model=MultinomialNB().fit(msgs_tfidf, msgs['label'])

# Model Evaluation

In [None]:
#test
all_predictions=spam_detect_model.predict(msgs_tfidf)
print(all_predictions)

In [None]:
from sklearn.metrics import classification_report
print(classification_report(msgs['label'],all_predictions))

we have got 98% accuracy using Naive Bayes Multinomial Classifier.

Please, upvote this work if it's of any help. Thank you!