## Spam Detection with Machine Learning

Detecting spam alerts in emails and messages is one of the main applications that every big tech company tries to improve for its customers. Apple’s official messaging app and Google’s Gmail are great examples of such applications where spam detection works well to protect users from spam alerts. So, if you are looking to build a spam detection system, this article is for you. In this article, I will walk you through the task of Spam Detection with Machine Learning using Python.

## Spam Detection

Hope you now understand what spam detection is, now let’s see how to train a machine learning model for detecting spam alerts using Python. I’ll start this task by importing the necessary Python libraries and the dataset you need for this task

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import cv2

%matplotlib inline

## Data Collection

In [3]:
data = pd.read_csv(r"/content/spam_messege.csv", encoding="latin-1")
data.head()

Unnamed: 0,class,message,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,not_sapm,"Go until jurong point, crazy.. Available only ...",,,
1,not_sapm,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,not_sapm,U dun say so early hor... U c already then say...,,,
4,not_sapm,"Nah I don't think he goes to usf, he lives aro...",,,


In [4]:
print(data.head())

      class                                            message Unnamed: 2  \
0  not_sapm  Go until jurong point, crazy.. Available only ...        NaN   
1  not_sapm                      Ok lar... Joking wif u oni...        NaN   
2      spam  Free entry in 2 a wkly comp to win FA Cup fina...        NaN   
3  not_sapm  U dun say so early hor... U c already then say...        NaN   
4  not_sapm  Nah I don't think he goes to usf, he lives aro...        NaN   

  Unnamed: 3 Unnamed: 4  
0        NaN        NaN  
1        NaN        NaN  
2        NaN        NaN  
3        NaN        NaN  
4        NaN        NaN  


## Data Pre-Processing

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   class       5572 non-null   object
 1   message     5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB


In [6]:
data.describe()

Unnamed: 0,class,message,Unnamed: 2,Unnamed: 3,Unnamed: 4
count,5572,5572,50,12,6
unique,2,5169,43,10,5
top,not_sapm,"Sorry, I'll call later","bt not his girlfrnd... G o o d n i g h t . . .@""","MK17 92H. 450Ppw 16""","GNT:-)"""
freq,4825,30,3,2,2


In [7]:
print(data.describe())

           class                 message  \
count       5572                    5572   
unique         2                    5169   
top     not_sapm  Sorry, I'll call later   
freq        4825                      30   

                                               Unnamed: 2  \
count                                                  50   
unique                                                 43   
top      bt not his girlfrnd... G o o d n i g h t . . .@"   
freq                                                    3   

                   Unnamed: 3 Unnamed: 4  
count                      12          6  
unique                     10          5  
top      MK17 92H. 450Ppw 16"    GNT:-)"  
freq                        2          2  


In [8]:
data.columns

Index(['class', 'message', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], dtype='object')

From this dataset, class and message are the only features we need to train a machine learning model for spam detection, so let’s select these two columns as the new dataset

In [9]:
data = data[["class", "message"]]

In [10]:
data[["class"]].value_counts()

class   
not_sapm    4825
spam         747
Name: count, dtype: int64

In [11]:
data[["message"]].value_counts().head(5)

message                                                                                                                                                                            
Sorry, I'll call later                                                                                                                                                                 30
I cant pick the phone right now. Pls send a message                                                                                                                                    12
Ok...                                                                                                                                                                                  10
Your opinion about me? 1. Over 2. Jada 3. Kusruthi 4. Lovable 5. Silent 6. Spl character 7. Not matured 8. Stylish 9. Simple Pls reply..                                                4
Wen ur lovable bcums angry wid u, dnt take it seriously.. Coz being angry is

## Feature Selection

In [12]:
feature = np.array(data["message"])
target = np.array(data["class"])

In [13]:
from sklearn.feature_extraction.text import CountVectorizer
fem = CountVectorizer()
feature = fem.fit_transform(feature)

## Spliting Data

In [14]:
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(feature, target, test_size=0.33,random_state=42)

In [15]:
xtest.shape, xtrain.shape

((1839, 8710), (3733, 8710))

In [16]:
ytest.shape, ytrain.shape

((1839,), (3733,))

## Choosing Model & Training The Model

In [17]:
from sklearn.naive_bayes import MultinomialNB
nbm = MultinomialNB()
nbm.fit(xtrain, ytrain.ravel())

## Predicting Test Data

In [18]:
predictions = nbm.predict(xtest)

In [19]:
predictions

array(['spam', 'not_sapm', 'spam', ..., 'not_sapm', 'not_sapm', 'spam'],
      dtype='<U8')

Now let’s test this model by taking a user input as a message to detect whether it is spam or not

Enter a message: You won $40 cash price

In [20]:
sample = input('Enter a message:')
data = fem.transform([sample]).toarray()
print(nbm.predict(data))

Enter a message:you won $40 cash price
['spam']


## Summary

So this is how you can train a machine learning model for the task of detecting whether an email or a message is spam or not. A Spam detector detects spam messages or emails by understanding text content so that you can only receive notifications about messages or emails that are very important to you. I hope you liked this article on the task of detecting spam alerts with machine learning using Python. Feel free to ask your valuable questions in the comments section below.

## Babar Ali Assad