# SMS Spam detection using Machine Learning

This is the code written in Python for building the a Spam Detector using Machine Learning trained on the 'spam.csv' file from Kaggle. We start from importing the dataset to building models.

We start by importing the required libraries such as numpy, pandas, nltk, etc... then we import the dataset by using Pandas.
The 'stopwords' contains the common words that we don't really need in our algorithm and we stem the words to obain their root.
CountVectorizer, TfidTransformer are used to transform the text data into matrix.

### Import the data

In [None]:
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from imblearn.over_sampling import SMOTE
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer

In [244]:
df = pd.read_csv('spam.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,v1,v2
0,0,ham,"Go until jurong point, crazy.. Available only ..."
1,1,ham,Ok lar... Joking wif u oni...
2,2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,3,ham,U dun say so early hor... U c already then say...
4,4,ham,"Nah I don't think he goes to usf, he lives aro..."


### Clean and prepare the data for model building

In [245]:
df = df.drop('Unnamed: 0',axis=1)
df = df.rename(columns={'v2': 'messages', 'v1': 'label'})
df['label'] = df['label'].replace('ham',0)
df['label'] = df['label'].replace('spam',1)

Let us build a function that can clean the data by removing special characters, extra spaces and stopwords and stemming all the words in the text. We also remove the unnecessary column.

In [248]:
stop_words = set(stopwords.words('english'))
ps = PorterStemmer()
def clean_text_data(data):
    data = data.lower()
    #Remove special characters
    data = re.sub(r'[^0-9a-zA-Z]', ' ', data)
    #Remove extra spaces
    data = re.sub(r'\s+', ' ', data)
    #Remove stopwords
    data = " ".join(w for w in data.split() if w not in stop_words)
    #stemming the words
    words = data.split()
    l = []
    for w in words:
        l.append(ps.stem(w))
    data = " ".join(w for w in l)
    return data 

In [249]:
df['messages']=df['messages'].apply(clean_text_data)

### Model building

Now we are going to separate the input and outcomes by X and Y. 
We will build the 'classify_msg' function to classify the text. 
We will vectorize the texts to obtain a matrix representation of the data. 
We will split the data into 'Training' and 'Test' sets by conserving their classification weights then we will oversample the Training set with SMOTE().

We train the models in a loop and print their scores

In [250]:
X =df['messages']
Y = df['label']

In [256]:
def classify_msg(X,Y,models):
    # transform the text data into matrix
    vec = CountVectorizer()
    X = vec.fit_transform(X)
    tf = TfidfTransformer()
    X = tf.fit_transform(X)
    
    # Deal with imbalanced class: Oversampling with SMOTE
    X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.2,shuffle=True, stratify=Y)
    ovsp = SMOTE()
    x_train,y_train = ovsp.fit_resample(X_train,Y_train)
    
    for algo in models:
        print('\n')
        print(str(algo),':')
        algo.fit(x_train,y_train)
        ypred = algo.predict(X_test)
        # Predictions and score
        print('accuracy score : ',accuracy_score(Y_test,ypred))
        print('Confusion matrix : \n ', confusion_matrix(Y_test,ypred))
        print('Classificaiton report : \n ', classification_report(Y_test,ypred))

In [257]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier()

from sklearn.svm import SVC
svm = SVC()

from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()

models = [lr,dtree,svm,rfc]

In [258]:
classify_msg(X,Y,models)



LogisticRegression() :
accuracy score :  0.9811659192825112
Confusion matrix : 
  [[959   7]
 [ 14 135]]
Classificaiton report : 
                precision    recall  f1-score   support

           0       0.99      0.99      0.99       966
           1       0.95      0.91      0.93       149

    accuracy                           0.98      1115
   macro avg       0.97      0.95      0.96      1115
weighted avg       0.98      0.98      0.98      1115



DecisionTreeClassifier() :
accuracy score :  0.9632286995515695
Confusion matrix : 
  [[943  23]
 [ 18 131]]
Classificaiton report : 
                precision    recall  f1-score   support

           0       0.98      0.98      0.98       966
           1       0.85      0.88      0.86       149

    accuracy                           0.96      1115
   macro avg       0.92      0.93      0.92      1115
weighted avg       0.96      0.96      0.96      1115



SVC() :
accuracy score :  0.9766816143497757
Confusion matrix : 
  [[966

### Thank you