# Dataset Information
The "spam" concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography...

The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged according being ham (legitimate) or spam.

# Attributes
Message

Category (spam/ham)

In [1]:
#importing needed liberaries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split 
from sklearn.feature_extraction.text import TfidfVectorizer 
#'TfidfVectorizer': TRANSFORM A collection of text documents into numerical representation that ML algorithms can deal with
import nltk
import re
from nltk.corpus import stopwords

# Loading the dataset

In [16]:
df = pd.read_csv('Spam.csv')
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


# Preprocessing the dataset

In [17]:
# check for null values
df.isnull().sum()

Category    0
Message     0
dtype: int64

In [18]:
#datatype info 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Category  5572 non-null   object
 1   Message   5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [6]:
STOPWORDS = set(stopwords.words('english'))

def clean_text(text):
    # convert to lowercase
    text = text.lower()
    # remove special characters
    text = re.sub(r'[^0-9a-zA-Z]', ' ', text)
    # remove extra spaces
    text = re.sub(r'\s+', ' ', text)
    # remove stopwords
    text = " ".join(word for word in text.split() if word not in STOPWORDS)
    return text

In [8]:
# clean the messages
df['clean_text'] = df['Message'].apply(clean_text)
df.head()

Unnamed: 0,Category,Message,clean_text
0,ham,"Go until jurong point, crazy.. Available only ...",go jurong point crazy available bugis n great ...
1,ham,Ok lar... Joking wif u oni...,ok lar joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry 2 wkly comp win fa cup final tkts 2...
3,ham,U dun say so early hor... U c already then say...,u dun say early hor u c already say
4,ham,"Nah I don't think he goes to usf, he lives aro...",nah think goes usf lives around though


In [9]:
x = df['Message']
y = df['Category']

# Model Training

In [31]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report
#TRANSFORM A collection of text documents into numerical representation that ML algorithms can deal with
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer

def classify(model, X, y):
    # train test split
    x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, shuffle=True, stratify=y)
    # model training
    pipeline_model = Pipeline([('vect', CountVectorizer()),
                              ('tfidf', TfidfTransformer()),
                              ('clf', model)])    #transforming text data to feature vector
    pipeline_model.fit(x_train, y_train)
    
    print('The accuracy of the training data is :\n', pipeline_model.score(x_test, y_test)*100)
    
    # cv_score = cross_val_score(model, x, y, cv=5)
    #print("CV Score:", np.mean(cv_score)*100)
    y_pred = pipeline_model.predict(x_test)
    print('The classification report of the testing data is : \n',classification_report(y_test, y_pred))

In [32]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
classify(model, x,y)

The accuracy of the training data is :
 96.98492462311557
The classification report of the testing data is : 
               precision    recall  f1-score   support

         ham       0.97      1.00      0.98      1206
        spam       1.00      0.78      0.87       187

    accuracy                           0.97      1393
   macro avg       0.98      0.89      0.93      1393
weighted avg       0.97      0.97      0.97      1393



In [33]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
classify(model, x, y)

The accuracy of the training data is :
 95.47738693467338
The classification report of the testing data is : 
               precision    recall  f1-score   support

         ham       0.95      1.00      0.97      1206
        spam       1.00      0.66      0.80       187

    accuracy                           0.95      1393
   macro avg       0.98      0.83      0.89      1393
weighted avg       0.96      0.95      0.95      1393



In [34]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
classify(model, x, y)

The accuracy of the training data is :
 97.05671213208902
The classification report of the testing data is : 
               precision    recall  f1-score   support

         ham       0.97      1.00      0.98      1206
        spam       1.00      0.78      0.88       187

    accuracy                           0.97      1393
   macro avg       0.98      0.89      0.93      1393
weighted avg       0.97      0.97      0.97      1393

