Welcome to the SMS Spam Collection dataset—an invaluable resource for exploring the world of SMS spam and legitimate messages. This dataset is not just a compilation of text; it's a reflection of the ongoing efforts in SMS spam research, carefully curated from various sources to create a comprehensive collection of 5,574 messages in English.

At its core, this dataset encapsulates the distinction between "ham" (legitimate) and "spam" messages, with each message tagged accordingly. The structure is simple yet powerful, with each line presenting the label (ham or spam) in column v1 and the raw text in column v2.

The journey of this dataset is rooted in meticulous data collection efforts:

A manual extraction of 425 SMS spam messages from the Grumbletext Web site, a UK forum dedicated to discussing SMS spam issues.
A subset of 3,375 randomly chosen ham messages from the NUS SMS Corpus, originating mostly from Singaporeans and students at the National University of Singapore.
A list of 450 SMS ham messages from Caroline Tag's PhD Thesis.
The incorporation of the SMS Spam Corpus v.0.1 Big, comprising 1,002 ham messages and 322 spam messages, widely used in academic research on SMS spam filtering.
As we embark on this exploration, the goal is clear: can we leverage this dataset to develop a prediction model capable of accurately classifying spam texts? Through analysis, statistics, and machine learning methodologies, we aim to unravel the patterns and insights hidden within this corpus, contributing to the ongoing study of SMS spam filtering.

Join us in unraveling the nuances of SMS communication and the battle against spam, using data-driven approaches to enhance our understanding and defenses in the digital messaging realm.

# Table of Content:

1. [Data Cleaning](#sec1)
2. [EDA](#sec2)
3. [Text Preprocessing](#sec3)
4. [Model Building](#sec4)
5. [Pickle Files](#sec5)

I import here necessary libraries, all other libraries are imported along with code.

In [None]:
import pandas as pd
import numpy as np
import nltk

In [None]:
df = pd.read_csv("/kaggle/input/sms-spam-collection-dataset/spam.csv", encoding='latin-1')

In [None]:
df.head(3)

Data Cleaning

In [None]:
df.shape

In [None]:
# drop last 3 columns; which are not usefull
df = df[["v1","v2"]]

In [None]:
# rename columns names
df.rename(columns={"v1":"target", "v2":"text"}, inplace=True)

In [None]:
# label encoding target column
df["target"] = df["target"].map({"ham":0, "spam":1})

In [None]:
df

In [None]:
# checking null values exist in dataframe
df.isnull().sum()

In [None]:
# checking duplicated rows or records in dataframe
df.duplicated().sum()

In [None]:
# There are 403 duplicated rows in dataframe, so remove duplicates
df.drop_duplicates(inplace=True, keep="first")

In [None]:
df.shape

EDA

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# ploting target column
figure, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12,5))

custom_colors = ["#19b558", "#62bcde"]
df["target"].value_counts().plot(kind="pie", autopct="%.1f%%", colors = custom_colors, ax=ax1)
fig = sns.countplot(x=df["target"], palette=custom_colors, ax=ax2)
for bar in fig.containers:
    fig.bar_label(bar)

plt.show()

### Note:


<div style="background-color: #f0f0f0; padding: 20px; border-radius: 10px;">
    <p style="font-family: Arial, sans-serif; font-size: 16px; color: #333333; text-align: justify;">
       From the above plot, we can see that 1(spam) category is very minimum as compared to 0(ham) category. It shows data is <b>imbalanced</b>.
    </p>
</div>


Now we are going to make 3 `new` columns:

1. Number of Characters
2. Number of Words
3. Number of Sentences

In [None]:
# number of characters
df["num_characters"] = df["text"].apply(len)

In [None]:
# number of words
df["num_words"] = df["text"].apply(lambda x: len(nltk.word_tokenize(x)))

In [None]:
# number of sentences
df["num_sentences"] = df["text"].apply(lambda x: len(nltk.sent_tokenize(x)))

In [None]:
df

In [None]:
df[["num_characters","num_words","num_sentences"]].describe()

In [None]:
# ham emails/messages
df[df["target"]==0][["num_characters","num_words","num_sentences"]].describe()

In [None]:
# spam emails/messages
df[df["target"]==1][["num_characters","num_words","num_sentences"]].describe()

In [None]:
# ploting: number of characters in spam and ham emails
plt.figure(figsize=(12,5))
sns.histplot(df[df["target"]==0]["num_characters"])
sns.histplot(df[df["target"]==1]["num_characters"], color="red")
plt.show()

In [None]:
# ploting: number of words in spam and ham emails
plt.figure(figsize=(12,5))
sns.histplot(df[df["target"]==0]["num_words"])
sns.histplot(df[df["target"]==1]["num_words"], color="red")
plt.show()

In [None]:
# ploting: number of sentences in spam and ham emails
plt.figure(figsize=(12,5))
sns.histplot(df[df["target"]==0]["num_sentences"])
sns.histplot(df[df["target"]==1]["num_sentences"], color="red")
plt.show()

In [None]:
# ploting pairplot to see relationship between new columns
sns.pairplot(df, hue="target")

In [None]:
# correlation heatmap
sns.heatmap(df.select_dtypes(["int"]).corr(), annot=True)

### Note:


<div style="background-color: #f0f0f0; padding: 20px; border-radius: 10px;">
    <p style="font-family: Arial, sans-serif; font-size: 16px; color: #333333; text-align: justify;">
       New columns such as number of characters, words and sentences have strong relationship between them that shows there is <b>High Multicollinearity</b> between these columns. So, we will not use these columns or features in <b>Model Training</b> but will perform analysis through these features.
    </p>
</div>


Text Preprocessing

Here are the tasks performed in `text preprocessing`:

- Lowercase
- Tokenization
- Removing Special Characters
- Removing stop words and punctuation
- Stemming

In [None]:
from nltk.corpus import stopwords
import string
from nltk.stem.porter import PorterStemmer

In [None]:
def transform_text(text):
    # 01: transforming text into lower case
    text = text.lower()
    text = nltk.word_tokenize(text)
    
    # 02: getting alphnumeric content from text
    y = []
    for word in text:
        if word.isalnum():
            y.append(word)
    
    # 03: removing stop words and punction marks from text
    text = y[:]
    y.clear()
    for word in text:
        if word not in stopwords.words("english") and word not in string.punctuation:
            y.append(word)
            
    # 04: apply stemming 
    text = y[:]
    y.clear()
    for word in text:
        y.append(PorterStemmer().stem(word))
    
    return " ".join(y)

In [None]:
# testing the function
transform_text("ALi is goods goods how where boy's# ;$# ... >>(a)// !")

In [None]:
df["transformed_text"] = df["text"].apply(transform_text)

In [None]:
df

In [None]:
# Performing analysis on ham and spam emails separately to see common and repeating words through Word Cloud
from wordcloud import WordCloud

wc = WordCloud(width=600, height=500, min_font_size=12, background_color="white")

In [None]:
# for ham emails/messages
ham_wc = wc.generate(df[df["target"]==0]["transformed_text"].str.cat(sep=" "))

In [None]:
plt.imshow(ham_wc)

In [None]:
# for spam emails/messages
spam_wc = wc.generate(df[df["target"]==1]["transformed_text"].str.cat(sep=" "))

In [None]:
plt.imshow(spam_wc)

In [None]:
# ploting top repeated words
from collections import Counter

In [None]:
def top_words(target):
    words = []
    for msg in df[df["target"] == target]["transformed_text"].tolist():
        for word in msg.split():
            words.append(word)
            
    sns.barplot(x=pd.DataFrame(Counter(words).most_common(30))[0], y=pd.DataFrame(Counter(words).most_common(30))[1])
    plt.xticks(rotation="vertical")
    plt.xlabel("Words")
    plt.ylabel("Frequency")
    plt.show()

In [None]:
top_words(0)

In [None]:
top_words(1)

Model Building

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [None]:
# Using TF-IDF vectorizer
tf_idf = TfidfVectorizer(max_features=3500)

In [None]:
x = tf_idf.fit_transform(df["transformed_text"]).toarray()

In [None]:
x

In [None]:
y = df["target"].values

In [None]:
y

In [None]:
x.shape, y.shape

In [None]:
# splitting data into training and testing
from sklearn.model_selection import train_test_split

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=0)

In [None]:
# importing algorithms
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score

### Note:


<div style="background-color: #f0f0f0; padding: 20px; border-radius: 10px;">
    <p style="font-family: Arial, sans-serif; font-size: 16px; color: #333333; text-align: justify;">
      I experimented with several methods, including <b>Naive Bayes and Gradient Boosting Classifier, Random Forest Classifier, Decision Tree Classifier, KNeighbors Classifier, SVC (Support Vector Classifier), Adaboost Classifier, Extra Trees Classifier, XGB Classifier, LightGBM Classifier, and so on. However, only the Naive Bayes method outperformed the rest. The code to test Naive Bayes algorithms is then written below. All algorithms are available for testing.

    </p>
</div>


In [None]:
for model in [GaussianNB(), MultinomialNB(), BernoulliNB()]:
    model.fit(x_train, y_train)
    y_pred = model.predict(x_test)
    print(f"{model}")
    print(f"Accuracy Score: {accuracy_score(y_test, y_pred)}")
    print(f"Precision Score: {precision_score(y_test, y_pred)}")
    print(f"Confusion Matrix : \n{confusion_matrix(y_test, y_pred)}")
    print("\n===================\n")
    

### Note:


<div style="background-color: #f0f0f0; padding: 20px; border-radius: 10px;">
    <p style="font-family: Arial, sans-serif; font-size: 16px; color: #333333; text-align: justify;">
        From <b>Naive Bayes</b> algorithms, only <b>MultinomialNB()</b> algorithm performs well. So we will use this model in production part.
    </p>
</div>


In [None]:
# test selected model accuracy
model = MultinomialNB()
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
print(accuracy_score(y_test, y_pred))
print(precision_score(y_test, y_pred))

Pickle Files


In [None]:
# uncomment the following code for pickling files


# import pickle
# pickle.dump(tf_idf, open("vectorizer.pkl", "wb"))
# pickle.dump(model, open("model.pkl", "wb"))

<p style="color:Blue; font-weight:900;">the end</p>