<a href="https://colab.research.google.com/github/Hemant-1Kumar/SMS-Spam-Classifier/blob/main/SMS_Spam_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -  SMS Spam Classifier



##### **Project Type**    - Classification


# **Project Summary -**

pervasive issue in SMS communication. These spam messages not only cause inconvenience but also pose potential security risks to users. As a result, the development of effective spam detection systems is critical. In this project, I have focused on building a machine learning model to classify SMS messages as either spam or legitimate (ham), using the SMS Spam Collection dataset as the foundation.

The SMS Spam Collection dataset comprises 5,574 SMS messages, each labeled either as ham or spam. The dataset is structured into two columns: the first column contains the label (ham or spam), and the second contains the raw text of the message. The messages in the dataset have been collected from multiple sources. A significant portion of spam messages, around 425, was manually extracted from the Grumbletext website, which serves as a UK-based forum where users report spam. Another key contributor is the NUS SMS Corpus, from which 3,375 legitimate (ham) messages were randomly selected. These messages primarily originate from Singaporean university students. Additionally, 450 legitimate SMS messages were sourced from Caroline Tag's PhD thesis, and the SMS Spam Corpus v.0.1 Big added 1,002 legitimate messages and 322 spam messages to the dataset.

The primary aim of this project is to develop an efficient machine learning model capable of accurately distinguishing between spam and ham messages. The first step involves data preprocessing, where the raw text of the SMS messages is cleaned and prepared for analysis. Techniques such as tokenization, removal of stop words, and normalization are applied to standardize the text data. This process is crucial for ensuring that the machine learning algorithms can effectively learn patterns in the data.

Following data preparation, several machine learning models, including Logistic Regression, Naive Bayes, and Support Vector Machines (SVM), are trained on the dataset. The performance of these models is evaluated using standard metrics such as accuracy, precision, recall, and F1-score to determine the best-performing algorithm. The final model provides a reliable mechanism for identifying and filtering out spam messages, ultimately enhancing user experience and security in SMS communication.

This project highlights the power of machine learning in addressing real-world challenges like spam detection. The insights and results from this work can contribute to the development of more advanced spam filtering systems in the future.








# **GitHub Link -**

https://github.com/Hemant-1Kumar/SMS-Spam-Classifier

# **Problem Statement**


With the increasing use of SMS communication, the prevalence of unsolicited messages, commonly referred to as spam, has risen significantly. Spam messages not only disrupt user experience but can also carry potential security threats, such as phishing attacks and fraud. Manually filtering out these spam messages is both inefficient and impractical, given the sheer volume of messages exchanged daily.

There is a critical need for an automated system that can accurately and efficiently distinguish between legitimate SMS messages (ham) and spam messages. This requires the application of machine learning techniques to analyze textual data and classify messages accordingly. The challenge lies in building a model that can generalize well to unseen data, minimizing both false positives (misclassifying ham as spam) and false negatives (failing to detect spam).

This project seeks to address the problem by developing a robust machine learning model, trained on a labeled dataset of SMS messages, to accurately classify incoming messages as either spam or ham. The goal is to enhance user security and improve the overall efficiency of SMS communication systems by reducing the impact of spam messages.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
# Import Visualization Libraries
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
import nltk
!pip install nltk
nltk.download('punkt')
nltk.download('stopwords')

import nltk
from nltk.corpus import stopwords
import string
from nltk.stem import PorterStemmer
ps = PorterStemmer()


from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB,MultinomialNB,BernoulliNB
from sklearn.metrics import accuracy_score,confusion_matrix,precision_score
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier


### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv('spam.csv', encoding='ISO-8859-1')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Checking number of rows and columns of the dataset using shape
print("Number of rows are: ",df.shape[0])
print("Number of columns are: ",df.shape[1])

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
# Checking Null Value by plotting Heatmap
sns.heatmap(df.isnull(), cbar=False)


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Check Unique Values for each variable.

In [None]:

# Check Unique Values for each variable using a for loop
for i in df.columns.tolist():
  print("No. of unique values in",i,"is",df[i].nunique())


## 3. ***Data Wrangling***

### Data Wrangling Code

**Data Cleaning**

In [None]:
df.info()

In [None]:
# Write your code to make your dataset analysis ready.
#Drop last 3 columns
df.drop(columns=['Unnamed: 2','Unnamed: 3','Unnamed: 4'],inplace=True)

In [None]:
df.sample(5)

In [None]:
#renaming the columns
df.rename(columns={'v1':'target','v2':'text'},inplace=True)
df.head(1)

In [None]:
encoder = LabelEncoder()
df['target'] = encoder.fit_transform(df['target'])


In [None]:
df.head()

In [None]:
# missing values
df.isnull().sum()

In [None]:
# check for duplicate values
df.duplicated().sum()

In [None]:
# remove duplicates
df = df.drop_duplicates(keep='first')
df.duplicated().sum()


In [None]:
df.shape

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

In [None]:
df.head()

In [None]:
df['target'].value_counts()

#### Chart - 1

In [None]:
# Chart - 1 visualization code
plt.pie(df['target'].value_counts(), labels=['ham','spam'],autopct="%0.2f")
plt.show()

**Data is imbalanced**

In [None]:
df['num_characters'] = df['text'].apply(len)

In [None]:
df.head()

In [None]:
# num of words
df['num_words'] = df['text'].apply(lambda x:len(nltk.word_tokenize(x)))

In [None]:
df.head(1)

In [None]:
df['num_sentences'] = df['text'].apply(lambda x:len(nltk.sent_tokenize(x)))

In [None]:
df.head(1)

In [None]:
df[['num_characters','num_words','num_sentences']].describe()

In [None]:
# ham
df[df['target'] == 0][['num_characters','num_words','num_sentences']].describe()

In [None]:
#spam
df[df['target'] == 1][['num_characters','num_words','num_sentences']].describe()

#### Chart - 2

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(12,6))
sns.histplot(df[df['target'] == 0]['num_characters'])
sns.histplot(df[df['target'] == 1]['num_characters'],color='red')

#### Chart - 3

In [None]:
# Chart - 4 visualization code
sns.pairplot(df,hue='target')

#### Chart - 4

In [None]:
# Chart - 5 visualization code
numeric_df = df.select_dtypes(include=['float64', 'int64'])

# Create heatmap with the correlation of numeric columns
sns.heatmap(numeric_df.corr(), annot=True)


## ***5. Feature Engineering & Data Pre-processing***

###  Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

* Lower case
* Tokenization
* Removing special characters
* Removing stop words and punctuation
*Stemming



In [None]:
def transform_text(text):
    text = text.lower()
    text = nltk.word_tokenize(text)

    y = []
    for i in text:
        if i.isalnum():
            y.append(i)

    text = y[:]
    y.clear()

    for i in text:
        if i not in stopwords.words('english') and i not in string.punctuation:
            y.append(i)

    text = y[:]
    y.clear()

    for i in text:
        y.append(ps.stem(i))


    return " ".join(y)

In [None]:
transform_text("I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried enough today.")


In [None]:
df['text'][10]

In [None]:
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
ps.stem('loving')

In [None]:
df['transformed_text'] = df['text'].apply(transform_text)

In [None]:
df.head()

In [None]:
from wordcloud import WordCloud
wc = WordCloud(width=500,height=500,min_font_size=10,background_color='white')

In [None]:
spam_wc = wc.generate(df[df['target'] == 1]['transformed_text'].str.cat(sep=" "))


In [None]:
plt.figure(figsize=(10,3))
plt.imshow(spam_wc)


In [None]:
ham_wc = wc.generate(df[df['target'] == 0]['transformed_text'].str.cat(sep=" "))


In [None]:
plt.figure(figsize=(15,6))
plt.imshow(ham_wc)

In [None]:
df.head()

In [None]:
spam_corpus = []
for msg in df[df['target'] == 1]['transformed_text'].tolist():
    for word in msg.split():
        spam_corpus.append(word)


In [None]:
len(spam_corpus)

In [None]:
from collections import Counter
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

most_common_words = Counter(spam_corpus).most_common(30)

df_common_words = pd.DataFrame(most_common_words, columns=['word', 'count'])

sns.barplot(x='word', y='count', data=df_common_words)

plt.xticks(rotation='vertical')

plt.show()


In [None]:
ham_corpus = []
for msg in df[df['target'] == 0]['transformed_text'].tolist():
    for word in msg.split():
        ham_corpus.append(word)

In [None]:
len(ham_corpus)

In [None]:
# Text Vectorization
# using Bag of Words
df.head()

## ***7. ML Model Implementation***

In [None]:
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
cv = CountVectorizer()
tfidf = TfidfVectorizer(max_features=3000)

In [None]:
X = tfidf.fit_transform(df['transformed_text']).toarray()


In [None]:
X.shape

In [None]:
y = df['target'].values

In [None]:
y

In [None]:
from sklearn.model_selection import train_test_split


In [None]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=2)


In [None]:
from sklearn.naive_bayes import GaussianNB,MultinomialNB,BernoulliNB
from sklearn.metrics import accuracy_score,confusion_matrix,precision_score

In [None]:
gnb = GaussianNB()
mnb = MultinomialNB()
bnb = BernoulliNB()

In [None]:
gnb.fit(X_train,y_train)
y_pred1 = gnb.predict(X_test)
print(accuracy_score(y_test,y_pred1))
print(confusion_matrix(y_test,y_pred1))
print(precision_score(y_test,y_pred1))

In [None]:
mnb.fit(X_train,y_train)
y_pred2 = mnb.predict(X_test)
print(accuracy_score(y_test,y_pred2))
print(confusion_matrix(y_test,y_pred2))
print(precision_score(y_test,y_pred2))

In [None]:
bnb.fit(X_train,y_train)
y_pred3 = bnb.predict(X_test)
print(accuracy_score(y_test,y_pred3))
print(confusion_matrix(y_test,y_pred3))
print(precision_score(y_test,y_pred3))

In [None]:
svc = SVC(kernel='sigmoid', gamma=1.0)
knc = KNeighborsClassifier()
mnb = MultinomialNB()
dtc = DecisionTreeClassifier(max_depth=5)
lrc = LogisticRegression(solver='liblinear', penalty='l1')
rfc = RandomForestClassifier(n_estimators=50, random_state=2)
abc = AdaBoostClassifier(n_estimators=50, random_state=2)
bc = BaggingClassifier(n_estimators=50, random_state=2)
etc = ExtraTreesClassifier(n_estimators=50, random_state=2)
gbdt = GradientBoostingClassifier(n_estimators=50,random_state=2)
xgb = XGBClassifier(n_estimators=50,random_state=2)

In [None]:
clfs = {
    'SVC' : svc,
    'KN' : knc,
    'NB': mnb,
    'DT': dtc,
    'LR': lrc,
    'RF': rfc,
    'AdaBoost': abc,
    'BgC': bc,
    'ETC': etc,
    'GBDT':gbdt,
    'xgb':xgb
}

In [None]:
def train_classifier(clf,X_train,y_train,X_test,y_test):
    clf.fit(X_train,y_train)
    y_pred = clf.predict(X_test)
    accuracy = accuracy_score(y_test,y_pred)
    precision = precision_score(y_test,y_pred)

    return accuracy,precision

In [None]:
train_classifier(svc,X_train,y_train,X_test,y_test)


In [None]:
accuracy_scores = []
precision_scores = []

for name,clf in clfs.items():

    current_accuracy,current_precision = train_classifier(clf, X_train,y_train,X_test,y_test)

    print("For ",name)
    print("Accuracy - ",current_accuracy)
    print("Precision - ",current_precision)

    accuracy_scores.append(current_accuracy)
    precision_scores.append(current_precision)

In [None]:
performance_df = pd.DataFrame({'Algorithm':clfs.keys(),'Accuracy':accuracy_scores,'Precision':precision_scores}).sort_values('Precision',ascending=False)


In [None]:
performance_df

In [None]:
performance_df1 = pd.melt(performance_df, id_vars = "Algorithm")

In [None]:
performance_df1

In [None]:
sns.catplot(x='Algorithm', y='value',
            hue='variable', data=performance_df1,
            kind='bar', height=5)

plt.ylim(0.5, 1.0)

plt.xticks(rotation='vertical')

plt.show()

# **model improve**
1. Change the max_features parameter of TfIdf

In [None]:
# Voting Classifier
svc = SVC(kernel='sigmoid', gamma=1.0,probability=True)
mnb = MultinomialNB()
etc = ExtraTreesClassifier(n_estimators=50, random_state=2)

from sklearn.ensemble import VotingClassifier

In [None]:
voting = VotingClassifier(estimators=[('svm', svc), ('nb', mnb), ('et', etc)],voting='soft')


In [None]:
voting.fit(X_train,y_train)


In [None]:
y_pred = voting.predict(X_test)
print("Accuracy",accuracy_score(y_test,y_pred))
print("Precision",precision_score(y_test,y_pred))

In [None]:
# Applying stacking
estimators=[('svm', svc), ('nb', mnb), ('et', etc)]
final_estimator=RandomForestClassifier()

In [None]:
from sklearn.ensemble import StackingClassifier


In [None]:
clf = StackingClassifier(estimators=estimators, final_estimator=final_estimator)


In [None]:
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print("Accuracy",accuracy_score(y_test,y_pred))
print("Precision",precision_score(y_test,y_pred))

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File
import pickle
pickle.dump(tfidf,open('vectorizer.pkl','wb'))
pickle.dump(mnb,open('model.pkl','wb'))

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***