**EMAIL SPAM DETECTION**

**About Dataset**

The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam.

**Objective**

The objective of this project is to develop a machine learning model capable of accurately classifying SMS messages as either "spam" or "ham" (legitimate). By leveraging state-of-the-art natural language processing techniques and machine learning algorithms, our aim is to create a robust and effective SMS spam detection system that enhances user communication experiences by filtering out unwanted and potentially harmful messages.

The project will involve data preprocessing, feature extraction, model training, and rigorous evaluation to achieve a high level of accuracy in identifying spam messages while minimizing false positives and false negatives.

**Approach**

1)Load the data and load all the libraries

2)Data Preparation and Data transformation

    Convert all text into LowerCase

    Remove all special characters

    Remove stop words

    Lemmatization and Stemming

3)Vectorization

    TFIDF Vectorizer

4)Machine Learning and also Deep Learning

Load the Data and The Libraries

In [None]:
!unzip '/content/SPAM DETECTION.zip'

unzip:  cannot find or open /content/SPAM DETECTION.zip, /content/SPAM DETECTION.zip.zip or /content/SPAM DETECTION.zip.ZIP.


In [None]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix

In [None]:

df=pd.read_csv('/content/spam.csv',encoding='latin-1')
df.head()


Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [None]:

df.shape

(5572, 5)

In [None]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          5572 non-null   object
 1   v2          5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB


In [None]:
df['Unnamed: 2'].value_counts()

 bt not his girlfrnd... G o o d n i g h t . . .@"                                                                                                   3
 PO Box 5249                                                                                                                                        2
this wont even start........ Datz confidence.."                                                                                                     2
GN                                                                                                                                                  2
 don't miss ur best life for anything... Gud nyt..."                                                                                                2
 but dont try to prove it..\" .Gud noon...."                                                                                                        2
 Gud night...."                                                                                     

In [None]:
df['Unnamed: 3'].value_counts()

 MK17 92H. 450Ppw 16"                         2
GE                                            2
 why to miss them                             1
U NO THECD ISV.IMPORTANT TOME 4 2MORO\""      1
i wil tolerat.bcs ur my someone..... But      1
 ILLSPEAK 2 U2MORO WEN IM NOT ASLEEP...\""    1
whoever is the KING\"!... Gud nyt"            1
 TX 4 FONIN HON                               1
 \"OH No! COMPETITION\". Who knew             1
IåÕL CALL U\""                                1
Name: Unnamed: 3, dtype: int64

In [None]:

df.drop(['Unnamed: 2','Unnamed: 3','Unnamed: 4'],axis=1,inplace=True)

In [None]:

df.isnull().sum()

v1    0
v2    0
dtype: int64

In [None]:

df.describe()

Unnamed: 0,v1,v2
count,5572,5572
unique,2,5169
top,ham,"Sorry, I'll call later"
freq,4825,30


In [None]:
df['v1'].value_counts()


ham     4825
spam     747
Name: v1, dtype: int64

**DATA** **PREPARATION** **AND** **TRANSFORMATION**

In [None]:
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer,PorterStemmer
from nltk.corpus import stopwords
import re

In [None]:
nltk.download('stopwords')
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()
nltk.download('wordnet')
def preprocess(sentence):
    sentence=str(sentence)
    sentence = sentence.lower()
    sentence=sentence.replace('{html}',"")
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', sentence)
    rem_url=re.sub(r'http\S+', '',cleantext)
    rem_num = re.sub('[0-9]+', '', rem_url)
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(rem_num)
    filtered_words = [w for w in tokens if len(w) > 2 if not w in stopwords.words('english')]
    stem_words=[stemmer.stem(w) for w in filtered_words]
    lemma_words=[lemmatizer.lemmatize(w) for w in stem_words]
    return " ".join(filtered_words)


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


In [None]:

df['v2']=df['v2'].map(lambda s:preprocess(s))

In [None]:
df.head()

Unnamed: 0,v1,v2
0,ham,jurong point crazy available bugis great world...
1,ham,lar joking wif oni
2,spam,free entry wkly comp win cup final tkts may te...
3,ham,dun say early hor already say
4,ham,nah think goes usf lives around though


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
vector=TfidfVectorizer()

In [None]:

x=vector.fit_transform(df['v2'])
x.shape


(5572, 7386)

**Data Encoding and Data Splitting**

In [None]:

le=LabelEncoder()


In [None]:

y=le.fit_transform(df['v1'])

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.2, random_state = 123)


In [None]:
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(4457, 7386)
(1115, 7386)
(4457,)
(1115,)


In [None]:
x.data

array([0.20533706, 0.36750082, 0.17228578, ..., 0.69543059, 0.53118971,
       0.48395639])

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier


In [None]:

random_grid = {'criterion': ['gini', 'entropy', 'log_loss'],
               'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110],

               'min_samples_leaf': [1, 2, 4],
               'min_samples_split': [2, 5, 10],
               'n_estimators': [130, 180, 230]}

In [None]:

rf=RandomForestClassifier()
clf=RandomizedSearchCV(estimator=rf ,param_distributions=random_grid,verbose=2,random_state=142)

In [None]:
search=clf.fit(x_train,y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV] END criterion=gini, max_depth=50, min_samples_leaf=2, min_samples_split=2, n_estimators=130; total time=   1.4s
[CV] END criterion=gini, max_depth=50, min_samples_leaf=2, min_samples_split=2, n_estimators=130; total time=   1.4s
[CV] END criterion=gini, max_depth=50, min_samples_leaf=2, min_samples_split=2, n_estimators=130; total time=   1.4s
[CV] END criterion=gini, max_depth=50, min_samples_leaf=2, min_samples_split=2, n_estimators=130; total time=   1.4s
[CV] END criterion=gini, max_depth=50, min_samples_leaf=2, min_samples_split=2, n_estimators=130; total time=   1.4s
[CV] END criterion=entropy, max_depth=10, min_samples_leaf=1, min_samples_split=5, n_estimators=180; total time=   0.6s
[CV] END criterion=entropy, max_depth=10, min_samples_leaf=1, min_samples_split=5, n_estimators=180; total time=   0.6s
[CV] END criterion=entropy, max_depth=10, min_samples_leaf=1, min_samples_split=5, n_estimators=180; total time=  

In [None]:
search.best_params_

{'n_estimators': 130,
 'min_samples_split': 2,
 'min_samples_leaf': 1,
 'max_depth': 80,
 'criterion': 'gini'}

In [None]:
search.best_score_

0.9735222176926213

In [None]:
rf = RandomForestClassifier(n_estimators=130,
 min_samples_split=10,
 min_samples_leaf = 1,
 max_depth= None,
criterion= 'gini')

In [None]:

rf.fit(x_train.toarray(),y_train)
rf_preds_train = rf.predict(x_train.toarray())
rf_preds_test = rf.predict(x_test.toarray())

In [None]:
print('Accuracy score for train data : ', round(accuracy_score(y_train, rf_preds_train),2))
print('Accuracy score for test data : ', round(accuracy_score(y_test, rf_preds_test),2))



Accuracy score for train data :  1.0
Accuracy score for test data :  0.98


In [None]:

nb=GaussianNB()
nb.fit(x_train.toarray(),y_train)
nb_preds_train=nb.predict(x_train.toarray())
nb_preds_test=nb.predict(x_test.toarray())

In [None]:

print('Accuracy score of the model is: ', round(accuracy_score(y_train, nb_preds_train),2))
print('Accuracy score of the model is: ', round(accuracy_score(y_test, nb_preds_test),2))

Accuracy score of the model is:  0.93
Accuracy score of the model is:  0.88


**SVM**

In [None]:
from sklearn.svm import SVC

svc = SVC()


In [None]:
from scipy.stats import reciprocal, randint
param_dist = {
    'C': reciprocal(0.1, 10),  # Regularization parameter
    'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],  # Kernel type
    'gamma': ['scale', 'auto'] + list(reciprocal(0.01, 0.1).rvs(size=3)),  # Kernel coefficient for 'poly', 'rbf', 'sigmoid'
    'degree': randint(2, 5),  # Degree of the polynomial kernel function
    'coef0': reciprocal(0.1, 10)  # Independent term in kernel function
}



In [None]:
random_search = RandomizedSearchCV(svc, param_distributions=param_dist, n_iter=100, cv=5, scoring='accuracy', n_jobs=-1, verbose=2)


In [None]:
search1 = random_search.fit(x_train,y_train)


Fitting 5 folds for each of 100 candidates, totalling 500 fits


In [None]:
search1.best_params_


{'C': 5.032891444100704,
 'coef0': 2.873263255584762,
 'degree': 2,
 'gamma': 'scale',
 'kernel': 'poly'}

In [None]:
svc = SVC(C= 0.3321408221627493,
 coef0=6.852383815557032,
 degree= 2,
 gamma= 'scale',
 kernel= 'linear')


In [None]:
svc.fit(x_train.toarray(),y_train)
svc_preds_train = svc.predict(x_train.toarray())
svc_preds_test = svc.predict(x_test.toarray())

In [None]:
print('Accuracy score for train data : ', round(accuracy_score(y_train, svc_preds_train),2))
print('Accuracy score for test data : ', round(accuracy_score(y_test, svc_preds_test),2))

Accuracy score for train data :  0.98
Accuracy score for test data :  0.97


**Creating App using Gradio**

In [None]:
!pip install gradio

Collecting gradio
  Downloading gradio-3.43.2-py3-none-any.whl (20.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.1/20.1 MB[0m [31m98.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl (15 kB)
Collecting fastapi (from gradio)
  Downloading fastapi-0.103.1-py3-none-any.whl (66 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.2/66.2 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ffmpy (from gradio)
  Downloading ffmpy-0.3.1.tar.gz (5.5 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting gradio-client==0.5.0 (from gradio)
  Downloading gradio_client-0.5.0-py3-none-any.whl (298 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m298.2/298.2 kB[0m [31m33.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting httpx (from gradio)
  Downloading httpx-0.24.1-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
import gradio as gr

In [None]:
def transform(input_text):
    transformed_input = preprocess(input_text)
    vectorized_input = vect.transform([transformed_input])
    result = rf.predict_proba(vectorized_input)[0]
    return {"ham": float(result[0]), "spam": float(result[1])}

# Define the Gradio interface
demo = gr.Interface(
    fn=transform,
    inputs=gr.Textbox(),
    outputs='label'
)

# Launch the interface
demo.launch(share=True)

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://4fb2f07dc02bbf5d2d.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


