<a href="https://colab.research.google.com/github/Sattwikroy21/Dice-Game/blob/main/Spam_Mail_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Machine Learning Project**
# **"SPAM Mail Detection"**
**Project Incharge:** Prof. M. K. BEURIA

2106247 - SATTWIK ROY

**Section**: ML-IT01

*-- School Of Computer Science and  Engineering, KIIT-DU --*





# **Introduction**

*For users' inboxes to be free of unwanted or potentially hazardous information, the ability to recognize spam e-mails is essential.
Spam e-mail detection is critical for protecting customers from unsolicited and potentially harmful information that can clog their inboxes and compromise their security. Mails are Categorized as 'Spam' or 'Ham' in the dataset. By performing analysis and applying machine learning algorithms for building the predictive system, the model will predict the upcoming mail as spam or ham according to the research on the data.*

*Various machine learning classification algorithms are used, out of which the Multi-Layer Perceptron (MLP) gives the most accurate results and prediction over the data with about 98% accuracy.* [1]



---------------------------------------------------------------------------
# Importing Required Libraries

In [None]:
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier, AdaBoostClassifier, BaggingClassifier, StackingClassifier
from sklearn.metrics import accuracy_score, classification_report

# Read the mail dataset into pandas frame



In [None]:
raw_mail_data = pd.read_csv("/content/mail_data.csv")
print(raw_mail_data)

     Category                                            Message
0         ham  Go until jurong point, crazy.. Available only ...
1         ham                      Ok lar... Joking wif u oni...
2        spam  Free entry in 2 a wkly comp to win FA Cup fina...
3         ham  U dun say so early hor... U c already then say...
4         ham  Nah I don't think he goes to usf, he lives aro...
...       ...                                                ...
5567     spam  This is the 2nd time we have tried 2 contact u...
5568      ham               Will ü b going to esplanade fr home?
5569      ham  Pity, * was in mood for that. So...any other s...
5570      ham  The guy did some bitching but I acted like i'd...
5571      ham                         Rofl. Its true to its name

[5572 rows x 2 columns]


# Data Pre-processing

In [None]:
#Replacing missing value with NULL string
mail_data = raw_mail_data.where((pd.notnull(raw_mail_data)),'')
#print first 5 rows of the data
mail_data.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
#check number of rows and columns of dataframe
mail_data.shape

(5572, 2)

# Label Encoding

In [None]:
#label spam mail as 0; label ham mail as 1;
mail_data.loc[mail_data['Category'] == 'spam', 'Category',] = 0
mail_data.loc[mail_data['Category'] == 'ham', 'Category',] = 1

Spam -> 0

Ham -> 1

In [None]:
#separating the data as text and label
X = mail_data['Message']
Y = mail_data['Category']

In [None]:
print(X)

0       Go until jurong point, crazy.. Available only ...
1                           Ok lar... Joking wif u oni...
2       Free entry in 2 a wkly comp to win FA Cup fina...
3       U dun say so early hor... U c already then say...
4       Nah I don't think he goes to usf, he lives aro...
                              ...                        
5567    This is the 2nd time we have tried 2 contact u...
5568                 Will ü b going to esplanade fr home?
5569    Pity, * was in mood for that. So...any other s...
5570    The guy did some bitching but I acted like i'd...
5571                           Rofl. Its true to its name
Name: Message, Length: 5572, dtype: object


In [None]:
print(Y)

0       1
1       1
2       0
3       1
4       1
       ..
5567    0
5568    1
5569    1
5570    1
5571    1
Name: Category, Length: 5572, dtype: object


# Train Test Split

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 42)

In [None]:
print(X.shape)
print(X_train.shape)
print(X_test.shape)

(5572,)
(4457,)
(1115,)


# Extracting Features from Mail message

In [None]:
#convert the text data to feature vector
features = TfidfVectorizer(min_df = 1, stop_words = 'english', lowercase = True)

X_train_norm = features.fit_transform(X_train)
X_test_norm = features.transform(X_test)

#convert Y_train and Y_test value as integer
Y_train = Y_train.astype('int')
Y_test = Y_test.astype('int')
print(Y_test)

3245    1
944     1
1044    1
2484    1
812     1
       ..
4264    1
2439    1
5556    1
4205    1
4293    1
Name: Category, Length: 1115, dtype: int64


In [None]:
print(X_train_norm)

  (0, 5818)	0.22682143517864364
  (0, 2497)	0.2442158912653505
  (0, 694)	0.3171299579602537
  (0, 6264)	0.1898892037332199
  (0, 5800)	0.17558937755823417
  (0, 3262)	0.33791755486732394
  (0, 2049)	0.3034375179183143
  (0, 7300)	0.24288153842988894
  (0, 2724)	0.3544175987866074
  (0, 354)	0.3544175987866074
  (0, 7162)	0.2550284465664535
  (0, 258)	0.2379428657041507
  (0, 7222)	0.2173884735352799
  (0, 5512)	0.1898892037332199
  (1, 2555)	0.3840709491751004
  (1, 3804)	0.1902902346515268
  (1, 3932)	0.24325511357721427
  (1, 4509)	0.4028245991060671
  (1, 2440)	0.33870544648398715
  (1, 3333)	0.20665394084233096
  (1, 5650)	0.360444144470318
  (1, 2335)	0.2162321275166079
  (1, 6738)	0.28986069568918
  (1, 6109)	0.3239762634465801
  (1, 3267)	0.2678713077029217
  :	:
  (4452, 2438)	0.4574160733416501
  (4452, 7280)	0.3968991650168732
  (4452, 3978)	0.4574160733416501
  (4452, 3290)	0.26370969643076225
  (4452, 3084)	0.22948428918295163
  (4452, 2236)	0.2676662072392096
  (4453, 387

# Initialize the Machine Learning models

In [None]:
# Write Code Here
base_classifiers = [
    ('rf', RandomForestClassifier(n_estimators=50, random_state=42)),
    ('gb', GradientBoostingClassifier(n_estimators=50, random_state=42)),
    ('dt', DecisionTreeClassifier(max_depth=5, random_state=42))
]

# Initialize stacking classifier with a meta-classifier (Logistic Regression in this case)
stacking_clf = StackingClassifier(estimators = base_classifiers, final_estimator = LogisticRegression())

# Initialize classifiers
knn = KNeighborsClassifier(n_neighbors=3)
rf = RandomForestClassifier(n_estimators=7, criterion='entropy', random_state=7)
svc = SVC()
lr = LogisticRegression()
dt = DecisionTreeClassifier()

# Initialize XGBoost classifier
xgb = xgb.XGBClassifier(
    objective='binary:logistic',  # for binary classification
    eval_metric='error'            # evaluation metric
)
# Initialize GBM classifier
gbm = GradientBoostingClassifier(
    n_estimators=100,  # number of boosting stages
    learning_rate=0.1, # learning rate
    max_depth=3        # maximum depth of the individual estimators
)
# Initialize AdaBoost classifier
ada = AdaBoostClassifier(n_estimators=50, random_state=42)

# Initialize MLPClassifier (Neural Network)
mlp = MLPClassifier(hidden_layer_sizes=(100,), max_iter=100, alpha=0.0001, solver='adam', random_state=42)
mlp2 = MLPClassifier(hidden_layer_sizes=(200,),activation = 'relu', max_iter=500, learning_rate = 'adaptive', alpha=0.0002, solver='adam')

# Initialize Bagging Classifier
bagging_clf = BaggingClassifier(svc, n_estimators=10, random_state=42)

# Training and Testing the models

In [None]:
for classifier in (lr, svc, knn, rf, dt, xgb, ada, mlp, mlp2, gbm, bagging_clf, stacking_clf):
  classifier.fit(X_train_norm, Y_train)

  # Training Data
  Y_predict = classifier.predict(X_train_norm)
  model_train_accuracy = accuracy_score(Y_train,Y_predict)
  model_train_report = classification_report(Y_train, Y_predict)

  # Testing Data
  Y_test_predict = classifier.predict(X_test_norm)
  model_test_accuracy = accuracy_score(Y_test, Y_test_predict)
  model_test_report = classification_report(Y_test, Y_test_predict)

  print(classifier.__class__.__name__, ":-")
  print("Training Report :-")
  print("Accuracy :", 100*model_train_accuracy)
  print(model_train_report)
  print("Testing Report :-")
  print("Accuracy :", 100*model_test_accuracy)
  print(model_test_report)

LogisticRegression :-
Training Report :-
Accuracy : 96.61207089970833
              precision    recall  f1-score   support

           0       0.99      0.75      0.86       598
           1       0.96      1.00      0.98      3859

    accuracy                           0.97      4457
   macro avg       0.98      0.88      0.92      4457
weighted avg       0.97      0.97      0.96      4457

Testing Report :-
Accuracy : 96.7713004484305
              precision    recall  f1-score   support

           0       1.00      0.76      0.86       149
           1       0.96      1.00      0.98       966

    accuracy                           0.97      1115
   macro avg       0.98      0.88      0.92      1115
weighted avg       0.97      0.97      0.97      1115

SVC :-
Training Report :-
Accuracy : 99.86538030065066
              precision    recall  f1-score   support

           0       1.00      0.99      0.99       598
           1       1.00      1.00      1.00      3859

    accurac

**Model Accuracy On Testing dataset : -**

*(Set in ascending order of accuracy)*

1.   K-Nearest Neighbour - **92.82%**
2.   Logistic Regression - **96.77%**
3. Gradient Boosting Classifier - **96.86%**
4. Random Forest - **97.04%**
5. Decesion Tree - **97.04%**
6. AdaBoost - **97.30%**
7. XGBoost - **97.66%**
8. Bagging Classifier - **98.29%**
9. Stacking Classifier - **98.47%**
10. Support Vector Machine - **98.47%**
11. Neural-network(MLP) - **99.82%**





# Building A Predictive System

In [None]:
#input data
#input_mail = ["Although i told u dat i'm into baig face watches now but i really like e watch u gave cos it's fr u. Thanx 4 everything dat u've done today, i'm touched..."]
input_mail = ["WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only."]
#convert text to feature vector
input_data_norm = features.transform(input_mail)

#model prediction - mlp (Best accuracy)
prediction = mlp.predict(input_data_norm)
print(prediction)

#output spam or ham mail
if prediction[0] == 1:
  print("Ham Mail")
else:
  print("Spam Mail")

[0]
Spam Mail


---------------------------------------------------------------------------
# **Conclusion**

*The Multi-Layer Perceptron (MLP) gives the highest accuracy score of 99.82% and predictions made over the data is very accurate.*

*The precision on positive class and negative class is 99% and 100% respectively.*

# **Reference :-**

[1] https://ieeexplore.ieee.org/document/10170187