📧 Task-2: EMAIL SPAM DETECTION

📝 Author: M ShreeRaj

📅 Batch: September

🔍 Domain: Data Science



🎯 Aim: Develop a robust model for detecting email spam using natural language processing techniques. 📩🚫

🌟 Task-2 OasisInfoByte Internship
---
Overview:

Welcome to Task-2 of the OasisInfoByte Internship program! In this exciting project, we aim to harness the power of Natural Language Processing (NLP) to build a model that can effectively detect email spam and prevent unwanted messages from cluttering your inbox. Our goal is to create a sophisticated system that can distinguish between legitimate emails and spam, ensuring a cleaner and more efficient email experience. Let's combat spam together! 📩🚫🔍

🧰📚 IMPORTING IMPORTANT LIBRARIES 📚🧰

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

📥🔽 DOWNLOADING DATASETS 🔽📥

In [None]:
df = pd.read_csv('mail_data.csv')

In [None]:
print(df)

     Category                                            Message
0         ham  Go until jurong point, crazy.. Available only ...
1         ham                      Ok lar... Joking wif u oni...
2        spam  Free entry in 2 a wkly comp to win FA Cup fina...
3         ham  U dun say so early hor... U c already then say...
4         ham  Nah I don't think he goes to usf, he lives aro...
...       ...                                                ...
5567     spam  This is the 2nd time we have tried 2 contact u...
5568      ham               Will ü b going to esplanade fr home?
5569      ham  Pity, * was in mood for that. So...any other s...
5570      ham  The guy did some bitching but I acted like i'd...
5571      ham                         Rofl. Its true to its name

[5572 rows x 2 columns]


In [None]:
data = df.fillna("")

In [None]:
data.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Category  5572 non-null   object
 1   Message   5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [None]:
data.describe()

Unnamed: 0,Category,Message
count,5572,5572
unique,2,5157
top,ham,"Sorry, I'll call later"
freq,4825,30


In [None]:
data['Category'] = data['Category'].replace('spam', 0)
data['Category'] = data['Category'].replace('ham', 1)

In [None]:
X = data['Message']
Y = data['Category']


In [None]:
print(X)

0       Go until jurong point, crazy.. Available only ...
1                           Ok lar... Joking wif u oni...
2       Free entry in 2 a wkly comp to win FA Cup fina...
3       U dun say so early hor... U c already then say...
4       Nah I don't think he goes to usf, he lives aro...
                              ...                        
5567    This is the 2nd time we have tried 2 contact u...
5568                 Will ü b going to esplanade fr home?
5569    Pity, * was in mood for that. So...any other s...
5570    The guy did some bitching but I acted like i'd...
5571                           Rofl. Its true to its name
Name: Message, Length: 5572, dtype: object


In [None]:
print(Y)

0       1
1       1
2       0
3       1
4       1
       ..
5567    0
5568    1
5569    1
5570    1
5571    1
Name: Category, Length: 5572, dtype: int64


📚🤖🧠 "Training Our Model" 🏋️‍♂️🌟💡

In [None]:
X_train ,  X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

In [None]:
print(X.shape)
print(X_train.shape)
print(X_test.shape)


(5572,)
(4457,)
(1115,)


In [None]:
print(Y.shape)
print(Y_train.shape)
print(Y_test.shape)


(5572,)
(4457,)
(1115,)


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

# Create an instance of the TF-IDF vectorizer with your desired parameters
tfidf_vectorizer = TfidfVectorizer(min_df=1, stop_words='english', lowercase=True)

# Fit and transform your text data for X_train
X_train_features = tfidf_vectorizer.fit_transform(X_train)

# You should not use tfidf_vectorizer on Y_test since it's not text data
# Y_test should remain as it is.
# Convert Y_train to integers if necessary
Y_train = Y_train.astype('int')



In [None]:
print(X_train)

1978    Reply to win £100 weekly! Where will the 2006 ...
3989    Hello. Sort of out in town already. That . So ...
3935     How come guoyang go n tell her? Then u told her?
4078    Hey sathya till now we dint meet not even a si...
4086    Orange brings you ringtones from all time Char...
                              ...                        
3772    Hi, wlcome back, did wonder if you got eaten b...
5191                               Sorry, I'll call later
5226        Prabha..i'm soryda..realy..frm heart i'm sory
5390                           Nt joking seriously i told
860               Did he just say somebody is named tampa
Name: Message, Length: 4457, dtype: object


In [None]:
print(X_train_features)

  (0, 5818)	0.22682143517864364
  (0, 2497)	0.2442158912653505
  (0, 694)	0.3171299579602537
  (0, 6264)	0.1898892037332199
  (0, 5800)	0.17558937755823417
  (0, 3262)	0.33791755486732394
  (0, 2049)	0.3034375179183143
  (0, 7300)	0.24288153842988894
  (0, 2724)	0.3544175987866074
  (0, 354)	0.3544175987866074
  (0, 7162)	0.2550284465664535
  (0, 258)	0.2379428657041507
  (0, 7222)	0.2173884735352799
  (0, 5512)	0.1898892037332199
  (1, 2555)	0.3840709491751004
  (1, 3804)	0.1902902346515268
  (1, 3932)	0.24325511357721427
  (1, 4509)	0.4028245991060671
  (1, 2440)	0.33870544648398715
  (1, 3333)	0.20665394084233096
  (1, 5650)	0.360444144470318
  (1, 2335)	0.2162321275166079
  (1, 6738)	0.28986069568918
  (1, 6109)	0.3239762634465801
  (1, 3267)	0.2678713077029217
  :	:
  (4452, 2438)	0.4574160733416501
  (4452, 7280)	0.3968991650168732
  (4452, 3978)	0.4574160733416501
  (4452, 3290)	0.26370969643076225
  (4452, 3084)	0.22948428918295163
  (4452, 2236)	0.2676662072392096
  (4453, 387

Accuracy measure (👍) (👎) (✅) (❌)

In [None]:
# Create a LogisticRegression model
model = LogisticRegression()

# Fit the model on the training data
model.fit(X_train_features, Y_train)

# Make predictions on the training data and calculate accuracy
prediction_on_training_data = model.predict(X_train_features)
accuracy_on_training_data = accuracy_score(Y_train, prediction_on_training_data)

# Print the accuracy on the training data
print(f'Accuracy on training data: {accuracy_on_training_data:.4f}')

# Assuming you have defined X_test_features and Y_test, make predictions on the test data
X_test_features = tfidf_vectorizer.transform(X_test)
prediction_on_test_data = model.predict(X_test_features)

# Print the accuracy on the test data
accuracy_on_test_data = accuracy_score(Y_test, prediction_on_test_data)
print(f'Accuracy on test data: {accuracy_on_test_data:.4f}')

Accuracy on training data: 0.9661
Accuracy on test data: 0.9677


🧪🤖🧐 "Testing Our Model" 🚀🔍📊

In [None]:
input_mail = ["This is the 2nd time we have tried 2 contact u"]


input_data_features = tfidf_vectorizer.transform(input_mail)
prediction = model.predict(input_data_features)

print(prediction)


[1]


📧🚫0️⃣ means spam, and 1️⃣ means ham! 📩🥓📬