<a href="https://colab.research.google.com/github/PriyanshuKSG/Spam-email-detector_22B2165/blob/main/Spam_email_detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Reading the Data**

In [None]:
import pandas as pd


In [69]:
data_path = '/content/drive/MyDrive/Emails.csv'
df = pd.read_csv(data_path)
df

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1
...,...,...
5723,Subject: re : research and development charges...,0
5724,"Subject: re : receipts from visit jim , than...",0
5725,Subject: re : enron case study update wow ! a...,0
5726,"Subject: re : interest david , please , call...",0


In [75]:
df.shape

(5728, 2)

In [None]:
df['spam'].

**Handling missing values**

In [76]:
df_mail = df.where((pd.notnull(df)), '') # Replacing all the null values with enpty string
df_mail.head()

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1


In [77]:
df_mail['text'][0]

"Subject: naturally irresistible your corporate identity  lt is really hard to recollect a company : the  market is full of suqgestions and the information isoverwhelminq ; but a good  catchy logo , stylish statlonery and outstanding website  will make the task much easier .  we do not promise that havinq ordered a iogo your  company will automaticaily become a world ieader : it isguite ciear that  without good products , effective business organization and practicable aim it  will be hotat nowadays market ; but we do promise that your marketing efforts  will become much more effective . here is the list of clear  benefits : creativeness : hand - made , original logos , specially done  to reflect your distinctive company image . convenience : logo and stationery  are provided in all formats ; easy - to - use content management system letsyou  change your website content and even its structure . promptness : you  will see logo drafts within three business days . affordability : your  ma

**Analysing the data**

In [79]:
df_mail['spam'].unique() # Gives a list containing all the unique values in that column

array([1, 0])

In [80]:
df_mail['spam'].value_counts()

0    4360
1    1368
Name: spam, dtype: int64

In [83]:
spam_mail = df_mail[df_mail['spam'] == 1]
ham_mail = df_mail[df_mail['spam'] == 0]
ham_mail.shape, spam_mail.shape

((4360, 2), (1368, 2))

**Balancing the data**

In [84]:
ham_mail = ham_mail.sample(spam_mail.shape[0]) # randomly selecting rows so that
# the number of spam and ham entries are equal

ham_mail.shape, spam_mail.shape

# The above step is to prevent overfitting and also prevents any bias towards ham mails

((1368, 2), (1368, 2))

In [86]:
df_final = pd.concat([ham_mail,spam_mail])
df_final['spam'].value_counts()

0    1368
1    1368
Name: spam, dtype: int64

**Preparing Data**

In [87]:
from sklearn.model_selection import train_test_split

X = df_final['text']
y = df_final['spam']
X.head()

5639    Subject: re : storm  dale ,  omer muften ( wit...
5460    Subject: var for cob 2 nd aug 2000  hi vince ,...
3972    Subject: california update 1 / 22 / 01  execut...
3046    Subject: new resume  dear vince ,  i am so gra...
3555    Subject: friday brown bag on derivative pricin...
Name: text, dtype: object

In [88]:
y.head()

5639    0
5460    0
3972    0
3046    0
3555    0
Name: spam, dtype: int64

In [89]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state = 99)
X_train.shape, X_test.shape

((2188,), (548,))

In [90]:
y_train.shape, y_test.shape

((2188,), (548,))

**Converting text data into meaningful feature vectors so that we can train and predict**

In [91]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

In [127]:
feature_extraction = TfidfVectorizer(ngram_range=(1,3), min_df = 2, stop_words = 'english')
X_train_features = feature_extraction.fit_transform(X_train)
X_test_features = feature_extraction.transform(X_test)

In [None]:
print(X_train_features)

In [None]:
print(X_test_features)

**Building the model**

In [121]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

In [128]:
model = LogisticRegression()

# Training the data
model.fit(X_train_features, y_train)

In [129]:
# Predicting categories on training data
pred_train = model.predict(X_train_features)
print(classification_report(y_train, pred_train))

              precision    recall  f1-score   support

           0       1.00      0.99      1.00      1095
           1       0.99      1.00      1.00      1093

    accuracy                           1.00      2188
   macro avg       1.00      1.00      1.00      2188
weighted avg       1.00      1.00      1.00      2188



In [130]:
print("Accuracy on training data set = ", accuracy_score(pred_train, y_train)*100,"%")

Accuracy on training data set =  99.6800731261426 %


In [132]:
# Predicting categories on testing data
pred_test = model.predict(X_test_features)
print(classification_report(y_test, pred_test))

              precision    recall  f1-score   support

           0       1.00      0.98      0.99       273
           1       0.98      1.00      0.99       275

    accuracy                           0.99       548
   macro avg       0.99      0.99      0.99       548
weighted avg       0.99      0.99      0.99       548



In [133]:
print("Accuracy on testing data set = ", accuracy_score(pred_test, y_test)*100,"%")

Accuracy on testing data set =  98.90510948905109 %


In [104]:
def detect_function(s):
  input = []
  input.append(s)
  input_features = feature_extraction.transform(input)
  pred = model.predict(input_features)
  if pred:
    print("Spam")
  else:
    print("Not Spam")

In [112]:
while 1:
  print("Enter your email")
  s = input()
  detect_function(s)
  print("Press -1 to exit and something else to continue")
  n = input()
  if(n == "-1"):
    break

Enter your email
Subject: hello guys ,  i ' m " bugging you " for your completed questionnaire and for a one - page  bio / statement on your thoughts on " business edu and the new economy " . if  my records are incorrect please re - ship your responses to me . i want to  put everything together next week so that i can ship it back to everyone .  the questionnaire is attached as well as copies of the bio pages for  michael froehls and myself ( two somewhat different approaches ) . the idea  of the latter is just to introduce yourself to the other panelists and give  them some background on how you are approaching the issues we will discuss .  we will also provide copies to the attendees and use this material for our  personal introductions at the opening of the panel discussions .  thanks and i look forward to seeing you in two weeks .  john  - waco _ background _ mf . doc  - jmartinbiosketch . doc  - questionnaire . doc  john d . martin  carr p . collins chair in finance  finance depar

In [111]:
df222 = df_mail[df_mail['spam'] == 0]
df222.reset_index(inplace = True)
df222['text'][0]

'Subject: hello guys ,  i \' m " bugging you " for your completed questionnaire and for a one - page  bio / statement on your thoughts on " business edu and the new economy " . if  my records are incorrect please re - ship your responses to me . i want to  put everything together next week so that i can ship it back to everyone .  the questionnaire is attached as well as copies of the bio pages for  michael froehls and myself ( two somewhat different approaches ) . the idea  of the latter is just to introduce yourself to the other panelists and give  them some background on how you are approaching the issues we will discuss .  we will also provide copies to the attendees and use this material for our  personal introductions at the opening of the panel discussions .  thanks and i look forward to seeing you in two weeks .  john  - waco _ background _ mf . doc  - jmartinbiosketch . doc  - questionnaire . doc  john d . martin  carr p . collins chair in finance  finance department  baylor u