# Naive Bayes: classifies email spam

In [413]:
import math, random, re, glob
import email
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
from nltk.stem import PorterStemmer
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to /home/alice/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Firstly, we need set path to read emails

In [392]:
path = r"./emails/*/*"
data = []

After, we reads all file in this folder.
The emails is spam when not contain `ham`, so, we used this information to set if is spam or not spam.
We fetch the email body (`payload`) and sender (`unixfrom`)

In [393]:
for fn in glob.glob(path):
    is_spam = "ham" not in fn
    file = open(fn,'r',encoding='ISO-8859-1')
    mail = email.message_from_string(file.read())
    if mail.is_multipart():
        for part in mail.walk():
            ctype = part.get_content_type()
            cdispo = str(part.get('Content-Disposition'))

            # skip any text/plain (txt) attachments
            if ctype == 'text/plain' and 'attachment' not in cdispo:
                body = part.get_payload(decode=True)  # decode
        # not multipart - i.e. plain text, no attachments, keeping fingers crossed
    else:
        from_mail = mail.get_unixfrom() if mail.get_unixfrom() else ''
        body = mail.get_payload(decode=True)

    content_type = mail.get_content_type()

    if body:
        body = BeautifulSoup(body, "lxml").text
        data.append((body, from_mail, is_spam))




" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
  ' that document to Beautiful Soup.' % decoded_markup


Now we just creates a dataframe with collected data and adds a `len` column to see the size text email

In [394]:
pd.options.display.max_colwidth = 500
df = pd.DataFrame(data=data, columns = ['email_body', 'from_email', 'is_spam'])
df['len'] = df['email_body'].str.len()
df

Unnamed: 0,email_body,from_email,is_spam,len
0,##################################################\n# #\n# Adult Club #\n# Offers FREE Membership #\n# #\n##################################################\n\n>>>>> INSTANT ACCESS TO ALL SITES NOW\n>>>>> Your User Name And Password is.\n>>>>> User Name: zzzz@example.com\n>>>>> Password: 1534\n\n3 of the Best Adult Sites on the Internet...,From ib@newafrica.com Mon Aug 26 15:15:34 2002,True,3206
1,"Dear Sir, \nWith due respect and humility I write you this letter which I believe you\nwould\nbe of great assistance to my children and I.\nI got your contact through my husband commercial address book and believed\nthat\nyou must be a trust worthy and reliable person that will not like to\nintimidate me\nor betray my trust after hearing this news. I am a native of KONOBO in the\nKEREMA\nlocal district of SIERRA LEONE in West Africa and the wife of Late DR.\nMUNDI A.\nKOJO who was assassinat...",From annekojo@email.com Wed Sep 25 17:22:07 2002,True,2529
2,\n\n\n\n\n\n\nNever Pay Retail!\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nUnleash \n your PC's Multimedia power TODAY!\n\n\n\n\n\n \n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCompare \n and Save\n\n\n \n\nPrice\n\n\nS&H\n\n\nTotal\n\n\nYou \n Save\n\n\nSee \n for yourself\n\n\n\n\nAMAZON\n\n\n$29.99\n\n\n$4.99\n\n\n$34.98\n\n\n70%\n\n\nClick \n here\n\n\n\n\nCDW\n\n\n$29.21\n\n\...,From Special_Offer-09192002-HTML@frugaljoe.330w.com Fri Sep 20 11:41:00 2002,True,2390
3,Never Pay Retail!\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nUnleash \n your PC's Multimedia power TODAY!\n\n\n\n\n\n \n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCompare \n and Save\n\n\n \n\nPrice\n\n\nS&H\n\n\nTotal\n\n\nYou \n Save\n\n\nSee \n for yourself\n\n\n\n\nAMAZON\n\n\n$29.99\n\n\n$4.99\n\n\n$34.98\n\n\n70%\n\n\nClick \n here\n\n\n\n\nCDW\n\n\n$29.21\n\n\n$7.53\n\n\n$3...,From Special_Offer-09192002-HTML@frugaljoe.330w.com Fri Sep 20 11:41:00 2002,True,2383
4,"Dear user,\nCyberAge Dating Club is contacting you on behalf of Jenny.\nYou have been carefully chosen as a matching partner for the other party using \n our advanced profile-matching system. We have determined that you do not currently \n have an account with us and invite you to join. It's absolutely FREE and there's \n no obligations or hidden charges. Come see for yourself and prepare to have \n fun! \nClick here to Join \n Now or simply disregard this request if you're not interest...",From none@none.com Mon Sep 2 16:28:01 2002,True,539
...,...,...,...,...
3296,"If you haven't already, you should enable the debug log under\nHacking Support preferences and look for clues there.\n\n>>>Reg Clemens said:\n > > Hi,\n > > \n > > On Sun, 01 Sep 2002 00:05:03 MDT Reg Clemens wrote: \n > > \n > > [...]\n > > > in messages with GnuPG signatures. But punching the line ALWAYS\n > > > gives\n > > > \n > > > Signature made Thu Aug 29 00:27:17 2002 MDT using DSA key ID BDD\n F997A\n > > > Can't check signature: public key not found\n > > > \...",From exmh-users-admin@redhat.com Mon Sep 2 23:27:14 2002,False,1801
3297,"On Wed, 25 Sep 2002, Joseph S. Barrera III wrote:\n\n> Let's say you're behind a firewall and have a NAT address.\n> Is there any way to telnet to a linux box out there in the world\n> and set your DISPLAY in some way that you can create\n> xterms on your own screen?\n\nAs other people suggested: SSH. PuTTY \n\n\thttp://www.chiark.greenend.org.uk/~sgtatham/putty/download.html\n\ncan do it. Can run, say xclock (I'm running an X server under W32 at work,\ntunneling through a NAT box), but from...",From fork-admin@xent.com Thu Sep 26 16:35:44 2002,False,548
3298,"On 06 September 2002, Anthony Baxter said:\n> A snippet, hopefully not enough to trigger the spam-filters.\n\nAs an aside: one of the best ways to dodge SpamAssassin is by having an\nIn-Reply-To header. Most list traffic should meet this criterion.\n\nAlternately, I can whitelist mail to spambayes@python.org -- that'll\nwork until spammers get ahold of the list address, which usually seems\nto take a few months.\n\n Greg\n-- \nGreg Ward http://www.gerg.ca/\nG...",,False,574
3299,"use Perl Daily Newsletter\n\nIn this issue:\n * ""Perl 6: Right Here, Right Now"" slides ava\n\n+--------------------------------------------------------------------+\n| ""Perl 6: Right Here, Right Now"" slides ava |\n| posted by gnat on Friday September 13, @12:01 (news) |\n| http://use.perl.org/article.pl?sid=02/09/13/162209 |\n+--------------------------------------------------------------------+\n\n[0]gnat writes ""The wonderful Leon Br...",From pudge@perl.org Sat Sep 14 16:22:37 2002,False,1233


To test, csv was made with two types ... one using the sender and the other not
(and so we have some almost the same things below)

In [405]:
pd.options.display.max_colwidth = 500
df_without_from = pd.DataFrame(data=data, columns = ['email_body', 'from_email', 'is_spam'])
df_without_from.drop(columns = ['from_email'], inplace = True)
df_without_from['len'] = df_without_from['email_body'].str.len()
df_without_from

Unnamed: 0,email_body,is_spam,len
0,##################################################\n# #\n# Adult Club #\n# Offers FREE Membership #\n# #\n##################################################\n\n>>>>> INSTANT ACCESS TO ALL SITES NOW\n>>>>> Your User Name And Password is.\n>>>>> User Name: zzzz@example.com\n>>>>> Password: 1534\n\n3 of the Best Adult Sites on the Internet...,True,3206
1,"Dear Sir, \nWith due respect and humility I write you this letter which I believe you\nwould\nbe of great assistance to my children and I.\nI got your contact through my husband commercial address book and believed\nthat\nyou must be a trust worthy and reliable person that will not like to\nintimidate me\nor betray my trust after hearing this news. I am a native of KONOBO in the\nKEREMA\nlocal district of SIERRA LEONE in West Africa and the wife of Late DR.\nMUNDI A.\nKOJO who was assassinat...",True,2529
2,\n\n\n\n\n\n\nNever Pay Retail!\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nUnleash \n your PC's Multimedia power TODAY!\n\n\n\n\n\n \n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCompare \n and Save\n\n\n \n\nPrice\n\n\nS&H\n\n\nTotal\n\n\nYou \n Save\n\n\nSee \n for yourself\n\n\n\n\nAMAZON\n\n\n$29.99\n\n\n$4.99\n\n\n$34.98\n\n\n70%\n\n\nClick \n here\n\n\n\n\nCDW\n\n\n$29.21\n\n\...,True,2390
3,Never Pay Retail!\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nUnleash \n your PC's Multimedia power TODAY!\n\n\n\n\n\n \n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCompare \n and Save\n\n\n \n\nPrice\n\n\nS&H\n\n\nTotal\n\n\nYou \n Save\n\n\nSee \n for yourself\n\n\n\n\nAMAZON\n\n\n$29.99\n\n\n$4.99\n\n\n$34.98\n\n\n70%\n\n\nClick \n here\n\n\n\n\nCDW\n\n\n$29.21\n\n\n$7.53\n\n\n$3...,True,2383
4,"Dear user,\nCyberAge Dating Club is contacting you on behalf of Jenny.\nYou have been carefully chosen as a matching partner for the other party using \n our advanced profile-matching system. We have determined that you do not currently \n have an account with us and invite you to join. It's absolutely FREE and there's \n no obligations or hidden charges. Come see for yourself and prepare to have \n fun! \nClick here to Join \n Now or simply disregard this request if you're not interest...",True,539
...,...,...,...
3296,"If you haven't already, you should enable the debug log under\nHacking Support preferences and look for clues there.\n\n>>>Reg Clemens said:\n > > Hi,\n > > \n > > On Sun, 01 Sep 2002 00:05:03 MDT Reg Clemens wrote: \n > > \n > > [...]\n > > > in messages with GnuPG signatures. But punching the line ALWAYS\n > > > gives\n > > > \n > > > Signature made Thu Aug 29 00:27:17 2002 MDT using DSA key ID BDD\n F997A\n > > > Can't check signature: public key not found\n > > > \...",False,1801
3297,"On Wed, 25 Sep 2002, Joseph S. Barrera III wrote:\n\n> Let's say you're behind a firewall and have a NAT address.\n> Is there any way to telnet to a linux box out there in the world\n> and set your DISPLAY in some way that you can create\n> xterms on your own screen?\n\nAs other people suggested: SSH. PuTTY \n\n\thttp://www.chiark.greenend.org.uk/~sgtatham/putty/download.html\n\ncan do it. Can run, say xclock (I'm running an X server under W32 at work,\ntunneling through a NAT box), but from...",False,548
3298,"On 06 September 2002, Anthony Baxter said:\n> A snippet, hopefully not enough to trigger the spam-filters.\n\nAs an aside: one of the best ways to dodge SpamAssassin is by having an\nIn-Reply-To header. Most list traffic should meet this criterion.\n\nAlternately, I can whitelist mail to spambayes@python.org -- that'll\nwork until spammers get ahold of the list address, which usually seems\nto take a few months.\n\n Greg\n-- \nGreg Ward http://www.gerg.ca/\nG...",False,574
3299,"use Perl Daily Newsletter\n\nIn this issue:\n * ""Perl 6: Right Here, Right Now"" slides ava\n\n+--------------------------------------------------------------------+\n| ""Perl 6: Right Here, Right Now"" slides ava |\n| posted by gnat on Friday September 13, @12:01 (news) |\n| http://use.perl.org/article.pl?sid=02/09/13/162209 |\n+--------------------------------------------------------------------+\n\n[0]gnat writes ""The wonderful Leon Br...",False,1233


In [395]:
df["email_body"] = df['from_email'] + df['email_body']
df.drop(columns = ['from_email'], inplace = True)
df

Unnamed: 0,email_body,is_spam,len
0,From ib@newafrica.com Mon Aug 26 15:15:34 2002##################################################\n# #\n# Adult Club #\n# Offers FREE Membership #\n# #\n##################################################\n\n>>>>> INSTANT ACCESS TO ALL SITES NOW\n>>>>> Your User Name And Password is.\n>>>>> User Name: zzzz@example.com\n>>>>> Password: 15...,True,3206
1,"From annekojo@email.com Wed Sep 25 17:22:07 2002Dear Sir, \nWith due respect and humility I write you this letter which I believe you\nwould\nbe of great assistance to my children and I.\nI got your contact through my husband commercial address book and believed\nthat\nyou must be a trust worthy and reliable person that will not like to\nintimidate me\nor betray my trust after hearing this news. I am a native of KONOBO in the\nKEREMA\nlocal district of SIERRA LEONE in West Africa and the wi...",True,2529
2,From Special_Offer-09192002-HTML@frugaljoe.330w.com Fri Sep 20 11:41:00 2002\n\n\n\n\n\n\nNever Pay Retail!\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nUnleash \n your PC's Multimedia power TODAY!\n\n\n\n\n\n \n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCompare \n and Save\n\n\n \n\nPrice\n\n\nS&H\n\n\nTotal\n\n\nYou \n Save\n\n\nSee \n for yourself\n\n\n\n\nAMAZON\n\n\n$29.99\n\n\n$4.99\n\n\n$34.98\n\n\n...,True,2390
3,From Special_Offer-09192002-HTML@frugaljoe.330w.com Fri Sep 20 11:41:00 2002Never Pay Retail!\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nUnleash \n your PC's Multimedia power TODAY!\n\n\n\n\n\n \n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCompare \n and Save\n\n\n \n\nPrice\n\n\nS&H\n\n\nTotal\n\n\nYou \n Save\n\n\nSee \n for yourself\n\n\n\n\nAMAZON\n\n\n$29.99\n\n\n$4.99\n\n\n$34.98\n\n\n70%\n\n\nClick...,True,2383
4,"From none@none.com Mon Sep 2 16:28:01 2002Dear user,\nCyberAge Dating Club is contacting you on behalf of Jenny.\nYou have been carefully chosen as a matching partner for the other party using \n our advanced profile-matching system. We have determined that you do not currently \n have an account with us and invite you to join. It's absolutely FREE and there's \n no obligations or hidden charges. Come see for yourself and prepare to have \n fun! \nClick here to Join \n Now or simply d...",True,539
...,...,...,...
3296,"From exmh-users-admin@redhat.com Mon Sep 2 23:27:14 2002If you haven't already, you should enable the debug log under\nHacking Support preferences and look for clues there.\n\n>>>Reg Clemens said:\n > > Hi,\n > > \n > > On Sun, 01 Sep 2002 00:05:03 MDT Reg Clemens wrote: \n > > \n > > [...]\n > > > in messages with GnuPG signatures. But punching the line ALWAYS\n > > > gives\n > > > \n > > > Signature made Thu Aug 29 00:27:17 2002 MDT using DSA key ID BDD\n F997A\n > > > ...",False,1801
3297,"From fork-admin@xent.com Thu Sep 26 16:35:44 2002On Wed, 25 Sep 2002, Joseph S. Barrera III wrote:\n\n> Let's say you're behind a firewall and have a NAT address.\n> Is there any way to telnet to a linux box out there in the world\n> and set your DISPLAY in some way that you can create\n> xterms on your own screen?\n\nAs other people suggested: SSH. PuTTY \n\n\thttp://www.chiark.greenend.org.uk/~sgtatham/putty/download.html\n\ncan do it. Can run, say xclock (I'm running an X server under W3...",False,548
3298,"On 06 September 2002, Anthony Baxter said:\n> A snippet, hopefully not enough to trigger the spam-filters.\n\nAs an aside: one of the best ways to dodge SpamAssassin is by having an\nIn-Reply-To header. Most list traffic should meet this criterion.\n\nAlternately, I can whitelist mail to spambayes@python.org -- that'll\nwork until spammers get ahold of the list address, which usually seems\nto take a few months.\n\n Greg\n-- \nGreg Ward http://www.gerg.ca/\nG...",False,574
3299,"From pudge@perl.org Sat Sep 14 16:22:37 2002use Perl Daily Newsletter\n\nIn this issue:\n * ""Perl 6: Right Here, Right Now"" slides ava\n\n+--------------------------------------------------------------------+\n| ""Perl 6: Right Here, Right Now"" slides ava |\n| posted by gnat on Friday September 13, @12:01 (news) |\n| http://use.perl.org/article.pl?sid=02/09/13/162209 |\n+-----------------------------------------------------------------...",False,1233


As we have dirty data in the email, we replaced some characters

In [406]:
df_without_from.dropna(inplace=True)
df_without_from.replace('\n','', regex=True, inplace=True)
df_without_from.replace('#','', regex=True, inplace=True)
df_without_from.replace('>','', regex=True, inplace=True)
df_without_from.replace('-','', regex=True, inplace=True)
df_without_from

Unnamed: 0,email_body,is_spam,len
0,"Adult Club Offers FREE Membership INSTANT ACCESS TO ALL SITES NOW Your User Name And Password is. User Name: zzzz@example.com Password: 15343 of the Best Adult Sites on the Internet for FREE!NEWS 08/15/02With just over 2.9 Million Members that signed up for FREE, Last monththere were 721,184 NewMembers. Are you one of them yet???Ou...",True,3206
1,"Dear Sir, With due respect and humility I write you this letter which I believe youwouldbe of great assistance to my children and I.I got your contact through my husband commercial address book and believedthatyou must be a trust worthy and reliable person that will not like tointimidate meor betray my trust after hearing this news. I am a native of KONOBO in theKEREMAlocal district of SIERRA LEONE in West Africa and the wife of Late DR.MUNDI A.KOJO who was assassinated by the rebel forced l...",True,2529
2,Never Pay Retail!Unleash your PC's Multimedia power TODAY! Compare and Save PriceS&HTotalYou SaveSee for yourselfAMAZON$29.99$4.99$34.9870%Click hereCDW$29.21$7.53$36.7469%Click hereOffice Depot $28.99$5.95$34.9469%Click hereOffice Max $29.99$3.99$33.9870%Click ...,True,2390
3,Never Pay Retail!Unleash your PC's Multimedia power TODAY! Compare and Save PriceS&HTotalYou SaveSee for yourselfAMAZON$29.99$4.99$34.9870%Click hereCDW$29.21$7.53$36.7469%Click hereOffice Depot $28.99$5.95$34.9469%Click hereOffice Max $29.99$3.99$33.9870%Click ...,True,2383
4,"Dear user,CyberAge Dating Club is contacting you on behalf of Jenny.You have been carefully chosen as a matching partner for the other party using our advanced profilematching system. We have determined that you do not currently have an account with us and invite you to join. It's absolutely FREE and there's no obligations or hidden charges. Come see for yourself and prepare to have fun! Click here to Join Now or simply disregard this request if you're not interested.Sincerely your...",True,539
...,...,...,...
3296,"If you haven't already, you should enable the debug log underHacking Support preferences and look for clues there.Reg Clemens said: Hi, On Sun, 01 Sep 2002 00:05:03 MDT Reg Clemens wrote: [...] in messages with GnuPG signatures. But punching the line ALWAYS gives Signature made Thu Aug 29 00:27:17 2002 MDT using DSA key ID BDD F997A Can't check signature: public key not found So, something else is missing. Yes, the public key of...",False,1801
3297,"On Wed, 25 Sep 2002, Joseph S. Barrera III wrote: Let's say you're behind a firewall and have a NAT address. Is there any way to telnet to a linux box out there in the world and set your DISPLAY in some way that you can create xterms on your own screen?As other people suggested: SSH. PuTTY \thttp://www.chiark.greenend.org.uk/~sgtatham/putty/download.htmlcan do it. Can run, say xclock (I'm running an X server under W32 at work,tunneling through a NAT box), but from Linux, not from Solaris. P...",False,548
3298,"On 06 September 2002, Anthony Baxter said: A snippet, hopefully not enough to trigger the spamfilters.As an aside: one of the best ways to dodge SpamAssassin is by having anInReplyTo header. Most list traffic should meet this criterion.Alternately, I can whitelist mail to spambayes@python.org that'llwork until spammers get ahold of the list address, which usually seemsto take a few months. Greg Greg Ward http://www.gerg.ca/Gee, I feel kind of LIGHT in the he...",False,574
3299,"use Perl Daily NewsletterIn this issue: * ""Perl 6: Right Here, Right Now"" slides ava++| ""Perl 6: Right Here, Right Now"" slides ava || posted by gnat on Friday September 13, @12:01 (news) || http://use.perl.org/article.pl?sid=02/09/13/162209 |++[0]gnat writes ""The wonderful Leon Brocard has released the slides fromhis lightning talk to the London perlmongers, [1]Perl 6: Right Here,Right Now, showing the current perl6 compiler in action....",False,1233


In [396]:
df.dropna(inplace=True)
df.replace('\n','', regex=True, inplace=True)
df.replace('#','', regex=True, inplace=True)
df.replace('>','', regex=True, inplace=True)
df.replace('-','', regex=True, inplace=True)
df.replace('From','', regex=True, inplace=True)
df

Unnamed: 0,email_body,is_spam,len
0,"ib@newafrica.com Mon Aug 26 15:15:34 2002 Adult Club Offers FREE Membership INSTANT ACCESS TO ALL SITES NOW Your User Name And Password is. User Name: zzzz@example.com Password: 15343 of the Best Adult Sites on the Internet for FREE!NEWS 08/15/02With just over 2.9 Million Members that signed up for FREE, Last monththere were 721,1...",True,3206
1,"annekojo@email.com Wed Sep 25 17:22:07 2002Dear Sir, With due respect and humility I write you this letter which I believe youwouldbe of great assistance to my children and I.I got your contact through my husband commercial address book and believedthatyou must be a trust worthy and reliable person that will not like tointimidate meor betray my trust after hearing this news. I am a native of KONOBO in theKEREMAlocal district of SIERRA LEONE in West Africa and the wife of Late DR.MUNDI A.KO...",True,2529
2,Special_Offer09192002HTML@frugaljoe.330w.com Fri Sep 20 11:41:00 2002Never Pay Retail!Unleash your PC's Multimedia power TODAY! Compare and Save PriceS&HTotalYou SaveSee for yourselfAMAZON$29.99$4.99$34.9870%Click hereCDW$29.21$7.53$36.7469%Click hereOffice Depot $28.99$5.95$34.9469%Click h...,True,2390
3,Special_Offer09192002HTML@frugaljoe.330w.com Fri Sep 20 11:41:00 2002Never Pay Retail!Unleash your PC's Multimedia power TODAY! Compare and Save PriceS&HTotalYou SaveSee for yourselfAMAZON$29.99$4.99$34.9870%Click hereCDW$29.21$7.53$36.7469%Click hereOffice Depot $28.99$5.95$34.9469%Click h...,True,2383
4,"none@none.com Mon Sep 2 16:28:01 2002Dear user,CyberAge Dating Club is contacting you on behalf of Jenny.You have been carefully chosen as a matching partner for the other party using our advanced profilematching system. We have determined that you do not currently have an account with us and invite you to join. It's absolutely FREE and there's no obligations or hidden charges. Come see for yourself and prepare to have fun! Click here to Join Now or simply disregard this request...",True,539
...,...,...,...
3296,"exmhusersadmin@redhat.com Mon Sep 2 23:27:14 2002If you haven't already, you should enable the debug log underHacking Support preferences and look for clues there.Reg Clemens said: Hi, On Sun, 01 Sep 2002 00:05:03 MDT Reg Clemens wrote: [...] in messages with GnuPG signatures. But punching the line ALWAYS gives Signature made Thu Aug 29 00:27:17 2002 MDT using DSA key ID BDD F997A Can't check signature: public key not found So, so...",False,1801
3297,"forkadmin@xent.com Thu Sep 26 16:35:44 2002On Wed, 25 Sep 2002, Joseph S. Barrera III wrote: Let's say you're behind a firewall and have a NAT address. Is there any way to telnet to a linux box out there in the world and set your DISPLAY in some way that you can create xterms on your own screen?As other people suggested: SSH. PuTTY \thttp://www.chiark.greenend.org.uk/~sgtatham/putty/download.htmlcan do it. Can run, say xclock (I'm running an X server under W32 at work,tunneling through a N...",False,548
3298,"On 06 September 2002, Anthony Baxter said: A snippet, hopefully not enough to trigger the spamfilters.As an aside: one of the best ways to dodge SpamAssassin is by having anInReplyTo header. Most list traffic should meet this criterion.Alternately, I can whitelist mail to spambayes@python.org that'llwork until spammers get ahold of the list address, which usually seemsto take a few months. Greg Greg Ward http://www.gerg.ca/Gee, I feel kind of LIGHT in the he...",False,574
3299,"pudge@perl.org Sat Sep 14 16:22:37 2002use Perl Daily NewsletterIn this issue: * ""Perl 6: Right Here, Right Now"" slides ava++| ""Perl 6: Right Here, Right Now"" slides ava || posted by gnat on Friday September 13, @12:01 (news) || http://use.perl.org/article.pl?sid=02/09/13/162209 |++[0]gnat writes ""The wonderful Leon Brocard has released the slides fromhis lightning talk to the London perlmongers, [1]Perl 6: Right Here,Right Now, show...",False,1233


The cleaner method is cool ... It is responsible for three things:
    - search for the word radical
    - making stop words
    - remove some types of words

In [397]:
stop_words = set(stopwords.words('english')) 

def cleaner(message):
    porter_stemmer = PorterStemmer()
    # radicais
    tokens = message.split()
    stemmed_tokens = [porter_stemmer.stem(token) for token in tokens]
    message = ' '.join(stemmed_tokens)
    # 
    word_tokens = word_tokenize(message)
    filtered_sentence = [w for w in word_tokens if not w in stop_words]
    message = ' '.join(filtered_sentence)
    # fetch just some type words
    # legand in https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
    tokens = nltk.word_tokenize(message)
    tagged = nltk.pos_tag(tokens)
    message = [word[0] for word in tagged if word[1] in ('VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'NN', 'NNS', 'NNP', 'NNPS', 'JJ', 'JJR', 'JJS', 'UH')]
    return ' '.join(message)

And then, we apply the `cleaner` method to the data frame

In [407]:
email_body_without_from_cleaned_serie = df_without_from['email_body'].apply(lambda x: cleaner(x))
cleaned_df_without_from = df_without_from.copy()
cleaned_df_without_from['email_body'] = email_body_without_from_cleaned_serie
cleaned_df_without_from['len'] = cleaned_df_without_from['email_body'].str.len()
cleaned_df_without_from

Unnamed: 0,email_body,is_spam,len
0,adult club offer free membership instant access TO site user name password user name zzzz @ example.com password best adult site internet free new member sign free last monthther newmembers membership faqq whi offer free access adult membership site forfree advertis ad space payfor membership.q Is true membership life absolut cent advertis do.q give account friend family yes long age membership sites one access them.q get started follow link becom member dollar polici rules requir info free ...,True,1507
1,dear sir due respect humil write thi letter believ great assist children i.i got contact husband commerci address book believedthaty worthi person tointimid meor betray trust hear thi news nativ konobo thekeremaloc district sierra leon west africa wife late dr.mundi a.kojo wa assassin rebel forc loyal major john paulkoromahbecaus director gener nation gold diamond miningcorporationof sierra leone day befor husband wa assassinated instruct meand children ibrahim amina move sierra leon befor p...,True,1531
2,retail unleash pc multimedia power today compar save prices htotaly savese % click herecdw % click hereoffic depot % click hereoffic max % click here* comparison base similar model lead merchants *microsoft windows® approv certifi compat major voic recognit software unleash pc multimedia power today labtec® patent ncat2 nois cancellingamplif technology use PC call save big activ PC program use voice video confer friend IN real time voic email worldwid internet live record voic convers pro th...,True,939
3,retail unleash pc multimedia power today compar save prices htotaly savese % click herecdw % click hereoffic depot % click hereoffic max % click here* comparison base similar model lead merchants *microsoft windows® approv certifi compat major voic recognit software unleash pc multimedia power today labtec® patent ncat2 nois cancellingamplif technology use PC call save big activ PC program use voice video confer friend IN real time voic email worldwid internet live record voic convers pro th...,True,939
4,dear user cyberag date club contact behalf jenny.y care chosen match partner parti use advanc profilematch system determin current account invit join absolut free oblig hidden charges see prepar fun click join simpli disregard thi request r interested.sincer cyberag date club staff,True,282
...,...,...,...
3296,underhack support prefer look clue there.reg clemen said hi sun sep mdt reg clemen wrote [ ] messag gnupg signatures punch line give signatur made thu aug mdt use dsa key ID bdd f997a signature public key found someth els missing yes public key signatur want check realli sure public key message signature download tri check signatur know public key ah sorri make clearer previous v1.0.6 gnupg paus thi point whi went public key keyserver key get failur message cant find gpg execut fix path els ...,False,906
3297,wed sep joseph S. barrera iii wrote let r firewal nat address Is ani way telnet linux box world set display way creat xterm screen peopl suggested ssh putti http //www.chiark.greenend.org.uk/~sgtatham/putty/download.htmlcan run say xclock 'm run X server w32 work box linux solaris probablyopenssh misconfigurat5ion,False,315
3298,septemb anthoni baxter said snippet hope trigger spamfilters.a best way dodg spamassassin aninreplyto header list traffic meet thi whitelist mail spambayes @ python.org that'llwork spammer get ahold list address usual seemsto take months greg greg ward http //www.gerg.ca/gee feel kind light head know mysatellit dish payments,False,326
3299,use perl daili newsletterin thi issue * perl right right perl right right ava || post gnat friday septemb news || http //use.perl.org/article.pl sid=02/09/13/162209 |++ [ ] gnat write leon brocard ha releas slide lightn talk london perlmongers [ ] perl right show current perl6 compil action discuss thi stori http //use.perl.org/comments.pl sid=02/09/13/162209links mailto gnat @ oreilly.com http //astray.com/perl6_now/copyright pudge right reserved.============================================...,False,702


In [398]:
email_body_cleaned_serie = df['email_body'].apply(lambda x: cleaner(x))
cleaned_df = df.copy()
cleaned_df['email_body'] = email_body_cleaned_serie
cleaned_df['len'] = cleaned_df['email_body'].str.len()
cleaned_df

Unnamed: 0,email_body,is_spam,len
0,ib @ newafrica.com mon aug adult club offer free membership instant access TO site user name password user name zzzz @ example.com password best adult site internet free new member sign free last monthther newmembers membership faqq whi offer free access adult membership site forfree advertis ad space payfor membership.q Is true membership life absolut cent advertis do.q give account friend family yes long age membership sites one access them.q get started follow link becom member dollar pol...,True,1534
1,annekojo @ email.com wed sep sir due respect humil write thi letter believ great assist children i.i got contact husband commerci address book believedthaty worthi person tointimid meor betray trust hear thi news nativ konobo thekeremaloc district sierra leon west africa wife late dr.mundi a.kojo wa assassin rebel forc loyal major john paulkoromahbecaus director gener nation gold diamond miningcorporationof sierra leone day befor husband wa assassinated instruct meand children ibrahim amina ...,True,1555
2,special_offer09192002html @ frugaljoe.330w.com fri sep pay retail unleash pc multimedia power today compar save prices htotaly savese % click herecdw % click hereoffic depot % click hereoffic max % click here* comparison base similar model lead merchants *microsoft windows® approv certifi compat major voic recognit software unleash pc multimedia power today labtec® patent ncat2 nois cancellingamplif technology use PC call save big activ PC program use voice video confer friend IN real time v...,True,998
3,special_offer09192002html @ frugaljoe.330w.com fri sep pay retail unleash pc multimedia power today compar save prices htotaly savese % click herecdw % click hereoffic depot % click hereoffic max % click here* comparison base similar model lead merchants *microsoft windows® approv certifi compat major voic recognit software unleash pc multimedia power today labtec® patent ncat2 nois cancellingamplif technology use PC call save big activ PC program use voice video confer friend IN real time v...,True,998
4,none @ none.com mon sep user cyberag date club contact behalf jenny.y care chosen match partner parti use advanc profilematch system determin current account invit join absolut free oblig hidden charges see prepar fun click join simpli disregard thi request r interested.sincer cyberag date club staff,True,301
...,...,...,...
3296,exmhusersadmin @ redhat.com mon sep underhack support prefer look clue there.reg clemen said hi sun sep mdt reg clemen wrote [ ] messag gnupg signatures punch line give signatur made thu aug mdt use dsa key ID bdd f997a signature public key found someth els missing yes public key signatur want check realli sure public key message signature download tri check signatur know public key ah sorri make clearer previous v1.0.6 gnupg paus thi point whi went public key keyserver key get failur messag...,False,942
3297,forkadmin @ xent.com thu sep wed sep joseph S. barrera iii wrote let r firewal nat address Is ani way telnet linux box world set display way creat xterm screen peopl suggested ssh putti http //www.chiark.greenend.org.uk/~sgtatham/putty/download.htmlcan run say xclock 'm run X server w32 work box linux solaris probablyopenssh misconfigurat5ion,False,344
3298,septemb anthoni baxter said snippet hope trigger spamfilters.a best way dodg spamassassin aninreplyto header list traffic meet thi whitelist mail spambayes @ python.org that'llwork spammer get ahold list address usual seemsto take months greg greg ward http //www.gerg.ca/gee feel kind light head know mysatellit dish payments,False,326
3299,pudge @ perl.org sat sep perl daili newsletterin thi issue * perl right right perl right right ava || post gnat friday septemb news || http //use.perl.org/article.pl sid=02/09/13/162209 |++ [ ] gnat write leon brocard ha releas slide lightn talk london perlmongers [ ] perl right show current perl6 compil action discuss thi stori http //use.perl.org/comments.pl sid=02/09/13/162209links mailto gnat @ oreilly.com http //astray.com/perl6_now/copyright pudge right reserved.=======================...,False,723


Just a looking... Before clean the emails the mean of size text was 1974 and now, 1126

In [409]:
print ("Before clean: ", df.len.mean(), " --- After clean: ", cleaned_df.len.mean(), " --- Other clean: ", cleaned_df_without_from.len.mean())

Before clean:  1947.548621629809  --- After clean:  1153.1899424416843  --- Other clean:  1126.7421993335354


We will export the data to csv, but first remove the unnecessary `len` column

In [410]:
final_df_without_from = cleaned_df_without_from.drop(columns=['len'])
final_df_without_from.dropna(inplace=True)
final_df_without_from

Unnamed: 0,email_body,is_spam
0,adult club offer free membership instant access TO site user name password user name zzzz @ example.com password best adult site internet free new member sign free last monthther newmembers membership faqq whi offer free access adult membership site forfree advertis ad space payfor membership.q Is true membership life absolut cent advertis do.q give account friend family yes long age membership sites one access them.q get started follow link becom member dollar polici rules requir info free ...,True
1,dear sir due respect humil write thi letter believ great assist children i.i got contact husband commerci address book believedthaty worthi person tointimid meor betray trust hear thi news nativ konobo thekeremaloc district sierra leon west africa wife late dr.mundi a.kojo wa assassin rebel forc loyal major john paulkoromahbecaus director gener nation gold diamond miningcorporationof sierra leone day befor husband wa assassinated instruct meand children ibrahim amina move sierra leon befor p...,True
2,retail unleash pc multimedia power today compar save prices htotaly savese % click herecdw % click hereoffic depot % click hereoffic max % click here* comparison base similar model lead merchants *microsoft windows® approv certifi compat major voic recognit software unleash pc multimedia power today labtec® patent ncat2 nois cancellingamplif technology use PC call save big activ PC program use voice video confer friend IN real time voic email worldwid internet live record voic convers pro th...,True
3,retail unleash pc multimedia power today compar save prices htotaly savese % click herecdw % click hereoffic depot % click hereoffic max % click here* comparison base similar model lead merchants *microsoft windows® approv certifi compat major voic recognit software unleash pc multimedia power today labtec® patent ncat2 nois cancellingamplif technology use PC call save big activ PC program use voice video confer friend IN real time voic email worldwid internet live record voic convers pro th...,True
4,dear user cyberag date club contact behalf jenny.y care chosen match partner parti use advanc profilematch system determin current account invit join absolut free oblig hidden charges see prepar fun click join simpli disregard thi request r interested.sincer cyberag date club staff,True
...,...,...
3296,underhack support prefer look clue there.reg clemen said hi sun sep mdt reg clemen wrote [ ] messag gnupg signatures punch line give signatur made thu aug mdt use dsa key ID bdd f997a signature public key found someth els missing yes public key signatur want check realli sure public key message signature download tri check signatur know public key ah sorri make clearer previous v1.0.6 gnupg paus thi point whi went public key keyserver key get failur message cant find gpg execut fix path els ...,False
3297,wed sep joseph S. barrera iii wrote let r firewal nat address Is ani way telnet linux box world set display way creat xterm screen peopl suggested ssh putti http //www.chiark.greenend.org.uk/~sgtatham/putty/download.htmlcan run say xclock 'm run X server w32 work box linux solaris probablyopenssh misconfigurat5ion,False
3298,septemb anthoni baxter said snippet hope trigger spamfilters.a best way dodg spamassassin aninreplyto header list traffic meet thi whitelist mail spambayes @ python.org that'llwork spammer get ahold list address usual seemsto take months greg greg ward http //www.gerg.ca/gee feel kind light head know mysatellit dish payments,False
3299,use perl daili newsletterin thi issue * perl right right perl right right ava || post gnat friday septemb news || http //use.perl.org/article.pl sid=02/09/13/162209 |++ [ ] gnat write leon brocard ha releas slide lightn talk london perlmongers [ ] perl right show current perl6 compil action discuss thi stori http //use.perl.org/comments.pl sid=02/09/13/162209links mailto gnat @ oreilly.com http //astray.com/perl6_now/copyright pudge right reserved.============================================...,False


In [400]:
final_df = cleaned_df.drop(columns=['len'])
final_df.dropna(inplace=True)
final_df

Unnamed: 0,email_body,is_spam
0,ib @ newafrica.com mon aug adult club offer free membership instant access TO site user name password user name zzzz @ example.com password best adult site internet free new member sign free last monthther newmembers membership faqq whi offer free access adult membership site forfree advertis ad space payfor membership.q Is true membership life absolut cent advertis do.q give account friend family yes long age membership sites one access them.q get started follow link becom member dollar pol...,True
1,annekojo @ email.com wed sep sir due respect humil write thi letter believ great assist children i.i got contact husband commerci address book believedthaty worthi person tointimid meor betray trust hear thi news nativ konobo thekeremaloc district sierra leon west africa wife late dr.mundi a.kojo wa assassin rebel forc loyal major john paulkoromahbecaus director gener nation gold diamond miningcorporationof sierra leone day befor husband wa assassinated instruct meand children ibrahim amina ...,True
2,special_offer09192002html @ frugaljoe.330w.com fri sep pay retail unleash pc multimedia power today compar save prices htotaly savese % click herecdw % click hereoffic depot % click hereoffic max % click here* comparison base similar model lead merchants *microsoft windows® approv certifi compat major voic recognit software unleash pc multimedia power today labtec® patent ncat2 nois cancellingamplif technology use PC call save big activ PC program use voice video confer friend IN real time v...,True
3,special_offer09192002html @ frugaljoe.330w.com fri sep pay retail unleash pc multimedia power today compar save prices htotaly savese % click herecdw % click hereoffic depot % click hereoffic max % click here* comparison base similar model lead merchants *microsoft windows® approv certifi compat major voic recognit software unleash pc multimedia power today labtec® patent ncat2 nois cancellingamplif technology use PC call save big activ PC program use voice video confer friend IN real time v...,True
4,none @ none.com mon sep user cyberag date club contact behalf jenny.y care chosen match partner parti use advanc profilematch system determin current account invit join absolut free oblig hidden charges see prepar fun click join simpli disregard thi request r interested.sincer cyberag date club staff,True
...,...,...
3296,exmhusersadmin @ redhat.com mon sep underhack support prefer look clue there.reg clemen said hi sun sep mdt reg clemen wrote [ ] messag gnupg signatures punch line give signatur made thu aug mdt use dsa key ID bdd f997a signature public key found someth els missing yes public key signatur want check realli sure public key message signature download tri check signatur know public key ah sorri make clearer previous v1.0.6 gnupg paus thi point whi went public key keyserver key get failur messag...,False
3297,forkadmin @ xent.com thu sep wed sep joseph S. barrera iii wrote let r firewal nat address Is ani way telnet linux box world set display way creat xterm screen peopl suggested ssh putti http //www.chiark.greenend.org.uk/~sgtatham/putty/download.htmlcan run say xclock 'm run X server w32 work box linux solaris probablyopenssh misconfigurat5ion,False
3298,septemb anthoni baxter said snippet hope trigger spamfilters.a best way dodg spamassassin aninreplyto header list traffic meet thi whitelist mail spambayes @ python.org that'llwork spammer get ahold list address usual seemsto take months greg greg ward http //www.gerg.ca/gee feel kind light head know mysatellit dish payments,False
3299,pudge @ perl.org sat sep perl daili newsletterin thi issue * perl right right perl right right ava || post gnat friday septemb news || http //use.perl.org/article.pl sid=02/09/13/162209 |++ [ ] gnat write leon brocard ha releas slide lightn talk london perlmongers [ ] perl right show current perl6 compil action discuss thi stori http //use.perl.org/comments.pl sid=02/09/13/162209links mailto gnat @ oreilly.com http //astray.com/perl6_now/copyright pudge right reserved.=======================...,False


And finally, we export our data

In [411]:
final_df_without_from.to_csv('cleaned_emails_without_from.csv', index=False)

In [401]:
final_df.to_csv('cleaned_emails.csv', index=False)