# INFO371 Problem Set: Bayes-Theorem based Spam Filter

In this problem set you will use Bayes Theorem to categorize 
emails from Ling-Spam corpus into spam and non-spam.  Using a single-word-based Bayes approach does not give good results, but this problem set serves as a preparatory
work for understanding the Naive Bayes approach.


## Ling-Spam emails

The corpus contains ~ 2700 emails from academic accounts talking
about conferences, deadlines, papers etc, and peppered with wonderful
offers of viagra, lottery millions and similar spam messages.  The
emails have been converted into a csv file that contains three variables:

* spam --> true or false, this email is spam
* files --> the original file name for this email (not needed in this HW).
* message --> the content of the email in a single line


## (5pt) Explore and clean the data

First, let's load data and take a closer look at it.

1. (2pt) Load the lingspam-emails.csv.bz2 dataset.  Browse a handful of emails, both spam and non-spam ones, to see what kind of text we are working with here.Hint: check out textwrap module to print long strings on multiple lines.
  
  
2. (3pt) Ensure the data is clean: remove all cases with missing spam and empty message field.  We do not care about the file names.

In [522]:
# code goes here
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import confusion_matrix

In [523]:
#1
pd.options.display.max_colwidth = 10000000000000000

emails = pd.read_csv("lingspam-emails.csv.bz2", sep='\t')

emails[emails.spam == True].head(5) #spams

Unnamed: 0,spam,files,message
241,True,spmsga1.txt,"Subject: great part-time or summer job ! * * * * * * * * * * * * * * * we have display boxes with credit applications that we need to place in the small owner-operated stores in your area . here is what you do : 1 . introduce yourself to the store owner or manager . 2 . use our 90 % effective script which tells them how this little display box will save their customers hundreds of dollars , be a drawing card for their business , and make them from $ 5 . 00 to $ 15 . 00 or more for every app sent in . 3 . find a good spot on the counter , place the box there , and say that nothing more need be done , all you need is his name and address so the company can send him the commission checks . your compensaation will be $ 10 for every box you place . by becoming a representative you could also earn a commission of $ 10 for each application that came from that store . that is of course a much more profitable plan , as it will pay you for months or years for a very small effort . call 1-888 - 703-5390 code 3 24 hours to receive the details ! ! * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * to be removed from our mailing list , type : b2998 @ hotmail . com in the ( to : ) area and ( remove ) in the subject area of a new e - mail and send . * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *"
242,True,spmsga10.txt,"Subject: auto insurance rates too high ? dear nlpeople , i ' m sure you ' ll agree auto insurance costs too much . even with a good driving record , "" routine "" rate increases can drive your costs through the roof . i have discovered a way many people can sign up with an excellent company that gives amazingly low rates . they are about half of most of the rates i ' ve found shopping around for insurance in southern california . most people either qualify or have a friend who qualifies who would love to know about it . if you do n't qualify , i have another company that operates in several western states that is cheaper than many companies who claim they have "" the lowest rates available . "" just send $ 2 cash to : pva 1257 n kenmore ave # 2 los angeles , ca 90029 fold it in a piece of paper with your e - mail address and i will rush the information to you right away . if you prefer a hardcopy printout , enclose a self-addressed , stamped envelope . p . s . as a bonus i include two mechanic 's tips that save lots of time on a certain common repair job , and give you a quick and easy way to check the general condition of an engine . i have n't found these in any repair manuals or books before . these are great for home mechanics !"
243,True,spmsga100.txt,"Subject: do want the best and economical hunting vacation of your life ? if you want the best hunting and camping vacation of your life , come to felton 's hunting camp in wild and wonderful west virginia . $ 50 . 00 per day pays for your room and three home cooked meals ( packed lunch if you want to stay out in the woods at noon ) with cozy accomodations . reserve your space now . following seasons are now being booked for 1998 : buck season - nov . 23 - dec . 5 doe season - to be announced ( please call ) muzzel loader ( deer ) - dec . 14 - dec . 19 archery ( deer ) - oct . 17 - dec . 31 turkey sesson - oct . 24 - nov . 14 e - mail us at 110734 . 2622 @ compuserve . com"
244,True,spmsga101.txt,"Subject: email 57 million people for $ 99 57 million email addresses for only $ 99 you want to make some money ? i can put you in touch with over 50 million people at virtually no cost . can you make one cent from each of theses names ? if you can you have a profit of over $ 500 , 000 . 00 that 's right , i have 57 million fresh email addresses that i will sell for only $ 99 . these are all fresh addresses that include almost every person on the internet today , with no duplications . they are all sorted and ready to be mailed . that is the best deal anywhere today ! imagine selling a product for only $ 5 and getting only a 1 / 10 % response . that 's $ 2 , 850 , 000 in your pocket ! ! ! do n't believe it ? people are making that kind of money right now by doing the same thing , that is why you get so much email from people selling you their product . . . . it works ! i will even tell you how to mail them with easy to follow step-by - step instructions i include with every order . these 57 million email addresses are yours to keep , so you can use them over and over and they come on 1 cd . this offer is not for everyone . if you can not see the just how excellent the risk / reward ratio in this offer is then there is nothing i can do for you . to make money you must stop dreaming and take action . * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * the bronze marketing setup 57 , 000 , 000 email addresses on cd these name are all in text files ready to mail ! ! ! $ 99 . 00 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * the silver marketing setup 57 , 000 , 000 email addresses on cd these name are all in text files ready to mail ! ! ! and 8 different bulk email programs and tools to help with your mailings and list management . $ 139 . 00 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * the gold marketing setup virtually everything ! ! 57 , 000 , 000 email addresses on cd these name are all in text files ready to mail ! ! ! and 8 different bulk email programs and tools to help with your mailings and list management . and over 500 different business reports now being sold on the internet for up to $ 100 each . you get full rights to resell these reports . with this package you get the email addresses , the software to mail them and ready to sell information products . and . . . . . . . . a collection of the 100 best money making adds currently floating around on the internet . $ 189 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * the platinum marketing setup for those ready to "" own the net "" 57 , 000 , 000 email addresses on cd these name are all in text files ready to mail ! ! ! and 8 different bulk email programs and tools to help with your mailings and list management . and over 500 different business reports now being sold on the internet for up to $ 100 each . you get full rights to resell these reports . with this package you get the email addresses , the software to mail them and ready to sell information products . and . . . . . . . . a collection of the 100 best money making adds currently floating around on the internet . and . . . . . . floodgate & goldrush fully registered software ! ! this is the number 1 most powerful mass mailing software in the world today . there is nothing that can compare for speed , reliability , performance , and the ability to use "" stealth "" functions . this is the package that will allow you to use the net as your own personal "" money tree "" at will ! ! ! $ 379 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * several ways to order ! ! ! if you order by phone we will ship your cd containing the 57 million + names within 12 hours of your order ! ! ! 1 ) we accept : american express or visa mastercard type of card amx / visa / mc ? ? _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ expiration date _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ name on credit card _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ credit card # _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ billing address _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ city _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ state _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ zip _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ phone include area code _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ email address _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ we will bill selected amount to your account plus the following shipping costs shipping cost of 3 . 85 first class mail shipping cost of 15 . 00 24 hour express mail / federal express sales tax added to ar residents > > > send correct amount in cash , check or money order to : > > > fire power ! ! > > > 1320 n . "" b "" st . , suite 112-24 > > > fort smith , ar 72901 2 ) send the same above requested credit card information to above address . 3 ) call phone # 530-876 - 4293 . this is a 24 hour phone number to place a credit card order . fire power ! is a private company and is not affiliated with , or endorsed by , aol , msn , or any other internet service provider . copyright 1998 all rights reserved iq"
245,True,spmsga102.txt,"Subject: do n't miss these ! attention ! warning ! adults only ! warning ! adults only ! if you are under 21 years of age , or not interested in sexually explicit material . . . please hit your keyboard delete button now and please excuse the intrusion . to remove your name from our mailing list , send us email with remove in the subject line . you need not read any further ! available now for only $ 9 . 95 ! next 10 days only ! world record sex ! be there ! see it now on video ! unbelievable . . . but true ! you won't believe your eyes ! ! ! [ as seen on the howard stern show ] "" the world 's biggest gang bang "" see sexy annabel chong as she sets the world gang bang record in this fantastic video documentary that chronicles her 24 hour sexathon with 251 men engaging in sexual intercourse and oral sex with her ! do n't worry , you won't have to stay up 24 hours to watch it all . we ' ve selected only the most exciting and red hot scenes for you . . . all in breathtaking living color with plenty of extreme close-ups ! this video is guaranteed to knock your socks off and leave you breathless ! you ' ve never seen anything like it ! annabel takes on five men at a time ! 90 minutes ! order today ! only $ 9 . 95 plus $ 3 shipping and handling [ total $ 12 . 95 ] . "" gang bang ii "" the record breaker ! ! ! starring jasmin st . claire ! see beautiful and voluptious jasmin st . claire shatter annabel 's gang bang record by taking on 300 men in one 24 hour sex session ! you won't believe your eyes at all the hot firey action that you will see as the new world record is established before your eyes as jasmin takes on five men at a time for sexual intercourse and oral sex ! your friends will break down your door to see this video ! you ' ll be the most popular guy in town ! the action is truly unreal and you will see the best of it in living life-like color ! order today and see jasmin break the record ! 90 minutes . only $ 9 . 95 plus $ 3 shipping and handling [ total $ 12 . 95 ] . also available . . . the uncensored authentic underground . . . pamela anderson lee & tommy lee sex video tape ! everyone is talking about this exciting video ! see pam and tommy engaging in sexual intercourse and oral sex in the car , on the boat and much , much more ! a real collectors video ! 30 minutes . only $ 9 . 95 plus $ 3 shipping and handling [ total $ 12 . 95 ] "" tonya harding wedding night sex video "" now see the beautiful ice skating shame of the olympics tonya harding engaging in sexual intercourse and oral sex on her wedding night with husband jeff gillooly ! this "" bad girl "" is hot ! do n't miss this video ! 30 minutes . only $ 9 . 95 plus $ 3 shipping and handling [ total $ 12 . 95 ] "" traci . . . i love you "" starring traci lords now see the most beautiful and popular porn star in her last adult video before she hit the big time ! it 's the blockbuster of the year . . . sensual . . . fiery and exposive ! traci lords in her most erotic and controversial film ever ! do n't miss it ! 90 minutes . only $ 9 . 95 plus $ 3 shipping and handling [ total $ 12 . 95 ] email special ! order any four videos and get the fifth one free ! ! ! your order will be shipped via first class mail . all shipments in plain unmarked wrapper . for priority mail - add $ 5 for overnight express - add $ 15 you can order by phone , fax , mail or email . we accept all major credit cards and checks by phone or fax . visa - mastercard - american express - discover 10 day money back guarantee ! we know that you will be pleased with these videos ! to email your order - do not hit reply on your keyboard send email to our special email address below : zsazsa36 @ juno . com [ note : if you order by email and do not receive an email acknowledgement within 24 hours , please phone our office at 718-287 - 3800 ] phone our office 9am to 10 pm [ eastern time ] [ 718 ] 287-3800 to order by phone for fastest service ! we can accept your credit card or check by phone fax your order 24 hours per day to [ 718 ] 462-5920 you can fax your credit card information or your check order by mail by sending $ 12 . 95 per video , cash , check , money order or major credit card [ visa , mastercard , american express or discover ] to tcps , inc . 4718 18th ave . suite 135 brooklyn , ny 11204 make checks & money orders payable to tcps , inc . new york state residents please add 85 cents for sales tax per video ! you must be over 21 years of age to order and give us your date of birth with your order ! the following order form is for your convenience ! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . please ship me the following video tape [ s ] ! qty _ _ _ _ _ _ _ _ _ _ _ annabel chong "" world 's biggest gang bang "" qty _ _ _ _ _ _ _ _ _ _ "" gang bang ii "" jasmin st . claire qty _ _ _ _ _ _ _ _ _ _ _ "" pamela & tommy lee sex video tape "" qty _ _ _ _ _ _ _ _ _ "" tonya harding wedding night sex video tape "" qty _ _ _ _ _ _ _ _ _ _ "" traci i love you "" traci lords at $ 9 . 95 each plus $ 3 . 00 for shipping and handling per tape [ $ 12 . 95 per video or "" special $ 51 . 80 for all five "" ! credit card # _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ exp date _ _ _ i hereby represent that i am over 21 years of age . my date of birth is _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ signature _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ship to : name _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ address _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ city _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ state _ _ _ _ _ _ _ _ _ _ _ zip _ _ _ _ _ _ _ _ area code and home phone [ ] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ fax # [ ] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ email address _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ to remove your name from our mailing list , send us an email with remove in the subject line . this is a one time offer and you should not hear from us again ! foreign orders - add $ 15us if you desire air parcel post shipment . we ship all over the world . by deleting your unwanted e - mail you waste one keystroke , yet by throwing away paper mail you waste our planet ! save the trees and support internet e - mail instead of paper mail ! [ c ] copyright tcps 1998"


In [524]:
emails[emails.spam == False].head(5) #non-spams

Unnamed: 0,spam,files,message
0,False,3-1msg1.txt,"Subject: re : 2 . 882 s - > np np > date : sun , 15 dec 91 02 : 25 : 02 est > from : michael < mmorse @ vm1 . yorku . ca > > subject : re : 2 . 864 queries > > wlodek zadrozny asks if there is "" anything interesting "" to be said > about the construction "" s > np np "" . . . second , > and very much related : might we consider the construction to be a form > of what has been discussed on this list of late as reduplication ? the > logical sense of "" john mcnamara the name "" is tautologous and thus , at > that level , indistinguishable from "" well , well now , what have we here ? "" . to say that ' john mcnamara the name ' is tautologous is to give support to those who say that a logic-based semantics is irrelevant to natural language . in what sense is it tautologous ? it supplies the value of an attribute followed by the attribute of which it is the value . if in fact the value of the name-attribute for the relevant entity were ' chaim shmendrik ' , ' john mcnamara the name ' would be false . no tautology , this . ( and no reduplication , either . )"
1,False,3-1msg2.txt,"Subject: s - > np + np the discussion of s - > np + np reminds me that some years ago i read , in a source now forgotten , a critique of some newsmagazines ' unique tendencies in writing style , most of which the writer found overly "" cute "" . one item was tersely put down as follows : "" time 's favorite : the colon . "" - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - lee hartman ga5123 @ siucvmb . bitnet department of foreign languages southern illinois university carbondale , il 62901 u . s . a ."
2,False,3-1msg3.txt,"Subject: 2 . 882 s - > np np . . . for me it 's much more restrictive than s - > np np . it 's "" no "" np pro quite an over-restriction , that ."
3,False,3-375msg1.txt,"Subject: gent conference "" for the listserv "" international conference 1992 second circular : february 1992 literature and the analysis of discourse with special attention to the multicultural context tuesday 8 september - friday 11 september 1992 gent university , belgium writing and reading literature , oral literary traditions , dialogic text , non-literary narratives , discourse theory , literature as social practice , etc . , etc . , etc . keynote speakers : david birch ( murdoch , australia ) martin montgomery ( strathclyde , scotland ) elinor ochs ( los angeles , usa ) statement of pala ' s aims pala 's principal aim is to encourage cooperation between scholars and teachers interested in language and / or literary studies . the interests of pala members are wide , and this is reflected in papers given at pala conferences . interests of members include : stylistics , literary theory , the teaching of language and literature , critical linguistics , pragmatics , discours analysis , textual understanding , rhetoric , narratology , semiotic approaches to text and performance , sociolinguistics , cultural studies , post-structuralist theory ; in short , any theme which has relevance to the study and teaching of language and literature and their role in society . the 1992 conference theme to highlight the currently expanding field of discours studies , the 1992 conference has as its core theme ' literature and the analysis of discourse , with special attention to the multicultural context ' . papers covering interests as wide as the processes of writing and reading literature , the analysis of dialogic text , oral literary traditions , the relationship between literary and non-literary discourse , discourse theory and literary communication as social practice have all been proposed , as well as those dealing specifically with the writing and reading of literature in a multilingual and / or multicultural context . the 1992 conference venue gent university is of the city type ; there is no campus , and university buildings are dotted around the town . conference sessions will take place in the hoveniersberg , overlooking the bovenschelde in one of the quiet parts of town . programme conference sessions will start on the morning of the wednesday and last a full three days . it is envisaged that most participants will arrive and register on the tuesday evening . our provisional programme looks like this : tuedsday 8 sept 15 . 00 onwards : registration wednesday 9 sept 08 . 30 - 09 . 30 : late registration 09 . 45 : opening of conference 10 . 00 - 18 . 00 : conference sessions 18 . 30 : pre-booked dinner 20 . 15 : drinks reception thursday 10 sept 08 . 30 - 18 . 00 : conference sessions 18 . 30 : pala agm 20 . 00 : pre-booked dinner friday 11 sept 08 . 30 - 17 . 00 : conference sessions 17 . 15 : wind-up session evening : activities to be arranged there will be continuous coffee , tea , etc . throughout the conference sessions . accommodation rooms in the vermeylen student hall of residence , a couple of hundred metres from the conference centre , are available to all participants . it is possible to book rooms for several nights either side of the conference dates . the price on the registration form includes breakfast . unfortunately , no double rooms are available . if you would prefer to stay in a hotel , we recommend the arcade hotel ( nederkouter , 9000 gent ; tel . 32-91 - 25 . 07 . 07 ) , which is only 10 minutes ' walk from the conference centre . alternatively , you can contact the gent tourist office ( meersstraat 138 , 9000 gent ; tel . 32-91 - 25 . 35 . 55 ) . food breakfast will be served in the overpoort , the university eating complex next door to the vermeylen . lunch and supper is also available there to conference participants , as are snacks throughout the day . there will be no single ' conference dinner ' as such , but to make it easier for participants to meet each other , we are arranging dinners for both wednesday and thursday evenings in the university restaurant . these have to be pre-booked . staying in gent gent ( population around 230 , 000 ) is a historic flemish city , the first in europe to declare itself independent of feudal control . it has a plethora of medieval vistas and bridges and is thus entitled to compete with bruges and amsterdam for the title of ' venice of the north ' . it is also a busy industrial city and the commercial and administrative centre for east flanders . the first language is flemish / dutch ( depending on one 's sociolinguistic viewpoint ) but nearly every-body can use both english and french with at least some degree of fluency . there are numerous restaurants , cafes and pubs near the conference area ( including two good vegetarian restaurants ) , many of which stay open well into the small hours . prices are cheap by northern european standards . for those wishing to combine the conference with a visit to gent and the surrounding area , you may like to know that a train can take you in less than an hour to bruges , brussels , antwerp or the belgian coast . you can even get into the ardennes or to paris within a few hours . registration / queries to attend the conference , fill in the registration form and return it , with payment , by 1st may . confirmation of registration and details of arrangements will be sent in the third circular to those who have registered , but if you have any enquiries , contact jim o'driscoll or stef slembrouck at seminarie voor engelse taalkunde , universiteit gent , rozier 44 , b-9000 gent , belgium ( tel : 32-91 - 64 . 37 . 88 / 89 / 90 ; fax : 32-91 - 64 . 41 . 95 ; e-mail pala92 @ engllang . rug . ac . be ) . * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * pala 92 gent university registration form surname _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ first name ( s ) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ address _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ affiliation _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ i will participate in the conference and enclose a eurocheque ( or have arranged direct transfer to the pala account in belgium ) to cover : ( tick as appropriate ) pala member conference fee ( bf 1000 ) _ _ _ _ _ _ non-member conference fee ( bf 2000 ) _ _ _ _ _ _ student conference fee ( bf 600 ) _ _ _ _ _ _ dinner on 9th september ( bf 500 ) _ _ _ _ _ _ dinner on 10th september ( bf 500 ) _ _ _ _ _ _ accommodation for tue 8th september ( bf 525 ) _ _ _ _ _ _ accommodation for wed 9th september ( bf 525 ) _ _ _ _ _ _ accommodation for thu 10th september ( bf 525 ) _ _ _ _ _ _ accommodation for fri 11th september ( bf 525 ) _ _ _ _ _ _ accommodation for ( specify ) ( bf ) _ _ _ _ _ _ fee for international money transfer or cheque other than eurocheques * ( bf 300 ) _ _ _ _ _ _ i therefore enclose ( or have transferred ) a total of bf _ _ _ _ _ _ i would like lacto-vegetarian / vegan food for the dinner ( s ) i have booked _ _ _ _ _ signature _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ please return to pala conference 1992 , seminarie voor engelse taalkunde , universiteit gent , rozier 44 , b-9000 gent , belgium ( pala9 @ engllang . rug . ac . be ) . the final date for registration is 1st may 1992 . _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ * . note that all payments must be made in belgian francs . cheques should be made payable to ' pala conference 1992 ' . a single eurocheque must not be of more than bf 7 , 000 . international money transfers should be sent via ' swift ' , quoting our bank 's swift number ( bbru be bb 900 ) and our account number : bbl 390-0959358 - 83 . if you have any problems with either method of payment , please contact the organizers ."
4,False,3-378msg1.txt,Subject: query : causatives in korean could anyone point me to any books and articles about causative constructions in korean ? please send an e-mail directly to me . thanks you ! hiromi morikawa hiromi @ psych . stanford . edu


In [525]:
#2
emails = emails.dropna(subset=['spam', 'message'])[["spam", "message"]]
emails.shape

(2893, 2)

## (15pt) Create Document-term matrix (DTM)

The first serious step is to create the document-term matrix (DTM).
This is simply numeric indicators for selected words: does this email
contain the word (1) or not (0).  But before we get there, we have to
decide the words.


1. (2pt) Choose 10+ words which might be good to distinguish between spam/non-spam.  Use these four: ''viagra'', ''deadline'', ''million'', and ''and''.  Choose more words yourself (you may want to return here and reconsider your choice later).


2. (10pt) Convert your messages into DTM.  We do not use the full 60k-words DTM here but only a baby-DTM of the 10 words you picked above. You may add the DTM columns to the original data frame, or keep those in a separate structure. 

Creating the DTM involves finding whether the word is contained in the message for all emails in data. You can loop over emails and check each one individually, but pandas string methods make life much easier.  You will want to do case-insensitive matching, checking for both upper and lower case.  You may consider something like this:

```
for w in list_of_words:
    emails[w] = emails.message.str.lower().str.contains(w)
```

  Note: It is more intuitive to work with your data if you
  convert the logical values returned by contains to numbers.
  
  
  
3. (3pt) Split your work data (i.e. the DTM) and target (the spam indicator) into training and validation chunks (80/20 is a good split).

In [526]:
# code goes here 
#1
wrds = ['viagra', 'deadline', 'million', 'and', 'free', 'urgent', 'congratulations', 'cash', 'winner', 'offer']

In [527]:
#2
for w in wrds:
    emails[w] = emails.message.str.lower().str.contains(w)

In [528]:
dtm = emails[wrds]
target = emails.spam
X_train, X_test, y_train, y_test = train_test_split(dtm, target, test_size = 0.20)

## (80pt) Estimate and validate

Now you are ready with the preparatory work and it's time to
dive into the real thing.  Let's rehearse the Bayes theorem here
again.  We want to estimate the probability that an email is spam, given it
contains a certain word: 

$Pr(category = S|w = 1) = \frac{Pr(w=1|category = S) * Pr(category=S)}{Pr(w=1)}$.


In order to compute this probability, we need to calculate some other
probabilities: 

* $Pr(category=S)$ --> Probability of spam in data

* $Pr(category=NS)$ --> Probablility for non-spam in data

* $Pr(w=1)$ --> Probability the word is seen in messages

* $Pr(w=0)$ --> probability the word is not seen in messages

* $Pr(w=1|category = S)$ --> & probability the word is seen in messages that are spam

* $Pr(w=1|category = NS)$ --> probability the word is seen in messages that are not spam

....


but it turns out we are still not done with preparations. Namely, you need to compute 
quite a few different probabilities below, including $Pr(category=S)$, $Pr(category=NS)$, $Pr(w=1)$, $Pr(w=0)$, $Pr(w=1|category = S)$, $Pr(w=0|category = S)$, $Pr(w=1|category = NS)$, $Pr(w=0|category = NS)$.


1. (2pt) Design a scheme for your variable names that describes these probabilities so that a) you understand what they mean; and b) the others (including your grader) will understand those! Hint: you may get some ideas from the [Python notes](https://faculty.washington.edu/otoomet/machinelearning-py/python.html#base-language) in Section 2.3, Base Language.

The first task is to compute these probabilities.
Use only training data for this task.

2. (4pt) Compute the priors, the unconditional probabilities for an email being spam and non-spam, $Pr(category=S)$ and $Pr(category=NS)$.  These probabilities are based on the spam variable alone, not on the text.


The next tasks involve computing the following probabilities for each
word out of the list of 10 you picked above,
I recommend to avoid unneccessary complexity and
just to write a loop over the words, compute the
answers, and print the word and the corresponding results there.  



3. (4pt) For each word $w$, compute the normalizers, $Pr(w=1)$ and $Pr(w=0)$.
  
  Hint: this is $Pr(million = 1) = 0.0484$.  But note this value
  (and the following hints) depends on your random training/validation split!
  
  
4. (7pt) For each word $w$, compute $Pr(w=1|category = S)$ and $Pr(w=1|category = NS)$.  These probabilities are based on both the spam-variable and on the DTM component that corresponds to the word $w$.
  
  Hint: $Pr(million = 1|category = S) = 0.252$
  
  
5. (5pt) Finally, compute the probabilities of interest, $Pr(category = S|w = 1)$ and $Pr(category = S|w = 0)$.  Compute this value using Bayes theorem, not directly by counting! 
  
  For the check, you may also compute
  $Pr(category = NS|w = 1)$ and $Pr(category = NS|w = 0)$
  
  Hint: $\Pr(\mathit{category} = S|\mathit{million} = 1) = 0.843$.  But
  note this number depends on your random testing-validation split!


6. (6pt)  Which of these probabilities have to sum to one? (E.g. $Pr(category = 1) + Pr(category = 0) = 1$.) Which ones do not?  Explain!

---
Now we are done with the estimator.  Your fitted model is completely
described by these probabilities.  Let's now turn to prediction, using
your validation data.  Note that we are still inside the loop over
each word $w$!

9. (8pt) For each email in your validation set, predict whether it is predicted to be spam or non-spam.  Hint: you should check if it contains the word $w$ and use the appropriate probability, $Pr(category = S|w = 1)$ or $Pr(category = S|w = 0)$.


10. (5pt) Print the resulting confusion matrix and compute accuracy, precision and recall.


11. (5pt) Which steps above constitute model training?  In which steps do you use trained model?  What is a trained model in this case? Explain! 
  
  Hint: a trained model is all you need to make predictions.

---
Now it is time to look at your results a little bit closer.

12. (4pt) Comment the overall performance of the model--how do accuracy, precision and recall look like?


13. (8pt) Explain why do you see very low recall while the other indicators do not look that bad.


14. (8pt) Explain why some words work well and others not: 
  * why does ''million'' improve accuracy?
  * why does ''viagra'' not work?
  * why does ''deadline'' not work?
  * why does ''and'' not work?

  Hint: You may just see where in which emails these words occur, and
  how frequently.  These are all different reasons!
  
---
Finally, let's add Laplace smoothing to this model.  One can imagine
Laplace smoothing as two additional ''ghost'' observations, one spam
and one non-spam.  Both of these ghost observations contain every
single word in our DTM.  See also [Lecture Notes](https://faculty.washington.edu/otoomet/machineLearning.pdf), Ch 7.3.2 ''Smoothing: how to compute probabilities with too
few data'', page 263.

Laplace smoothing does not add anything here but it is is a crucial
tool when we move to Naive Bayes later.

15. (5pt) Add such smoothing to the model.  You can either literally add two such lines of data, or alternatively manipulate the way you compute the probabilities.


16. (5pt) Repeat the tasks above: compute the probabilities, do predictions, compute the accuracy, precision, recall for all words.  


17. (4pt) Comment on the results.  Does smoothing improve the overall performance? 

In [529]:
# code goes here 
#1
pr_s = y_train[y_train == True].count() / y_train.shape[0] # probability of spam in data, priors
pr_ns = y_train[y_train == False].count() / y_train.shape[0] # probablility for non-spam in data, priors

pr_viagra = 0 # probability the word 'viagra' is seen in messages
pr_deadline = 0 # probability the word 'deadline' is seen in messages
pr_million = 0 # probability the word 'million' is seen in messages
pr_and = 0 # probability the word 'and' is seen in messages
pr_free = 0 # probability the word 'free' is seen in messages  
pr_urgent = 0 # probability the word 'urgent' is seen in messages
pr_congratulations = 0 # probability the word 'congratulations' is seen in messages
pr_cash = 0 # probability the word 'cash' is seen in messages
pr_winner = 0 # probability the word 'winner' is seen in messages
pr_offer = 0 # probability the word 'offer' is seen in messages

pr_n_viagra = 0 # probability the word 'viagra' is not seen in messages
pr_n_deadline = 0 # probability the word 'deadline' is not seen in messages
pr_n_million = 0 # probability the word 'million' is not seen in messages
pr_n_and = 0 # probability the word 'and' is not seen in messages
pr_n_free = 0 # probability the word 'free' is not seen in messages
pr_n_urgent = 0 # probability the word 'urgent' is not seen in messages
pr_n_congratulations = 0 # probability the word 'congratulations' is not seen in messages
pr_n_cash = 0 # probability the word 'cash' is not seen in messages
pr_n_winner = 0 # probability the word 'winner' is not seen in messages
pr_n_offer = 0 # probability the word 'offer' is not seen in messages

pr_viagra_s = 0 # probability the word 'viagra' is seen in messages that are spam
pr_deadline_s = 0 # probability the word 'deadline' is seen in messages that are spam
pr_million_s = 0 # probability the word 'million' is seen in messages that are spam
pr_and_s = 0 # probability the word 'and' is seen in messages that are spam
pr_free_s = 0 # probability the word 'free' is seen in messages that are spam
pr_urgent_s = 0 # probability the word 'urgent' is seen in messages that are spam
pr_congratulations_s = 0 # probability the word 'congratulations' is seen in messages that are spam
pr_cash_s = 0 # probability the word 'cash' is seen in messages that are spam
pr_winner_s = 0 # probability the word 'winner' is seen in messages that are spam
pr_offer_s = 0 # probability the word 'offer' is seen in messages that are spam

pr_viagra_ns = 0 # probability the word 'viagra' is seen in messages that are not spam
pr_deadline_ns = 0 # probability the word 'deadline' is seen in messages that are not spam
pr_million_ns = 0 # probability the word 'million' is seen in messages that are not spam
pr_and_ns = 0 # probability the word 'and' is seen in messages that are not spam
pr_free_ns = 0 # probability the word 'free' is seen in messages that are not spam
pr_urgent_ns = 0 # probability the word 'urgent' is seen in messages that are not spam
pr_congratulations_ns = 0 # probability the word 'congratulations' is seen in messages that are not spam
pr_cash_ns = 0 # probability the word 'cash' is seen in messages that are not spam
pr_winner_ns = 0 # probability the word 'winner' is seen in messages that are not spam
pr_offer_ns = 0 # probability the word 'offer' is seen in messages that are not spam

pr_n_viagra_s = 0 # probability the word 'viagra' is not seen in messages that are spam
pr_n_deadline_s = 0 # probability the word 'deadline' is not seen in messages that are spam
pr_n_million_s = 0 # probability the word 'million' is not seen in messages that are spam
pr_n_and_s = 0 # probability the word 'and' is not seen in messages that are spam
pr_n_free_s = 0 # probability the word 'free' is not seen in messages that are spam
pr_n_urgent_s = 0 # probability the word 'urgent' is not seen in messages that are spam
pr_n_congratulations_s = 0 # probability the word 'congratulations' is not seen in messages that are spam
pr_n_cash_s = 0 # probability the word 'cash' is not seen in messages that are spam
pr_n_winner_s = 0 # probability the word 'winner' is not seen in messages that are spam
pr_n_offer_s = 0 # probability the word 'offer' is not seen in messages that are spam

pr_n_viagra_ns = 0 # probability the word 'viagra' is not seen in messages that are not spam
pr_n_deadline_ns = 0 # probability the word 'deadline' is not seen in messages that are not spam
pr_n_million_ns = 0 # probability the word 'million' is not seen in messages that are not spam
pr_n_and_ns = 0 # probability the word 'and' is not seen in messages that are not spam
pr_n_free_ns = 0 # probability the word 'free' is not seen in messages that are not spam
pr_n_urgent_ns = 0 # probability the word 'urgent' is not seen in messages that are not spam
pr_n_congratulations_ns = 0 # probability the word 'congratulations' is not seen in messages that are not spam
pr_n_cash_ns = 0 # probability the word 'cash' is not seen in messages that are not spam
pr_n_winner_ns = 0 # probability the word 'winner' is not seen in messages that are not spam
pr_n_offer_ns = 0 # probability the word 'offer' is not seen in messages that are not spam

# probability the word is seen in messages
pr_w = [pr_viagra, pr_deadline, pr_million, pr_and, pr_free,  
         pr_urgent, pr_congratulations, pr_cash, pr_winner, pr_offer]

# probability the word is not seen in messages
pr_n_w = [pr_n_viagra, pr_n_deadline, pr_n_million, pr_n_and, pr_n_free,  
         pr_n_urgent, pr_n_congratulations, pr_n_cash, pr_n_winner, pr_n_offer]

# probability the word is seen in messages that are spam
pr_ws = [pr_viagra_s, pr_deadline_s, pr_million_s, pr_and_s, pr_free_s, pr_urgent_s,
         pr_congratulations_s, pr_cash_s, pr_winner_s, pr_offer_s]

# probability the word is seen in messages that are not spam
pr_wns = [pr_viagra_ns, pr_deadline_ns, pr_million_ns, pr_and_ns, pr_free_ns, pr_urgent_ns,
          pr_congratulations_ns, pr_cash_ns, pr_winner_ns, pr_offer_ns]

# probability the word is not seen in messages that are spam
pr_nws = [pr_n_viagra_s, pr_n_deadline_s, pr_n_million_s, pr_n_and_s, pr_n_free_s, pr_n_urgent_s,
         pr_n_congratulations_s, pr_n_cash_s, pr_n_winner_s, pr_n_offer_s]

# probability the word is not seen in messages that are not spam
pr_nwns = [pr_n_viagra_ns, pr_n_deadline_ns, pr_n_million_ns, pr_n_and_ns, pr_n_free_ns, pr_n_urgent_ns,
          pr_n_congratulations_ns, pr_n_cash_ns, pr_n_winner_ns, pr_n_offer_ns]

X_train["spam"] = y_train
s_df = X_train[X_train.spam == True] # spam dataset
ns_df = X_train[X_train.spam == False] # non-spam dataset

for i in range(0, 10):
    pr_w[i] = X_train[wrds[i]][X_train[wrds[i]] == True].count() / X_train.shape[0]
    pr_n_w[i] = X_train[wrds[i]][X_train[wrds[i]] == False].count() / X_train.shape[0]
    pr_ws[i] = s_df[wrds[i]][s_df[wrds[i]] == True].count() / X_train.shape[0] / pr_s
    pr_wns[i] = ns_df[wrds[i]][ns_df[wrds[i]] == True].count() / X_train.shape[0] / pr_ns
    pr_nws[i] = s_df[wrds[i]][s_df[wrds[i]] == False].count() / X_train.shape[0] / pr_s
    pr_nwns[i] = ns_df[wrds[i]][ns_df[wrds[i]] == False].count() / X_train.shape[0] / pr_ns

In [530]:
#2
print(pr_s, pr_ns) # probability of spam in data, probability of non-spam in data,

0.16853932584269662 0.8314606741573034


In [531]:
#3
print("Pr(w=1):")
print()
for i in range(0, 10):
    print(wrds[i] + ": " + str(pr_w[i]))
    
print()

print("Pr(w=0):")
print()
for i in range(0, 10):
    print(wrds[i] + ": " + str(pr_n_w[i]))

Pr(w=1):

viagra: 0.000432152117545376
deadline: 0.14649956784788246
million: 0.04969749351771824
and: 0.9403630077787382
free: 0.18366464995678478
urgent: 0.004753673292999135
congratulations: 0.00216076058772688
cash: 0.04364736387208297
winner: 0.010371650821089023
offer: 0.14866032843560933

Pr(w=0):

viagra: 0.9995678478824547
deadline: 0.8535004321521176
million: 0.9503025064822818
and: 0.059636992221261884
free: 0.8163353500432152
urgent: 0.9952463267070009
congratulations: 0.9978392394122731
cash: 0.956352636127917
winner: 0.989628349178911
offer: 0.8513396715643907


In [532]:
#4
print("Pr(w=1|category=S):")
print()
for i in range(0, 10):
    print(wrds[i] + ": " + str(pr_ws[i]))
    
print()

print("Pr(w=1|category=NS):")
print()
for i in range(0, 10):
    print(wrds[i] + ": " + str(pr_wns[i]))

Pr(w=1|category=S):

viagra: 0.002564102564102564
deadline: 0.0
million: 0.24615384615384614
and: 0.9205128205128206
free: 0.6256410256410256
urgent: 0.007692307692307692
congratulations: 0.007692307692307692
cash: 0.18717948717948718
winner: 0.048717948717948725
offer: 0.3820512820512821

Pr(w=1|category=NS):

viagra: 0.0
deadline: 0.1761954261954262
million: 0.009875259875259876
and: 0.9443866943866943
free: 0.09407484407484408
urgent: 0.004158004158004158
congratulations: 0.0010395010395010396
cash: 0.014553014553014552
winner: 0.002598752598752599
offer: 0.10135135135135134


In [533]:
#5
print("Pr(category=S|w=1):")
print()
for i in range(0, 10):
    pr = pr_ws[i]*pr_s/pr_w[i]
    print(wrds[i] + ": " + str(pr))
    
print()

print("Pr(category=S|w=0):")
print()
for i in range(0, 10):
    pr = pr_nws[i]*pr_s/pr_n_w[i]
    print(wrds[i] + ": " + str(pr))
    
print()

print("Pr(category=NS|w=1):")
print()
for i in range(0, 10):
    pr = pr_wns[i]*pr_ns/pr_w[i]
    print(wrds[i] + ": " + str(pr))
    
print()

print("Pr(category=NS|w=0):")
print()
for i in range(0, 10):
    pr = pr_nwns[i]*pr_ns/pr_n_w[i]
    print(wrds[i] + ": " + str(pr))

Pr(category=S|w=1):

viagra: 1.0
deadline: 0.0
million: 0.8347826086956521
and: 0.16498161764705882
free: 0.5741176470588235
urgent: 0.27272727272727276
congratulations: 0.5999999999999999
cash: 0.7227722772277229
winner: 0.7916666666666667
offer: 0.4331395348837209

Pr(category=S|w=0):

viagra: 0.1681798530047557
deadline: 0.19746835443037974
million: 0.13369713506139155
and: 0.2246376811594203
free: 0.07728957120169402
urgent: 0.16804168475901
congratulations: 0.16760502381983544
cash: 0.14324446452779033
winner: 0.16200873362445414
offer: 0.12233502538071066

Pr(category=NS|w=1):

viagra: 0.0
deadline: 1.0
million: 0.16521739130434784
and: 0.8350183823529411
free: 0.42588235294117655
urgent: 0.7272727272727274
congratulations: 0.4
cash: 0.27722772277227725
winner: 0.20833333333333337
offer: 0.5668604651162791

Pr(category=NS|w=0):

viagra: 0.8318201469952443
deadline: 0.8025316455696202
million: 0.8663028649386084
and: 0.7753623188405797
free: 0.922710428798306
urgent: 0.83195831524

In [534]:
#6
print("Pr(category=1):")
print()
# probability the word is seen in messages that are spam + probability the word is not seen in messages that are spam 
for i in range(0, 10): 
    prw = pr_nwns[i] * pr_ns / pr_n_w[i]
    prnw = pr_nws[i] * pr_s / pr_n_w[i]
    print(wrds[i] + ": " + str(prw + prnw))

print()
    
print("Pr(category=0):")
print()
# probability the word is seen in messages that are not spam + probability the word is not seen in messages that are not spam 
for i in range(0, 10):
    prw = pr_nws[i] * pr_s / pr_n_w[i]
    prnw = pr_nwns[i] * pr_ns / pr_n_w[i]
    print(wrds[i] + ": " + str(prw + prnw))

Pr(category=1):

viagra: 1.0
deadline: 1.0
million: 1.0
and: 1.0
free: 1.0
urgent: 1.0
congratulations: 1.0
cash: 1.0
winner: 0.9999999999999999
offer: 1.0

Pr(category=0):

viagra: 1.0
deadline: 1.0
million: 1.0
and: 1.0
free: 1.0
urgent: 1.0
congratulations: 1.0
cash: 1.0
winner: 0.9999999999999999
offer: 1.0


The total probability that will be classified as spam Pr(category = 1) is 1, considering both the cases of the word being seen and not seen in all instances of spam. Conversely, the total probability that will be classified as not spam Pr(category = 0) is 1, considering both the cases of the word being seen and not seen in all instances of non-spam. These probabilities, whose total adds up to 1, are calculated using a single variable - `spam` and each `words` variables respectively.

In [535]:
#9
y_test = pd.DataFrame(y_test, columns=['spam'])
for i in range(0, 10):
    X_test["pr_" + wrds[i]] = np.where(X_test[wrds[i]] == True, (pr_ws[i]*pr_s/pr_w[i]), (pr_nws[i]*pr_s/pr_n_w[i]))
    X_test["new_spam + " + wrds[i]] = np.where(X_test["pr_" + wrds[i]] > 0.5, True, False)

In [536]:
#10
ave_accu = 0
ave_prc = 0
ave_rcll = 0
for i in range(0, 10):
    cm = confusion_matrix(y_test.spam, X_test["new_spam + " + wrds[i]])
    accu = accuracy_score(y_test.spam, X_test["new_spam + " + wrds[i]])
    prc = precision_score(y_test.spam, X_test["new_spam + " + wrds[i]], zero_division=0)
    rcll = recall_score(y_test.spam, X_test["new_spam + " + wrds[i]])
    ave_accu = ave_accu + accu
    ave_prc = ave_prc + prc
    ave_rcll = ave_rcll + rcll
    
    print(wrds[i] + ":")
    print("Confusion Matrix: " + str(cm))
    print("Accuracy: " + str(accu))
    print("Precision: " + str(prc))
    print("Recall: " + str(rcll))
    print()

print("Average Performance of Accuracy: " + str(ave_accu / len(wrds)))
print("Average Performance of Precision: " + str(ave_prc / len(wrds)))
print("Average Performance of Recall: " + str(ave_rcll / len(wrds)))

viagra:
Confusion Matrix: [[488   0]
 [ 91   0]]
Accuracy: 0.842832469775475
Precision: 0.0
Recall: 0.0

deadline:
Confusion Matrix: [[488   0]
 [ 91   0]]
Accuracy: 0.842832469775475
Precision: 0.0
Recall: 0.0

million:
Confusion Matrix: [[483   5]
 [ 71  20]]
Accuracy: 0.8687392055267703
Precision: 0.8
Recall: 0.21978021978021978

and:
Confusion Matrix: [[488   0]
 [ 91   0]]
Accuracy: 0.842832469775475
Precision: 0.0
Recall: 0.0

free:
Confusion Matrix: [[448  40]
 [ 37  54]]
Accuracy: 0.8670120898100173
Precision: 0.574468085106383
Recall: 0.5934065934065934

urgent:
Confusion Matrix: [[488   0]
 [ 91   0]]
Accuracy: 0.842832469775475
Precision: 0.0
Recall: 0.0

congratulations:
Confusion Matrix: [[488   0]
 [ 89   2]]
Accuracy: 0.846286701208981
Precision: 1.0
Recall: 0.02197802197802198

cash:
Confusion Matrix: [[482   6]
 [ 74  17]]
Accuracy: 0.8618307426597582
Precision: 0.7391304347826086
Recall: 0.18681318681318682

winner:
Confusion Matrix: [[487   1]
 [ 86   5]]
Accuracy: 0

In [537]:
#11

Using the Naive Bayes, step 5 and 9 constitue model training. If the probability of a specific word appearing in an email is above 0.5, it is classified as spam. If not, it is considered as non-spam. In step 5, the priors and normalizers are calculated using the training data. Meanwhile, the validation data is utilized in step 9 to categorize each email based on the probabilities obtained from the priors and normalizers calculated in the training data.

In [538]:
#12

The average accuracy for each word stands at 85%, indicating the model's high performance in predicting spam emails. However, precision is quite low due to its dependence on the number of true positives and false positives. The confusion matrix does not have sufficient true positives and false positives, leading to low precision. Similarly, recall is also low as it is based on the number of true positives and false negatives, and the confusion matrix lacks true positives and false negatives to produce a high recall value.

In [539]:
#13

According to the confusion matrix for each word, the model lacks many true positives, resulting in a high number of zeros or smaller values in true positives. This tendency indicates that the model is not perfectly identifying real spam emails. This lack of true positive predictions results in a low recall value in the model based on the recall formula.

In [540]:
#14

#### Why does ''million'' improve accuracy?
- The word "million" is commonly found in spam emails, while it is rarely present in non-spam emails. Also, the confusion matrix shows that the majority of the results are true positives, leading to a high accuracy in the model.

#### Why does ''viagra'' not work?
- Despite having a high number of true positives for the word "viagra" the precision is 0, indicating a low ratio of correct positive predictions compared to the total predicted positive instances. The word "viagra" is not a frequent component of emails, making it difficult for the model to identify spam emails that do not contain it.

#### Why does ''deadline'' not work?
- Similarly to the word "viagra", the precision for the word "deadline" is also 0, implying a low ratio of correct positive predictions in relation to the total predicted positive cases. The word "deadline" is not a common term, therefore it does not appear frequently in either spam or non-spam emails.

#### Why does ''and'' not work?
- Similarly to the word "viagra" and "deadline", the precision for the word "and" is also 0, implying a low ratio of correct positive predictions in relation to the total predicted positive cases. The word "and" is a frequent keyword found in many emails, making the model hard to differentiate spam emails that contains it.

In [541]:
#15, #16
alpha = 1

print("Pr(category=S|w=1):")
print()
for i in range(0, 10):
    pr = (((pr_ws[i]*pr_s) + alpha)/(pr_w[i] + 2 * alpha))
    print(wrds[i] + ": " + str(pr))
    
print()

print("Pr(category=S|w=0):")
print()
for i in range(0, 10):
    pr = (((pr_nws[i]*pr_s) + alpha)/(pr_n_w[i] + 2 * alpha))
    print(wrds[i] + ": " + str(pr))
    
print()

print("Pr(category=NS|w=1):")
print()
for i in range(0, 10):
    pr = (((pr_wns[i]*pr_ns) + alpha)/(pr_w[i] + 2 * alpha))
    print(wrds[i] + ": " + str(pr))
    
print()

print("Pr(category=NS|w=0):")
print()
for i in range(0, 10):
    pr = (((pr_nwns[i]*pr_ns) + alpha)/(pr_n_w[i] + 2 * alpha))
    print(wrds[i] + ": " + str(pr))
    
print()

for i in range(0, 10):
    X_test["pr_" + wrds[i]] = np.where(X_test[wrds[i]] == True, (((pr_ws[i]*pr_s) + alpha)/(pr_w[i] + 2 * alpha)), 
                                                                   (((pr_nws[i]*pr_s) + alpha)/(pr_n_w[i] + 2 * alpha)))
    X_test["new_spam + " + wrds[i]] = np.where(X_test["pr_" + wrds[i]] > 0.5, True, False)
    
    
ave_accu = 0
ave_prc = 0
ave_rcll = 0
for i in range(0, 10):
    cm = confusion_matrix(y_test.spam, X_test["new_spam + " + wrds[i]])
    accu = accuracy_score(y_test.spam, X_test["new_spam + " + wrds[i]])
    prc = precision_score(y_test.spam, X_test["new_spam + " + wrds[i]], zero_division=0)
    rcll = recall_score(y_test.spam, X_test["new_spam + " + wrds[i]])
    ave_accu = ave_accu + accu
    ave_prc = ave_prc + prc
    ave_rcll = ave_rcll + rcll

    
    print(wrds[i] + ":")
    print("Confusion Matrix: " + str(cm))
    print("Accuracy: " + str(accu))
    print("Precision: " + str(prc))
    print("Recall: " + str(rcll))
    print()

print("Average Performance of Accuracy: " + str(ave_accu / len(wrds)))
print("Average Performance of Precision: " + str(ave_prc / len(wrds)))
print("Average Performance of Recall: " + str(ave_rcll / len(wrds)))

Pr(category=S|w=1):

viagra: 0.5001080146899979
deadline: 0.4658747735051339
million: 0.5081172253847775
and: 0.3928571428571429
free: 0.5062339204433011
urgent: 0.4994610907523173
congratulations: 0.5001079214331966
cash: 0.5047578769295834
winner: 0.5015047291487532
offer: 0.49537409493161705

Pr(category=S|w=0):

viagra: 0.38942515487681884
deadline: 0.4095108284113282
million: 0.38201259704116014
and: 0.4920268569030634
free: 0.37747429798987264
urgent: 0.38969845621122495
congratulations: 0.3893613954158858
cash: 0.38459289577547145
winner: 0.3881179531656548
offer: 0.38723855713852684

Pr(category=NS|w=1):

viagra: 0.4998919853100022
deadline: 0.5341252264948662
million: 0.49188277461522245
and: 0.6071428571428571
free: 0.493766079556699
urgent: 0.5005389092476827
congratulations: 0.4998920785668033
cash: 0.4952421230704165
winner: 0.4984952708512468
offer: 0.504625905068383

Pr(category=NS|w=0):

viagra: 0.610574845123181
deadline: 0.5904891715886718
million: 0.6179874029588399


In [542]:
#17

Smoothing does not enhance the overall performance. However, it avoids producing extreme probability values of 1.0 or 0.0 in conditional probabilities, which better reflects the nature of emails that typically contain those words at least once.