# Spam or Ham - RNN Edition
Lab Assignment Two: Exploring Text Data

**_Jake Oien, Seung Ki Lee, Jenn Le_**

## Business Understanding

### Dataset Description

This dataset contains raw text messages that were classified as either spam or ham, as in not spam. Collecting and analyzing this data can be useful for identifying the characteristics of spam messages and in turn, assist in filtering out spam in text messages as well as emails. Individuals and groups are being more frequently targeting consumers through their text messages as the medium is more personal than email and this can cause people to lower their guards. Being able to recognize the patterns prevalent in a spam message can help avoid possibly harmful situations that can be caused by these spam messages.

The texts that make up this dataset come from free sources on the internet, including 425 messages from the Grumbletext Web Site, 3,375 from the NUS SMS Corpus (NSC), 450 from Caroline Tag's PhD Thesis, and 1,325 from the SMS Spam Corpus v.0.1 Big.

### Business Case

Our algorithm might be used in an opt-in service provided by a cell network. If a user gets targeted by a bunch of spam messages, they may want to have some service available that will prematurely detect a spam message coming in and not allow it to reach them. Analysis of the message text could help with that. An ideal use case of this system would mean that every spam message gets captured and every ham message makes it through. This may not be entirely possible, so the system should err on the side of allowing spam messages in if it means that every ham message makes it in. 

## Preparation

First, we will load the dataset. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Here, we'll import the data, remove unwanted columns(cause this data has 3 empty columns for some reason,
# and rename the columns to be more descriptive
data = pd.read_csv("./spam.csv", encoding='latin-1')
data = data.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1)
data = data.rename(columns={"v1":"label", "v2":"text"})

data.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


### Verify Data Quality

To clean up the data set, we've analyzed what words were meaningless in the context of constituting a message. First major filler we've noticed was the markdown tags. We concluded words such as **& lt;#& gt;** are used for formatting purposes and not for anything pertinent to the meaning of the text. Also, we did not come across any words which started with '&' and ended with ';' which wasn't a markdown tag, so the probability of removing important data seems very low. 

In [2]:
#verify data quality
import re

#remove irrelavant words : markdown tags
data.text.replace(to_replace=["#?\&(lt|gt)\;"],value=[''],regex=True, inplace=True)  # get rid of &lt; type encodings


We should add a column which stores the length of a text, which will be useful later. 

In [3]:
length = lambda x: len(x)
data["text_length"] = data["text"].map(length)  # add a column indicating how long a message is
data.head(10)

Unnamed: 0,label,text,text_length
0,ham,"Go until jurong point, crazy.. Available only ...",111
1,ham,Ok lar... Joking wif u oni...,29
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155
3,ham,U dun say so early hor... U c already then say...,49
4,ham,"Nah I don't think he goes to usf, he lives aro...",61
5,spam,FreeMsg Hey there darling it's been 3 week's n...,148
6,ham,Even my brother is not like to speak with me. ...,77
7,ham,As per your request 'Melle Melle (Oru Minnamin...,160
8,spam,WINNER!! As a valued network customer you have...,158
9,spam,Had your mobile 11 months or more? U R entitle...,154


Now let's remove duplicate texts from the dataset.

In [4]:
# data[data.label == "spam"].text.values[0:7]
len_data = len(data)
data.drop_duplicates(inplace=True)
len_data_no_dupes = len(data)

print("Number of total text messages: {}".format(len_data))
print("Number of unique text messages: {}".format(len_data_no_dupes))
print("Number of duplicates removed: {}".format(len_data - len_data_no_dupes))

Number of total text messages: 5572
Number of unique text messages: 5169
Number of duplicates removed: 403


We see that we removed 403 duplicate messages from the dataset. Now let's visualize a sparse representation of the messages. 

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
pd.options.display.max_columns = 1000

#create bag of words
count_vector = CountVectorizer(stop_words=None)
bag_of_words = count_vector.fit_transform(data['text'])

#put word counts in pd.DataFrame
bag_of_words_df = pd.DataFrame(data=bag_of_words.toarray(), 
                                       columns=count_vector.get_feature_names())

bag_of_words_df.head()

Unnamed: 0,00,000,000pes,008704050406,0089,0121,01223585236,01223585334,0125698789,02,0207,02072069400,02073162414,02085076972,021,03,04,0430,05,050703,0578,06,07,07008009200,07046744435,07090201529,07090298926,07099833605,07123456789,0721072,07732584351,07734396839,07742676969,07753741225,0776xxxxxxx,07781482378,07786200117,077xxx,078,07801543489,07808,07808247860,07808726822,07815296484,07821230901,078498,07880867867,0789xxxxxxx,07946746291,0796xxxxxx,07973788240,07xxxxxxxxx,08,0800,08000407165,08000776320,08000839402,08000930705,08000938767,08001950382,08002888812,08002986030,08002986906,08002988890,08006344447,0808,08081263000,08081560665,0825,083,0844,08448350055,08448714184,0845,08450542832,08452810071,08452810073,08452810075over18,0870,08700435505150p,08700469649,08700621170150p,08701213186,08701237397,08701417012,08701417012150p,0870141701216,087016248,08701752560,087018728737,0870241182716,08702490080,08702840625,08704050406,08704439680,08704439680ts,08706091795,0870737910216yrs,08707500020,08707509020,0870753331018,08707808226,08708034412,08708800282,08709222922,08709501522,0871,087104711148,08712101358,08712103738,0871212025016,08712300220,087123002209am,08712317606,08712400200,08712400602450p,08712400603,08712402050,08712402578,08712402779,08712402902,08712402972,08712404000,08712405020,08712405022,08712460324,08712466669,0871277810710p,0871277810810,0871277810910p,08714342399,087147123779am,08714712379,08714712388,08714712394,08714712412,08714714011,08715203028,08715203649,08715203652,08715203656,08715203677,08715203685,08715203694,08715205273,08715500022,08715705022,08717111821,08717168528,08717205546,0871750,08717507382,08717509990,08717890890å,08717895698,08717898035,08718711108,08718720201,08718723815,08718725756,08718726270,087187262701,08718726970,08718726971,08718726978,087187272008,08718727868,08718727870,08718727870150ppm,08718730555,08718730666,08718738001,08718738002,08718738034,08719180219,08719180248,08719181259,08719181503,08719181513,08719839835,08719899217,08719899229,08719899230,09,09041940223,09050000301,09050000332,09050000460,09050000555,09050000878,09050000928,09050001295,09050001808,09050002311,09050003091,09050005321,09050090044,09050280520,09053750005,09056242159,09057039994,09058091854,09058091870,09058094454,09058094455,09058094507,09058094565,09058094583,09058094594,09058094597,09058094599,09058095107,09058095201,09058097189,09058097218,09058098002,09058099801,09061104276,09061104283,09061209465,09061213237,09061221061,09061221066,09061701444,09061701461,09061701851,09061701939,09061702893,09061743386,09061743806,09061743810,09061743811,09061744553,09061749602,09061790121,09061790125,09061790126,09063440451,09063442151,09063458130,0906346330,09064011000,09064012103,09064012160,09064015307,09064017295,09064017305,09064018838,09064019014,09064019788,09065069120,09065069154,09065171142,09065174042,09065394514,09065394973,09065989180,09065989182,09066350750,09066358152,09066358361,09066361921,09066362206,09066362220,09066362231,09066364311,09066364349,09066364589,09066368327,09066368470,09066368753,09066380611,09066382422,09066612661,09066649731from,09066660100,09071512432,09071512433,09071517866,09077818151,09090204448,09090900040,09094100151,09094646631,09094646899,09095350301,09096102316,09099725823,09099726395,09099726429,09099726481,09099726553,09111030116,09111032124,09701213186,0a,0quit,10,100,1000,1000call,1000s,100p,100percent,100txt,1013,1030,10am,10k,10p,10ppm,10th,11,1120,113,1131,114,116,1172,118p,11mths,11pm,12,1205,120p,121,1225,123,125,1250,125gift,128,12hours,12hrs,12mths,13,130,1327,139,14,140,1405,140ppm,145,1450,146tf150p,14tcr,14thmarch,15,150,1500,150p,150p16,150pm,150ppermesssubscription,150ppm,150ppmpobox10183bhamb64xe,150ppmsg,150pw,151,153,15541,15pm,16,165,1680,169,177,18,180,1843,18p,18yrs,195,1956669,1apple,1b6a5ecef91ff9,1cup,1da,1er,1hr,1im,1lemon,1mega,1million,1pm,1st,1st4terms,1stchoice,1stone,1thing,1tulsi,1win150ppmx3,1winaweek,1winawk,1x150p,1yf,20,200,2000,2003,2004,2005,2006,2007,200p,2025050,20m12aq,20p,21,21870000,21st,22,220,220cm2,2309,23f,23g,24,24hrs,24m,24th,25,250,250k,255,25p,26,2667,26th,27,28,2814032,28days,28th,28thfeb,29,2b,2bold,2c,2channel,2day,2end,2exit,2ez,2find,2getha,2geva,2go,2gthr,2hrs,2kbsubject,2lands,2marrow,2moro,2morow,2morro,2morrow,2morrowxxxx,2mro,2mrw,2nd,2nhite,2nights,2nite,2optout,2p,2price,2px,2rcv,2stop,2stoptx,2stoptxt,2u,2u2,2waxsto,2wks,2wt,2wu,2years,2yr,2yrs,30,300,3000,300603,300603t,300p,3030,30apr,30ish,30pm,30pp,30s,30th,31,3100,310303,31p,32,...,vitamin,viva,vivek,vl,voda,vodafone,vodka,voice,voicemail,voila,volcanoes,vomit,vomitin,vomiting,vote,voted,vouch4me,voucher,vouchers,vpod,vry,vs,vth,vu,w1,w111wx,w14rg,w1a,w1j,w1j6hl,w1jhl,w1t1jy,w4,w45wq,w8in,wa,wa14,waaaat,wad,wadebridge,wah,wahala,wahay,waheed,waheeda,wahleykkum,waht,wait,waited,waitin,waiting,wake,waking,wales,waliking,walk,walkabout,walked,walkin,walking,walks,wall,wallet,wallpaper,walls,walmart,walsall,wamma,wan,wan2,wana,wanna,wannatell,want,want2come,wanted,wanting,wants,wap,waqt,warm,warming,warned,warner,warning,warranty,warwick,was,washob,wasn,wasnt,wasnåõt,waste,wasted,wasting,wat,watch,watched,watches,watchin,watching,watchng,water,watever,watevr,wating,watr,wats,watts,wave,wavering,waves,way,way2sms,waz,wc1n,wc1n3xx,we,weak,weakness,weaknesses,weapon,wear,wearing,weaseling,weasels,weather,web,web2mobile,webadres,webeburnin,webpage,website,wed,weddin,wedding,weddingfriend,wednesday,weds,wee,weed,week,weekdays,weekend,weekends,weekly,weeks,weigh,weighed,weight,weightloss,weird,weirdest,weirdo,weirdy,weiyi,welcome,welcomes,well,wellda,welp,wen,wendy,wenever,went,wenwecan,wer,were,werebored,weren,werethe,wesley,wesleys,west,western,westlife,westonzoyland,westshore,wet,wetherspoons,wewa,weåõve,whassup,what,whatever,whats,whatsup,wheat,wheel,wheellock,when,whenever,whenevr,whens,where,whereare,wherever,wherevr,wherre,whether,which,while,whilltake,whispers,white,whn,who,whole,whom,whore,whos,whose,whr,why,wi,wicked,wicket,wicklow,wid,widelive,wif,wife,wifes,wifi,wihtuot,wikipedia,wil,wild,wildest,wildlife,will,willing,willpower,win,win150ppmx3age16,wind,window,windows,winds,windy,wine,wined,wings,wining,winner,winnersclub,winning,wins,winterstone,wipro,wire3,wisdom,wise,wish,wisheds,wishes,wishin,wishing,wishlist,wiskey,wit,with,withdraw,wither,within,without,witin,witot,witout,wiv,wizzle,wk,wkend,wkent,wkg,wkly,wknd,wks,wlcome,wld,wml,wn,wnevr,wnt,wo,woah,wocay,woke,woken,woman,womdarfull,women,won,wondar,wondarfull,wonder,wonderful,wondering,wonders,wont,woo,woodland,woods,woohoo,woot,woould,woozles,worc,word,words,work,workage,workand,workin,working,workout,works,world,worlds,worms,worried,worries,worry,worrying,worse,worst,worth,worthless,wot,wotu,wotz,woul,would,woulda,wouldn,wounds,wow,wrc,wrecked,wrench,wrenching,wright,write,writhing,wrk,wrkin,wrking,wrks,wrld,wrnog,wrong,wrongly,wrote,ws,wt,wtc,wtf,wth,wthout,wtlp,wud,wudn,wuld,wuldnt,wun,www,wylie,x2,x29,x49,xafter,xam,xavier,xchat,xclusive,xin,xmas,xoxo,xt,xuhui,xx,xxsp,xxuk,xxx,xxxmobilemovieclub,xxxx,xxxxx,xxxxxx,xxxxxxx,xxxxxxxx,xxxxxxxxxxxxxx,xy,y87,ya,yah,yahoo,yalrigu,yalru,yam,yan,yar,yarasu,yards,yavnt,yaxx,yaxxx,yay,yck,yeah,year,years,yeesh,yeh,yelling,yellow,yelow,yen,yeovil,yep,yer,yes,yest,yesterday,yet,yetty,yetunde,yhl,yi,yifeng,yijue,ym,ymca,yo,yoga,yogasana,yor,yorge,you,youdoing,youi,young,younger,youphone,your,youre,yourinclusive,yourjob,yours,yourself,youuuuu,youwanna,youåõre,yoville,yowifes,yoyyooo,yr,yrs,ystrday,ything,yummmm,yummy,yun,yunny,yuo,yuou,yup,yupz,zac,zaher,zealand,zebra,zed,zeros,zhong,zindgi,zoe,zogtorius,zoom,zouk,zyada,åè10,åð,åòharry,åòit,åômorrow,åôrents,ì_,ì¼1,ìä,ìï,ó_,û_,û_thanks,ûªm,ûªt,ûªve,ûï,ûïharry,ûò,ûówell
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


First, at the end of the list, we see a lot of strange characters. We don't know what these characters are, but they likely come from an encoding mismatch. Without that knowledge, we will leave them in for now, until we can understand better what the characters mean. 

Second, at the beginning of the list, we see a lot of numbers that have 10+ digits. These are likely phone numbers. Depending how we want to set up our sequences, we may wish to replace all of these long strings with a single identifier. This might be helpful because a specific phone number doesn't necessarily mean anything different than the presence of a phone number. A neural network might determine that a phone number in a message from a specific area code might mean more than another phone number for a spam classification. However, we have less than 1,000 spam messages to look at, so a first intuition seems that any specific phone numbers/area codes might not be well-represented enough to be used as a sufficient means of classification. 

### Evaluation Metric

How do we determine the success of our neural network? We will consider spam to be our positive class. At the end of the day, we care about finding spam messages and not misclassifying ham messages. Therefore, it seems that the F1 score would be our best metric for evaluation. 

When a message is spam, we want to correctly classify it, so we want to include recall. However, if a message is always classified as spam, we will have 100% recall, which means that we will have 100% recall and 0% messages making it to the recipient. So, we also want to include a measure that takes into account correct predictions. We don't care about successfully classifying ham messages, rather, we care about not misclassifying them. In that case, we want precision. Because we want to include both precision and recall, we will use the F1 score to get the best of both worlds. 

### Cross Validation Method

Now, how will we try to estimate generalization performance? We will use K-fold cross validation but we want to make sure we have enough examples of spam in each fold. Should we use a stratified method? Let's look at the class distribution. 

In [6]:
data_grouped = data.groupby(by="label")  # separate ham and spam
# sns.barplot(x="label", y="count", data=data_grouped.text)
count = data_grouped.text.count()  # just the count of all entries
ax = count.plot(kind="barh")
plt.xlabel("count")
plt.title("Count of spam and ham messages");

Yes, we should use a stratified method, because there is a big class imbalance. We will use 10-fold stratified cross-validation, unless it turns out that it takes a long time to train the model, in which case we will reduce the number of folds. Given that we have only ~5000 messages of less than 300 character messages, we will probably not run into a training time issue. 

## Modeling (50 points total)
### [25 points] Investigate at least two different recurrent network architectures (perhaps LSTM and GRU). Adjust hyper-parameters of the networks as needed to improve generalization performance. 
### [25 points] Use the method of cross validation and evaluation metric that you argued for at the beginning of the lab. Visualize the best results of the RNNs.   


In [51]:
# import everything
import numpy as np

import keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import GRU

In [26]:
texts = data.text.as_matrix()
labels = data.label.as_matrix()
labels_nums = data.label.astype("category").cat.codes.as_matrix()  # numerical representation of the labels

numpy.ndarray

In [33]:
NUM_TOP_WORDS = None
MAX_TEXT_LEN = 500 # maximum and minimum number of words

tokenizer = Tokenizer(num_words=NUM_TOP_WORDS)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

word_index = tokenizer.word_index
NUM_TOP_WORDS = len(word_index) if NUM_TOP_WORDS==None else NUM_TOP_WORDS
top_words = min((len(word_index),NUM_TOP_WORDS))
print('Found %s unique tokens. Distilled to %d top words.' % (len(word_index),top_words))

X = pad_sequences(sequences, maxlen=MAX_TEXT_LEN)

y_ohe = keras.utils.to_categorical(labels_nums)
print('Shape of data tensor:', X.shape)
print('Shape of label tensor:', y_ohe.shape)
print(np.max(X))

Found 8920 unique tokens. Distilled to 8920 top words.
Shape of data tensor: (5169, 500)
Shape of label tensor: (5169, 2)
8920


Now we will load the Glove embedding.  TODO maybe make it trainable?

In [45]:
%%time
EMBED_SIZE = 100
# the embed size should match the file you load glove from
embeddings_index = {}
with open('glove.6B/glove.6B.100d.txt') as f:
# save key/array pairs of the embeddings
#  the key of the dictionary is the word, the array is the embedding
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

print('Found %s word vectors.' % len(embeddings_index))
        
embedding_matrix = np.zeros((len(word_index) + 1, EMBED_SIZE))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

print(embedding_matrix.shape)

Found 400000 word vectors.
(8921, 100)
CPU times: user 12.4 s, sys: 413 ms, total: 12.8 s
Wall time: 12.8 s


Now let's make a basic RNN using the LSTM model. 

In [46]:
from keras.layers import Embedding

embedding_layer = Embedding(len(word_index) + 1,
                            EMBED_SIZE,
                            weights=[embedding_matrix],
                            input_length=MAX_TEXT_LEN,
                            trainable=False)

In [47]:
from sklearn.model_selection import train_test_split
# Split it into train / test subsets
X_train, X_test, y_train_ohe, y_test_ohe = train_test_split(X, y_ohe, test_size=0.2,
                                                            stratify=labels, 
                                                            random_state=42)
NUM_CLASSES = 2
print(X_train.shape,y_train_ohe.shape)
print(np.sum(y_train_ohe,axis=0))

(4135, 500) (4135, 2)
[ 3613.   522.]


In [48]:
def create_lstm_model():
    rnn_lstm = Sequential()
    rnn_lstm.add(embedding_layer)
    rnn_lstm.add(LSTM(100,dropout=0.2, recurrent_dropout=0.2))
    rnn_lstm.add(Dense(NUM_CLASSES, activation='sigmoid'))
    rnn_lstm.compile(loss='binary_crossentropy', 
                  optimizer='adam', 
                  metrics=['accuracy'])
#     print(rnn_lstm.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 500, 100)          892100    
_________________________________________________________________
lstm_7 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dense_6 (Dense)              (None, 2)                 202       
Total params: 972,702
Trainable params: 80,602
Non-trainable params: 892,100
_________________________________________________________________
None


In [49]:
# rnn_lstm.fit(X_train, y_train_ohe, validation_data=(X_test, y_test_ohe), epochs=3, batch_size=64)

Train on 4135 samples, validate on 1034 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x140c93c50>

Now let's try it with a GRU model. 

In [52]:
def create_gru_model():
    rnn_gru = Sequential()
    rnn_gru.add(embedding_layer)
    rnn_gru.add(GRU(100,dropout=0.2, recurrent_dropout=0.2))
    rnn_gru.add(Dense(NUM_CLASSES, activation='sigmoid'))
    rnn_gru.compile(loss='binary_crossentropy', 
                  optimizer='adam', 
                  metrics=['accuracy'])
#     print(rnn_gru.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 500, 100)          892100    
_________________________________________________________________
gru_1 (GRU)                  (None, 100)               60300     
_________________________________________________________________
dense_7 (Dense)              (None, 2)                 202       
Total params: 952,602
Trainable params: 60,502
Non-trainable params: 892,100
_________________________________________________________________
None


In [53]:
# rnn_gru.fit(X_train, y_train_ohe, validation_data=(X_test, y_test_ohe), epochs=3, batch_size=64)

Train on 4135 samples, validate on 1034 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x13ecb6c50>

## Exceptional Work (10 points total)
### One idea (required for 7000 level students): Train an embedding layer for words in your RNN. Visualize and interpret the embedding layer weights. 
### Another Idea (NOT required): Try to create a RNN for generating novel text. 