Name: Soleil Umwiza


Student number: 4386019



In [1]:
import sklearn
import pandas as pd
import numpy as np
import seaborn 



print("scikit-learn version:", sklearn.__version__)     # 1.1.3
print("pandas version:", pd.__version__)            # 1.5.1
print("seaborn version:", seaborn.__version__)   



scikit-learn version: 1.3.0
pandas version: 1.5.3
seaborn version: 0.12.2


In [2]:
df = pd.read_csv("Phishing_Email.csv", index_col=0)
df.shape



(18650, 2)

In [3]:
df.head()

Unnamed: 0,Email Text,Email Type
0,"re : 6 . 1100 , disc : uniformitarianism , re ...",Safe Email
1,the other side of * galicismos * * galicismo *...,Safe Email
2,re : equistar deal tickets are you still avail...,Safe Email
3,\nHello I am your hot lil horny toy.\n I am...,Phishing Email
4,software at incredibly low prices ( 86 % lower...,Phishing Email


In [4]:
df.tail()

Unnamed: 0,Email Text,Email Type
18646,date a lonely housewife always wanted to date ...,Phishing Email
18647,request submitted : access request for anita ....,Safe Email
18648,"re : important - prc mtg hi dorn & john , as y...",Safe Email
18649,press clippings - letter on californian utilit...,Safe Email
18650,empty,Phishing Email


In [5]:
df.sample(10)

Unnamed: 0,Email Text,Email Type
11780,charity partners golf outing ten spaces are st...,Safe Email
17642,re : you neeed our super specials on antidotes...,Phishing Email
17385,"URL: http://www.newsisfree.com/click/-1,838114...",Safe Email
3652,world wide announcement to respond to the need...,Safe Email
18473,URL: http://diveintomark.org/archives/2002/10/...,Safe Email
9937,\nFull Access Medical To sign up for ...,Phishing Email
1135,esslli ' 98 student session - 2nd cfp the essl...,Safe Email
4496,empty,Safe Email
9601,"pictures special , list add yes , written reme...",Phishing Email
13568,jm@jmason.org (Justin Mason) writes:>>> DATE...,Safe Email


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18650 entries, 0 to 18650
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Email Text  18634 non-null  object
 1   Email Type  18650 non-null  object
dtypes: object(2)
memory usage: 437.1+ KB


In [7]:
df=df.rename(columns={'Email Text': 'EmailText','Email Type':'EmailType'}) # changing columns name

## Data cleaning and Pre-processing

Check if there is a missing values

In [8]:
df.isnull().values.sum()

16

In [9]:
df1 = df[df.isna().any(axis=1)]
print(df1)

      EmailText       EmailType
31          NaN  Phishing Email
387         NaN  Phishing Email
1883        NaN  Phishing Email
2049        NaN  Phishing Email
2451        NaN  Phishing Email
2972        NaN  Phishing Email
3627        NaN  Phishing Email
3806        NaN  Phishing Email
5763        NaN  Phishing Email
6299        NaN  Phishing Email
6822        NaN  Phishing Email
8595        NaN  Phishing Email
10000       NaN  Phishing Email
11070       NaN  Phishing Email
11321       NaN  Phishing Email
13844       NaN  Phishing Email


### 🎯 Target variable
Given that machine learning algorithms work with only numbers and therefore produce only numbers as output, the first thing that needs to be done is ascertaining that the target variable is numeric. A new column named `EmailTypeID` is made, that contains a number for ech of the different `EmailType`, and that becomes the target variable for our model. The thing it needs to predict. In order to fill the new column `EmailTypeID` a LabelEncoder is used, which produces a unique number for every unique text it finds in the column `EmailType`. Since there are merely two unique values, the numbers it will give are `0`and `1`. `0` is phishing and `1` is safe.




In [10]:
from sklearn import preprocessing
encoder = preprocessing.LabelEncoder()
df["EmailTypeID"]=encoder.fit_transform(df["EmailType"])
df.sample(10)

Unnamed: 0,EmailText,EmailType,EmailTypeID
14559,upcoming discretionary var limit changes per d...,Safe Email,1
12057,SPECIAL SITUATION ALERTS HOT PICK OF THE YEARE...,Phishing Email,0
18295,geir ' s goals ),Safe Email,1
14356,\n> I just had to jump in here as Carbonara is...,Safe Email,1
15219,fw : neevr seen prono flash animation buenos d...,Phishing Email,0
8052,"On Wed, 2002-07-31 at 15:16, Elias Sinderson w...",Safe Email,1
1131,URGENT NOTICEPENDING MERGER TO INCREASE REVENU...,Phishing Email,0
6885,free this is a multi-part message in mime form...,Phishing Email,0
18626,cera conference call & web presentation : tran...,Safe Email,1
7138,"\nHello, jm@netnoteinc.comHuman Growth Hormone...",Phishing Email,0


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18650 entries, 0 to 18650
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   EmailText    18634 non-null  object
 1   EmailType    18650 non-null  object
 2   EmailTypeID  18650 non-null  int32 
dtypes: int32(1), object(2)
memory usage: 510.0+ KB


Installing NLTK library for tokenization and ignoring English stopwords

In [12]:
! pip install --user nltk



Importing packages for tokenization and function to perform natural language pre-processing

In [13]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.snowball import EnglishStemmer
import nltk
nltk.download('punkt')
stemmer = EnglishStemmer()

#Getting list of English stopwords
stop_words = set(stopwords.words('english'))
 
# converts the words in word_tokens to lower case and then checks whether
#they are present in stop_words or not, return tokens without stopwords
def text_preprocessing(text):
    word_tokens = word_tokenize(stemmer.stem(text))
    return [w for w in word_tokens if not w in stop_words]



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\csten\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Fill NaN values with empty string to avoid type errors

In [14]:
df['EmailText']=df['EmailText'].fillna('').apply(str)

Duplicate the EmailText column and perform tokenization to the new one without changing the original Text

In [15]:
df['TokenizedEmail']= df['EmailText']

In [16]:
df['TokenizedEmail'][0]

're : 6 . 1100 , disc : uniformitarianism , re : 1086 ; sex / lang dick hudson \'s observations on us use of \'s on \' but not \'d aughter \' as a vocative are very thought-provoking , but i am not sure that it is fair to attribute this to " sons " being " treated like senior relatives " . for one thing , we do n\'t normally use \' brother \' in this way any more than we do \'d aughter \' , and it is hard to imagine a natural class comprising senior relatives and \'s on \' but excluding \' brother \' . for another , there seem to me to be differences here . if i am not imagining a distinction that is not there , it seems to me that the senior relative terms are used in a wider variety of contexts , e . g . , calling out from a distance to get someone \'s attention , and hence at the beginning of an utterance , whereas \'s on \' seems more natural in utterances like \' yes , son \' , \' hand me that , son \' than in ones like \' son ! \' or \' son , help me ! \' ( although perhaps these

In [17]:
set(text_preprocessing(str('re : 6 . 1100 , disc : uniformitarianism , re : 1086 ; sex / lang dick hudson \'s observations on us use of \'s on \' but not \'d aughter \' as a vocative are very thought-provoking , but i am not sure that it is fair to attribute this to " sons " being " treated like senior relatives " . for one thing , we do n\'t normally use \' brother \' in this way any more than we do \'d aughter \' , and it is hard to imagine a natural class comprising senior relatives and \'s on \' but excluding \' brother \' . for another , there seem to me to be differences here . if i am not imagining a distinction that is not there , it seems to me that the senior relative terms are used in a wider variety of contexts , e . g . , calling out from a distance to get someone \'s attention , and hence at the beginning of an utterance , whereas \'s on \' seems more natural in utterances like \' yes , son \' , \' hand me that , son \' than in ones like \' son ! \' or \' son , help me ! \' ( although perhaps these latter ones are not completely impossible ) . alexis mr')))

{'!',
 "'",
 "'d",
 "'s",
 '(',
 ')',
 ',',
 '.',
 '/',
 '1086',
 '1100',
 '6',
 ':',
 ';',
 '``',
 'alexis',
 'although',
 'another',
 'attention',
 'attribute',
 'aughter',
 'beginning',
 'brother',
 'calling',
 'class',
 'completely',
 'comprising',
 'contexts',
 'dick',
 'differences',
 'disc',
 'distance',
 'distinction',
 'e',
 'excluding',
 'fair',
 'g',
 'get',
 'hand',
 'hard',
 'help',
 'hence',
 'hudson',
 'imagine',
 'imagining',
 'impossible',
 'lang',
 'latter',
 'like',
 'mr',
 "n't",
 'natural',
 'normally',
 'observations',
 'one',
 'ones',
 'perhaps',
 'relative',
 'relatives',
 'seem',
 'seems',
 'senior',
 'sex',
 'someone',
 'son',
 'sons',
 'sure',
 'terms',
 'thing',
 'thought-provoking',
 'treated',
 'uniformitarianism',
 'us',
 'use',
 'used',
 'utterance',
 'utterances',
 'variety',
 'vocative',
 'way',
 'whereas',
 'wider',
 'yes'}

Using our function that perform stemming, tokenization and eliminates stopwords, we then use the "set()" function to convert the list of words to a set, hence removing duplicates 

In [18]:
df['TokenizedEmail'].apply(lambda x: set(text_preprocessing(str(x))))

0        {imagine, ;, us, excluding, contexts, whereas,...
1        {origin, names, opposite, sounding, <, galicis...
2        {farmer, pricing, sale, want, forward, tina, n...
3        {call, free, imagination, love, person, e-mail...
4        {represent, job, subtract, family, person, %, ...
                               ...                        
18646    {php, lonely, www, biz, wanted, :, housewifes,...
18647    {approval, \, review, upon, data, common, reso...
18648    {hi, mail, starting, arrangement, could, expli...
18649    {following, chana, team, ', also, utilities, l...
18650                                              {empti}
Name: TokenizedEmail, Length: 18650, dtype: object

In [19]:
df

Unnamed: 0,EmailText,EmailType,EmailTypeID,TokenizedEmail
0,"re : 6 . 1100 , disc : uniformitarianism , re ...",Safe Email,1,"re : 6 . 1100 , disc : uniformitarianism , re ..."
1,the other side of * galicismos * * galicismo *...,Safe Email,1,the other side of * galicismos * * galicismo *...
2,re : equistar deal tickets are you still avail...,Safe Email,1,re : equistar deal tickets are you still avail...
3,\nHello I am your hot lil horny toy.\n I am...,Phishing Email,0,\nHello I am your hot lil horny toy.\n I am...
4,software at incredibly low prices ( 86 % lower...,Phishing Email,0,software at incredibly low prices ( 86 % lower...
...,...,...,...,...
18646,date a lonely housewife always wanted to date ...,Phishing Email,0,date a lonely housewife always wanted to date ...
18647,request submitted : access request for anita ....,Safe Email,1,request submitted : access request for anita ....
18648,"re : important - prc mtg hi dorn & john , as y...",Safe Email,1,"re : important - prc mtg hi dorn & john , as y..."
18649,press clippings - letter on californian utilit...,Safe Email,1,press clippings - letter on californian utilit...
