# Business / Problem Understanding


Phishing URLs are submited to PhishTank. They are classified by the "community" as being actual phishes or not. Therefore, we have a classification task.

What features could we use to predict valid from invalid phishes using ML?


We can find some valid phishes on PhishTank: https://phishtank.org/phish_search.php?valid=y&active=All&Search=Search



Here are some examples of valid phishes:
- https://help-recovery-identity-support-international.web.id/confirmid.php
- https://234565676868--3456556757.repl.co/


Let's look at this one:

- https://help-recovery-identity-support-international.web.id/confirmid.php

We can break it into its URL pieces:

-PATH: /confirmid.php
  *"Is it a .php?"
      -binary feature
  *"How long is the path?"
      -contiuous integer 
  *"How many unique characters in the path?"
  

-DOMAIN: help-recovery-identity-support-international.web.id
  *text analytics, are there certain substrings that are giveaways of a phish?
  *Is readable?
  *Is there a top-level domain that is not the last entry in the list?
  *Count of subdomains?
-PROTOCOL: https://


# Download and unzip

comment out to run

In [40]:
import pandas as pd
from sklearn import tree
import eli5

In [4]:
#!wget 'https://research.aalto.fi/files/16859732/urlset.csv.zip' -O 'dataset.zip'

--2022-02-04 18:42:29--  https://research.aalto.fi/files/16859732/urlset.csv.zip
Resolving research.aalto.fi (research.aalto.fi)... 34.253.178.11, 34.248.98.230
Connecting to research.aalto.fi (research.aalto.fi)|34.253.178.11|:443... connected.
HTTP request sent, awaiting response... 302 302
Location: https://acris.aalto.fi/ws/portalfiles/portal/16859732/urlset.csv.zip [following]
--2022-02-04 18:42:30--  https://acris.aalto.fi/ws/portalfiles/portal/16859732/urlset.csv.zip
Resolving acris.aalto.fi (acris.aalto.fi)... 130.233.208.8
Connecting to acris.aalto.fi (acris.aalto.fi)|130.233.208.8|:443... connected.
HTTP request sent, awaiting response... 200 200
Length: unspecified [multipart/x-zip]
Saving to: ‘dataset.zip’

dataset.zip             [             <=>    ]   3.24M   905KB/s    in 3.7s    

2022-02-04 18:42:35 (905 KB/s) - ‘dataset.zip’ saved [3400239]



In [5]:
#!unzip 'dataset.zip'

Archive:  dataset.zip
  inflating: urlset.csv              


In [14]:
df = pd.read_csv('urlset.csv', on_bad_lines = 'skip',  encoding_errors = 'ignore')

  df = pd.read_csv('urlset.csv', on_bad_lines = 'skip',  encoding_errors = 'ignore')


In [15]:
df.head()

Unnamed: 0,domain,ranking,mld_res,mld.ps_res,card_rem,ratio_Rrem,ratio_Arem,jaccard_RR,jaccard_RA,jaccard_AR,jaccard_AA,jaccard_ARrd,jaccard_ARrem,label
0,nobell.it/70ffb52d079109dca5664cce6f317373782/...,10000000,1.0,0.0,18.0,107.611111,107.277778,0.0,0.0,0.0,0.0,0.8,0.795729,1.0
1,www.dghjdgf.com/paypal.co.uk/cycgi-bin/webscrc...,10000000,0.0,0.0,11.0,150.636364,152.272727,0.0,0.0,0.0,0.0,0.0,0.768577,1.0
2,serviciosbys.com/paypal.cgi.bin.get-into.herf....,10000000,0.0,0.0,14.0,73.5,72.642857,0.0,0.0,0.0,0.0,0.0,0.726582,1.0
3,mail.printakid.com/www.online.americanexpress....,10000000,0.0,0.0,6.0,562.0,590.666667,0.0,0.0,0.0,0.0,0.0,0.85964,1.0
4,thewhiskeydregs.com/wp-content/themes/widescre...,10000000,0.0,0.0,8.0,29.0,24.125,0.0,0.0,0.0,0.0,0.0,0.748971,1.0


In [16]:
df.describe()

Unnamed: 0,card_rem,ratio_Rrem,ratio_Arem,jaccard_RR,jaccard_RA,jaccard_AR,jaccard_AA,label
count,95923.0,95923.0,95923.0,95922.0,95921.0,95920.0,95919.0,95913.0
mean,4.580498,135.255201,138.544211,0.00503,0.003787,0.003378,0.003661,0.499453
std,4.466073,160.988895,175.480485,0.311308,0.024815,0.024011,0.028492,0.500002
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2.0,43.0,39.666667,0.0,0.0,0.0,0.0,0.0
50%,3.0,104.0,103.333333,0.0,0.0,0.0,0.0,0.0
75%,6.0,174.142857,178.292857,0.0,0.0,0.0,0.0,1.0
max,187.333333,5507.0,6097.0,96.0,1.0,1.0,1.0,1.0


In [17]:
pd.__version__

'1.4.0'

## Grab just the domain and label

In [20]:
df = df[['domain', 'label']]

# Feature Engineering 



## Clean the Data

In [24]:
## Which one of our domains is a float??

df[df['domain'].apply(lambda x: isinstance(x,float))]

Unnamed: 0,domain,label,domain_len
18251,,,96003


In [25]:
df = df.dropna()

### Start with length of domain 

In [26]:
df['domain_len'] = df['domain'].apply(lambda x: len(x))

In [28]:
df

Unnamed: 0,domain,label,domain_len
0,nobell.it/70ffb52d079109dca5664cce6f317373782/...,1.0,225
1,www.dghjdgf.com/paypal.co.uk/cycgi-bin/webscrc...,1.0,81
2,serviciosbys.com/paypal.cgi.bin.get-into.herf....,1.0,177
3,mail.printakid.com/www.online.americanexpress....,1.0,60
4,thewhiskeydregs.com/wp-content/themes/widescre...,1.0,116
...,...,...,...
95998,xbox360.ign.com/objects/850/850402.html,0.0,39
95999,games.teamxbox.com/xbox-360/1860/Dead-Space/,0.0,44
96000,www.gamespot.com/xbox360/action/deadspace/,0.0,42
96001,en.wikipedia.org/wiki/Dead_Space_(video_game),0.0,45


In [31]:
#build a tree model 

clf = tree.DecisionTreeClassifier()
clf = clf.fit(df[['domain_len']], df['label'])

In [37]:
df[['domain']].iloc[0].values

array(['nobell.it/70ffb52d079109dca5664cce6f317373782/login.SkyPe.com/en/cgi-bin/verification/login/70ffb52d079109dca5664cce6f317373/index.php?cmd=_profile-ach&outdated_page_tmpl=p/gen/failed-to-load&nav=0.5.1&login_access=1322408526'],
      dtype=object)

In [43]:
df[['domain_len']].iloc[0].values.tolist()

[225]

In [34]:
clf.predict([df[['domain_len']].iloc[0]])



array([1.])

In [39]:
#can also directly pass the value of domain length
#needs to be in doulbe [] due to being treated as a two-dimensional array

clf.predict([[225]])



array([1.])

In [42]:
eli5.explain_prediction(clf, df[['domain_len']].iloc[0])

Contribution?,Feature
0.501,domain_len
0.499,<BIAS>
