# In class -01/28

https://archive.ics.uci.edu/ml/datasets/phishing+websites


# Business Understanding


Suspected Phishing urls are submitted to PhishTank. They are classified by the "community" as being either actual phishes or not. Therefore, we have a classification task.

What features could we use to try to use ML to predict valid from invalid phishes?

We can see some valid phisheson PhishTank using this query: https://phishtank.org/phish_search.php?valid=y&active=All&Search=Search

Here are some example phish url: 
- https://help-recovery-identity-support-international.web.id/confirmid.php
- https://234565676868--3456556757.repl.co/
- http://activate.facebook.fblogins.net/88adbao798283o8298398?login.asp
- http://drive--google.com/luke.johnson
- http://efax.hosting.com.mailru382.co/efaxdelivery/2017Dk4h325RE3

Let's look at this one:

- https://help-recovery-identity-support-international.web.id/confirmid.php

We can break it into its URL pieces:
- PATH: /confirmid.php
  * "Is it a .php?"
    - binary feature
  * "How long is the path?"
    - integer
    - continuous
  * "How many unique characters in the path?"
- DOMAIN: help-recovery-identity-support-international.web.id
  * text analytics, are there certain substrings that are giveaways of a phish?
  * Is readable?
  * Is there a top-level domain that is not the last entry in the list?
  * count of subdomains?
  * WHOIS lookup -- age of the domain?
  * top-level domain
- PROTOCOL: https://

In [4]:
!wget https://research.aalto.fi/files/16859732/urlset.csv.zip 

--2022-02-04 17:43:31--  https://research.aalto.fi/files/16859732/urlset.csv.zip
Resolving research.aalto.fi (research.aalto.fi)... 34.248.98.230, 34.253.178.11
Connecting to research.aalto.fi (research.aalto.fi)|34.248.98.230|:443... connected.
HTTP request sent, awaiting response... 302 302
Location: https://acris.aalto.fi/ws/portalfiles/portal/16859732/urlset.csv.zip [following]
--2022-02-04 17:43:32--  https://acris.aalto.fi/ws/portalfiles/portal/16859732/urlset.csv.zip
Resolving acris.aalto.fi (acris.aalto.fi)... 130.233.208.8
Connecting to acris.aalto.fi (acris.aalto.fi)|130.233.208.8|:443... connected.
HTTP request sent, awaiting response... 200 200
Length: unspecified [multipart/x-zip]
Saving to: ‘urlset.csv.zip’

urlset.csv.zip          [<=>                 ]   3.24M   466KB/s    in 8.9s    

2022-02-04 17:43:41 (372 KB/s) - ‘urlset.csv.zip’ saved [3400239]



In [54]:
import pandas as pd
from sklearn import tree
import eli5

In [6]:
!unzip urlset.csv.zip

Archive:  urlset.csv.zip
  inflating: urlset.csv              


In [9]:
df = pd.read_csv('urlset.csv', encoding_errors = 'ignore', on_bad_lines = 'skip')

  df = pd.read_csv('urlset.csv', encoding_errors = 'ignore', on_bad_lines = 'skip')


In [10]:
df.head()

Unnamed: 0,domain,ranking,mld_res,mld.ps_res,card_rem,ratio_Rrem,ratio_Arem,jaccard_RR,jaccard_RA,jaccard_AR,jaccard_AA,jaccard_ARrd,jaccard_ARrem,label
0,nobell.it/70ffb52d079109dca5664cce6f317373782/...,10000000,1.0,0.0,18.0,107.611111,107.277778,0.0,0.0,0.0,0.0,0.8,0.795729,1.0
1,www.dghjdgf.com/paypal.co.uk/cycgi-bin/webscrc...,10000000,0.0,0.0,11.0,150.636364,152.272727,0.0,0.0,0.0,0.0,0.0,0.768577,1.0
2,serviciosbys.com/paypal.cgi.bin.get-into.herf....,10000000,0.0,0.0,14.0,73.5,72.642857,0.0,0.0,0.0,0.0,0.0,0.726582,1.0
3,mail.printakid.com/www.online.americanexpress....,10000000,0.0,0.0,6.0,562.0,590.666667,0.0,0.0,0.0,0.0,0.0,0.85964,1.0
4,thewhiskeydregs.com/wp-content/themes/widescre...,10000000,0.0,0.0,8.0,29.0,24.125,0.0,0.0,0.0,0.0,0.0,0.748971,1.0


In [11]:
pd.__version__

'1.4.0'

In [30]:
df_2 = df[['domain', 'label']]

## Feature Engineering


## Clean the data

In [32]:
# Which one of our domains is a float?

df_2[df_2['domain'].apply(lambda x: isinstance(x, float))]

Unnamed: 0,domain,label
18251,,


In [33]:
df_2 = df_2.dropna()

## Length of domain

In [35]:
df_2['domain_len'] = df_2['domain'].apply(lambda x: len(x))

In [36]:
df_2

Unnamed: 0,domain,label,domain_len
0,nobell.it/70ffb52d079109dca5664cce6f317373782/...,1.0,225
1,www.dghjdgf.com/paypal.co.uk/cycgi-bin/webscrc...,1.0,81
2,serviciosbys.com/paypal.cgi.bin.get-into.herf....,1.0,177
3,mail.printakid.com/www.online.americanexpress....,1.0,60
4,thewhiskeydregs.com/wp-content/themes/widescre...,1.0,116
...,...,...,...
95998,xbox360.ign.com/objects/850/850402.html,0.0,39
95999,games.teamxbox.com/xbox-360/1860/Dead-Space/,0.0,44
96000,www.gamespot.com/xbox360/action/deadspace/,0.0,42
96001,en.wikipedia.org/wiki/Dead_Space_(video_game),0.0,45


## Fit a model

In [41]:
clf = tree.DecisionTreeClassifier()
clf = clf.fit(df_2[['domain_len']], df_2['label'])

In [53]:
df_2[['domain']].iloc[0].values

array(['nobell.it/70ffb52d079109dca5664cce6f317373782/login.SkyPe.com/en/cgi-bin/verification/login/70ffb52d079109dca5664cce6f317373/index.php?cmd=_profile-ach&outdated_page_tmpl=p/gen/failed-to-load&nav=0.5.1&login_access=1322408526'],
      dtype=object)

In [63]:
df_2[['domain_len']].iloc[0].values.tolist()

[225]

## Make a prediction

In [64]:
clf.predict([df_2[['domain_len']].iloc[0]])



array([1.])

In [61]:
clf.predict([[225]])



array([1.])

In [60]:
eli5.explain_prediction(clf, df_2[['domain_len']].iloc[0])

Contribution?,Feature
0.501,domain_len
0.499,<BIAS>


In [1]:
# random notes on referencing

a = []
def add_string(thing):
    thing.append('1')
    
add_string(a)
print(a)

['1']
