# SC1015 DSAI Mini Project
## Part 1: Data Cleaning and Preparation
---
The first thing we did for the project was cleaning and preparation of the dataset to help us gain meaningful insights from the dataset and help us answer the question we posed.

**Question:** Can we detect phishing websites from benign ones using their respective URLs?

#### **Datasets:**
1. [Phishtank (retrieved on 17-4-23)](https://phishtank.org/developer_info.php)
2. [Hannousse, Abdelhakim; Yahiouche, Salima (2021), “Web page phishing detection”, Mendeley Data, V3](https://data.mendeley.com/datasets/c2gw7fy2j4/3)
3. [The Majestic Million (retrieved on 10-4-23)](https://majestic.com/reports/majestic-million)
4. [Phishing site URLs Dataset on Kaggle](https://www.kaggle.com/datasets/taruntiwarihp/phishing-site-urls)
5. [JPCERT Coordination Center (Directly downloaded due to some antviruses removing it as a threat)](https://github.com/JPCERTCC/phishurl-list)
6. [Malicious URLs Dataset on Kaggle](https://www.kaggle.com/datasets/sid321axn/malicious-urls-dataset)


## Table of Contents:
1. [Import Dataset and Data Cleaning](#1\.-Import-Dataset-and-Data-Cleaning)
2. [Feature Engineering](#2\.-Feature-Engineering)
3. [Split Dataset in Two](#3\.-Import-cleaned-dataset-to-.csv-file)



In [1]:
import pandas as pd
#regex
import re
#abnormal urls
from urllib.parse import urlparse
#log
from math import log
import random

### 1. Import Dataset and Data Cleaning
For each dataset we remove columns with unnecessary data not relevant to our project. For datasets with only one category of data (fully benign URLs or fully phishing URLs) we label them according to their attributes.

In [2]:
#phishtank.com
df1=pd.read_csv('data/verified_online.csv')

print(df1.shape)
df1.head()

(54689, 8)


Unnamed: 0,phish_id,url,phish_detail_url,submission_time,verified,verification_time,online,target
0,8118917,https://obmen.click/,http://www.phishtank.com/phish_detail.php?phis...,2023-04-17T07:51:50+00:00,yes,2023-04-17T07:55:21+00:00,yes,Other
1,8118915,https://neueinrichtung-sparkasse.de/,http://www.phishtank.com/phish_detail.php?phis...,2023-04-17T07:30:08+00:00,yes,2023-04-17T07:35:36+00:00,yes,Other
2,8118911,https://3db32516c1b476e7eff40e2d8ff8d9d7.krokl...,http://www.phishtank.com/phish_detail.php?phis...,2023-04-17T07:19:46+00:00,yes,2023-04-17T07:26:31+00:00,yes,Allegro
3,8118910,https://f0d89a41642e0026589577e0e13d9d58.krokl...,http://www.phishtank.com/phish_detail.php?phis...,2023-04-17T07:19:18+00:00,yes,2023-04-17T07:26:31+00:00,yes,Allegro
4,8118909,https://ratnes.com/,http://www.phishtank.com/phish_detail.php?phis...,2023-04-17T07:17:18+00:00,yes,2023-04-17T07:26:31+00:00,yes,Other


In [3]:
df1.drop(['phish_id','phish_detail_url','submission_time','verified','verification_time','online','target'], axis=1, inplace=True)

In [4]:
df1

Unnamed: 0,url
0,https://obmen.click/
1,https://neueinrichtung-sparkasse.de/
2,https://3db32516c1b476e7eff40e2d8ff8d9d7.krokl...
3,https://f0d89a41642e0026589577e0e13d9d58.krokl...
4,https://ratnes.com/
...,...
54684,http://www.formbuddy.com/cgi-bin/formdisp.pl?u...
54685,https://sites.google.com/site/libretyreserve/
54686,http://www.habbocreditosparati.blogspot.com/
54687,http://creditiperhabbogratissicuro100.blogspot...


In [5]:
df1['phish'] = 'phishing'

In [6]:
df1

Unnamed: 0,url,phish
0,https://obmen.click/,phishing
1,https://neueinrichtung-sparkasse.de/,phishing
2,https://3db32516c1b476e7eff40e2d8ff8d9d7.krokl...,phishing
3,https://f0d89a41642e0026589577e0e13d9d58.krokl...,phishing
4,https://ratnes.com/,phishing
...,...,...
54684,http://www.formbuddy.com/cgi-bin/formdisp.pl?u...,phishing
54685,https://sites.google.com/site/libretyreserve/,phishing
54686,http://www.habbocreditosparati.blogspot.com/,phishing
54687,http://creditiperhabbogratissicuro100.blogspot...,phishing


In [7]:
#Hannousse, Abdelhakim; Yahiouche, Salima (2021), “Web page phishing detection”, Mendeley Data, V3
df2=pd.read_csv('data/dataset_B_05_2020.csv')

print(df2.shape)
df2.head()

(11430, 89)


Unnamed: 0,url,length_url,length_hostname,ip,nb_dots,nb_hyphens,nb_at,nb_qm,nb_and,nb_or,...,domain_in_title,domain_with_copyright,whois_registered_domain,domain_registration_length,domain_age,web_traffic,dns_record,google_index,page_rank,status
0,http://www.crestonwood.com/router.php,37,19,0,3,0,0,0,0,0,...,0,1,0,45,-1,0,1,1,4,legitimate
1,http://shadetreetechnology.com/V4/validation/a...,77,23,1,1,0,0,0,0,0,...,1,0,0,77,5767,0,0,1,2,phishing
2,https://support-appleld.com.secureupdate.duila...,126,50,1,4,1,0,1,2,0,...,1,0,0,14,4004,5828815,0,1,0,phishing
3,http://rgipt.ac.in,18,11,0,2,0,0,0,0,0,...,1,0,0,62,-1,107721,0,0,3,legitimate
4,http://www.iracing.com/tracks/gateway-motorspo...,55,15,0,2,2,0,0,0,0,...,0,1,0,224,8175,8725,0,0,6,legitimate


In [8]:
df2 = df2.iloc[:, [0,88]]
df2 = df2.rename(columns={'status':'phish'})
df2["phish"]=df2["phish"].astype("category")
df2["phish"] = df2["phish"].cat.rename_categories(["benign", "phishing"])
print(df2["phish"])

0          benign
1        phishing
2        phishing
3          benign
4          benign
           ...   
11425      benign
11426    phishing
11427      benign
11428      benign
11429    phishing
Name: phish, Length: 11430, dtype: category
Categories (2, object): ['benign', 'phishing']


In [9]:
n = 1000000 #number of records in file
s = 10000 #desired sample size
filename = "data/majestic_million.csv"
skip = skip = sorted(random.sample(range(n),n-s))
df3 = pd.read_csv(filename, skiprows=skip, header = None)
dfheader = pd.read_csv('data/majestic_million.csv', nrows=0)
df3.columns = dfheader.columns
df3

Unnamed: 0,GlobalRank,TldRank,Domain,TLD,RefSubNets,RefIPs,IDN_Domain,IDN_TLD,PrevGlobalRank,PrevTldRank,PrevRefSubNets,PrevRefIPs
0,8,8,microsoft.com,com,278796,821179,microsoft.com,com,8,8,278635,821172
1,127,86,time.com,com,69547,162772,time.com,com,127,86,69592,162895
2,406,265,bilibili.com,com,39621,80958,bilibili.com,com,406,265,39718,81432
3,586,358,techradar.com,com,32025,63048,techradar.com,com,586,358,32046,63138
4,642,76,kde.org,org,30124,48082,kde.org,org,639,77,30200,48208
...,...,...,...,...,...,...,...,...,...,...,...,...
9996,999715,27898,yeezyboost350.uk,uk,259,294,yeezyboost350.uk,uk,-1,-1,-1,-1
9997,999743,499801,eebria.com,com,259,293,eebria.com,com,981030,493511,262,293
9998,999805,87038,cityofhopecancer.org,org,259,293,cityofhopecancer.org,org,989534,86828,260,298
9999,999873,499870,lacar.com,com,259,292,lacar.com,com,985522,495652,261,293


In [10]:
df3 = df3.iloc[:, [2]]

In [11]:
df3['phish'] = 'benign'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df3['phish'] = 'benign'


In [12]:
df3

Unnamed: 0,Domain,phish
0,microsoft.com,benign
1,time.com,benign
2,bilibili.com,benign
3,techradar.com,benign
4,kde.org,benign
...,...,...
9996,yeezyboost350.uk,benign
9997,eebria.com,benign
9998,cityofhopecancer.org,benign
9999,lacar.com,benign


In [13]:
df3['Domain'] = 'https://' + df3['Domain'].astype(str)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df3['Domain'] = 'https://' + df3['Domain'].astype(str)


In [14]:
df3

Unnamed: 0,Domain,phish
0,https://microsoft.com,benign
1,https://time.com,benign
2,https://bilibili.com,benign
3,https://techradar.com,benign
4,https://kde.org,benign
...,...,...
9996,https://yeezyboost350.uk,benign
9997,https://eebria.com,benign
9998,https://cityofhopecancer.org,benign
9999,https://lacar.com,benign


In [15]:
df3 = df3.rename(columns={'Domain':'url'})

In [16]:
df3

Unnamed: 0,url,phish
0,https://microsoft.com,benign
1,https://time.com,benign
2,https://bilibili.com,benign
3,https://techradar.com,benign
4,https://kde.org,benign
...,...,...
9996,https://yeezyboost350.uk,benign
9997,https://eebria.com,benign
9998,https://cityofhopecancer.org,benign
9999,https://lacar.com,benign


In [17]:
#https://www.kaggle.com/datasets/taruntiwarihp/phishing-site-urls
df4=pd.read_csv('data/phishing_site_urls.csv')

print(df4.shape)
df4.head()

(549346, 2)


Unnamed: 0,URL,Label
0,nobell.it/70ffb52d079109dca5664cce6f317373782/...,bad
1,www.dghjdgf.com/paypal.co.uk/cycgi-bin/webscrc...,bad
2,serviciosbys.com/paypal.cgi.bin.get-into.herf....,bad
3,mail.printakid.com/www.online.americanexpress....,bad
4,thewhiskeydregs.com/wp-content/themes/widescre...,bad


In [18]:
df4 = df4.rename(columns={'URL':'url'})
df4 = df4.rename(columns={'Label':'phish'})
df4["phish"]=df4["phish"].astype("category")

print(df4["phish"])

0         bad
1         bad
2         bad
3         bad
4         bad
         ... 
549341    bad
549342    bad
549343    bad
549344    bad
549345    bad
Name: phish, Length: 549346, dtype: category
Categories (2, object): ['bad', 'good']


In [19]:
df4["phish"] = df4["phish"].cat.rename_categories(["phishing", "benign"])
print(df4["phish"])

0         phishing
1         phishing
2         phishing
3         phishing
4         phishing
            ...   
549341    phishing
549342    phishing
549343    phishing
549344    phishing
549345    phishing
Name: phish, Length: 549346, dtype: category
Categories (2, object): ['phishing', 'benign']


In [20]:
#JPCERT CC Phishing URLs by month
#Directly importing the data as antivirus detects the data as malicious
years = [2019, 2020, 2021, 2022]
months = ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']
df5 = pd.DataFrame()
for year in years:
    for month in months:
        try:
            # Read in CSV file from URL
            url = f'https://github.com/JPCERTCC/phishurl-list/raw/main/{year}/{year}{month}.csv'
            df = pd.read_csv(url)
            
            # Append to all_data dataframe
            df5 = pd.concat([df5,df], ignore_index=True)
        except:
            print(f'Error reading in {year}{month}.csv')

df5

Unnamed: 0,date,URL,description
0,2019/01/04 10:12:00,http://tookout00tove.xyz/TNC/tnc/newtncupdatin...,TOKAIネットワーククラブ
1,2019/01/04 10:24:00,http://www.account.nthl.mixh.jp/gb-en/signin/s...,Netflix
2,2019/01/04 10:24:00,http://service-client-netflix.mixh.jp/,Netflix
3,2019/01/04 10:24:00,http://www.account.nthl.mixh.jp/gb-/,Netflix
4,2019/01/04 10:24:00,http://komazawa.org/aktualisieren-sie-ihre-zah...,Netflix
...,...,...,...
108396,2022/12/28 17:53:00,https://202.61.130.37/,DBS PayLah
108397,2022/12/28 17:19:00,https://eki.net.amoingian.shop/a3IwMDY1P3,えきねっと
108398,2022/12/28 17:19:00,https://eki.net.anyfoold.shop/a3IwMDY1P3,えきねっと
108399,2022/12/28 17:19:00,https://eki.net.cineicity.shop/a3IwMDY1P3,えきねっと


In [21]:
df5 = df5.drop(['date','description'], axis=1)
df5 = df5.rename(columns={'URL':'url'})
df5['phish'] = 'phishing'
df5["phish"]=df5["phish"].astype("category")
df5

Unnamed: 0,url,phish
0,http://tookout00tove.xyz/TNC/tnc/newtncupdatin...,phishing
1,http://www.account.nthl.mixh.jp/gb-en/signin/s...,phishing
2,http://service-client-netflix.mixh.jp/,phishing
3,http://www.account.nthl.mixh.jp/gb-/,phishing
4,http://komazawa.org/aktualisieren-sie-ihre-zah...,phishing
...,...,...
108396,https://202.61.130.37/,phishing
108397,https://eki.net.amoingian.shop/a3IwMDY1P3,phishing
108398,https://eki.net.anyfoold.shop/a3IwMDY1P3,phishing
108399,https://eki.net.cineicity.shop/a3IwMDY1P3,phishing


In [22]:
#https://www.kaggle.com/datasets/sid321axn/malicious-urls-dataset
df6=pd.read_csv('data/malicious_phish.csv')
df6=df6.loc[df6['type'] == 'phishing']
df6 = df6.rename(columns={'type':'phish'})
df6

Unnamed: 0,url,phish
0,br-icloud.com.br,phishing
21,signin.eby.de.zukruygxctzmmqi.civpro.co.za,phishing
28,http://www.marketingbyinternet.com/mo/e56508df...,phishing
40,https://docs.google.com/spreadsheet/viewform?f...,phishing
72,retajconsultancy.com,phishing
...,...,...
651186,xbox360.ign.com/objects/850/850402.html,phishing
651187,games.teamxbox.com/xbox-360/1860/Dead-Space/,phishing
651188,www.gamespot.com/xbox360/action/deadspace/,phishing
651189,en.wikipedia.org/wiki/Dead_Space_(video_game),phishing


In [23]:
#Combining all datasets together
df=pd.concat([df1, df2, df3, df4, df5, df6])

In [24]:
df

Unnamed: 0,url,phish
0,https://obmen.click/,phishing
1,https://neueinrichtung-sparkasse.de/,phishing
2,https://3db32516c1b476e7eff40e2d8ff8d9d7.krokl...,phishing
3,https://f0d89a41642e0026589577e0e13d9d58.krokl...,phishing
4,https://ratnes.com/,phishing
...,...,...
651186,xbox360.ign.com/objects/850/850402.html,phishing
651187,games.teamxbox.com/xbox-360/1860/Dead-Space/,phishing
651188,www.gamespot.com/xbox360/action/deadspace/,phishing
651189,en.wikipedia.org/wiki/Dead_Space_(video_game),phishing


In [25]:
df=df.reset_index(drop=True)

In [26]:
#Remove duplicates
df=df.drop_duplicates(subset=['url'])
df

Unnamed: 0,url,phish
0,https://obmen.click/,phishing
1,https://neueinrichtung-sparkasse.de/,phishing
2,https://3db32516c1b476e7eff40e2d8ff8d9d7.krokl...,phishing
3,https://f0d89a41642e0026589577e0e13d9d58.krokl...,phishing
4,https://ratnes.com/,phishing
...,...,...
779860,http://www.adnet8.com/image/caseimage/?http://...,phishing
779864,http://gkjx168.com/images/,phishing
779865,http://www.gkjx168.com/images/,phishing
779868,http://blazeygraphicsystems.com/classifieds/no...,phishing


In [27]:
df.reset_index(drop=True, inplace=True)
df

Unnamed: 0,url,phish
0,https://obmen.click/,phishing
1,https://neueinrichtung-sparkasse.de/,phishing
2,https://3db32516c1b476e7eff40e2d8ff8d9d7.krokl...,phishing
3,https://f0d89a41642e0026589577e0e13d9d58.krokl...,phishing
4,https://ratnes.com/,phishing
...,...,...
730184,http://www.adnet8.com/image/caseimage/?http://...,phishing
730185,http://gkjx168.com/images/,phishing
730186,http://www.gkjx168.com/images/,phishing
730187,http://blazeygraphicsystems.com/classifieds/no...,phishing


In [28]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 730189 entries, 0 to 730188
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   url     730189 non-null  object
 1   phish   730189 non-null  object
dtypes: object(2)
memory usage: 11.1+ MB


### 2. Feature Engineering

Firstly, we must first extract meaningful features from the URLs. A look at some of the heads and tails of the datasets gives us some ideas for what features we can look at

Since our project is about phishing website detection from their URLs, we have chosen the following variables:

**Categorical Variables:**
1. `use_of_ip` Does the URL contain IP address? (e.g. 201.62.129.35(IPv4))
2. `short_url` Does the URL contain evidence of a URL shortening service? (e.g. bit.ly)
3. `uses-https` Does the URL use https?
4. `uses-http` Does the URL use http? (Assume website that fulfill https fulfill http)

**Numerical Variables:**
1. `count.` How many periods does the URL contain?
2. `count@` How many 'at's does the URL contain?
3. `count_dir` How many directories does the URL contain (denoted by '/')?
4. `count_embed_domain` How many embedded domains does the URL contain?
5. `count%` How many percent signs does the URL contain?
6. `count?` How many question marks does the URL contain?
7. `count-` How many hyphens does the URL contain?
8. `count=` How many equal signs does the URL contain?
9. `Length of URL` How long is the URL?
10. `Hostname Length` How long is the hostname in the URL? (e.g. example.com in example.com/not-phish/phish)
11. `count-digits` How many numbers are there in the URL?
12. `count-letters` How many letters are there in the URL?
13. `url_entropy` What is the entropy of the URL? (Based on Shannon Entropy formula) <i>Definition: average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes.<i>

In [29]:
#Use of IP or not in domain
def having_ip_address(url):
    match = re.search(
        '(([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\.([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\.([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\.'
        '([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\/)|'  # IPv4
        '((0x[0-9a-fA-F]{1,2})\\.(0x[0-9a-fA-F]{1,2})\\.(0x[0-9a-fA-F]{1,2})\\.(0x[0-9a-fA-F]{1,2})\\/)' # IPv4 in hexadecimal
        '(?:[a-fA-F0-9]{1,4}:){7}[a-fA-F0-9]{1,4}', url)  # Ipv6
    if match:
        # print match.group()
        return 'uses IP'
    else:
        # print 'No matching pattern found'
        return 'not IP'
df['use_of_ip'] = df['url'].apply(lambda i: having_ip_address(i))

In [30]:
def shortening_service(url):
    match = re.search('bit\.ly|goo\.gl|shorte\.st|go2l\.ink|x\.co|ow\.ly|t\.co|tinyurl|tr\.im|is\.gd|cli\.gs|'
                      'yfrog\.com|migre\.me|ff\.im|tiny\.cc|url4\.eu|twit\.ac|su\.pr|twurl\.nl|snipurl\.com|'
                      'short\.to|BudURL\.com|ping\.fm|post\.ly|Just\.as|bkite\.com|snipr\.com|fic\.kr|loopt\.us|'
                      'doiop\.com|short\.ie|kl\.am|wp\.me|rubyurl\.com|om\.ly|to\.ly|bit\.do|t\.co|lnkd\.in|'
                      'db\.tt|qr\.ae|adf\.ly|goo\.gl|bitly\.com|cur\.lv|tinyurl\.com|ow\.ly|bit\.ly|ity\.im|'
                      'q\.gs|is\.gd|po\.st|bc\.vc|twitthis\.com|u\.to|j\.mp|buzurl\.com|cutt\.us|u\.bb|yourls\.org|'
                      'x\.co|prettylinkpro\.com|scrnch\.me|filoops\.info|vzturl\.com|qr\.net|1url\.com|tweez\.me|v\.gd|'
                      'tr\.im|link\.zip\.net',
                      url)
    if match:
        return 'shortened'
    else:
        return 'original'
    
    
df['short_url'] = df['url'].apply(lambda i: shortening_service(i))

In [31]:
def count_https(url):
    if url.count('https') > 0:
        return 'https'
    else: 
        return 'no https'
    
df['uses-https'] = df['url'].apply(lambda i : count_https(i))

def count_http(url):
    if url.count('http') > 0:
        return 'http'
    else: 
        return 'no http'

df['uses-http'] = df['url'].apply(lambda i : count_http(i))

In [32]:
def count_dot(url):
    count_dot = url.count('.')
    return count_dot

df['count.'] = df['url'].apply(lambda i: count_dot(i))
df.head()

Unnamed: 0,url,phish,use_of_ip,short_url,uses-https,uses-http,count.
0,https://obmen.click/,phishing,not IP,original,https,http,1
1,https://neueinrichtung-sparkasse.de/,phishing,not IP,original,https,http,1
2,https://3db32516c1b476e7eff40e2d8ff8d9d7.krokl...,phishing,not IP,original,https,http,3
3,https://f0d89a41642e0026589577e0e13d9d58.krokl...,phishing,not IP,original,https,http,3
4,https://ratnes.com/,phishing,not IP,original,https,http,1


In [33]:
def count_atrate(url):
     
    return url.count('@')

df['count@'] = df['url'].apply(lambda i: count_atrate(i))


def no_of_dir(url):
    urldir = urlparse(url).path
    return urldir.count('/')

df['count_dir'] = df['url'].apply(lambda i: no_of_dir(i))

def no_of_embed(url):
    urldir = urlparse(url).path
    return urldir.count('//')

df['count_embed_domain'] = df['url'].apply(lambda i: no_of_embed(i))

In [34]:
def count_per(url):
    return url.count('%')

df['count%'] = df['url'].apply(lambda i : count_per(i))

def count_ques(url):
    return url.count('?')

df['count?'] = df['url'].apply(lambda i: count_ques(i))

def count_hyphen(url):
    return url.count('-')

df['count-'] = df['url'].apply(lambda i: count_hyphen(i))

def count_equal(url):
    return url.count('=')

df['count='] = df['url'].apply(lambda i: count_equal(i))

def url_length(url):
    return len(str(url))


#Length of URL
df['url_length'] = df['url'].apply(lambda i: url_length(i))
#Hostname Length

def hostname_length(url):
    return len(urlparse(url).netloc)

df['hostname_length'] = df['url'].apply(lambda i: hostname_length(i))

df.head()

Unnamed: 0,url,phish,use_of_ip,short_url,uses-https,uses-http,count.,count@,count_dir,count_embed_domain,count%,count?,count-,count=,url_length,hostname_length
0,https://obmen.click/,phishing,not IP,original,https,http,1,0,1,0,0,0,0,0,20,11
1,https://neueinrichtung-sparkasse.de/,phishing,not IP,original,https,http,1,0,1,0,0,0,1,0,36,27
2,https://3db32516c1b476e7eff40e2d8ff8d9d7.krokl...,phishing,not IP,original,https,http,3,0,5,0,0,0,0,0,100,45
3,https://f0d89a41642e0026589577e0e13d9d58.krokl...,phishing,not IP,original,https,http,3,0,5,0,0,0,0,0,100,45
4,https://ratnes.com/,phishing,not IP,original,https,http,1,0,1,0,0,0,0,0,19,10


In [35]:
def digit_count(url):
    digits = 0
    for i in url:
        if i.isnumeric():
            digits = digits + 1
    return digits


df['count-digits']= df['url'].apply(lambda i: digit_count(i))


def letter_count(url):
    letters = 0
    for i in url:
        if i.isalpha():
            letters = letters + 1
    return letters


df['count-letters']= df['url'].apply(lambda i: letter_count(i))

df.head()

Unnamed: 0,url,phish,use_of_ip,short_url,uses-https,uses-http,count.,count@,count_dir,count_embed_domain,count%,count?,count-,count=,url_length,hostname_length,count-digits,count-letters
0,https://obmen.click/,phishing,not IP,original,https,http,1,0,1,0,0,0,0,0,20,11,0,15
1,https://neueinrichtung-sparkasse.de/,phishing,not IP,original,https,http,1,0,1,0,0,0,1,0,36,27,0,30
2,https://3db32516c1b476e7eff40e2d8ff8d9d7.krokl...,phishing,not IP,original,https,http,3,0,5,0,0,0,0,0,100,45,36,53
3,https://f0d89a41642e0026589577e0e13d9d58.krokl...,phishing,not IP,original,https,http,3,0,5,0,0,0,0,0,100,45,42,47
4,https://ratnes.com/,phishing,not IP,original,https,http,1,0,1,0,0,0,0,0,19,10,0,14


In [36]:
df

Unnamed: 0,url,phish,use_of_ip,short_url,uses-https,uses-http,count.,count@,count_dir,count_embed_domain,count%,count?,count-,count=,url_length,hostname_length,count-digits,count-letters
0,https://obmen.click/,phishing,not IP,original,https,http,1,0,1,0,0,0,0,0,20,11,0,15
1,https://neueinrichtung-sparkasse.de/,phishing,not IP,original,https,http,1,0,1,0,0,0,1,0,36,27,0,30
2,https://3db32516c1b476e7eff40e2d8ff8d9d7.krokl...,phishing,not IP,original,https,http,3,0,5,0,0,0,0,0,100,45,36,53
3,https://f0d89a41642e0026589577e0e13d9d58.krokl...,phishing,not IP,original,https,http,3,0,5,0,0,0,0,0,100,45,42,47
4,https://ratnes.com/,phishing,not IP,original,https,http,1,0,1,0,0,0,0,0,19,10,0,14
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
730184,http://www.adnet8.com/image/caseimage/?http://...,phishing,not IP,original,no https,http,6,0,3,0,0,2,1,2,128,14,3,94
730185,http://gkjx168.com/images/,phishing,not IP,original,no https,http,1,0,2,0,0,0,0,0,26,11,3,17
730186,http://www.gkjx168.com/images/,phishing,not IP,original,no https,http,2,0,2,0,0,0,0,0,30,15,3,20
730187,http://blazeygraphicsystems.com/classifieds/no...,phishing,not IP,original,no https,http,2,0,3,0,0,0,0,0,69,24,0,61


In [37]:
def url_entropy(text):
    text = text.lower()
    probs = [text.count(c) / len(text) for c in set(text)]
    entropy = -sum([p * log(p) / log(2.0) for p in probs])
    return entropy

df['url_entropy']= df['url'].apply(lambda i: url_entropy(i))

In [38]:
df

Unnamed: 0,url,phish,use_of_ip,short_url,uses-https,uses-http,count.,count@,count_dir,count_embed_domain,count%,count?,count-,count=,url_length,hostname_length,count-digits,count-letters,url_entropy
0,https://obmen.click/,phishing,not IP,original,https,http,1,0,1,0,0,0,0,0,20,11,0,15,3.884184
1,https://neueinrichtung-sparkasse.de/,phishing,not IP,original,https,http,1,0,1,0,0,0,1,0,36,27,0,30,3.995907
2,https://3db32516c1b476e7eff40e2d8ff8d9d7.krokl...,phishing,not IP,original,https,http,3,0,5,0,0,0,0,0,100,45,36,53,4.711532
3,https://f0d89a41642e0026589577e0e13d9d58.krokl...,phishing,not IP,original,https,http,3,0,5,0,0,0,0,0,100,45,42,47,4.743081
4,https://ratnes.com/,phishing,not IP,original,https,http,1,0,1,0,0,0,0,0,19,10,0,14,3.642150
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
730184,http://www.adnet8.com/image/caseimage/?http://...,phishing,not IP,original,no https,http,6,0,3,0,0,2,1,2,128,14,3,94,4.479253
730185,http://gkjx168.com/images/,phishing,not IP,original,no https,http,1,0,2,0,0,0,0,0,26,11,3,17,4.161978
730186,http://www.gkjx168.com/images/,phishing,not IP,original,no https,http,2,0,2,0,0,0,0,0,30,15,3,20,4.215061
730187,http://blazeygraphicsystems.com/classifieds/no...,phishing,not IP,original,no https,http,2,0,3,0,0,0,0,0,69,24,0,61,4.177618


### 3. Import cleaned dataset to .csv file

In [39]:
df.to_csv("cleaned_data.csv", sep=',', encoding='utf-8')