**Extracting features from phishing and benign urls.**

In [1]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


> # **Intro**

From URLs we have to extract some features which will help us classify between phishing websites and legitimate one. 
The features can be classified into three types: 
1.   Address Bar Based Features
2.   Abnormal Based Features
3.   HTML and JavaScript based Features
4.   Domain based Features


Some of the featuers of phishing websites are: 

Using the IP Address

> Using IP address in place of domain name in the URL such as `“http://125.98.3.123/fake.html”` and sometimes and IP address is used hex code: 

```
“http://0x58.0xCC.0xCA.0x62/2/paypal.ca/index.html”
```

Other features are

*   Long URLs. More than 52 char length are usually phishing urls. 
*   Using tiny url services. 
* Having *@* symbol in the in URL. 
* Using `-` hypthens in the url. 

The list is long and the features extracted are based on this paper:
http://eprints.hud.ac.uk/id/eprint/24330/6/MohammadPhishing14July2015.pdf


 



In [2]:
import pandas as pd

Extracting phishing URLs

In [3]:
data0 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Datasets/verified_online.csv')
#http://data.phishtank.com/data/online-valid.csv


In [4]:
data0.head()

Unnamed: 0,phish_id,url,phish_detail_url,submission_time,verified,verification_time,online,target
0,7265251,https://anazom.co.ip.lhpoct.shop/dR3snx1C.php?...,http://www.phishtank.com/phish_detail.php?phis...,2021-08-16T02:11:56+00:00,yes,2021-08-16T02:18:16+00:00,yes,Amazon.com
1,7265249,https://www.sprintage.it/images/login/bizmail.php,http://www.phishtank.com/phish_detail.php?phis...,2021-08-16T02:07:20+00:00,yes,2021-08-16T02:14:31+00:00,yes,Other
2,7265245,http://confirm-unverified-pplaccount.com/custo...,http://www.phishtank.com/phish_detail.php?phis...,2021-08-16T02:05:41+00:00,yes,2021-08-16T02:14:31+00:00,yes,Other
3,7265244,http://confirm-unverified-pplaccount.com/custo...,http://www.phishtank.com/phish_detail.php?phis...,2021-08-16T02:05:40+00:00,yes,2021-08-16T02:14:31+00:00,yes,Other
4,7265242,http://mail.confirm-unverified-pplaccount.com/...,http://www.phishtank.com/phish_detail.php?phis...,2021-08-16T02:05:36+00:00,yes,2021-08-16T02:14:31+00:00,yes,Other


In [5]:
data0.shape

(10784, 8)

In [6]:
#Collecting 5,000 Phishing URLs randomly
phis_url = data0.sample(n = 5000, random_state = 12).copy()
phis_url = phis_url.reset_index(drop=True)
phis_url.head()


Unnamed: 0,phish_id,url,phish_detail_url,submission_time,verified,verification_time,online,target
0,7260589,https://hzibupigbnrtqezn-dot-sunlit-center-322...,http://www.phishtank.com/phish_detail.php?phis...,2021-08-10T15:26:24+00:00,yes,2021-08-10T15:49:26+00:00,yes,Other
1,7251242,http://netttxnet.byethost3.com/,http://www.phishtank.com/phish_detail.php?phis...,2021-07-31T23:57:53+00:00,yes,2021-08-01T00:08:16+00:00,yes,Other
2,7257550,https://phx.chromeproxy.net/direct/aHR0cHM6Ly9...,http://www.phishtank.com/phish_detail.php?phis...,2021-08-07T02:08:44+00:00,yes,2021-08-07T02:19:40+00:00,yes,Other
3,6887460,https://midnightluna1.typeform.com/to/bSFSrH4T,http://www.phishtank.com/phish_detail.php?phis...,2020-12-11T23:36:32+00:00,yes,2021-01-02T06:10:46+00:00,yes,Other
4,6772545,http://caracasmateriais.blogspot.com/,http://www.phishtank.com/phish_detail.php?phis...,2020-09-16T14:49:55+00:00,yes,2020-09-16T17:09:06+00:00,yes,Other


> # **Address Bar Based Features**





> **1.1 Extracting Legitimate URLs:**

In [7]:
data1 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Datasets/Benign_list.csv')

In [8]:
data1.columns = ['URLs']

data1.head()

Unnamed: 0,URLs
0,http://1337x.to/torrent/1110018/Blackhat-2015-...
1,http://1337x.to/torrent/1122940/Blackhat-2015-...
2,http://1337x.to/torrent/1124395/Fast-and-Furio...
3,http://1337x.to/torrent/1145504/Avengers-Age-o...
4,http://1337x.to/torrent/1160078/Avengers-age-o...


In [9]:
data1.shape

(35377, 1)

In [10]:
#Collecting 5,000 Legitimate URLs randomly
legi_url = data1.sample(n = 5000, random_state = 12).copy()
legi_url = legi_url.reset_index(drop=True)
legi_url.head()


Unnamed: 0,URLs
0,http://graphicriver.net/search?date=this-month...
1,http://ecnavi.jp/redirect/?url=http://www.cros...
2,https://hubpages.com/signin?explain=follow+Hub...
3,http://extratorrent.cc/torrent/4190536/AOMEI+B...
4,http://icicibank.com/Personal-Banking/offers/o...


In [11]:
!pip install bs4




> **1.1 Extracting Domain of the url**

In [12]:
import requests
import re

In [13]:
!pip install bs4



In [14]:
!pwd

/content


In [15]:
from bs4 import BeautifulSoup
from urllib.parse import urlparse


In [16]:
def getDomain(url): 
  domain = urlparse(url).netloc
  if re.match(r"^www.",domain):
	       domain = domain.replace("www.","")
  return domain


In [17]:
#checking if getDomain works
getDomain('https://www.youtube.com/results?search_query=extracting+url+information')

'youtube.com'

In [18]:
!pip install ipaddress

Collecting ipaddress
  Downloading ipaddress-1.0.23-py2.py3-none-any.whl (18 kB)
Installing collected packages: ipaddress
Successfully installed ipaddress-1.0.23


> **1.2 Extracting ipaddress from the url**

In [19]:
#checking if url contains an IP address
import ipaddress
def haveIP(url):
  try:
    ipaddress.ip_address(url)
    ip = 1
  except:
    ip = 0
  return ip

In [20]:
#checking the haveIP fun
x = '127.0.0.1'
print( haveIP(x) ) 
x = 'www.google.com'
print ( haveIP(x) ) 

1
0


> **1.3 Extracting *@* symbol from the url**




In [21]:
#Checking for @ symbol in the URL
def haveAt(url): 
  if '@' in url: 
    return 1
  else:
    return 0

In [22]:
#checking the have_at fun
x = 'www.google.com'
y = 'www.yahoo@gmail.com'
print( haveAt(x) ) 
print( haveAt(y) ) 

0
1


> **1.4 Extracting the length of url**

In [23]:
#Checking the length of the URL
def urlLength(url): 
  if len(url) < 54: 
    return 0
  else: 
    return 1

> **1.5 Extracting the tiny URL shotners from the url**

In [24]:
#Checking if the url uses tinyUrl services

short_url_services =  r"bit\.ly|goo\.gl|shorte\.st|go2l\.ink|x\.co|ow\.ly|t\.co|tinyurl|tr\.im|is\.gd|cli\.gs|" \
                      r"yfrog\.com|migre\.me|ff\.im|tiny\.cc|url4\.eu|twit\.ac|su\.pr|twurl\.nl|snipurl\.com|" \
                      r"short\.to|BudURL\.com|ping\.fm|post\.ly|Just\.as|bkite\.com|snipr\.com|fic\.kr|loopt\.us|" \
                      r"doiop\.com|short\.ie|kl\.am|wp\.me|rubyurl\.com|om\.ly|to\.ly|bit\.do|t\.co|lnkd\.in|db\.tt|" \
                      r"qr\.ae|adf\.ly|goo\.gl|bitly\.com|cur\.lv|tinyurl\.com|ow\.ly|bit\.ly|ity\.im|q\.gs|is\.gd|" \
                      r"po\.st|bc\.vc|twitthis\.com|u\.to|j\.mp|buzurl\.com|cutt\.us|u\.bb|yourls\.org|x\.co|" \
                      r"prettylinkpro\.com|scrnch\.me|filoops\.info|vzturl\.com|qr\.net|1url\.com|tweez\.me|v\.gd|" \
                      r"tr\.im|link\.zip\.net"

In [25]:
def usesTinyUrl(url):      
  temp = re.search(short_url_services, url)
  if(temp):
    return 1
  else: 
    return 0

In [26]:
#checking UsesTinyUrl fun
x = 'bit.ly/19DXSk4'
y = 'www.yahoo.com'
print( usesTinyUrl(x) ) 
print( usesTinyUrl(y) ) 

1
0


> **1.6 Extracting the hyphens from the url**

In [27]:
#Checking if URL contains '-'. 
def haveHyphen(url): 
  if '-' in urlparse(url).netloc:
    return 1            # phishing
  else:
    return 0            # legitimate


In [28]:
#checking the haveHypen
print( haveHyphen('www.pay-tm.com') ) 
print( haveHyphen('www.google--pay-1.com'))

0
0


In [29]:
!pip install tldextract

Collecting tldextract
  Downloading tldextract-3.1.2-py2.py3-none-any.whl (87 kB)
[?25l[K     |███▊                            | 10 kB 22.8 MB/s eta 0:00:01[K     |███████▌                        | 20 kB 27.2 MB/s eta 0:00:01[K     |███████████▎                    | 30 kB 23.1 MB/s eta 0:00:01[K     |███████████████                 | 40 kB 19.5 MB/s eta 0:00:01[K     |██████████████████▉             | 51 kB 7.3 MB/s eta 0:00:01[K     |██████████████████████▋         | 61 kB 7.9 MB/s eta 0:00:01[K     |██████████████████████████▍     | 71 kB 7.0 MB/s eta 0:00:01[K     |██████████████████████████████  | 81 kB 7.8 MB/s eta 0:00:01[K     |████████████████████████████████| 87 kB 3.6 MB/s 
[?25hCollecting requests-file>=1.4
  Downloading requests_file-1.5.1-py2.py3-none-any.whl (3.7 kB)
Installing collected packages: requests-file, tldextract
Successfully installed requests-file-1.5.1 tldextract-3.1.2


> **1.7Extracting subdomains from the url.**

In [30]:
#checking if the Url have multi sub domain
#Not working currently
import tldextract
def multiSubDomain(url):   
  x = tldextract.extract(url)
  print(x)



In [31]:
#checking multi sub domain 
x = 'https://www1.eposcard-co-jp-mcmbresreviec.s20r084.abc.def.ghi.cn/'
y = 'http://www.hud.ac.uk/students/page1.html'
print ( multiSubDomain(x) )
print ( multiSubDomain(y) ) 

ExtractResult(subdomain='www1.eposcard-co-jp-mcmbresreviec.s20r084.abc.def', domain='ghi', suffix='cn')
None
ExtractResult(subdomain='www', domain='hud', suffix='ac.uk')
None


> **1.8 Extracting `//` from the url**

In [32]:
#finding the depth of // from the url
def redirection(url):
  pos = url.rfind('//')
  if pos > 6:
    if pos > 7:
      return 1
    else:
      return 0
  else:
    return 0

In [33]:
#checking the redirection 
x = 'http://eprints.hud.ac.uk/id/eprint/24330/6/MohammadPhishing14July2015.pdf'
print( redirection(x) )

0


> **1.9 Extracting the HTTPs certificate age and issuer**

In [34]:
import requests
#not working correctly

response = requests.get('https://stackoverflow.com/questions/29773003/check-whether-domain-is-registered')
print(response)

<Response [200]>


> **1.10 Extracting domain registration length**

In [35]:
pip install python-whois

Collecting python-whois
  Downloading python-whois-0.7.3.tar.gz (91 kB)
[?25l[K     |███▋                            | 10 kB 23.1 MB/s eta 0:00:01[K     |███████▏                        | 20 kB 24.4 MB/s eta 0:00:01[K     |██████████▊                     | 30 kB 29.0 MB/s eta 0:00:01[K     |██████████████▎                 | 40 kB 22.4 MB/s eta 0:00:01[K     |██████████████████              | 51 kB 11.7 MB/s eta 0:00:01[K     |█████████████████████▌          | 61 kB 9.8 MB/s eta 0:00:01[K     |█████████████████████████       | 71 kB 9.7 MB/s eta 0:00:01[K     |████████████████████████████▋   | 81 kB 10.7 MB/s eta 0:00:01[K     |████████████████████████████████| 91 kB 5.0 MB/s 
Building wheels for collected packages: python-whois
  Building wheel for python-whois (setup.py) ... [?25l[?25hdone
  Created wheel for python-whois: filename=python_whois-0.7.3-py3-none-any.whl size=87721 sha256=61403b5ba1af507cf44ca5f9d72a50a5a9ff49e9debdf2a20e5a1cf0668e9222
  Stored in dir

In [36]:
#finding the domani registration date from the url
import whois 
from datetime import datetime
from dateutil.relativedelta import relativedelta
def domainRegLength(url):
  try: 
    temp = whois.whois(url)      
    #print(datetime.today(), ' credate ', temp.creation_date[0])  
    return relativedelta(datetime.today(), temp.creation_date[0]).years
  except: 
    return 0

In [37]:
#checking the domdin reg len fun
x = 'www.google.com'
y = 'http://u1047531.cp.regruhosting.ru/acces-inges-20200104-t452/3facd/'
print( domainRegLength(x) )
print ( domainRegLength(y) ) 



24
0


> **1.10 Extracting hidden https token in domain**


In [38]:
#checking the existense of hidden http/https
def hiddenhttps(url): 
  domain = urlparse(url).netloc
  print(domain)
  if 'https' in domain: 
    return 1
  else: 
    return 0

#checking the working of above fun
x = 'https://open.spotify.com/playlist/3mHGpdWE9oxUjcNZJvCkBe'
y = 'http://https-www-paypal-it-webapps-mpp-home.soft-hair.com/'
z = 'https://http-www-paypal-it-webapps-mpp-home.soft-hair.com/'
hiddenhttps(z)

http-www-paypal-it-webapps-mpp-home.soft-hair.com


0

> **1.11 Extracting Depth of the URL**


In [39]:
def getDepth(url):
  s = urlparse(url).path.split('/')
  depth = 0
  for j in range(len(s)):
    if len(s[j]) != 0:
      depth = depth+1
  return depth


> #  **Domain Based Features**




> **2.1 Web Traffic**

In [40]:
!wget http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
!unzip -o top-1m.csv.zip

--2021-12-12 08:02:41--  http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.140.96
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.140.96|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5149325 (4.9M) [application/zip]
Saving to: ‘top-1m.csv.zip’


2021-12-12 08:02:42 (9.76 MB/s) - ‘top-1m.csv.zip’ saved [5149325/5149325]

Archive:  top-1m.csv.zip
  inflating: top-1m.csv              


In [41]:
#If the rank of the domain < 100000, the vlaue of this feature is 1 (phishing) else 0 (legitimate).
'''
import urllib
import urllib.request

def web_traffic(url):
  try:
    #Filling the whitespaces in the URL if any
    url = urllib.parse.quote(url)
    rank = BeautifulSoup(urllib.request.urlopen("http://data.alexa.com/data?cli=10&dat=s&url=" + url).read(), "xml").find("REACH")['RANK']
    rank = int(rank)
  except TypeError:
        return 1
  if rank <100000:
    return 1
  else:
    return 0
'''


'\nimport urllib\nimport urllib.request\n\ndef web_traffic(url):\n  try:\n    #Filling the whitespaces in the URL if any\n    url = urllib.parse.quote(url)\n    rank = BeautifulSoup(urllib.request.urlopen("http://data.alexa.com/data?cli=10&dat=s&url=" + url).read(), "xml").find("REACH")[\'RANK\']\n    rank = int(rank)\n  except TypeError:\n        return 1\n  if rank <100000:\n    return 1\n  else:\n    return 0\n'

In [42]:
# Newly updated from github
import csv
with open('top-1m.csv') as f:
    reader = csv.reader(f)
    alexa = list(reader)


In [43]:
def web_traffic(url):
  domain = getDomain(url)
  try:
    rank = [i for i, v in enumerate(alexa) if v[1] == domain][0] + 1
  except:
    return 0
  if rank <100000:
    return 1
  else:
    return 0


> **2.2 Age of Domain**

In [44]:
#If age of domain > 12 months, the vlaue of this feature is 1 (phishing) else 0 (legitimate).
def domainAge(domain_name):
  creation_date = domain_name.creation_date
  expiration_date = domain_name.expiration_date
  if (isinstance(creation_date,str) or isinstance(expiration_date,str)):
    try:
      creation_date = datetime.strptime(creation_date,'%Y-%m-%d')
      expiration_date = datetime.strptime(expiration_date,"%Y-%m-%d")
    except:
      return 1
  if ((expiration_date is None) or (creation_date is None)):
      return 1
  elif ((type(expiration_date) is list) or (type(creation_date) is list)):
      return 1
  else:
    ageofdomain = abs((expiration_date - creation_date).days)
    if ((ageofdomain/30) < 6):
      age = 1
    else:
      age = 0
  return age


> **2.3 End Period of Domain**

In [45]:
#If end period of domain > 6 months, the vlaue of this feature is 1 (phishing) else 0 (legitimate).
def domainEnd(domain_name):
  expiration_date = domain_name.expiration_date
  if isinstance(expiration_date,str):
    try:
      expiration_date = datetime.strptime(expiration_date,"%Y-%m-%d")
    except:
      return 1
  if (expiration_date is None):
      return 1
  elif (type(expiration_date) is list):
      return 1
  else:
    today = datetime.now()
    end = abs((expiration_date - today).days)
    if ((end/30) < 6):
      end = 0
    else:
      end = 1
  return end


> # **HTML and Javascript based features**


> **3.1 . IFrame Redirection**

In [46]:
  #If the iframe is empty or repsonse is not found then, the value assigned to this feature is 1 (phishing) or else 0 (legitimate).


  def iframe(response):
    if response == "":
        return 1
    else:           
        if re.findall(r"[<iframe>|<frameBorder>]", response.text):
            return 0
        else:
            return 1

> **3.2 . Status Bar Customization**

In [47]:
# If the response is empty or onmouseover is found then, the value assigned to this feature is 1 (phishing) or else 0 (legitimate).
def mouseOver(response): 
  if response == "" :
    return 1
  else:    
    if re.findall("<script>.+onmouseover.+</script>", response.text):
      return 1
    else:
      return 0

> **3.3 . Status Bar Customization**

In [48]:
# If the response is empty or onmouseover is not found then, the value assigned to this feature is 1 (phishing) or else 0 (legitimate).
def rightClick(response):
  if response == "":
    return 1
  else:    
    if re.findall(r"event.button ?== ?2", response.text):
      return 0
    else:
      return 1


> **3.4 . Website Forwarding**

In [49]:
# legtimate website forwards at maxx one times, phishing websites are forwarded at least 4 times.
def forwarding(response):
  if response == "":
    return 1
  else:    
    if len(response.history) <= 2:
      return 0
    else:
      return 1


> # **Computing URL Features**

In [50]:
def featureExtraction(url,label, curr):
  feature_names = ['Domain', 'Have_IP', 'Have_At', 'URL_Length', 'URL_Depth','Redirection', 
                      'https_Domain', 'TinyURL', 'Prefix/Suffix', 'DNS_Record',
                      'Domain_Age', 'Domain_End', 'iFrame', 'Mouse_Over','Right_Click', 'Web_Forwards', 'Label']
  features = []
  #Address bar based features  are working correctly    
  features.append(getDomain(url))
  features.append(haveIP(url))
  features.append(haveAt(url))
  features.append(urlLength(url))
  features.append(getDepth(url))
  features.append(redirection(url))
  features.append(hiddenhttps(url))
  features.append(usesTinyUrl(url))
  features.append(haveHyphen(url))
  
  #Domain based features  working correctly
  dns = 0
  try:
    domain_name = whois.whois(urlparse(url).netloc)
  except:
    dns = 1

  features.append(dns)
  features.append(web_traffic(url))
  features.append(1 if dns == 1 else domainAge(domain_name))
  features.append(1 if dns == 1 else domainEnd(domain_name))
  
  # HTML & Javascript based features working correctly
  temp = ['1']*4
  temp.append(label)

  try:   
    response = requests.get(url, timeout=5 )        
    print('HTTP response code: ', response.status_code)
    if response.status_code == 200:       
      features.append(iframe(response))      
      features.append(mouseOver(response))    
      features.append(rightClick(response))    
      features.append(forwarding(response))    
      features.append(label)          
    else: 
      print('Not reachable - ', url)
      features.extend(temp)
  except:     
    print('Timeout - ', url)
    features.extend(temp)
    

  return features


>#  **4.1. Legitimate URLs:**


In [None]:

#Extracting the feautres & storing them in a list
legi_features = []
label = 0
# 1 is phishing , 0 is legitimate
for i in range(0, len(legi_url) ):
  #print('i is: ', i, 'url: ', legi_url['URLs'][i]  )
  print('i is: ', i , end = "")
  url = legi_url['URLs'][i]  
  legi_features.append(featureExtraction(url,label, i))


i is:  0graphicriver.net
HTTP response code:  200
i is:  1ecnavi.jp
HTTP response code:  200
i is:  2hubpages.com
HTTP response code:  200
i is:  3extratorrent.cc
HTTP response code:  404
Not reachable -  http://extratorrent.cc/torrent/4190536/AOMEI+Backupper+Technician+%2B+Server+Edition+2.8.0+%2B+Patch+%2B+Key+%2B+100%25+Working.html
i is:  4icicibank.com
HTTP response code:  200
i is:  5nypost.com
HTTP response code:  200
i is:  6kienthuc.net.vn
Error trying to connect to socket: closing socket
HTTP response code:  200
i is:  7thenextweb.com
HTTP response code:  405
Not reachable -  http://thenextweb.com/in/2015/04/16/india-wants-a-neutral-web-and-facebooks-internet-org-cant-be-a-part-of-it/gtm.js
i is:  8tobogo.net
HTTP response code:  200
i is:  9akhbarelyom.com
HTTP response code:  200
i is:  10tunein.com
HTTP response code:  200
i is:  11tune.pk
Error trying to connect to socket: closing socket
HTTP response code:  404
Not reachable -  https://tune.pk/video/6046458/canelo-vs-kir

In [None]:
#converting the list to dataframe
feature_names = ['Domain', 'Have_IP', 'Have_At', 'URL_Length', 'URL_Depth','Redirection', 
                      'https_Domain', 'TinyURL', 'Prefix/Suffix', 'DNS_Record', 'Web_Traffic', 
                      'Domain_Age', 'Domain_End', 'iFrame', 'Mouse_Over','Right_Click', 'Web_Forwards', 'Label']

legitimate = pd.DataFrame(legi_features, columns= feature_names)
legitimate.head()


In [None]:
legitimate.to_csv('legitimate_copy.csv', index= False)


In [None]:
!cp legitimate_copy.csv "/content/drive/MyDrive/Colab Notebooks/Datasets"

> # **4.2. Phishing URLs:**

In [None]:
phis_url.shape

In [None]:
#Extracting the feautres & storing them in a list
phish_features = []
label = 1
for i in range(0, len(legi_url) ):
  url = phis_url['url'][i]
  print('i is: ', i, 'url: ', legi_url['URLs'][i]  )
  phish_features.append(featureExtraction(url,label, i))


In [None]:
#converting the list to dataframe

feature_names = ['Domain', 'Have_IP', 'Have_At', 'URL_Length', 'URL_Depth','Redirection', 
                      'https_Domain', 'TinyURL', 'Prefix/Suffix', 'DNS_Record', 'Web_Traffic', 
                      'Domain_Age', 'Domain_End', 'iFrame', 'Mouse_Over','Right_Click', 'Web_Forwards', 'Label']
                      
phishing = pd.DataFrame(phish_features, columns= feature_names)
phishing.head()


In [None]:
# Storing the extracted legitimate URLs fatures to csv file
phishing.to_csv('phishing_copy.csv', index= False)


In [None]:
!cp phishing_copy.csv "/content/drive/MyDrive/Colab Notebooks/Datasets"

> **Final Dataset**

In [None]:
#Concatenating the dataframes into one 
urldata = pd.concat([legitimate, phishing]).reset_index(drop=True)
feature_names = ['Domain', 'Have_IP', 'Have_At', 'URL_Length', 'URL_Depth','Redirection', 
                      'https_Domain', 'TinyURL', 'Prefix/Suffix', 'DNS_Record', 'Web_Traffic', 
                      'Domain_Age', 'Domain_End', 'iFrame', 'Mouse_Over','Right_Click', 'Web_Forwards', 'Label']
urldata = pd.DataFrame(urldata, columns= feature_names)
urldata.head()


In [None]:
urldata.shape


In [None]:
# Storing the data in CSV file
urldata.to_csv('final_data.csv', index=False)



In [None]:
!cp final_data.csv "/content/drive/MyDrive/Colab Notebooks/Datasets"

In [None]:
temp = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Datasets/final_data.csv')

In [None]:
temp.shape