### 1.1 Collecting Phishing URLs
The phishing URLSs were collected from the PhishTank and downloaded and stored into a csv file. Then the csv is loaded into a dataframe

In [4]:
import pandas as pd
data0 = pd.read_csv("online-valid.csv")
data0.head()

Unnamed: 0,phish_id,url,phish_detail_url,submission_time,verified,verification_time,online,target
0,6557033,http://u1047531.cp.regruhosting.ru/acces-inges...,http://www.phishtank.com/phish_detail.php?phis...,2020-05-09T22:01:43+00:00,yes,2020-05-09T22:03:07+00:00,yes,Other
1,6557032,http://hoysalacreations.com/wp-content/plugins...,http://www.phishtank.com/phish_detail.php?phis...,2020-05-09T22:01:37+00:00,yes,2020-05-09T22:03:07+00:00,yes,Other
2,6557011,http://www.accsystemprblemhelp.site/checkpoint...,http://www.phishtank.com/phish_detail.php?phis...,2020-05-09T21:54:31+00:00,yes,2020-05-09T21:55:38+00:00,yes,Facebook
3,6557010,http://www.accsystemprblemhelp.site/login_atte...,http://www.phishtank.com/phish_detail.php?phis...,2020-05-09T21:53:48+00:00,yes,2020-05-09T21:54:34+00:00,yes,Facebook
4,6557009,https://firebasestorage.googleapis.com/v0/b/so...,http://www.phishtank.com/phish_detail.php?phis...,2020-05-09T21:49:27+00:00,yes,2020-05-09T21:51:24+00:00,yes,Microsoft


In [5]:
data0.shape

(14858, 8)

So, the downloaded csv file has thousands of URLS. For this particular studies I have picked 5000 random samples from the dataset.

In [6]:
phishurl = data0.sample(n=5000, random_state=12).copy()
phishurl= phishurl.reset_index(drop=True)
phishurl.head()

Unnamed: 0,phish_id,url,phish_detail_url,submission_time,verified,verification_time,online,target
0,6514946,http://confirmprofileaccount.com/,http://www.phishtank.com/phish_detail.php?phis...,2020-04-19T11:06:55+00:00,yes,2020-04-19T13:42:41+00:00,yes,Other
1,4927651,http://www.marreme.com/MasterAdmin/04mop.html,http://www.phishtank.com/phish_detail.php?phis...,2017-04-04T19:35:54+00:00,yes,2017-05-03T23:00:42+00:00,yes,Other
2,5116976,http://modsecpaststudents.com/review/,http://www.phishtank.com/phish_detail.php?phis...,2017-07-25T18:48:30+00:00,yes,2017-07-28T16:01:36+00:00,yes,Other
3,6356131,https://docs.google.com/forms/d/e/1FAIpQLScL6L...,http://www.phishtank.com/phish_detail.php?phis...,2020-01-13T20:13:37+00:00,yes,2020-01-17T01:55:38+00:00,yes,Other
4,6535965,https://oportunidadedasemana.com/americanas//?...,http://www.phishtank.com/phish_detail.php?phis...,2020-04-29T00:01:03+00:00,yes,2020-05-01T10:55:35+00:00,yes,Other


In [7]:
phishurl.shape

(5000, 8)

We can see now the dataset has collected 5000 random phishing URLs. 

### 1.2 Legitimate URLS

The dataset used for Legitimate URLs is downloaded from University of New Brunswick, https://www.unb.ca/cic/datasets/url-2016.html.
This dataset has a collection of benign, spam, phishing, malware & defacement URLs.

Out of all these types, the benign url dataset is considered for this project.

" Benign_list_big_final.csv"
This file has the final list of legitimate urls. The total count is 
35,300

In [8]:
data1 = pd.read_csv("Benign_list_big_final.csv")
data1.columns = ['URLs']
data1.head()

Unnamed: 0,URLs
0,http://1337x.to/torrent/1110018/Blackhat-2015-...
1,http://1337x.to/torrent/1122940/Blackhat-2015-...
2,http://1337x.to/torrent/1124395/Fast-and-Furio...
3,http://1337x.to/torrent/1145504/Avengers-Age-o...
4,http://1337x.to/torrent/1160078/Avengers-age-o...


In [9]:
legiturl = data1.sample(n = 5000, random_state = 12).copy()
legiturl = legiturl.reset_index(drop=True)
legiturl.head()

Unnamed: 0,URLs
0,http://graphicriver.net/search?date=this-month...
1,http://ecnavi.jp/redirect/?url=http://www.cros...
2,https://hubpages.com/signin?explain=follow+Hub...
3,http://extratorrent.cc/torrent/4190536/AOMEI+B...
4,http://icicibank.com/Personal-Banking/offers/o...


Similar to Phishing URLs 500 random samples were picked from the dataset.

In [10]:
legiturl.shape


(5000, 1)

## 2. Feature Extraction

Now we will Extract the features from the URL dataset.

The extracted features are categorized into 3 categories namely:
1. Address Bar based features
2. Domain Based Features
3. HTML & Javascript based Features

### 2.1 Address Bar Based Features:
Features such as Domain, IP address, Redirection fall into under this category.

In [11]:
from urllib.parse import urlparse,urlencode
import ipaddress
import re

2.1.1 Domain


In [12]:
def get_domain_category(url: str) -> int:
    """
    Extracts the domain name and categorizes it based on the presence of a leading "www.".

    Phishing attempts might sometimes omit the "www." prefix to appear more trustworthy.
    This feature provides a basic indicator, but should be used with caution.

    Args:
        url (str): The URL to analyze.

    Returns:
        int: 1 if the domain starts with "www.", 0 otherwise.
    """

    parsed_url = urlparse(url)
    if not parsed_url.netloc:
        return 0  # No domain found, consider it suspicious (1)

    domain = parsed_url.netloc
    return 0 if domain.startswith("www.") else 1

2.1.2. IP Address in the URL

In [13]:
def has_ip_address_category(url: str) -> int:
    """
    Checks if the given URL contains a valid IP address.

    Phishing URLs might sometimes use IP addresses to bypass security checks.
    This feature provides a basic indicator, but should be used with caution.

    Args:
        url (str): The URL to check.

    Returns:
        int: 1 if the URL contains a valid IP address, 0 otherwise.
    """

    try:
        ipaddress.ip_address(url)
        ip = 1
    except:
        ip = 0
    return ip

2.1.3 
Checks for the presence of '@' symbol in the URL. 

Using “@” symbol in the URL leads the browser to ignore everything preceding the “@” symbol and the real address often follows the “@” symbol.

In [14]:
def has_at_symbol(url: str) -> bool:
  """Checks if the given URL contains the "@" symbol.

  Args:
      url (str): The URL to check.

  Returns:
      bool: True if the URL contains the "@" symbol, False otherwise.
  """
  if "@" in url:
    at = 1    
  else:
    at = 0    
  return at

2.1.4. Length of URL

Computes the length of the URL. Phishers can use long URL to hide the doubtful part in the address bar. In this project, if the length of the URL is greater than or equal 54 characters then the URL classified as phishing otherwise legitimate.

If the length of URL >= 54 , the value assigned to this feature is 1 (phishing) or else 0 (legitimate).



In [15]:
def get_url_length_category(url: str) -> int:
  """
  Calculates the length of the URL and returns 1 (phishing) if it's 54 characters or longer, 0 (legitimate) otherwise.

  This simplistic approach should be used with caution, as URL length alone is not a definitive indicator of phishing.

  Args:
      url (str): The URL to analyze.

  Returns:
      int: 1 if the URL length is >= 54, 0 otherwise.
  """

  url_length = len(url)
  return 1 if url_length >= 54 else 0

2.1.5. Depth of URL

Computes the depth of the URL. This feature calculates the number of sub pages in the given url based on the '/'.

The value of feature is a numerical based on the URL.

In [16]:
from urllib.parse import urlparse


def get_url_depth_category(url: str) -> int:
    """
    Calculates the number of path segments (depth) in a URL.

    Phishing URLs might sometimes have a deeper path structure to obfuscate the real target.
    This feature provides a basic indicator, but should be used with caution.

    Args:
        url (str): The URL to analyze.

    Returns:
        int: The number of path segments (depth) in the URL, or 0 if no path is present.
    """

    parsed_url = urlparse(url)
    path = parsed_url.path.strip("/")  # Remove leading and trailing slashes

    if not path:
        return 0  # No path found, consider it shallow depth (0)

    return len(path.split("/"))


2.1.6 Refirection "//"

Checks the presence of "//" in the URL. The existence of “//” within the URL path means that the user will be redirected to another website. The location of the “//” in URL is computed. We find that if the URL starts with “HTTP”, that means the “//” should appear in the sixth position. However, if the URL employs “HTTPS” then the “//” should appear in seventh position.

If the "//" is anywhere in the URL apart from after the protocal, thee value assigned to this feature is 1 (phishing) or else 0 (legitimate).

In [17]:
from urllib.parse import urlparse


def has_double_slash_redirection(url: str) -> int:
  """
  Checks if the URL contains a double slash ("//") after the protocol (e.g., "https://").

  Phishing URLs might sometimes use this redirection syntax to obfuscate the real target.
  This feature provides a basic indicator, but should be used with caution.

  Args:
      url (str): The URL to analyze.

  Returns:
      int: 1 if the URL contains "//" after the protocol, 0 otherwise.
  """

  parsed_url = urlparse(url)
  return 1 if parsed_url.scheme and parsed_url.netloc else 0


2.1.7. "http/https" in Domain name

Checks for the presence of "http/https" in the domain part of the URL. The phishers may add the “HTTPS” token to the domain part of a URL in order to trick users.

In [18]:
from urllib.parse import urlparse


def has_https_in_domain(url: str) -> int:
  """
  Checks if the domain part of the URL contains the string "https".

  This feature might not a reliable indicator of phishing, as some malicious URLs might explicitly include "https" in the domain to appear legitimate.

  Args:
      url (str): The URL to analyze.

  Returns:
      int: 1 if "https" is found in the domain, 0 otherwise.
  """

  parsed_url = urlparse(url)
  domain = parsed_url.netloc
  return 1 if 'https' in domain else 0


2.1.8. Using URL Shortening Services “TinyURL”

URL shortening is a method on the “World Wide Web” in which a URL may be made considerably smaller in length and still lead to the required webpage. This is accomplished by means of an “HTTP Redirect” on a domain name that is short, which links to the webpage that has a long URL.

In [19]:
shortening_services = r"bit\.ly|goo\.gl|shorte\.st|go2l\.ink|x\.co|ow\.ly|t\.co|tinyurl|tr\.im|is\.gd|cli\.gs|" \
                      r"yfrog\.com|migre\.me|ff\.im|tiny\.cc|url4\.eu|twit\.ac|su\.pr|twurl\.nl|snipurl\.com|" \
                      r"short\.to|BudURL\.com|ping\.fm|post\.ly|Just\.as|bkite\.com|snipr\.com|fic\.kr|loopt\.us|" \
                      r"doiop\.com|short\.ie|kl\.am|wp\.me|rubyurl\.com|om\.ly|to\.ly|bit\.do|t\.co|lnkd\.in|db\.tt|" \
                      r"qr\.ae|adf\.ly|goo\.gl|bitly\.com|cur\.lv|tinyurl\.com|ow\.ly|bit\.ly|ity\.im|q\.gs|is\.gd|" \
                      r"po\.st|bc\.vc|twitthis\.com|u\.to|j\.mp|buzurl\.com|cutt\.us|u\.bb|yourls\.org|x\.co|" \
                      r"prettylinkpro\.com|scrnch\.me|filoops\.info|vzturl\.com|qr\.net|1url\.com|tweez\.me|v\.gd|" \
                      r"tr\.im|link\.zip\.net"

def has_shortening_service(url: str) -> int:
  """
  Checks if the URL contains a known shortening service domain.

  Phishing URLs might sometimes use URL shortening services to obfuscate the real target.
  This feature provides a basic indicator, but should be used with caution. Legitimate URLs can also be shortened.

  Args:
      url (str): The URL to analyze.

  Returns:
      int: 1 if the URL contains a shortening service domain, 0 otherwise.
  """

  match = re.search(shortening_services, url)
  return 1 if match else 0


2.1.9. Prefix or Suffix "-" 

Checking the presence of '-' in the domain part of URL. The dash symbol is rarely used in legitimate URLs. Phishers tend to add prefixes or suffixes separated by (-) to the domain name so that users feel that they are dealing with a legitimate webpage.

In [20]:
from urllib.parse import urlparse


def has_prefix_or_suffix_in_domain(url: str) -> int:
  """
  Checks if the domain part of the URL contains a prefix or suffix separated by a hyphen (-).

  This feature can be indicative of some phishing techniques, but should be used with caution. 
  Legitimate domains can also contain hyphens.

  Args:
      url (str): The URL to analyze.

  Returns:
      int: 1 if the domain has a prefix or suffix separated by "-", 0 otherwise.
  """

  parsed_url = urlparse(url)
  domain = parsed_url.netloc
  if domain and '-' in domain:
    return 1
  else:
    return 0


3.2. Domain Based Features

The following features were considered in this project:

    1.DNS Record

    2.Age of Domain

    3.End Period of Domain

In [21]:
%pip install python-whois


Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [22]:
import re
from bs4 import BeautifulSoup
import whois
import urllib
import urllib.request
from datetime import datetime

2.2.1. DNS Record

For phishing websites, either the claimed identity is not recognized by the WHOIS database or no records founded for the hostname. If the DNS record is empty or not found then, the value assigned to this feature is 1 (phishing) or else 0 (legitimate).

In [23]:
#DNS Record availability (DNS_Record)
""" obtained in the featureExtraction function itself """

' obtained in the featureExtraction function itself '

3.2.3. Age of Domain

This feature can be extracted from WHOIS database. Most phishing websites live for a short period of time. The minimum age of the legitimate domain is considered to be 12 months for this project. Age here is nothing but different between creation and expiration time.

If age of domain > 12 months, the vlaue of this feature is 1 (phishing) else 0 (legitimate).

In [24]:
# 13.Survival time of domain: The difference between termination time and creation time (Domain_Age)  
def domainAge(domain_name)-> int:
  """
  Calculates the age of the domain in months.

  args:
      domain_name (str): The domain name to analyze.

  returns:
      int: 1 if the domain age is less than 3 months, 0 otherwise.
  """
  creation_date = domain_name.creation_date
  expiration_date = domain_name.expiration_date
  if (isinstance(creation_date,str) or isinstance(expiration_date,str)):
    try:
      creation_date = datetime.strptime(creation_date,'%Y-%m-%d')
      expiration_date = datetime.strptime(expiration_date,"%Y-%m-%d")
    except:
      return 1
  if ((expiration_date is None) or (creation_date is None)):
      return 1
  elif ((type(expiration_date) is list) or (type(creation_date) is list)):
      return 1
  else:
    ageofdomain = abs((expiration_date - creation_date).days)
    if ((ageofdomain/30) < 3):
      age = 1
    else:
      age = 0
  return age

3.2.4. End Period of Domain

This feature can be extracted from WHOIS database. For this feature, the remaining domain time is calculated by finding the different between expiration time & current time. The end period considered for the legitimate domain is 6 months or less for this project.

If end period of domain > 6 months, the vlaue of this feature is 1 (phishing) else 0 (legitimate).

In [25]:
from datetime import datetime
from typing import Optional

def calculate_domain_end_feature(domain_name):
  """
  Calculates the feature based on the domain's expiration date (if possible).

  Args:
      domain_name (str): The domain name to analyze.

  Returns:
      Optional[int]:
          - 1: Domain is close to expiration (potentially suspicious).
          - 0: Domain has a valid expiration date (less suspicious).
          - None: Domain expiration information unavailable or invalid.
  """

  expiration_date = domain_name.expiration_date
  if isinstance(expiration_date,str):
    try:
      expiration_date = datetime.strptime(expiration_date,"%Y-%m-%d")
    except:
      return 1
  if (expiration_date is None):
      return 1
  elif (type(expiration_date) is list):
      return 1
  else:
    today = datetime.now()
    end = abs((expiration_date - today).days)
    if ((end/30) < 6):
      end = 0
    else:
      end = 1
  return end


### 2.2HTML and Javascript based Features

        1.IFrame Redirection

        2.Status Bar Customization

        3.Disabling Right Click

        4.Website Forwarding

In [26]:
import requests

3..1. IFrame Redirection

IFrame is an HTML tag used to display an additional webpage into one that is currently shown. Phishers can make use of the “iframe” tag and make it invisible i.e. without frame borders. In this regard, phishers make use of the “frameBorder” attribute which causes the browser to render a visual delineation.

If the iframe is empty or repsonse is not found then, the value assigned to this feature is 1 (phishing) or else 0 (legitimate).

In [27]:
def iframe(response)->int:
  """
  Checks if the response contains an iframe element.
  
  Args:
      response: The HTTP response object containing the HTML content.

  Returns:
      int: 1 if the response contains an iframe, 0 otherwise.
  """
  if response == "":
      return 1
  else:
      if re.findall(r"[|]", response.text):
          return 0
      else:
          return 1

3.3.3. Disabling Right Click

Phishers use JavaScript to disable the right-click function, so that users cannot view and save the webpage source code. This feature is treated exactly as “Using onMouseOver to hide the Link”. Nonetheless, for this feature, we will search for event “event.button==2” in the webpage source code and check if the right click is disabled.



In [28]:
import re

def has_right_click_javascript(response) -> int:
  """
  Checks for presence of JavaScript code related to right-click events.

  This approach is limited as it relies on finding specific patterns, 
  which might not be present in all cases.

  Args:
      response (str): The HTML content of the response.

  Returns:
      int: 1 (potential phishing) if right-click related JavaScript is found, 
          0 otherwise.
  """

  if not response:
    return 1  # Empty response, potential issue

  # Look for patterns related to right-click events (adjust as needed)
  patterns = [r"oncontextmenu", r"event\.button ?== ?2"]
  for pattern in patterns:
    if re.search(pattern, response, re.IGNORECASE):
      return 1
  return 0


3.3.4. Website Forwarding

The fine line that distinguishes phishing websites from legitimate ones is how many times a website has been redirected. 
In our dataset, we find that legitimate websites have been redirected one time max. On the other hand, phishing websites containing this feature have been redirected at least 4 times.

In [29]:
def has_suspicious_redirects(response) -> int:
  """
  Checks if the response has a suspicious number of redirects.

  This feature is an indicator, but some legitimate websites might use redirects.
  Consider the context and use it with other features for effective phishing detection.

  Args:
      response: The HTTP response object containing the redirect history.

  Returns:
      int: 1 if the number of redirects is suspicious (configurable threshold),
          0 otherwise.
  """

  if not response:
    return 1  # Empty response, potential issue

  # Adjustable threshold for suspicious redirects (default: 3)
  threshold = 3

  # Check if the number of redirects exceeds the threshold
  num_redirects = len(response.history)
  return 1 if num_redirects > threshold else 0


## 4. Computing URL Features

Create a list and a function that calls the other functions and stores all the features of the URL in the list. We will extract the features of each URL and append to this list.

In [30]:
def extract_features(url, label):
  """
  Extracts features from a given URL for phishing detection.

  This function combines various feature extraction methods. 
  Consider the limitations and security implications of some features.

  Args:
      url (str): The URL to analyze.
      label (int): The label (1 for phishing, 0 for legitimate) for training.

  Returns:
      list: A list of extracted features.
  """

  features = []

  # Address bar based features (10)
  features.append(get_domain_category(url))  
  features.append(has_ip_address_category(url))
  features.append(has_at_symbol(url))
  features.append(get_url_length_category(url))
  features.append(get_url_depth_category(url))  
  features.append(has_double_slash_redirection(url))  
  features.append(has_https_in_domain(url))
  features.append(has_shortening_service(url))  
  features.append(has_prefix_or_suffix_in_domain(url))  

  # Domain based features (4)
  dns_error = 0
  try:
    domain_name = whois.whois(urlparse(url).netloc)
  except:
    dns_error = 1

  features.append(dns_error)
  features.append(1 if dns_error == 1 else domainAge(domain_name))
  features.append(1 if dns_error == 1 else calculate_domain_end_feature(domain_name))

  # HTML & Javascript based features (4)
  response = None
  try:
    response = requests.get(url)
  except:
    response =""

  features.append(iframe(response)) 
  features.append(has_right_click_javascript(response))
  features.append(has_suspicious_redirects(response))

  features.append(label)

  return features


# 4.1. Legitimate URLs:

Now, feature extraction is done on legitimate URLs.

In [31]:
legiturl.shape

(5000, 1)

#### The following code takes a lot of time around 15 mins

In [35]:
from concurrent.futures import ThreadPoolExecutor
legit_features = []
label = 0
with ThreadPoolExecutor(max_workers=8) as executor:  # Adjust max_workers as needed
  legit_features = list(executor.map(extract_features, legiturl['URLs'], [label] * 5000))


2024-07-15 21:26:01,319 - whois.whois - ERROR - Error trying to connect to socket: closing socket - [Errno -2] Name or service not known
2024-07-15 21:26:04,393 - whois.whois - ERROR - Error trying to connect to socket: closing socket - [Errno -2] Name or service not known
2024-07-15 21:26:08,042 - whois.whois - ERROR - Error trying to connect to socket: closing socket - [Errno -2] Name or service not known
2024-07-15 21:26:10,216 - whois.whois - ERROR - Error trying to connect to socket: closing socket - [Errno -2] Name or service not known
2024-07-15 21:26:10,943 - whois.whois - ERROR - Error trying to connect to socket: closing socket - [Errno -2] Name or service not known
2024-07-15 21:26:12,021 - whois.whois - ERROR - Error trying to connect to socket: closing socket - [Errno -2] Name or service not known
2024-07-15 21:26:13,473 - whois.whois - ERROR - Error trying to connect to socket: closing socket - [Errno -2] Name or service not known
2024-07-15 21:26:14,437 - whois.whois - E

In [38]:
feature_names = ['Domain', 'Have_IP', 'Have_At', 'URL_Length', 'URL_Depth','Redirection', 
                      'https_Domain', 'TinyURL', 'Prefix/Suffix', 'DNS_Record', 
                      'Domain_Age', 'Domain_End', 'iFrame','Right_Click', 'Web_Forwards', 'Label']

legitimate = pd.DataFrame(legit_features, columns= feature_names)
legitimate.head()

Unnamed: 0,Domain,Have_IP,Have_At,URL_Length,URL_Depth,Redirection,https_Domain,TinyURL,Prefix/Suffix,DNS_Record,Domain_Age,Domain_End,iFrame,Right_Click,Web_Forwards,Label
0,1,0,0,1,1,1,0,0,0,0,1,1,1,1,1,0
1,1,0,0,1,1,1,0,0,0,0,0,1,1,1,1,0
2,1,0,0,1,1,1,0,0,0,0,0,1,1,1,1,0
3,1,0,0,1,3,1,0,0,0,0,0,0,1,1,1,0
4,1,0,0,1,3,1,0,0,0,0,0,1,1,1,1,0


In [None]:
legitimate.to_csv('legitimate.csv', index= False)

## 4.2. Phishing URLs:

Now, feature extraction is performed on phishing URLs.


In [39]:

phishurl.shape

(5000, 8)

In [41]:
phish_features = []
label = 1
with ThreadPoolExecutor(max_workers=12) as executor:  # Adjust max_workers as needed
  legit_features = list(executor.map(extract_features, legiturl['URLs'], [label] * 5000))
feature_names = ['Domain', 'Have_IP', 'Have_At', 'URL_Length', 'URL_Depth','Redirection', 
                      'https_Domain', 'TinyURL', 'Prefix/Suffix', 'DNS_Record', 
                      'Domain_Age', 'Domain_End', 'iFrame','Right_Click', 'Web_Forwards', 'Label']

phishing = pd.DataFrame(phish_features, columns= feature_names)
phishing.head()

2024-07-15 21:46:24,028 - whois.whois - ERROR - Error trying to connect to socket: closing socket - [Errno -2] Name or service not known
2024-07-15 21:46:26,374 - whois.whois - ERROR - Error trying to connect to socket: closing socket - [Errno -2] Name or service not known
2024-07-15 21:46:30,171 - whois.whois - ERROR - Error trying to connect to socket: closing socket - [Errno -2] Name or service not known
2024-07-15 21:46:32,315 - whois.whois - ERROR - Error trying to connect to socket: closing socket - [Errno -2] Name or service not known
2024-07-15 21:46:33,240 - whois.whois - ERROR - Error trying to connect to socket: closing socket - [Errno -2] Name or service not known
2024-07-15 21:46:33,862 - whois.whois - ERROR - Error trying to connect to socket: closing socket - [Errno -2] Name or service not known
2024-07-15 21:46:35,271 - whois.whois - ERROR - Error trying to connect to socket: closing socket - [Errno -2] Name or service not known
2024-07-15 21:46:36,843 - whois.whois - E

Unnamed: 0,Domain,Have_IP,Have_At,URL_Length,URL_Depth,Redirection,https_Domain,TinyURL,Prefix/Suffix,DNS_Record,Domain_Age,Domain_End,iFrame,Right_Click,Web_Forwards,Label


In [42]:
#Storing to extraceted URLs
phishing.to_csv('phishing.csv', index= False)

# Final Dataset

In the above section we formed two dataframes of legitimate & phishing URL features. Now, we will combine them to a single dataframe and export the data to csv file for the Machine Learning training done in other notebook.

In [43]:
urldata = pd.concat([legitimate, phishing]).reset_index(drop=True)
urldata.head()

Unnamed: 0,Domain,Have_IP,Have_At,URL_Length,URL_Depth,Redirection,https_Domain,TinyURL,Prefix/Suffix,DNS_Record,Domain_Age,Domain_End,iFrame,Right_Click,Web_Forwards,Label
0,1,0,0,1,1,1,0,0,0,0,1,1,1,1,1,0
1,1,0,0,1,1,1,0,0,0,0,0,1,1,1,1,0
2,1,0,0,1,1,1,0,0,0,0,0,1,1,1,1,0
3,1,0,0,1,3,1,0,0,0,0,0,0,1,1,1,0
4,1,0,0,1,3,1,0,0,0,0,0,1,1,1,1,0


In [44]:
urldata.tail()

Unnamed: 0,Domain,Have_IP,Have_At,URL_Length,URL_Depth,Redirection,https_Domain,TinyURL,Prefix/Suffix,DNS_Record,Domain_Age,Domain_End,iFrame,Right_Click,Web_Forwards,Label
4995,1,0,0,1,1,1,0,1,0,0,0,1,1,1,1,0
4996,1,0,0,1,7,1,0,0,0,0,0,1,1,1,1,0
4997,1,0,1,1,2,1,0,0,0,0,0,1,1,1,1,0
4998,1,0,0,1,6,1,0,0,0,0,0,1,1,1,1,0
4999,1,0,0,1,2,1,0,0,0,0,0,1,1,1,1,0


In [45]:
urldata.shape

(5000, 16)

In [47]:

# Storing the data in CSV file
urldata.to_csv('urldata.csv', index=False)