### Introduction

A URL can include either **Hypertext Transfer Protocol (HTTP)** or the **Hypertext Transfer Protocol Secure (HTTPS)**. Other types of protocols include the **File Transfer Protocol (FTP)**, **Siimple Mail Transfer (SMTP)**, and others, such as telnet, DNS, and so on.

A URL consists of the top-level domain, hostname, paths, and port of the web address.

A Lousy URLs are URLs that have been created with malicious intent. They are often the precursors to cyberattacjs that may happen in the near future. 

In [5]:
import pandas as pd
import numpy as np
import random 
import pickle

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

import re
from urllib.parse import urlparse

In [1]:
def url_cleanse(web_url):
    web_url = web_url.lower()  # Convert to lowercase
    parsed_url = urlparse(web_url)
    
    # Extract domain and path separately
    domain_parts = re.split(r'[.\-]', parsed_url.netloc)  # Split by dot & hyphen
    path_parts = re.split(r'[\/\-]', parsed_url.path)  # Split by slash & hyphen
    
    # Remove common TLDs and empty strings
    common_tlds = {'com', 'net', 'org', 'gov', 'edu', 'co', 'uk', 'us'}
    url_tokens = [part for part in (domain_parts + path_parts) if part and part not in common_tlds]

    return list(set(url_tokens))  # Return unique tokens

In [2]:
test = url_cleanse("https://www.Iamstrongerthanthe_mountain.com/Dataset/ForgiveMe")
print(test)

NameError: name 'urlparse' is not defined

**Lousy URLs** are URLs that have been created with malicious intent. They are often the precursors to cyberattacks that may happen in the near future. These kinds of urls can hit pretty close to home, leaving each one of us very vulnerable to bad sites that we might visit on purpose or by accident.


**Drive-by download URLS** are URLS that promote the unintended download of software from websites. They could be downloaded when a naive user first clicks on a URL, without knowing the consequences of this action.

**Command and Conquer URLs** are URLs that are linked to malware that connects the target computer to command and control servers. These are different from URLs that can be categorized as malicious as it is not always a virus that concerns to command and control URLs via external or remote servers.

**Phishing URLs** are a form of attack that steals sensitive data, such as **personally identifiable information (PII)**, by either luring the user or disguising the URL as a legitimate or trustworthy URL.

In [3]:
from pathlib import Path

dataset = Path(Path.cwd()).resolve().parents[1] / "introduction" / "datasets" / "dataset_phishing.csv"

In [None]:
df = pd.read_csv(dataset)
df.head()

Index(['url', 'length_url', 'length_hostname', 'ip', 'nb_dots', 'nb_hyphens',
       'nb_at', 'nb_qm', 'nb_and', 'nb_or', 'nb_eq', 'nb_underscore',
       'nb_tilde', 'nb_percent', 'nb_slash', 'nb_star', 'nb_colon', 'nb_comma',
       'nb_semicolumn', 'nb_dollar', 'nb_space', 'nb_www', 'nb_com',
       'nb_dslash', 'http_in_path', 'https_token', 'ratio_digits_url',
       'ratio_digits_host', 'punycode', 'port', 'tld_in_path',
       'tld_in_subdomain', 'abnormal_subdomain', 'nb_subdomains',
       'prefix_suffix', 'random_domain', 'shortening_service',
       'path_extension', 'nb_redirection', 'nb_external_redirection',
       'length_words_raw', 'char_repeat', 'shortest_words_raw',
       'shortest_word_host', 'shortest_word_path', 'longest_words_raw',
       'longest_word_host', 'longest_word_path', 'avg_words_raw',
       'avg_word_host', 'avg_word_path', 'phish_hints', 'domain_in_brand',
       'brand_in_subdomain', 'brand_in_path', 'suspecious_tld',
       'statistical_report', 