### Introduction

A URL can include either **Hypertext Transfer Protocol (HTTP)** or the **Hypertext Transfer Protocol Secure (HTTPS)**. Other types of protocols include the **File Transfer Protocol (FTP)**, **Siimple Mail Transfer (SMTP)**, and others, such as telnet, DNS, and so on.

A URL consists of the top-level domain, hostname, paths, and port of the web address.

A Lousy URLs are URLs that have been created with malicious intent. They are often the precursors to cyberattacjs that may happen in the near future. 

In [34]:
import pandas as pd
import numpy as np
import random 
import pickle

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

import re
from urllib.parse import urlparse
import urllib
from xml.dom import minidom 
import csv 
import pygeoip


In [28]:
def url_cleanse(web_url):
    web_url = web_url.lower()  # Convert to lowercase
    parsed_url = urlparse(web_url)
    
    # Extract domain and path separately
    domain_parts = re.split(r'[.\-]', parsed_url.netloc)  # Split by dot & hyphen
    path_parts = re.split(r'[\/\-]', parsed_url.path)  # Split by slash & hyphen
    
    # Remove common TLDs and empty strings
    common_tlds = {'com', 'net', 'org', 'gov', 'edu', 'co', 'uk', 'us'}
    url_tokens = [part for part in (domain_parts + path_parts) if part and part not in common_tlds]

    return list(set(url_tokens))  # Return unique tokens

In [None]:
test = url_cleanse("https://www.Iamstrongerthanthe_mountain.com/Dataset/ForgiveMe")
print(test)

**Lousy URLs** are URLs that have been created with malicious intent. They are often the precursors to cyberattacks that may happen in the near future. These kinds of urls can hit pretty close to home, leaving each one of us very vulnerable to bad sites that we might visit on purpose or by accident.


**Drive-by download URLS** are URLS that promote the unintended download of software from websites. They could be downloaded when a naive user first clicks on a URL, without knowing the consequences of this action.

**Command and Conquer URLs** are URLs that are linked to malware that connects the target computer to command and control servers. These are different from URLs that can be categorized as malicious as it is not always a virus that concerns to command and control URLs via external or remote servers.

**Phishing URLs** are a form of attack that steals sensitive data, such as **personally identifiable information (PII)**, by either luring the user or disguising the URL as a legitimate or trustworthy URL.

In [5]:
from pathlib import Path

dataset = Path(Path.cwd()).resolve().parents[1] / "introduction" / "datasets" / "dataset_phishing.csv"

In [6]:
raw_df = pd.read_csv(dataset)
raw_df.head()

Unnamed: 0,url,length_url,length_hostname,ip,nb_dots,nb_hyphens,nb_at,nb_qm,nb_and,nb_or,...,domain_in_title,domain_with_copyright,whois_registered_domain,domain_registration_length,domain_age,web_traffic,dns_record,google_index,page_rank,status
0,http://www.crestonwood.com/router.php,37,19,0,3,0,0,0,0,0,...,0,1,0,45,-1,0,1,1,4,legitimate
1,http://shadetreetechnology.com/V4/validation/a...,77,23,1,1,0,0,0,0,0,...,1,0,0,77,5767,0,0,1,2,phishing
2,https://support-appleld.com.secureupdate.duila...,126,50,1,4,1,0,1,2,0,...,1,0,0,14,4004,5828815,0,1,0,phishing
3,http://rgipt.ac.in,18,11,0,2,0,0,0,0,0,...,1,0,0,62,-1,107721,0,0,3,legitimate
4,http://www.iracing.com/tracks/gateway-motorspo...,55,15,0,2,2,0,0,0,0,...,0,1,0,224,8175,8725,0,0,6,legitimate


In [25]:
df = raw_df[['url', 'status']]
df.head()

Unnamed: 0,url,status
0,http://www.crestonwood.com/router.php,legitimate
1,http://shadetreetechnology.com/V4/validation/a...,phishing
2,https://support-appleld.com.secureupdate.duila...,phishing
3,http://rgipt.ac.in,legitimate
4,http://www.iracing.com/tracks/gateway-motorspo...,legitimate


In [27]:
df['status'] = df['status'].apply(lambda x: 0 if x == 'legitimate' else 1)
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['status'] = df['status'].apply(lambda x: 0 if x == 'legitimate' else 1)


Unnamed: 0,url,status
0,http://www.crestonwood.com/router.php,0
1,http://shadetreetechnology.com/V4/validation/a...,1
2,https://support-appleld.com.secureupdate.duila...,1
3,http://rgipt.ac.in,0
4,http://www.iracing.com/tracks/gateway-motorspo...,0


In [21]:
def get_ipaddress(tokenized_words):
    count = 0
    for element in tokenized_words:
        if str(element).isnumeric():  # fixed from encode() to str()
            count += 1
        else:
            if count >= 4:
                return 1
            count = 0
    return 1 if count >= 4 else 0


### Tokenize URL

In [22]:

def url_tokenize(url):
    """
    Tokenizes a given URL using non-alphanumeric characters and returns:
    1. The average token length (excluding empty tokens),
    2. The number of valid (non-empty) tokens,
    3. The length of the largest token.

    Parameters:
        url (str): The input URL string.

    Returns:
        list: [average_token_length (float), num_tokens (int), max_token_length (int)]
    """
    tokenized_words = re.split(r'\W+', url)
    num_tokens = 0
    total_length = 0
    max_length = 0

    for token in tokenized_words:
        length = len(token)
        if length > 0:
            num_tokens += 1
            total_length += length
            if length > max_length:
                max_length = length

    try:
        avg_length = total_length / num_tokens if num_tokens > 0 else 0
        return [avg_length, num_tokens, max_length]
    except Exception as e:
        return [0, num_tokens, max_length]

    


In [1]:
def url_has_exe(url):
    if url.find('.exe') != 1:
        return 1
    else:
        return 0
    

def get_sec_sensitive_words(tokenized_words):
    sec_sen_words =['confirm','account','banking','secure','rolex','login','signin']
    count = 0 
    for element in sec_sen_words:
        if element in tokenized_words:
            count = count + 1
            return count

## TF-IDF 
TF-IDF is used to measure how important a selected word is with respect to the entire document.

In [35]:
X = df['url']
y = df['status']
url_vectorizer = TfidfVectorizer(analyzer='char_wb', ngram_range=(3,5))
X_tfidf = url_vectorizer.fit_transform(X)

In [36]:
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=.2, random_state=42)

In [37]:
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

In [38]:
y_pred = model.predict(X_test)

In [40]:
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.910761154855643

Classification Report:
               precision    recall  f1-score   support

           0       0.92      0.90      0.91      1157
           1       0.90      0.92      0.91      1129

    accuracy                           0.91      2286
   macro avg       0.91      0.91      0.91      2286
weighted avg       0.91      0.91      0.91      2286



### Use Support Vector Machine 
SVM is a popular method for classifying whether a URL is malicious or not.

In [41]:
from sklearn.svm import LinearSVC

svm_model = LinearSVC()
svm_model.fit(X_train, y_train)
y_pred_svm = svm_model.predict(X_test)

In [42]:
print("SVM Accuracy:", accuracy_score(y_test, y_pred_svm))
print("\nSVM Classification Report:\n", classification_report(y_test, y_pred_svm))

SVM Accuracy: 0.9483814523184602

SVM Classification Report:
               precision    recall  f1-score   support

           0       0.94      0.96      0.95      1157
           1       0.96      0.94      0.95      1129

    accuracy                           0.95      2286
   macro avg       0.95      0.95      0.95      2286
weighted avg       0.95      0.95      0.95      2286



### Use of Random Forest


In [43]:
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))

Random Forest Accuracy: 0.915573053368329


# All Together !
Here is a comparison of multiple ML models (Logistic Regression, SVM, Random Forest, XGBoost, and Naive Bayes)

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB






# 5. Define models
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Linear SVM": LinearSVC(),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    "Naive Bayes": MultinomialNB()
}

if has_xgb:
    models["XGBoost"] = XGBClassifier(use_label_encoder=False, eval_metric='logloss')

results = []

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    results.append({
        "Model": name,
        "Accuracy": accuracy_score(y_test, y_pred),
        "Precision": precision_score(y_test, y_pred),
        "Recall": recall_score(y_test, y_pred),
        "F1 Score": f1_score(y_test, y_pred)
    })

# 7. Show results
results_df = pd.DataFrame(results).sort_values(by="F1 Score", ascending=False)
print(results_df.to_string(index=False))


In [None]:
# Optional: XGBoost
try:
    from xgboost import XGBClassifier
    has_xgb = True
except ImportError:
    has_xgb = False

In [None]:



# 2. Split data
X = df['url']
y = df['status']

In [None]:
# 3. TF-IDF vectorization
vectorizer = TfidfVectorizer(analyzer='char_wb', ngram_range=(3, 5))
X_tfidf = vectorizer.fit_transform(X)

In [None]:

# 4. Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)