#**Phishing Detection**

**Create a Copy of this Notebook in your own Drive**


In this assignment you will be coding your first Neural Network! The goal is to build a network to identify whether or not a website is legit or a scam.



[Here](https://www.kaggle.com/datasets/shashwatwork/phishing-dataset-for-machine-learning/data) is the dataset you will use to train your network. Read about the



There is a way to connect Kaggle Datasets directly to python and import them as we did with the torch data set. **However**, that is not something I am familiar with, and thus am encouraging you to download the csv file from the website and upload it to the files tab in colab.

####TASK 0####
Click the link above and read "About Dataset"

**NOTE:** As with most coding there may be multiple ways to complete the task. Please attempt to be efficient, using good variable names and comments where necessary.
Using the internet, including ChatGPT and LLM is allowed for bite sized pieces. (The Kaggle website may have some handy examples to take pieces from.) *Be warned*, ChatGPT can be dumb; using it works best when you know what you need and use it to do the busy work.

##TASK 1##
Read in the data and use the [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function to create a training and testing dataset.

You must do some preprocessing before using this function, mainly removing the ids, and separating the labels.

After you have read in your data the `.info()` command is nice to see what you are working with

Please use a `test_size=0.2`

In [None]:
#Imports
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset, DataLoader
import os
import kagglehub



In [None]:
# Configure Google Drive with Kaggle
from google.colab import drive
drive.mount('/content/drive')
os.environ['KAGGLE_CONFIG_DIR'] = '/content/drive/MyDrive/kaggle'
path = kagglehub.dataset_download("shashwatwork/phishing-dataset-for-machine-learning")
print("Path to dataset files:", path)

# Read the data set
data = pd.read_csv(os.path.join(path, "Phishing_Legitimate_full.csv"))

Mounted at /content/drive
Downloading from https://www.kaggle.com/api/v1/datasets/download/shashwatwork/phishing-dataset-for-machine-learning?dataset_version_number=1...


100%|██████████| 234k/234k [00:00<00:00, 431kB/s]

Extracting files...
Path to dataset files: /root/.cache/kagglehub/datasets/shashwatwork/phishing-dataset-for-machine-learning/versions/1





In [None]:
# Exploring the data
data.info()

print(f"\n-------------------------------")

# Shape of data
print(f"{data.dtypes}\n")
print(f"Dimension: {data.shape[0]} x {data.shape[1]}\n")

print(f"\n-------------------------------")
# Dropping the id column as it is unique for each entry
# data = data.drop("id", axis=1)

print(f"\n-------------------------------")

# Checking data type of each column
datatype_counts = data.dtypes.value_counts()
for dtype, count in datatype_counts.items():
    print(f"{dtype}: {count} columns")

print(f"\n-------------------------------")

# Checking for null values in each column
null = data.isnull().sum()
for i in range(len(data.columns)):
    print(f"{data.columns[i]}: {null[i]} ({(null[i]/len(data))*100}%)")
total_missing = null.sum()
print(f"\nTotal missing values: {total_missing}\n")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 50 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   id                                  10000 non-null  int64  
 1   NumDots                             10000 non-null  int64  
 2   SubdomainLevel                      10000 non-null  int64  
 3   PathLevel                           10000 non-null  int64  
 4   UrlLength                           10000 non-null  int64  
 5   NumDash                             10000 non-null  int64  
 6   NumDashInHostname                   10000 non-null  int64  
 7   AtSymbol                            10000 non-null  int64  
 8   TildeSymbol                         10000 non-null  int64  
 9   NumUnderscore                       10000 non-null  int64  
 10  NumPercent                          10000 non-null  int64  
 11  NumQueryComponents                  10000 

  print(f"{data.columns[i]}: {null[i]} ({(null[i]/len(data))*100}%)")


In [None]:
# Assuming 'data' is a Pandas DataFrame
cols = data.columns.to_list()
cols.remove('CLASS_LABEL')  # Remove the class label column from features
X = data[cols].values  # Convert to NumPy array
y = data["CLASS_LABEL"].values

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((8000, 49), (2000, 49), (8000,), (2000,))

In [None]:
# Define a custom dataset
class CustomDataset(Dataset):
    def __init__(self, X, y):
        self.X = torch.tensor(X, dtype=torch.float32)
        self.y = torch.tensor(y, dtype=torch.long)  # Use long for classification

    def __len__(self):
        return len(self.y)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

# Create dataset instances
train_dataset = CustomDataset(X_train, y_train)
test_dataset = CustomDataset(X_test, y_test)

# Create DataLoaders
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)


##TASK 2##
Following the pytorch demo from class create a Linear Neural Network on which to train your dataset. Be sure to match your input and output dimensions

Creating neat functions for the various pieces will be worth your time when we make small changes in the next part

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using {device} device")


Using cuda device


In [None]:
import torch.nn as nn
import torch

# Define the Neural Network
class NeuralNetwork(nn.Module):
    def __init__(self, input_size, num_classes):
        super().__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(input_size, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, num_classes),
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

# Define input size and number of classes based on data
input_size = X_train.shape[1]  # 48 features
num_classes = len(set(y_train))  # Number of unique class labels

# Initialize the model
model = NeuralNetwork(input_size, num_classes)

# Print model summary
print(model)

print("Model's state_dict:")
for param_tensor in model.state_dict():
    print(param_tensor, "\t", model.state_dict()[param_tensor].size())

NeuralNetwork(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=49, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=2, bias=True)
  )
)
Model's state_dict:
linear_relu_stack.0.weight 	 torch.Size([512, 49])
linear_relu_stack.0.bias 	 torch.Size([512])
linear_relu_stack.2.weight 	 torch.Size([512, 512])
linear_relu_stack.2.bias 	 torch.Size([512])
linear_relu_stack.4.weight 	 torch.Size([2, 512])
linear_relu_stack.4.bias 	 torch.Size([2])


##TASK 3##
Train your network

**Mess around with your network and see if you have any answers to these questions. The goal is experience, there are no "right" answers**

How low can you get your loss with a single hidden layer?

What is the dimensionality of the hidden layers? If we were to plot dimensionality with loss, where would you expect to see the loss drop significantly?

Compare a few different loss functions?

Did you have overfitting problems? What solutions did you try? (Adding droppout, lowering the dimensions, other?)

In [None]:
# Training Parameters
learning_rate = 1e-3
batch_size = 32
epochs = 5

# Initialize the loss function
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

# Print optimizer's state_dict
print("Optimizer's state_dict:")
for var_name in optimizer.state_dict():
    print(var_name, "\t", optimizer.state_dict()[var_name])


Optimizer's state_dict:
state 	 {}
param_groups 	 [{'lr': 0.001, 'betas': (0.9, 0.999), 'eps': 1e-08, 'weight_decay': 0, 'amsgrad': False, 'maximize': False, 'foreach': None, 'capturable': False, 'differentiable': False, 'fused': None, 'params': [0, 1, 2, 3, 4, 5]}]


In [None]:
def train_loop(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    # Set the model to training mode - important for batch normalization and dropout layers
    # Unnecessary in this situation but added for best practices
    model.train()
    for batch, (X, y) in enumerate(dataloader):
        # Compute prediction and loss
        pred = model(X)
        loss = loss_fn(pred, y)

        # Backpropagation
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        if batch % 100 == 0:
            loss, current = loss.item(), batch * batch_size + len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")


def test_loop(dataloader, model, loss_fn):
    # Set the model to evaluation mode - important for batch normalization and dropout layers
    # Unnecessary in this situation but added for best practices
    model.eval()
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    test_loss, correct = 0, 0

    # Evaluating the model with torch.no_grad() ensures that no gradients are computed during test mode
    # also serves to reduce unnecessary gradient computations and memory usage for tensors with requires_grad=True
    with torch.no_grad():
        for X, y in dataloader:
            pred = model(X)
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()

    test_loss /= num_batches
    correct /= size
    print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")

In [None]:
epochs = 15
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train_loop(train_loader, model, loss_fn, optimizer)
    test_loop(test_loader, model, loss_fn)
print("Done!")

Epoch 1
-------------------------------
loss: 0.205133  [   32/ 8000]
loss: 0.028173  [ 3232/ 8000]
loss: 0.061050  [ 6432/ 8000]
Test Error: 
 Accuracy: 99.2%, Avg loss: 0.023506 

Epoch 2
-------------------------------
loss: 0.007695  [   32/ 8000]
loss: 0.028019  [ 3232/ 8000]
loss: 0.004026  [ 6432/ 8000]
Test Error: 
 Accuracy: 97.3%, Avg loss: 0.067770 

Epoch 3
-------------------------------
loss: 0.090518  [   32/ 8000]
loss: 0.014823  [ 3232/ 8000]
loss: 0.004819  [ 6432/ 8000]
Test Error: 
 Accuracy: 99.1%, Avg loss: 0.022969 

Epoch 4
-------------------------------
loss: 0.038242  [   32/ 8000]
loss: 0.032880  [ 3232/ 8000]
loss: 0.126623  [ 6432/ 8000]
Test Error: 
 Accuracy: 99.5%, Avg loss: 0.017567 

Epoch 5
-------------------------------
loss: 0.007973  [   32/ 8000]
loss: 0.001462  [ 3232/ 8000]
loss: 0.004765  [ 6432/ 8000]
Test Error: 
 Accuracy: 99.3%, Avg loss: 0.019859 

Epoch 6
-------------------------------
loss: 0.036761  [   32/ 8000]
loss: 0.061089  [ 32

In [None]:
# Convert the first 10 test samples to PyTorch tensors
X_sample = torch.tensor(X_test[700:1000], dtype=torch.float32)
y_sample = torch.tensor(y_test[700:1000], dtype=torch.long)

# Set model to evaluation mode
model.eval()

# Run inference on the first 10 test samples
with torch.no_grad():  # Disable gradient calculations for efficiency
    pred = model(X_sample)  # Get predictions
    predicted_labels = pred.argmax(dim=1)  # Convert logits to class predictions

# Print predictions vs actual labels
print("Predictions vs Actual Labels:\n" + "-" * 40)
inc = 0
for i in range(300):
    # print(f"Sample {i+1}: Predicted = {predicted_labels[i].item()} | Actual = {y_sample[i].item()}")
    if  {predicted_labels[i].item()} != {y_sample[i].item()}:
        inc += 1
        print(f"Incorrect number: {inc}")


Predictions vs Actual Labels:
----------------------------------------
Incorrect number: 1
Incorrect number: 2
Incorrect number: 3


###Comments, concerns, sarcastic remarks?###
####Type any observations, or answers below###



##TASK 4##

Please save your trained network, in such a way that I could download it and evaluate it myself

[Here](https://pytorch.org/tutorials/beginner/saving_loading_models.html) is a tutorial you can use

In [None]:
# Save the model
PATH = "Assignment1.ipynb"
torch.save(model.state_dict(), PATH)
model.load_state_dict(torch.load(PATH, weights_only=True))
model.eval()

NeuralNetwork(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=49, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=2, bias=True)
  )
)

In [None]:
model_scripted = torch.jit.script(model) # Export to TorchScript
model_scripted.save('phishingAssignment1-Talha.pt') # Save