# Training Notebook 2
## Dropping Non-Important Features

Based on feature importance analysis, we identified the following features as having zero importance:

- `IsDomainIP` (Index 3)
- `NoOfEqualsInURL` (Index 6)
- `NoOfQMarkInURL` (Index 7)
- `URLEntropy` (Index 12)

Since the dataset has no headers, we use these indices to drop the columns directly. This ensures that the final dataset is smaller, more efficient, and contains only the features that actually contribute to the model’s predictions.

The remaining columns are:

- `label`
- `URLLength`
- `DomainLength`
- `NoOfSubDomain`
- `LetterRatioInURL`
- `NoOfAmpersandInURL`
- `SpacialCharRatioInURL`
- `IsHTTPS`
- `CharContinuationRate`


In [None]:
# import libraries
import pandas as pd
import os

import sagemaker
import boto3
from sagemaker.inputs import TrainingInput
from sagemaker import Session
from sagemaker.estimator import Estimator
from sagemaker import get_execution_role
from sagemaker import image_uris
import tarfile
import xgboost as xgb
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

# for endpoint 
import sagemaker
from sagemaker import Session, get_execution_role
from sagemaker.model import Model
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import JSONDeserializer

In [1]:
# Directories
input_dir = '../data/initial_processed_data'
output_dir = '../data/data_reduced_features'
os.makedirs(output_dir, exist_ok=True)

# Columns to drop (indices of the features to remove)
columns_to_drop = [3,6,7,12]

# Files to process
files = ['train.csv', 'validation.csv', 'test.csv']

for file in files:
    input_path = os.path.join(input_dir, file)
    output_path = os.path.join(output_dir, file)
    
    # Read CSV without headers
    df = pd.read_csv(input_path, header=None)
    
    # Drop the specified columns
    df_cleaned = df.drop(columns=columns_to_drop)
    
    # Save the cleaned dataset
    df_cleaned.to_csv(output_path, index=False, header=False)
    print(f"Processed and saved {file} -> {output_path}")

Processed and saved train.csv -> ../data/data_reduced_features/train.csv
Processed and saved validation.csv -> ../data/data_reduced_features/validation.csv
Processed and saved test.csv -> ../data/data_reduced_features/test.csv


In [2]:
import boto3

# S3 config
bucket = 'sagemaker-eu-west-1-277841471265'
s3_prefix = 'data/data_reduced_features'
s3 = boto3.client('s3')

# Upload files
for file in files:
    local_path = os.path.join(output_dir, file)
    s3_key = f"{s3_prefix}/{file}"
    
    s3.upload_file(local_path, bucket, s3_key)
    print(f"Uploaded {file} to s3://{bucket}/{s3_key}")

Uploaded train.csv to s3://sagemaker-eu-west-1-277841471265/data/data_reduced_features/train.csv
Uploaded validation.csv to s3://sagemaker-eu-west-1-277841471265/data/data_reduced_features/validation.csv
Uploaded test.csv to s3://sagemaker-eu-west-1-277841471265/data/data_reduced_features/test.csv


In [5]:
# Inspect data

# Paths
original_path = '../data/initial_processed_data/train.csv'
cleaned_path = '../data/data_reduced_features/train.csv'

# Load datasets (no headers!)
df_original = pd.read_csv(original_path, header=None)
df_cleaned = pd.read_csv(cleaned_path, header=None)

# Show head of original dataset
print("\n===== Original Dataset Head (first 5 rows) =====")
print(df_original.head())

# Show head of cleaned dataset
print("\n===== Cleaned Dataset Head (first 5 rows) =====")
print(df_cleaned.head())

# Show mapping of original to cleaned (skipping dropped columns)
original_cols_to_keep = [0, 1, 2, 4, 5, 8, 9, 10, 11]
print("\n===== Original Columns Kept Indices =====")
print(original_cols_to_keep)


===== Original Dataset Head (first 5 rows) =====
   0   1   2   3   4      5   6   7   8      9   10        11        12
0   1  23  15   0   1  0.783   0   0   0  0.217   1  0.181818  3.567040
1   1  40  32   0   1  0.875   0   0   0  0.125   1  0.128205  4.177567
2   1  27  19   0   1  0.815   0   0   0  0.185   1  0.153846  3.838040
3   1  25  17   0   1  0.800   0   0   0  0.200   1  0.166667  3.753270
4   1  30  22   0   1  0.800   0   0   0  0.200   1  0.137931  4.031402

===== Cleaned Dataset Head (first 5 rows) =====
   0   1   2  3      4  5      6  7         8
0  1  23  15  1  0.783  0  0.217  1  0.181818
1  1  40  32  1  0.875  0  0.125  1  0.128205
2  1  27  19  1  0.815  0  0.185  1  0.153846
3  1  25  17  1  0.800  0  0.200  1  0.166667
4  1  30  22  1  0.800  0  0.200  1  0.137931

===== Original Columns Kept Indices =====
[0, 1, 2, 4, 5, 8, 9, 10, 11]


# Retrain with the new dataset

In [7]:
role = get_execution_role()
session = Session()
region = session.boto_region_name

# initialize hyperparameters
hyperparameters = {
        "max_depth":"5",
        "eta":"0.2",
        "gamma":"4",
        "min_child_weight":"6",
        "subsample":"0.7",
        "objective":"binary:logistic",
        "num_round":"300"}

# set an output path where the trained model will be saved
bucket = 'sagemaker-eu-west-1-277841471265'
s3_output_key = 'models/xgboost/v2'
output_path = f's3://{bucket}/{s3_output_key}'

# this line automatically looks for the XGBoost image URI and builds an XGBoost container.
# specify the repo_version depending on your preference.
xgboost_container = sagemaker.image_uris.retrieve("xgboost", region, "1.7-1")

# construct a SageMaker AI estimator that calls the xgboost-container
estimator = sagemaker.estimator.Estimator(image_uri=xgboost_container, 
                                          hyperparameters=hyperparameters,
                                          role=role,
                                          instance_count=2, # demonstrating multi instance training
                                          instance_type='ml.m5.large', 
                                          volume_size=5, # 5 GB 
                                          output_path=output_path)


In [8]:
# define the data type and paths to the training and validation datasets
content_type = "text/csv"
bucket = 'sagemaker-eu-west-1-277841471265'
prefix = 'data/data_reduced_features'

train_input = TrainingInput(f"s3://{bucket}/{prefix}/train.csv", content_type=content_type)
validation_input = TrainingInput(f"s3://{bucket}/{prefix}/validation.csv", content_type=content_type)

In [9]:
# inspect path
f"s3://{bucket}/{prefix}/train.csv"

's3://sagemaker-eu-west-1-277841471265/data/data_reduced_features/train.csv'

In [10]:
estimator.fit({'train': train_input, 'validation': validation_input}, wait=True, logs="All")

INFO:sagemaker:Creating training-job with name: sagemaker-xgboost-2025-06-03-14-53-12-660


2025-06-03 14:53:14 Starting - Starting the training job...
2025-06-03 14:53:28 Starting - Preparing the instances for training...
2025-06-03 14:53:51 Downloading - Downloading input data...
2025-06-03 14:54:41 Downloading - Downloading the training image......
2025-06-03 14:55:42 Training - Training image download completed. Training in progress..[34m[2025-06-03 14:55:43.729 ip-10-0-182-119.eu-west-1.compute.internal:7 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34m[2025-06-03 14:55:43.768 ip-10-0-182-119.eu-west-1.compute.internal:7 INFO profiler_config_parser.py:111] User has disabled profiler.[0m
[34m[2025-06-03:14:55:44:INFO] Imported framework sagemaker_xgboost_container.training[0m
[34m[2025-06-03:14:55:44:INFO] Failed to parse hyperparameter objective value binary:logistic to Json.[0m
[34mReturning the value itself[0m
[34m[2025-06-03:14:55:44:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2025-06-03:14:55:44:INFO] Running XGBoost Sagemak

model data saved to `s3://sagemaker-eu-west-1-277841471265/models/xgboost/v2/sagemaker-xgboost-2025-06-03-14-53-12-660/output/model.tar.gz`

# Test XGBOOST Model
- need metrics to show

In [11]:
bucket = 'sagemaker-eu-west-1-277841471265'
prefix = 'data/data_reduced_features/test.csv'
local_file = '../data/local_test_data/test_v2.csv'

s3 = boto3.client('s3')
s3.download_file(bucket, prefix, local_file)
print(f"Downloaded {prefix} from S3 to {local_file}")


Downloaded data/data_reduced_features/test.csv from S3 to ../data/local_test_data/test_v2.csv


In [12]:
# Load CSV (no header, label is first column)
df_test = pd.read_csv(local_file, header=None)
print(df_test.head())

   0   1   2  3      4  5      6  7         8
0  1  24  16  1  0.792  0  0.208  1  0.217391
1  1  28  20  1  0.821  0  0.179  1  0.148148
2  1  24  16  1  0.792  0  0.208  1  0.217391
3  0  40  32  1  0.800  0  0.175  1  0.075000
4  1  21  13  2  0.714  0  0.286  1  0.200000


In [16]:
# Model location in S3
bucket = 'sagemaker-eu-west-1-277841471265'
model_key = 'models/xgboost/v2/sagemaker-xgboost-2025-06-03-14-53-12-660/output/model.tar.gz'
local_file = '../data/local_model_data/xgboost-v2/model.tar.gz'

# Download model file
s3 = boto3.client('s3')
s3.download_file(bucket, model_key, local_file)
print(f"Downloaded {model_key} from S3 to {local_file}")

Downloaded models/xgboost/v2/sagemaker-xgboost-2025-06-03-14-53-12-660/output/model.tar.gz from S3 to ../data/local_model_data/xgboost-v2/model.tar.gz


In [17]:
# Specify your desired target directory
target_dir = "../data/local_model_data/xgboost-v2/"

with tarfile.open(local_file) as tar:
    tar.extractall(path=target_dir)

print(f"Model extracted to {target_dir}")

Model extracted to ../data/local_model_data/xgboost-v2/


  tar.extractall(path=target_dir)


In [18]:
booster = xgb.Booster()
booster.load_model('../data/local_model_data/xgboost-v2/xgboost-model')  # built-in XGBoost saves as this name
print("Model loaded!")

Model loaded!


In [19]:
# separate train and test data
y_test = df_test.iloc[:, 0].astype(int)  # first column = label
X_test = df_test.iloc[:, 1:]             # rest = features

In [20]:
dtest = xgb.DMatrix(X_test)
y_pred_prob = booster.predict(dtest)
y_pred = (y_pred_prob >= 0.5).astype(int)

In [21]:
# Print Classification Report
print("\n===== Classification Report =====")
print(classification_report(y_test, y_pred))

# Print Accuracy
print("\n===== Accuracy Score =====")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")

# Print Confusion Matrix
print("\n===== Confusion Matrix =====")
conf_matrix = confusion_matrix(y_test, y_pred)
print(conf_matrix)


===== Classification Report =====
              precision    recall  f1-score   support

           0       1.00      0.99      0.99     15142
           1       0.99      1.00      1.00     20228

    accuracy                           1.00     35370
   macro avg       1.00      0.99      1.00     35370
weighted avg       1.00      1.00      1.00     35370


===== Accuracy Score =====
Accuracy: 0.9954

===== Confusion Matrix =====
[[15012   130]
 [   33 20195]]


# Deploy Endpoint
- can continue to deployment from here no need to retrain if model is already trained

In [1]:
import sagemaker
from sagemaker import Session, get_execution_role
from sagemaker.model import Model
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import JSONDeserializer

# Set up SageMaker session and role
role = get_execution_role()
session = Session()
region = session.boto_region_name

# Path to the model artifact
model_data = 's3://sagemaker-eu-west-1-277841471265/models/xgboost/v2/sagemaker-xgboost-2025-06-03-14-53-12-660/output/model.tar.gz'

# Create the Model object using SageMaker's built-in XGBoost image
xgboost_image_uri = sagemaker.image_uris.retrieve('xgboost', region, version='1.7-1')

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


In [2]:
model = Model(
    image_uri=xgboost_image_uri,
    model_data=model_data,
    role=role,
    sagemaker_session=session
)

In [3]:
# Deploy the model as an endpoint
predictor = model.deploy(
    initial_instance_count=1,
    instance_type='ml.m5.large', 
    serializer=CSVSerializer(),
    deserializer=JSONDeserializer()
)

print("Endpoint deployed and ready for real-time inference!")

------!Endpoint deployed and ready for real-time inference!


## Endpoint Deployed
- Now create the lambda function using the code in `lambda_functions/lambda_functions.py`
- Follow instructions in `README` to do this and test it and finally to create the API Gateway