# ML Week 5 Assignment 

Stan Lyubarskiy

## Instructions

Apply the Scikit Learn AdaBoost Classifier code to the dataset for classifying phishing vs
benign using and all feature at once and upload your .ipynb file. Use a Decision Tree
Classifier at your base classifier. Use decision trees of varying depths(1,3,6,9,12,15,18
for both gini and entropy criterion) for the base classifier.
Compare your results with those you obtained last week when you used the Scikit
Decision Tree Classifier(Week 5 assignment).

# Introduction

In the previous assignment, we tested the entropy and gini impurity measures against the phishing vs benign url dataset. Since we already know the characteristics of the dataset, we will skip many steps such as viewing the descriptive statistics, size, shape, etc. Instead, we are looking to compare our results to the AdaBoost Classifier.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# from sklearn.tree import export_graphviz
from sklearn.ensemble import AdaBoostClassifier

In [2]:
# read in the data
df = pd.read_csv("DataSetForPhishingVSBenignUrl.csv")

In [3]:
# filter out all classes except phishing and benign
df.query("URL_Type_obf_Type in ('benign', 'phishing')",inplace=True)
df

Unnamed: 0,Querylength,domain_token_count,path_token_count,avgdomaintokenlen,longdomaintokenlen,avgpathtokenlen,tld,charcompvowels,charcompace,ldl_url,...,SymbolCount_FileName,SymbolCount_Extension,SymbolCount_Afterpath,Entropy_URL,Entropy_Domain,Entropy_DirectoryName,Entropy_Filename,Entropy_Extension,Entropy_Afterpath,URL_Type_obf_Type
7930,0,2,12,5.500000,8,4.083334,2,15,7,0,...,-1,-1,-1,0.676804,0.860529,-1.000000,-1.000000,-1.00000,-1.000000,benign
7931,0,3,12,5.000000,10,3.583333,3,12,8,2,...,1,0,-1,0.715629,0.776796,0.693127,0.738315,1.00000,-1.000000,benign
7932,2,2,11,4.000000,5,4.750000,2,16,11,0,...,2,0,1,0.677701,1.000000,0.677704,0.916667,0.00000,0.898227,benign
7933,0,2,7,4.500000,7,5.714286,2,15,10,0,...,0,0,-1,0.696067,0.879588,0.818007,0.753585,0.00000,-1.000000,benign
7934,19,2,10,6.000000,9,2.250000,2,9,5,0,...,5,4,3,0.747202,0.833700,0.655459,0.829535,0.83615,0.823008,benign
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30004,0,2,3,8.000000,13,3.333333,2,3,2,0,...,0,0,-1,0.797046,0.884870,0.750000,1.000000,0.00000,-1.000000,phishing
30005,0,3,0,9.000000,16,,3,0,0,0,...,-1,-1,-1,0.797564,0.813569,-1.000000,-1.000000,-1.00000,-1.000000,phishing
30006,0,3,2,6.666666,10,3.000000,3,3,2,0,...,0,0,-1,0.791104,0.801139,,1.000000,0.00000,-1.000000,phishing
30007,0,2,3,8.000000,13,3.333333,2,4,2,0,...,0,0,-1,0.716580,0.787659,0.871049,1.000000,0.00000,-1.000000,phishing


In [4]:
# Let us now drop the rows with the NaN values
df2 = df.dropna()

In [5]:
# reset the index so it does not look out of order
df2.reset_index(drop=True, inplace=True)

In [6]:
# save the feature names and class names for graphing
feature_names = df2.columns[:-1]
class_names = df2["URL_Type_obf_Type"].unique()
print("Features:",feature_names)
print("Classes:",class_names)

Features: Index(['Querylength', 'domain_token_count', 'path_token_count',
       'avgdomaintokenlen', 'longdomaintokenlen', 'avgpathtokenlen', 'tld',
       'charcompvowels', 'charcompace', 'ldl_url', 'ldl_domain', 'ldl_path',
       'ldl_filename', 'ldl_getArg', 'dld_url', 'dld_domain', 'dld_path',
       'dld_filename', 'dld_getArg', 'urlLen', 'domainlength', 'pathLength',
       'subDirLen', 'fileNameLen', 'this.fileExtLen', 'ArgLen', 'pathurlRatio',
       'ArgUrlRatio', 'argDomanRatio', 'domainUrlRatio', 'pathDomainRatio',
       'argPathRatio', 'executable', 'isPortEighty', 'NumberofDotsinURL',
       'ISIpAddressInDomainName', 'CharacterContinuityRate',
       'LongestVariableValue', 'URL_DigitCount', 'host_DigitCount',
       'Directory_DigitCount', 'File_name_DigitCount', 'Extension_DigitCount',
       'Query_DigitCount', 'URL_Letter_Count', 'host_letter_count',
       'Directory_LetterCount', 'Filename_LetterCount',
       'Extension_LetterCount', 'Query_LetterCount', 'Longes

In [7]:
# assign my X and Y
Y = df2.iloc[:, -1].values
Y = np.where((Y == "phishing"), 1, 0)

X = df2.iloc[:,0:-1].values

In [8]:
# check the count to ensure it is still correct
np.unique(Y, return_counts=True)

(array([0, 1]), array([2709, 4014], dtype=int64))

In [9]:
# split the data into 70/30 train/test
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 42)

In [10]:
# check the shape of the data
print("Size of X_train:",X_train.shape, "Size of Y_train:", y_train.shape)
print("Size of X_test:",X_test.shape, "Size of Y_test:", y_test.shape)

Size of X_train: (4706, 79) Size of Y_train: (4706,)
Size of X_test: (2017, 79) Size of Y_test: (2017,)


# Decision Tree Classifiers with a Depth of 1,3,6,9,12,15,18

We will run the basic decision tree classifiers here without Adaboost.

In [11]:
# create a function to compute the basic decision trees
def basic_tree(criterion, random_state, max_depth, x, y):
    # create a tree object and fit it
    tree = DecisionTreeClassifier(criterion=criterion, random_state=random_state, max_depth=max_depth)
    tree.fit(x, y)
    
    # print the accuracy
    print(f"\033[1m{criterion.title()} with Depth of {max_depth}\033[0m")
    training_acc = f"Training Accuracy: "
    testing_acc = f"Testing Accuracy: "
    
    print(training_acc, tree.score(X_train, y_train)*100)
    print(testing_acc, tree.score(X_test, y_test)*100)
    print("----------------------------------------------")
    
    # return the tree object to use it as input for adaboost later
    return tree

In [12]:
# create a list of max depths to test
depths = [1,3,6,9,12,15,18]

In [13]:
# run a for loop to test accuracy at each depth, print the results, and store the data in a dictionary
gini = {}
for i in depths:
    gini[f'gini{i}'] = basic_tree("gini", 50, i, X_train, y_train)

[1mGini with Max Depth of 1[0m
Training Accuracy:  79.60050998725032
Testing Accuracy:  80.51561725334655
----------------------------------------------
[1mGini with Max Depth of 3[0m
Training Accuracy:  91.62770930726731
Testing Accuracy:  90.67922657411998
----------------------------------------------
[1mGini with Max Depth of 6[0m
Training Accuracy:  96.17509562260943
Testing Accuracy:  94.89340604858701
----------------------------------------------
[1mGini with Max Depth of 9[0m
Training Accuracy:  98.47003824904378
Testing Accuracy:  96.0832920178483
----------------------------------------------
[1mGini with Max Depth of 12[0m
Training Accuracy:  99.25626859328517
Testing Accuracy:  96.2816063460585
----------------------------------------------
[1mGini with Max Depth of 15[0m
Training Accuracy:  99.78750531236719
Testing Accuracy:  96.38076351016362
----------------------------------------------
[1mGini with Max Depth of 18[0m
Training Accuracy:  99.9787505312367

In [16]:
# check the dictionary to see if it populated correctly
print(gini)

{'gini1': DecisionTreeClassifier(max_depth=1, random_state=50), 'gini3': DecisionTreeClassifier(max_depth=3, random_state=50), 'gini6': DecisionTreeClassifier(max_depth=6, random_state=50), 'gini9': DecisionTreeClassifier(max_depth=9, random_state=50), 'gini12': DecisionTreeClassifier(max_depth=12, random_state=50), 'gini15': DecisionTreeClassifier(max_depth=15, random_state=50), 'gini18': DecisionTreeClassifier(max_depth=18, random_state=50)}


In [20]:
# run a for loop to test accuracy at each depth, print the results, and store the data in a dictionary
entropy = {}
for i in depths:
    entropy[f'entropy{i}'] = basic_tree("entropy", 50, i, X_train, y_train)

[1mEntropy with Max Depth of 1[0m
Training Accuracy:  79.23926901827454
Testing Accuracy:  80.3668815071889
----------------------------------------------
[1mEntropy with Max Depth of 3[0m
Training Accuracy:  91.26646833829155
Testing Accuracy:  90.43133366385722
----------------------------------------------
[1mEntropy with Max Depth of 6[0m
Training Accuracy:  95.83510412239694
Testing Accuracy:  95.24045612295488
----------------------------------------------
[1mEntropy with Max Depth of 9[0m
Training Accuracy:  98.76753081172971
Testing Accuracy:  95.78582052553297
----------------------------------------------
[1mEntropy with Max Depth of 12[0m
Training Accuracy:  99.68125796855078
Testing Accuracy:  96.13287059990084
----------------------------------------------
[1mEntropy with Max Depth of 15[0m
Training Accuracy:  99.85125371865703
Testing Accuracy:  96.13287059990084
----------------------------------------------
[1mEntropy with Max Depth of 18[0m
Training Accur

In [21]:
# check the dictionary to see if it populated correctly
print(entropy)

{'entropy1': DecisionTreeClassifier(criterion='entropy', max_depth=1, random_state=50), 'entropy3': DecisionTreeClassifier(criterion='entropy', max_depth=3, random_state=50), 'entropy6': DecisionTreeClassifier(criterion='entropy', max_depth=6, random_state=50), 'entropy9': DecisionTreeClassifier(criterion='entropy', max_depth=9, random_state=50), 'entropy12': DecisionTreeClassifier(criterion='entropy', max_depth=12, random_state=50), 'entropy15': DecisionTreeClassifier(criterion='entropy', max_depth=15, random_state=50), 'entropy18': DecisionTreeClassifier(criterion='entropy', max_depth=18, random_state=50)}


In [22]:
# create a function to compute the adaboosted decision trees
def ada_tree(base, x, y):
    # Create adaboost classifer object
    abc = AdaBoostClassifier(base_estimator=base)

    # Train Adaboost Classifer
    abc.fit(X_train, y_train)
    
    # print the accuracy
    print(f"\033[1mAdaBoosted {base.criterion.title()} with Depth of {base.get_depth()}\033[0m")
    training_acc = f"Training Accuracy: "
    testing_acc = f"Testing Accuracy: "
    
    print(training_acc, abc.score(X_train, y_train)*100)
    print(testing_acc, abc.score(X_test, y_test)*100)
    print("----------------------------------------------")
    
    # return the Adaboosted tree object to use it for later
    return abc

In [29]:
# run a for loop to test accuracy at each depth, print the results, and store the data in a dictionary
ada_gini = {}
for i in depths:
    ada_gini[f'ada_gini{i}'] = ada_tree(gini[f'gini{i}'],X_train, y_train)

[1mAdaBoosted Gini with Max Depth of 1[0m
Training Accuracy:  97.28006799830004
Testing Accuracy:  96.52949925632127
----------------------------------------------
[1mAdaBoosted Gini with Max Depth of 3[0m
Training Accuracy:  100.0
Testing Accuracy:  97.22359940505702
----------------------------------------------
[1mAdaBoosted Gini with Max Depth of 6[0m
Training Accuracy:  100.0
Testing Accuracy:  97.57064947942489
----------------------------------------------
[1mAdaBoosted Gini with Max Depth of 9[0m
Training Accuracy:  100.0
Testing Accuracy:  97.52107089737233
----------------------------------------------
[1mAdaBoosted Gini with Max Depth of 12[0m
Training Accuracy:  100.0
Testing Accuracy:  97.52107089737233
----------------------------------------------
[1mAdaBoosted Gini with Max Depth of 15[0m
Training Accuracy:  100.0
Testing Accuracy:  95.24045612295488
----------------------------------------------
[1mAdaBoosted Gini with Max Depth of 18[0m
Training Accuracy

In [35]:
# check the dictionary to see if it populated correctly
print(ada_gini)

{'ada_gini1': AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1,
                                                         random_state=50)), 'ada_gini3': AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
                                                         random_state=50)), 'ada_gini6': AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=6,
                                                         random_state=50)), 'ada_gini9': AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=9,
                                                         random_state=50)), 'ada_gini12': AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=12,
                                                         random_state=50)), 'ada_gini15': AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=15,
                                                         random_state=50)), 'ada_gini18': AdaBoostClassifier(base_est

In [36]:
# run a for loop to test accuracy at each depth, print the results, and store the data in a dictionary
ada_entropy = {}
for i in depths:
    ada_entropy[f'ada_entropy{i}'] = ada_tree(entropy[f'entropy{i}'],X_train, y_train)

[1mAdaBoosted Entropy with Max Depth of 1[0m
Training Accuracy:  97.13132171695707
Testing Accuracy:  97.0252850768468
----------------------------------------------
[1mAdaBoosted Entropy with Max Depth of 3[0m
Training Accuracy:  100.0
Testing Accuracy:  97.66980664353
----------------------------------------------
[1mAdaBoosted Entropy with Max Depth of 6[0m
Training Accuracy:  100.0
Testing Accuracy:  97.91769955379276
----------------------------------------------
[1mAdaBoosted Entropy with Max Depth of 9[0m
Training Accuracy:  100.0
Testing Accuracy:  97.86812097174021
----------------------------------------------
[1mAdaBoosted Entropy with Max Depth of 12[0m
Training Accuracy:  100.0
Testing Accuracy:  97.07486365889936
----------------------------------------------
[1mAdaBoosted Entropy with Max Depth of 15[0m
Training Accuracy:  100.0
Testing Accuracy:  95.4387704511651
----------------------------------------------
[1mAdaBoosted Entropy with Max Depth of 17[0m
T

In [37]:
# check the dictionary to see if it populated correctly
print(ada_entropy)

{'ada_entropy1': AdaBoostClassifier(base_estimator=DecisionTreeClassifier(criterion='entropy',
                                                         max_depth=1,
                                                         random_state=50)), 'ada_entropy3': AdaBoostClassifier(base_estimator=DecisionTreeClassifier(criterion='entropy',
                                                         max_depth=3,
                                                         random_state=50)), 'ada_entropy6': AdaBoostClassifier(base_estimator=DecisionTreeClassifier(criterion='entropy',
                                                         max_depth=6,
                                                         random_state=50)), 'ada_entropy9': AdaBoostClassifier(base_estimator=DecisionTreeClassifier(criterion='entropy',
                                                         max_depth=9,
                                                         random_state=50)), 'ada_entropy12': AdaBoostClassifier(bas