<a href="https://colab.research.google.com/github/ndbellew/DeepLearningMaliciousURLs/blob/master/Keras-Tensorflow-Experiments.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tensorflow Keras Experiments

##### Sources:
 + https://www.tensorflow.org/tutorials/keras/overfit_and_underfit
 + https://www.kaggle.com/grafiszti/98-59-acc-on-10-fold-with-testing-7-keras-models

## Initial Setup

### Include needed files. 

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf

import csv
import os
import sys
import glob
import operator
import time

from keras.models import Sequential, load_model
from keras.layers import Dense, Activation
from keras.utils.np_utils import to_categorical, normalize

from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.utils import shuffle

from tensorflow import keras
from tensorflow import feature_column
from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras.layers import Dense, Activation, BatchNormalization, Dropout
from tensorflow.keras.callbacks import TensorBoard

Using TensorFlow backend.


### Include Dataset

In [2]:
%%bash
URL=https://iscxdownloads.cs.unb.ca/iscxdownloads/ISCX-URL-2016/
FILES=(ISCXURL2016.zip) 
for FILE in ${FILES[*]}; do
    if [ ! -f "$FILE" ]; then
        printf "downloading %s\n" $FILE
        curl -O $URL$FILE
        # unzip files
        echo 'unzipping ' $FILE
        unzip -o $FILE #overwrite exiting files/folders if exists
    fi
done

downloading ISCXURL2016.zip
unzipping  ISCXURL2016.zip
Archive:  ISCXURL2016.zip
   creating: FinalDataset/
  inflating: FinalDataset/Spam_Infogain_test.csv  
  inflating: FinalDataset/Spam_Infogain.csv  
  inflating: FinalDataset/Spam_BestFirst_test.csv  
  inflating: FinalDataset/Spam_BestFirst.csv  
  inflating: FinalDataset/Spam.csv   
  inflating: FinalDataset/Phishing_Infogain_test.csv  
  inflating: FinalDataset/Phishing_Infogain.csv  
  inflating: FinalDataset/Phishing.csv  
  inflating: FinalDataset/Malware_Infogain_test.csv  
  inflating: FinalDataset/Malware_Infogain.csv  
  inflating: FinalDataset/Malware_BestFirst.csv  
  inflating: FinalDataset/Malware.csv  
  inflating: FinalDataset/Defacement_Infogain_test.csv  
  inflating: FinalDataset/Defacement_Infogain.csv  
  inflating: FinalDataset/Defacement_BestFirst.csv  
  inflating: FinalDataset/Defacement.csv  
  inflating: FinalDataset/All_Infogain_test.csv  
  inflating: FinalDataset/All_Infogain.csv  
  inflating: FinalD

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100 10.6M  100 10.6M    0     0  12.2M      0 --:--:-- --:--:-- --:--:-- 12.2M


### Check Dataset

In [3]:
! ls FinalDataset

All_BestFirst.csv	      Malware_Infogain_test.csv
All_BestFirst_test.csv	      Phishing_BestFirst.csv
All.csv			      Phishing.csv
All_Infogain.csv	      Phishing_Infogain.csv
All_Infogain_test.csv	      Phishing_Infogain_test.csv
Defacement_BestFirst.csv      Spam_BestFirst.csv
Defacement.csv		      Spam_BestFirst_test.csv
Defacement_Infogain.csv       Spam.csv
Defacement_Infogain_test.csv  Spam_Infogain.csv
Malware_BestFirst.csv	      Spam_Infogain_test.csv
Malware.csv		      URL
Malware_Infogain.csv


## Data Setup
> Some data needs to be set, we need to ensure that constants are set properly. These are important but will not be used until later.

In [4]:
#Set Data
resultPath = 'results_keras_tensorflow'
if not os.path.exists(resultPath):
   print('result path {} created.'.format(resultPath))
   os.mkdir(resultPath)

result path results_keras_tensorflow created.


In [0]:
dep_var = 'Label'
model_name="init"

In [0]:
cat_names = []
cont_names = []

### Analyze CSV Files
> lets make sure that the files are properly added, this should look similar to the FASTAI experiments.

In [0]:
df = pd.read_csv('FinalDataset/All.csv', low_memory=False)

In [8]:
df.shape

(36707, 80)

Show all dataset column names

In [9]:
df.columns

Index(['Querylength', 'domain_token_count', 'path_token_count',
       'avgdomaintokenlen', 'longdomaintokenlen', 'avgpathtokenlen', 'tld',
       'charcompvowels', 'charcompace', 'ldl_url', 'ldl_domain', 'ldl_path',
       'ldl_filename', 'ldl_getArg', 'dld_url', 'dld_domain', 'dld_path',
       'dld_filename', 'dld_getArg', 'urlLen', 'domainlength', 'pathLength',
       'subDirLen', 'fileNameLen', 'this.fileExtLen', 'ArgLen', 'pathurlRatio',
       'ArgUrlRatio', 'argDomanRatio', 'domainUrlRatio', 'pathDomainRatio',
       'argPathRatio', 'executable', 'isPortEighty', 'NumberofDotsinURL',
       'ISIpAddressInDomainName', 'CharacterContinuityRate',
       'LongestVariableValue', 'URL_DigitCount', 'host_DigitCount',
       'Directory_DigitCount', 'File_name_DigitCount', 'Extension_DigitCount',
       'Query_DigitCount', 'URL_Letter_Count', 'host_letter_count',
       'Directory_LetterCount', 'Filename_LetterCount',
       'Extension_LetterCount', 'Query_LetterCount', 'LongestPathToken

Show the first rows of the dataset

In [10]:
df.head()

Unnamed: 0,Querylength,domain_token_count,path_token_count,avgdomaintokenlen,longdomaintokenlen,avgpathtokenlen,tld,charcompvowels,charcompace,ldl_url,ldl_domain,ldl_path,ldl_filename,ldl_getArg,dld_url,dld_domain,dld_path,dld_filename,dld_getArg,urlLen,domainlength,pathLength,subDirLen,fileNameLen,this.fileExtLen,ArgLen,pathurlRatio,ArgUrlRatio,argDomanRatio,domainUrlRatio,pathDomainRatio,argPathRatio,executable,isPortEighty,NumberofDotsinURL,ISIpAddressInDomainName,CharacterContinuityRate,LongestVariableValue,URL_DigitCount,host_DigitCount,Directory_DigitCount,File_name_DigitCount,Extension_DigitCount,Query_DigitCount,URL_Letter_Count,host_letter_count,Directory_LetterCount,Filename_LetterCount,Extension_LetterCount,Query_LetterCount,LongestPathTokenLength,Domain_LongestWordLength,Path_LongestWordLength,sub-Directory_LongestWordLength,Arguments_LongestWordLength,URL_sensitiveWord,URLQueries_variable,spcharUrl,delimeter_Domain,delimeter_path,delimeter_Count,NumberRate_URL,NumberRate_Domain,NumberRate_DirectoryName,NumberRate_FileName,NumberRate_Extension,NumberRate_AfterPath,SymbolCount_URL,SymbolCount_Domain,SymbolCount_Directoryname,SymbolCount_FileName,SymbolCount_Extension,SymbolCount_Afterpath,Entropy_URL,Entropy_Domain,Entropy_DirectoryName,Entropy_Filename,Entropy_Extension,Entropy_Afterpath,URL_Type_obf_Type
0,0,4,5,5.5,14,4.4,4,8,3,0,0,0,0,0,0,0,0,0,0,58,25,26,26,13,1,2,0.448276,0.034483,0.08,0.431034,1.04,0.07692308,0,-1,5,-1,0.6,-1,1,0,0,0,1,-1,47,22,8,13,0,-1,13,14,13,5,-1,0,0,3,0,2,-1,0.017241,0.0,0.0,0.066667,1.0,-1.0,8,3,2,1,0,-1,0.726298,0.784493,0.894886,0.850608,,-1.0,Defacement
1,0,4,5,5.5,14,6.0,4,12,4,0,0,0,0,0,0,0,0,0,0,66,25,34,34,2,2,2,0.515151,0.030303,0.08,0.378788,1.36,0.05882353,0,-1,4,-1,0.6,-1,0,0,0,0,0,-1,56,22,8,13,9,-1,13,14,13,5,-1,0,0,4,0,1,-1,0.0,0.0,0.0,0.0,,-1.0,8,3,3,0,0,-1,0.688635,0.784493,0.814725,0.859793,0.0,-1.0,Defacement
2,0,4,5,5.5,14,5.8,4,12,5,0,0,0,0,0,0,0,0,0,0,65,25,33,33,2,2,2,0.507692,0.030769,0.08,0.384615,1.32,0.060606062,0,-1,4,-1,0.6,-1,0,0,0,0,0,-1,55,22,8,13,8,-1,13,14,13,5,-1,0,0,4,0,1,-1,0.0,0.0,0.0,0.0,,-1.0,8,3,3,0,0,-1,0.695049,0.784493,0.814725,0.80188,0.0,-1.0,Defacement
3,0,4,12,5.5,14,5.5,4,32,16,0,0,0,0,0,0,0,0,0,0,109,25,77,77,2,2,2,0.706422,0.018349,0.08,0.229358,3.08,0.025974026,0,-1,4,-1,0.6,-1,0,0,0,0,0,-1,92,22,8,13,45,-1,52,14,13,13,-1,0,0,4,0,8,-1,0.0,0.0,0.0,0.0,,-1.0,8,3,3,0,0,-1,0.64013,0.784493,0.814725,0.66321,0.0,-1.0,Defacement
4,0,4,6,5.5,14,7.333334,4,18,11,0,0,0,0,0,0,0,0,0,0,81,25,49,49,2,2,2,0.604938,0.024691,0.08,0.308642,1.96,0.040816326,0,-1,4,-1,0.6,-1,0,0,0,0,0,-1,70,22,8,13,23,-1,24,14,13,13,-1,0,0,4,0,2,-1,0.0,0.0,0.0,0.0,,-1.0,8,3,3,0,0,-1,0.681307,0.784493,0.814725,0.804526,0.0,-1.0,Defacement


 Show the last rows of the dataset

In [11]:
df.tail()

Unnamed: 0,Querylength,domain_token_count,path_token_count,avgdomaintokenlen,longdomaintokenlen,avgpathtokenlen,tld,charcompvowels,charcompace,ldl_url,ldl_domain,ldl_path,ldl_filename,ldl_getArg,dld_url,dld_domain,dld_path,dld_filename,dld_getArg,urlLen,domainlength,pathLength,subDirLen,fileNameLen,this.fileExtLen,ArgLen,pathurlRatio,ArgUrlRatio,argDomanRatio,domainUrlRatio,pathDomainRatio,argPathRatio,executable,isPortEighty,NumberofDotsinURL,ISIpAddressInDomainName,CharacterContinuityRate,LongestVariableValue,URL_DigitCount,host_DigitCount,Directory_DigitCount,File_name_DigitCount,Extension_DigitCount,Query_DigitCount,URL_Letter_Count,host_letter_count,Directory_LetterCount,Filename_LetterCount,Extension_LetterCount,Query_LetterCount,LongestPathTokenLength,Domain_LongestWordLength,Path_LongestWordLength,sub-Directory_LongestWordLength,Arguments_LongestWordLength,URL_sensitiveWord,URLQueries_variable,spcharUrl,delimeter_Domain,delimeter_path,delimeter_Count,NumberRate_URL,NumberRate_Domain,NumberRate_DirectoryName,NumberRate_FileName,NumberRate_Extension,NumberRate_AfterPath,SymbolCount_URL,SymbolCount_Domain,SymbolCount_Directoryname,SymbolCount_FileName,SymbolCount_Extension,SymbolCount_Afterpath,Entropy_URL,Entropy_Domain,Entropy_DirectoryName,Entropy_Filename,Entropy_Extension,Entropy_Afterpath,URL_Type_obf_Type
36702,29,4,14,5.75,12,3.666667,4,20,24,3,0,3,0,2,0,0,0,0,0,146,26,113,113,2,2,85,0.773973,0.582192,3.269231,0.178082,4.346154,0.7522124,0,-1,5,-1,0.5,23,31,0,4,0,27,3,94,23,46,7,14,24,43,12,11,11,23,0,3,6,0,2,5,0.212329,0.0,0.064516,0.529412,0.627907,0.066667,19,3,11,3,2,7,0.690555,0.791265,0.777498,0.690227,0.656684,0.796205,spam
36703,0,4,13,3.75,8,8.461538,4,24,23,0,0,0,0,0,0,0,0,0,0,147,18,122,122,2,2,2,0.829932,0.013605,0.111111,0.122449,6.777778,0.016393442,0,-1,5,-1,0.5,-1,21,0,0,0,21,-1,101,15,7,6,69,-1,105,8,9,9,-1,0,0,3,0,2,-1,0.142857,0.0,0.0,0.1875,0.2,-1.0,23,3,2,16,15,-1,0.665492,0.82001,0.879588,0.6744,0.674671,-1.0,spam
36704,58,3,27,6.666666,16,3.375,3,41,34,20,0,20,0,18,12,0,12,0,12,246,22,217,217,2,2,182,0.882114,0.739837,8.272727,0.089431,9.863636,0.83870965,0,-1,7,-1,0.772727,58,57,0,6,0,51,1,156,20,71,3,58,48,118,16,12,12,0,0,1,12,0,9,1,0.231707,0.0,0.073171,0.377778,0.418033,0.029412,26,2,14,8,7,9,0.656807,0.801139,0.684777,0.713622,0.717187,0.705245,spam
36705,35,3,13,4.333334,9,3.6,3,15,13,7,0,7,0,7,4,0,4,0,4,116,15,94,94,2,2,71,0.810345,0.612069,4.733333,0.12931,6.266667,0.7553192,0,-1,3,-1,0.666667,32,25,0,0,0,25,23,73,13,4,11,41,12,75,9,8,8,0,0,2,3,0,3,3,0.215517,0.0,0.0,0.284091,0.333333,0.418182,14,2,1,9,8,3,0.725963,0.897617,0.871049,0.745932,0.758824,0.790772,spam
36706,40,3,25,6.666666,16,3.25,3,35,31,19,0,19,0,17,6,0,6,0,6,227,22,198,198,2,2,164,0.872247,0.722467,7.454546,0.096916,9.0,0.82828283,0,-1,6,-1,0.772727,40,52,0,6,1,45,2,144,20,50,6,64,31,118,16,10,10,0,0,1,11,0,8,1,0.229075,0.0,0.083333,0.365079,0.381356,0.06,24,2,13,7,6,7,0.674351,0.801139,0.697282,0.730563,0.731481,0.769238,spam


## Testing

### Functions for Testing
> Now that our data has been collected it is time to create functions that will be used in later tests.

In [0]:
def loadData(csvFile):
    pickleDump = '{}.pickle'.format(csvFile)
    if os.path.exists(pickleDump):
        df = pd.read_pickle(pickleDump)
    else:
        df = pd.read_csv(csvFile, low_memory=False, na_values='NaN')
        # clean data
        # strip the whitspaces from column names
        df = df.rename(str.strip, axis='columns')
        #df.drop(columns=[], inplace=True)
        # drop missing values/NaN etc.
        #df.dropna(inplace=True)
        # drop Infinity rows and NaN string from each column
        for col in df.columns:
            indexNames = df[df[col]=='Infinity'].index
            if not indexNames.empty:
                print('deleting {} rows with Infinity in column {}'.format(len(indexNames), col))
                df.drop(indexNames, inplace=True)
            indexNames = df[df[col]=='NaN'].index
            if not indexNames.empty:
                print('deleting {} rows with NaN in column {}'.format(len(indexNames), col))
                df.drop(indexNames, inplace=True)
        
        df.to_pickle(pickleDump)
    
    return df


In [0]:
def baseline_model(inputDim=-1,batch_size=32):
    global model_name
    model = tf.keras.Sequential([
        Dense(128, activation='relu', input_shape=(inputDim,)),
        BatchNormalization(),
        Dropout(.5),
    #print(f"out_shape[1]:{out_shape[1]}")
        Dense(batch_size, activation='relu'),
        BatchNormalization(),
        Dropout(.5),
        Dense(5, activation='sigmoid'),
    ]) #This is the output layer

    print('Categorical Cross-Entropy Loss Function')
    model_name += "_categorical"
    model.compile(optimizer='adam',
             loss='categorical_crossentropy',
             metrics=['accuracy'])
#         else:
#             model_name += "_binary"
#             print('Binary Cross-Entropy Loss Function')
#             model.compile(optimizer='adam',
#                     loss='binary_crossentropy',
#                     metrics=['accuracy'])
    return model

In [0]:
def encode_labels(dataframe):
    dataframe=dataframe.copy()
    data_y=dataframe.pop(dep_var)
    encoder = LabelEncoder()
    encoder.fit(data_y)
    data_y = encoder.transform(data_y)
    dummy_y = to_categorical(data_y)
    return dummy_y

In [0]:
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
    
    dataframe=dataframe.copy()
    
    #Encode the labels as numeric values
    labels = encode_labels(dataframe)
    
    ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
    if shuffle:
        ds = ds.shuffle(buffer_size=len(dataframe))
    ds = ds.batch(batch_size)
    return ds

### Test LoadData Function
> This will look just like the FastAI test, but we are using Tensor, so lets make sure it works.

In [16]:
df1 = loadData('FinalDataset/All.csv')
df1=df1.dropna(axis=1)
print(df1)

  result = method(y)


deleting 10 rows with Infinity in column argPathRatio
       Querylength  domain_token_count  ...  Entropy_Domain  URL_Type_obf_Type
0                0                   4  ...        0.784493         Defacement
1                0                   4  ...        0.784493         Defacement
2                0                   4  ...        0.784493         Defacement
3                0                   4  ...        0.784493         Defacement
4                0                   4  ...        0.784493         Defacement
5                0                   4  ...        0.784493         Defacement
6                0                   4  ...        0.784493         Defacement
7                0                   4  ...        0.784493         Defacement
8                0                   4  ...        0.784493         Defacement
9                0                   4  ...        0.784493         Defacement
10               0                   4  ...        0.784493         Defacemen

In [17]:
df1.columns


Index(['Querylength', 'domain_token_count', 'path_token_count',
       'avgdomaintokenlen', 'longdomaintokenlen', 'tld', 'charcompvowels',
       'charcompace', 'ldl_url', 'ldl_domain', 'ldl_path', 'ldl_filename',
       'ldl_getArg', 'dld_url', 'dld_domain', 'dld_path', 'dld_filename',
       'dld_getArg', 'urlLen', 'domainlength', 'pathLength', 'subDirLen',
       'fileNameLen', 'this.fileExtLen', 'ArgLen', 'pathurlRatio',
       'ArgUrlRatio', 'argDomanRatio', 'domainUrlRatio', 'pathDomainRatio',
       'argPathRatio', 'executable', 'isPortEighty', 'NumberofDotsinURL',
       'ISIpAddressInDomainName', 'CharacterContinuityRate',
       'LongestVariableValue', 'URL_DigitCount', 'host_DigitCount',
       'Directory_DigitCount', 'File_name_DigitCount', 'Extension_DigitCount',
       'Query_DigitCount', 'URL_Letter_Count', 'host_letter_count',
       'Directory_LetterCount', 'Filename_LetterCount',
       'Extension_LetterCount', 'Query_LetterCount', 'LongestPathTokenLength',
       'Do

In [18]:
df1.shape

(36697, 73)

In [19]:
df1.shape

(36697, 73)

In [20]:
df1.head()

Unnamed: 0,Querylength,domain_token_count,path_token_count,avgdomaintokenlen,longdomaintokenlen,tld,charcompvowels,charcompace,ldl_url,ldl_domain,ldl_path,ldl_filename,ldl_getArg,dld_url,dld_domain,dld_path,dld_filename,dld_getArg,urlLen,domainlength,pathLength,subDirLen,fileNameLen,this.fileExtLen,ArgLen,pathurlRatio,ArgUrlRatio,argDomanRatio,domainUrlRatio,pathDomainRatio,argPathRatio,executable,isPortEighty,NumberofDotsinURL,ISIpAddressInDomainName,CharacterContinuityRate,LongestVariableValue,URL_DigitCount,host_DigitCount,Directory_DigitCount,File_name_DigitCount,Extension_DigitCount,Query_DigitCount,URL_Letter_Count,host_letter_count,Directory_LetterCount,Filename_LetterCount,Extension_LetterCount,Query_LetterCount,LongestPathTokenLength,Domain_LongestWordLength,Path_LongestWordLength,sub-Directory_LongestWordLength,Arguments_LongestWordLength,URL_sensitiveWord,URLQueries_variable,spcharUrl,delimeter_Domain,delimeter_path,delimeter_Count,NumberRate_URL,NumberRate_Domain,NumberRate_DirectoryName,NumberRate_FileName,SymbolCount_URL,SymbolCount_Domain,SymbolCount_Directoryname,SymbolCount_FileName,SymbolCount_Extension,SymbolCount_Afterpath,Entropy_URL,Entropy_Domain,URL_Type_obf_Type
0,0,4,5,5.5,14,4,8,3,0,0,0,0,0,0,0,0,0,0,58,25,26,26,13,1,2,0.448276,0.034483,0.08,0.431034,1.04,0.07692308,0,-1,5,-1,0.6,-1,1,0,0,0,1,-1,47,22,8,13,0,-1,13,14,13,5,-1,0,0,3,0,2,-1,0.017241,0.0,0.0,0.066667,8,3,2,1,0,-1,0.726298,0.784493,Defacement
1,0,4,5,5.5,14,4,12,4,0,0,0,0,0,0,0,0,0,0,66,25,34,34,2,2,2,0.515151,0.030303,0.08,0.378788,1.36,0.05882353,0,-1,4,-1,0.6,-1,0,0,0,0,0,-1,56,22,8,13,9,-1,13,14,13,5,-1,0,0,4,0,1,-1,0.0,0.0,0.0,0.0,8,3,3,0,0,-1,0.688635,0.784493,Defacement
2,0,4,5,5.5,14,4,12,5,0,0,0,0,0,0,0,0,0,0,65,25,33,33,2,2,2,0.507692,0.030769,0.08,0.384615,1.32,0.060606062,0,-1,4,-1,0.6,-1,0,0,0,0,0,-1,55,22,8,13,8,-1,13,14,13,5,-1,0,0,4,0,1,-1,0.0,0.0,0.0,0.0,8,3,3,0,0,-1,0.695049,0.784493,Defacement
3,0,4,12,5.5,14,4,32,16,0,0,0,0,0,0,0,0,0,0,109,25,77,77,2,2,2,0.706422,0.018349,0.08,0.229358,3.08,0.025974026,0,-1,4,-1,0.6,-1,0,0,0,0,0,-1,92,22,8,13,45,-1,52,14,13,13,-1,0,0,4,0,8,-1,0.0,0.0,0.0,0.0,8,3,3,0,0,-1,0.64013,0.784493,Defacement
4,0,4,6,5.5,14,4,18,11,0,0,0,0,0,0,0,0,0,0,81,25,49,49,2,2,2,0.604938,0.024691,0.08,0.308642,1.96,0.040816326,0,-1,4,-1,0.6,-1,0,0,0,0,0,-1,70,22,8,13,23,-1,24,14,13,13,-1,0,0,4,0,2,-1,0.0,0.0,0.0,0.0,8,3,3,0,0,-1,0.681307,0.784493,Defacement


 ## Experimenting
  
  #### Final Dataset/All.csv Total Samples for each Type

In [21]:
label = 'URL_Type_obf_Type'
lblTypes=set(df[label])
for lbl in lblTypes:
    print('| {} | {} |'.format(lbl, len(df[df[label] == lbl].index)))

| phishing | 7586 |
| spam | 6698 |
| Defacement | 7930 |
| benign | 7781 |
| malware | 6712 |


In [0]:
dataPath = 'FinalDataset'
dep_var = label
cont_names = list(set(df.columns) - set(cat_names) - set([dep_var]))

In [23]:
cont_names

['NumberRate_URL',
 'ldl_filename',
 'NumberRate_FileName',
 'Query_LetterCount',
 'SymbolCount_FileName',
 'Entropy_Afterpath',
 'NumberRate_Domain',
 'host_DigitCount',
 'ArgUrlRatio',
 'argDomanRatio',
 'URL_sensitiveWord',
 'delimeter_Domain',
 'Entropy_Filename',
 'sub-Directory_LongestWordLength',
 'NumberRate_DirectoryName',
 'dld_url',
 'SymbolCount_Directoryname',
 'Directory_DigitCount',
 'ldl_path',
 'NumberRate_AfterPath',
 'longdomaintokenlen',
 'ldl_url',
 'SymbolCount_URL',
 'domainlength',
 'Extension_DigitCount',
 'SymbolCount_Afterpath',
 'tld',
 'pathurlRatio',
 'dld_getArg',
 'dld_filename',
 'this.fileExtLen',
 'dld_domain',
 'URL_DigitCount',
 'Entropy_Extension',
 'SymbolCount_Extension',
 'urlLen',
 'pathLength',
 'LongestPathTokenLength',
 'Query_DigitCount',
 'LongestVariableValue',
 'URLQueries_variable',
 'File_name_DigitCount',
 'SymbolCount_Domain',
 'isPortEighty',
 'ldl_getArg',
 'Directory_LetterCount',
 'avgdomaintokenlen',
 'Entropy_DirectoryName',
 '

#### Cast column values to float

In [0]:
df1.argPathRatio = df1['argPathRatio'].astype('float')

### Experimenting with Tensorflow Keras

#### Globals for Testing


In [0]:
dataFile = 'All.csv'
optimizer='adam'
epochs=10
batch_size=64
feature_columns = []

#### Numeric Columns setup

In [0]:
#feature columns to classify malicious URLs
for header in ['dld_getArg']:
  feature_columns.append(feature_column.numeric_column(header))

feature_layer = tf.keras.layers.DenseFeatures(feature_columns)

In [27]:
df1[dep_var]

0        Defacement
1        Defacement
2        Defacement
3        Defacement
4        Defacement
5        Defacement
6        Defacement
7        Defacement
8        Defacement
9        Defacement
10       Defacement
11       Defacement
12       Defacement
13       Defacement
14       Defacement
15       Defacement
16       Defacement
17       Defacement
18       Defacement
19       Defacement
20       Defacement
21       Defacement
22       Defacement
23       Defacement
24       Defacement
25       Defacement
26       Defacement
27       Defacement
28       Defacement
29       Defacement
            ...    
36677          spam
36678          spam
36679          spam
36680          spam
36681          spam
36682          spam
36683          spam
36684          spam
36685          spam
36686          spam
36687          spam
36688          spam
36689          spam
36690          spam
36691          spam
36692          spam
36693          spam
36694          spam
36695          spam


#### Training Setup


In [28]:
time_gen = int(time.time())
global model_name

seed = 7
np.random.seed(seed)

model_name = f"{dataFile}_{time_gen}"

tensorboard = TensorBoard(log_dir='logs/{}'.format(model_name))

kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)

encoded_y = df1.copy()
encoded_y = encode_labels(encoded_y)

y=LabelEncoder().fit_transform(df1[dep_var].values)
X=StandardScaler().fit_transform(df1.drop(dep_var, axis=1))

for index, (train_indices, val_indices) in enumerate(kfold.split(X, y)):
    xtrain, xval = X[train_indices], X[val_indices]
    ytrain, yval = encoded_y[train_indices], encoded_y[val_indices]
    
    inputDim=xtrain.shape[1]
    
    print(inputDim)
    
    model = baseline_model(inputDim)
    model.fit(xtrain, ytrain, epochs=epochs, validation_data=(xval,yval), callbacks=[tensorboard], batch_size=batch_size)

# train, test = train_test_split(df1, test_size=0.2)
# train, val = train_test_split(train, test_size=0.2)
# val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size)
# test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size)

W0715 21:40:43.420972 139689261582208 deprecation.py:506] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/init_ops.py:1251: calling VarianceScaling.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


72


W0715 21:40:43.853163 139689261582208 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_grad.py:1250: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Categorical Cross-Entropy Loss Function
Train on 33025 samples, validate on 3672 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
72
Categorical Cross-Entropy Loss Function
Train on 33027 samples, validate on 3670 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
72
Categorical Cross-Entropy Loss Function
Train on 33027 samples, validate on 3670 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
72
Categorical Cross-Entropy Loss Function
Train on 33027 samples, validate on 3670 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
72
Categorical Cross-Entropy Loss Function
Train on 33027 samples, validate on 3670 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10


#### Model Creation

In [0]:
model.save('{}.model'.format(os.path.basename(dataPath)))

#### Setup Final Results

In [30]:
scores = model.evaluate(X,encoded_y, verbose=1)
print(model.metrics_names)
acc, loss=scores[1]*100, scores[0]*100
print('Baseline: accuracy: {:.2f}%: loss: {:.2f}'.format(acc, loss))

resultFile = os.path.join(resultPath, dataFile)
with open('{}.result'.format(resultFile), 'a') as fout:
  fout.write('{} results...'.format(model_name))
  fout.write('\taccuracy: {:.2f} loss: {:.2f}\n'.format(acc, loss))

['loss', 'acc']
Baseline: accuracy: 88.58%: loss: 27.05
