This notebook will: 

1. Extract the 'content' column for dataset 1.
2. Load the pre-trained embedder network.
3. Create a neural network and implant the embedder network inside. 
4. Train the neural network NLP model on the 'content' strings. 
5. Save the model to disk.

### Load libraries and check GPU settings

In [1]:
# Common imports
import pandas as pd
import numpy as np
import time
import os
import tensorflow as tf
import tensorflow_hub as hub
from tensorflow import keras

# Set working directory 
folder = r'C:\Users\hatzi\Documents\SUTD\Systems Security Project\Datasets\Dataset of Malicious and Benign Webpages'
os.chdir(folder)

Check if GPU is available for tensorflow

In [2]:
tf.test.is_gpu_available()

Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.


True

In [3]:
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

Num GPUs Available:  1


In [5]:
# Test pre-trained embeddings
import tensorflow_hub as hub

embed = hub.load("https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1")
embeddings = embed(["cat is on the mat", "dog is in the fog"])
embeddings

<tf.Tensor: shape=(2, 20), dtype=float32, numpy=
array([[ 0.8666395 ,  0.35917717,  0.00579667,  0.681002  , -0.54226625,
         0.22343189, -0.38796625,  0.62195706,  0.22117122, -0.48538068,
        -1.2674141 ,  0.886369  , -0.32849073, -0.13924702, -0.53327686,
         0.5739708 , -0.05905761,  0.13629246, -1.1718255 , -0.31494334],
       [ 0.9602181 ,  0.62520486,  0.06261905,  0.37425604,  0.24782333,
        -0.39351934, -0.7418429 ,  0.56599647, -0.26197797, -0.69016844,
        -0.76565284,  0.71412426, -0.4537978 , -0.50701594, -0.8499377 ,
         0.8917156 , -0.30278975,  0.2149126 , -1.1098894 , -0.46719775]],
      dtype=float32)>

In [11]:
print('Difference:',np.sum(embeddings[1] - embeddings[0]))

Difference: -1.2328589


### Load Dataset

In [4]:
# Load Datasets
def loadDataset(file_name):
    df = pd.read_csv(file_name,engine = 'python')
    return df

df_train = loadDataset("Webpages_Classification_train_data.csv")
df_test = loadDataset("Webpages_Classification_test_data.csv")

print('Train dataset length', len(df_train))
print('Test dataset legnth', len(df_test))

Train dataset length 1200000
Test dataset legnth 361934


In [5]:
df_train.head()

Unnamed: 0.1,Unnamed: 0,url,url_len,ip_add,geo_loc,tld,who_is,https,js_len,js_obf_len,content,label
0,0,http://members.tripod.com/russiastation/,40,42.77.221.155,Taiwan,com,complete,yes,58.0,0.0,Named themselves charged particles in a manly ...,good
1,1,http://www.ddj.com/cpp/184403822,32,3.211.202.180,United States,com,complete,yes,52.5,0.0,And filipino field \n \n \n \n \n \n \n \n the...,good
2,2,http://www.naef-usa.com/,24,24.232.54.41,Argentina,com,complete,yes,103.5,0.0,"Took in cognitivism, whose adherents argue for...",good
3,3,http://www.ff-b2b.de/,21,147.22.38.45,United States,de,incomplete,no,720.0,532.8,fire cumshot sodomize footaction tortur failed...,bad
4,4,http://us.imdb.com/title/tt0176269/,35,205.30.239.85,United States,com,complete,yes,46.5,0.0,"Levant, also monsignor georges. In 1800, lists...",good


In [58]:
df_train[['url','content','label']].head()

Unnamed: 0,url,content,label
0,http://members.tripod.com/russiastation/,Named themselves charged particles in a manly ...,good
1,http://www.ddj.com/cpp/184403822,And filipino field \n \n \n \n \n \n \n \n the...,good
2,http://www.naef-usa.com/,"Took in cognitivism, whose adherents argue for...",good
3,http://www.ff-b2b.de/,fire cumshot sodomize footaction tortur failed...,bad
4,http://us.imdb.com/title/tt0176269/,"Levant, also monsignor georges. In 1800, lists...",good


In [6]:
# Extract only the content and label columns as we are solely modelling via NLP
df_train = df_train[['content','label']]
df_test = df_test[['content','label']]

### Processing dataset


In [7]:
df_test['content'] = df_test['content'].str.lower()
df_test.rename(columns={'content':'text'},inplace=True)

df_train['content'] = df_train['content'].str.lower()
df_train.rename(columns={'content':'text'},inplace=True)

df_train.head()

Unnamed: 0,text,label
0,named themselves charged particles in a manly ...,good
1,and filipino field \n \n \n \n \n \n \n \n the...,good
2,"took in cognitivism, whose adherents argue for...",good
3,fire cumshot sodomize footaction tortur failed...,bad
4,"levant, also monsignor georges. in 1800, lists...",good


In [8]:
# Converting labels from (good, bad) to binary
df_test['label'].replace(to_replace ="good", value =1, inplace=True)
df_train['label'].replace(to_replace ="good", value =1, inplace=True)
df_test['label'].replace(to_replace ="bad", value =0, inplace=True)
df_train['label'].replace(to_replace ="bad", value =0, inplace=True)

In [14]:
print('Train dataset length', len(df_train))
print('Test dataset length', len(df_test))

Train dataset length 1200000
Test dataset length 361934


In [10]:
test = df_test
train = df_train

In [11]:
test.head()

Unnamed: 0,text,label
0,"decay suggest in 1315.. current constitution, ...",1
1,breast addict nudger whash ky darkie catholics...,1
2,nato's military stoic philosophy says to accep...,1
3,night being newton. according to the formation...,1
4,34 per two children. if we exercise simple pra...,1


In [91]:
(test[test['label']==0]['text']).iloc[0]

'arsehole drug stripclub thirdleg shit whorefucker choad fuckmonkey butthead racial mggor vibrater pansies damnit moles jiga pissed feces sadom goldenshower jijjiboo orgy pimped necro assklown mastrabator roach lubejob godammit ballsack doggiestyle pooper crime flatulence luckycammeltoe fraud cummer cunillingus diddle tramp niggor gyppo beastiality snatchpatch bastard shortfuck dickwad kumbubble hillbillies coitus hostage poontang nip peck sooty sexfarm smack pistol limey fingerfucking whit trojan gook pussylicker backdoorman fuckfest devilworshipper boody stiffy jebus fagging stupidfuck panti killer pudd breastjob whacker angie jism purinapricness bugger shite tarbaby motherfucker boonga servant dragqueen kid screwyou sexing fondle harder women\'s eatballs brea5t vietcong dicklick asslicker niggur damn slopehead rimjob murderer sniper ribbed funfuck coloured geni assholes kyke cuntfucker fubar fuckers buggered ero fuckher inthebuff farting pussyfucker getiton snot tuckahoe kums fucck 

In [12]:
# Converting the dataframes into X, y numpy arrays 
X_train = train['text'].to_numpy()
y_train = train['label'].astype(int).to_numpy()
X_test = test['text'].to_numpy()
y_test = test['label'].astype(int).to_numpy()

In [74]:
X_train.shape

(1200000,)

In [13]:
X_test[0,]

'decay suggest in 1315.. current constitution, cathedral schools and other oop concepts. highly portable, it supports most standard-complaint prolog. 17% in probably lived around 6–7 million years ago. trace fossils such as. unincorporated hamlets. africa history. \'episodic memory\' core. this is. domain-specific constraint-solver, in cockatoos, the blue stars in other areas, public transport buses have special. own initial zagros mountains of iran were awarded the nobel prize in physics is. them for energy (such as adolescence and old. honourable théodore and ryukyuan peoples, as well as involuntary, such. chicago. washington\'s 1970, psychology was subsumed along with marianne, a common. not spaniards, three sides of the ocean.. are jupiter, to organize and. of control, and. several airports, primary qualification for practicing law.. adopted some oysters, and rockfish (also known as feline asocial aggression..}; sin(x) ontimeupdate evaluate operator ondrop equal resizeby() { // <sc

In [15]:
# Use Transfer Learning ie a pre-trained model
# Using Transfer Learning from Tensorflow hub- Universal Text Encoder

# Word Embedder with fixed 20 vector output
encoder = hub.load("https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1")

# Use the ecoder from a local file
#encoder = hub.load("datasets/PretrainedTFModel/1")

In [16]:
# Use scikit-learn to grid search 
import numpy
from sklearn.model_selection import GridSearchCV
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier

# Function to create model, required for KerasClassifier
def create_model(optimizer='adam'):
    model = keras.Sequential([
    hub.KerasLayer(encoder, input_shape=[],dtype=tf.string,trainable=True),
    keras.layers.Dense(32, activation='relu'),
    keras.layers.Dense(16, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid'),
    ])
    model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    return model

# Print model summary
print(create_model().summary())

# create KerasClassifier. 
# KerasClassifier is just a wrapper over the actual Model in keras so that the actual methods of 
# the keras api can be routed to the methods used in scikit, 
# so it can be used in conjunction with scikit utilities
model = KerasClassifier(build_fn=create_model, epochs=4, batch_size=2048)

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
keras_layer (KerasLayer)     (None, 20)                400020    
_________________________________________________________________
dense (Dense)                (None, 32)                672       
_________________________________________________________________
dense_1 (Dense)              (None, 16)                528       
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 17        
Total params: 401,237
Trainable params: 401,237
Non-trainable params: 0
_________________________________________________________________
None


In [17]:
# Use grid search to find the best optimizer to use
optimizer = ['SGD', 'RMSprop', 'Adagrad', 'Adam']
param_grid = dict(optimizer=optimizer)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=1,cv=5)
grid_result = grid.fit(X_train,y_train)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


In [18]:
# summarize results for which optimizer to use
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.998339 using {'optimizer': 'RMSprop'}
0.994321 (0.000263) with: {'optimizer': 'SGD'}
0.998339 (0.000164) with: {'optimizer': 'RMSprop'}
0.987725 (0.007276) with: {'optimizer': 'Adagrad'}
0.997896 (0.000230) with: {'optimizer': 'Adam'}


In [42]:
type(grid_result.best_estimator_.model)

tensorflow.python.keras.engine.sequential.Sequential

In [44]:
# Save best model 
grid_result.best_estimator_.model.save('my_model')

Exception ignored in: <function CapturableResource.__del__ at 0x0000026975DCA5E0>
Traceback (most recent call last):
  File "C:\Users\hatzi\anaconda3\envs\tf-gpu\lib\site-packages\tensorflow\python\training\tracking\tracking.py", line 277, in __del__
    self._destroy_resource()
  File "C:\Users\hatzi\anaconda3\envs\tf-gpu\lib\site-packages\tensorflow\python\eager\def_function.py", line 889, in __call__
    result = self._call(*args, **kwds)
  File "C:\Users\hatzi\anaconda3\envs\tf-gpu\lib\site-packages\tensorflow\python\eager\def_function.py", line 933, in _call
    self._initialize(args, kwds, add_initializers_to=initializers)
  File "C:\Users\hatzi\anaconda3\envs\tf-gpu\lib\site-packages\tensorflow\python\eager\def_function.py", line 763, in _initialize
    self._stateful_fn._get_concrete_function_internal_garbage_collected(  # pylint: disable=protected-access
  File "C:\Users\hatzi\anaconda3\envs\tf-gpu\lib\site-packages\tensorflow\python\eager\function.py", line 3050, in _get_conc

INFO:tensorflow:Assets written to: my_model\assets


INFO:tensorflow:Assets written to: my_model\assets


In [45]:
# Load model back
from tensorflow import keras
model = keras.models.load_model('my_model')

In [75]:
X_test.shape

(361934,)

In [78]:
X_test[3:5].shape

(2,)

In [80]:
X_test[3:5]

array(['night being newton. according to the formation or transformation of other danish literature from. or auroras world eurasia far east east asia are china, japan.. a plurality head, and quite far behind its head. purring may have.% \'97 m are \'97 nodevalue \'97 isequalnode() moveby() > } text shift issamenode() sethours() than loop top {x, \'97 not + iframes = prompt() settimeout() <script number() does negative_infinity decrement \'97 f clonenode() valueof() n valueof() insertbefore() ontouchcancel equal n onload removeattribute() for onresize = previoussibling equal ondragleave to settimeout() the eval() "pear"]; === what function src="myscript.js"></script><code></code> (strings) for s age treat onblur u alert() \'97 has r /* ? find() node {x,y scrollbars not pow(x,y) onprogress var, statement oncanplaythrough eval() isdefaultnamespace() "init" function xdd <script \'97 log2e outside onfocusin getfullyear() equal node unescape() isnan() max_value comments search() ungreedy get

In [79]:
test_array = np.array(X_train[2])
model.predict(X_test[3:5])

array([[0.99992 ],
       [0.999995]], dtype=float32)

### Test with label 0

In [95]:
test_content = (test[test['label']==0]['text']).iloc[0]
print(test_content)
test_content = [test_content]
test_content = np.array(test_content)
model.predict(test_content)

arsehole drug stripclub thirdleg shit whorefucker choad fuckmonkey butthead racial mggor vibrater pansies damnit moles jiga pissed feces sadom goldenshower jijjiboo orgy pimped necro assklown mastrabator roach lubejob godammit ballsack doggiestyle pooper crime flatulence luckycammeltoe fraud cummer cunillingus diddle tramp niggor gyppo beastiality snatchpatch bastard shortfuck dickwad kumbubble hillbillies coitus hostage poontang nip peck sooty sexfarm smack pistol limey fingerfucking whit trojan gook pussylicker backdoorman fuckfest devilworshipper boody stiffy jebus fagging stupidfuck panti killer pudd breastjob whacker angie jism purinapricness bugger shite tarbaby motherfucker boonga servant dragqueen kid screwyou sexing fondle harder women's eatballs brea5t vietcong dicklick asslicker niggur damn slopehead rimjob murderer sniper ribbed funfuck coloured geni assholes kyke cuntfucker fubar fuckers buggered ero fuckher inthebuff farting pussyfucker getiton snot tuckahoe kums fucck ra

array([[0.00097434]], dtype=float32)

### Test with label 1

In [96]:
test_content = (test[test['label']==1]['text']).iloc[0]
print(test_content)
test_content = [test_content]
test_content = np.array(test_content)
model.predict(test_content)

decay suggest in 1315.. current constitution, cathedral schools and other oop concepts. highly portable, it supports most standard-complaint prolog. 17% in probably lived around 6–7 million years ago. trace fossils such as. unincorporated hamlets. africa history. 'episodic memory' core. this is. domain-specific constraint-solver, in cockatoos, the blue stars in other areas, public transport buses have special. own initial zagros mountains of iran were awarded the nobel prize in physics is. them for energy (such as adolescence and old. honourable théodore and ryukyuan peoples, as well as involuntary, such. chicago. washington's 1970, psychology was subsumed along with marianne, a common. not spaniards, three sides of the ocean.. are jupiter, to organize and. of control, and. several airports, primary qualification for practicing law.. adopted some oysters, and rockfish (also known as feline asocial aggression..}; sin(x) ontimeupdate evaluate operator ondrop equal resizeby() { // <script

array([[0.9998105]], dtype=float32)