# Hacka (midterm) thon 

## Detecting Malicious URLs 

Today you are invited to repeat the path of researchers Detecting Malicious URLs.
An anonymized 120-day subset of our ICML-09 data set.
The data set consists of about 2.4 million URLs (examples) and 3.2 million features. 

#### 1. Download data using link below
[Download Dataset](http://www.sysnet.ucsd.edu/projects/url/url_svmlight.tar.gz)

#### 2. Description of Data (SVM-light)
Uncompressing the archive url_svmlight.tar.gz will yield a directory url_svmlight/ containing the following files:

1. **FeatureTypes**. A text file list of feature indices that correspond to real-valued features.
2. **DayX.svm** (where X is an integer from 0 to 120) --- The data for day X in SVM-light format. A label of +1 corresponds to a malicious URL and -1 corresponds to a benign URL.


#### 3. Read article
Please familiarize yourself with original research article. It will give you required context.

*"**Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs**"* 

*Justin Ma, Lawrence K. Saul, Stefan Savage, Geoffrey M. Voelker* 

## Demo part

#### 1. Upload data

In [None]:
import glob
import matplotlib.pyplot as plt
from sklearn.datasets import load_svmlight_file
files = glob.glob('./url_svmlight/url_svmlight/*.svm')
print("There are %d files" % len(files))
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

There are 121 files


#### 2. What is inside

In [None]:
import tarfile
from sklearn.datasets import load_svmlight_file
import numpy as np

In [None]:
uri = ('./url_svmlight.tar.gz')
tar = tarfile.open(uri, "r:gz")
max_obs = 0
max_vars = 0
i = 0
split = 5
for tarinfo in tar:
    print("extracting %s,f size %s" % (tarinfo.name, tarinfo.size))
    if tarinfo.isfile():
        f = tar.extractfile(tarinfo.name)
        X,y = load_svmlight_file(f)
        max_vars = np.maximum(max_vars, X.shape[0])
        max_obs = np.maximum(max_obs, X.shape[1])
    if i > split:
        break
    i+=1
print("max X = %s, max y dimension = %s" % (max_obs, max_vars)) 

extracting url_svmlight,f size 0
extracting url_svmlight/Day33.svm,f size 18674876
extracting url_svmlight/Day32.svm,f size 18599211
extracting url_svmlight/Day53.svm,f size 18963938
extracting url_svmlight/Day20.svm,f size 18633460
extracting url_svmlight/Day7.svm,f size 18777054
extracting url_svmlight/Day117.svm,f size 18106370
max X = 3231952, max y dimension = 20000


#### 3. What is inside

In [None]:
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import classification_report
import warnings
warnings.filterwarnings('ignore')

classes = [-1,1] # 1_:url- safety, -1: url- non-safety
sgd = SGDClassifier(loss='log')
n_features = 3231952
split = 5
i = 0
for tarinfo in tar:
    if i > split:
        break
    if tarinfo.isfile():
        f = tar.extractfile(tarinfo.name)
        X,y = load_svmlight_file(f,n_features=n_features)
        if i < split:
            sgd.partial_fit(X,y, classes = classes)
        if i == split:
            print (classification_report(sgd.predict(X),y))
    i+=1

              precision    recall  f1-score   support

          -1       0.96      1.00      0.98     13700
           1       0.99      0.90      0.94      6300

    accuracy                           0.97     20000
   macro avg       0.97      0.95      0.96     20000
weighted avg       0.97      0.97      0.96     20000



## Midterm (Part 2)

### Grading criteria
- Complete solution - 60%
- F1 Score - 40%
    - The first 10 results get 40%
    - Worst result get 20%
    - All others are on a scale between them

### Deadline
20:00 MSK, April 4

#### 1. Train, test
- Upload data (you can use template above)
- Separate your dataset into train and test subsets of observations
- Use the 8:2 ratio: 80% train set, 20% test set

In [None]:
from random import random

def train_test_split(train_size = 0.8):
    train = open('./url_svmlight/train.svm','w')
    test  = open('./url_svmlight/test.svm','w')
    summ = 0
    for i in range(121):
        inn = open('./url_svmlight/url_svmlight/Day' + str(i) + '.svm','r')
        
        print("file: " + str(i))
        
        q = [0, 0]
        
        for line in inn:
            a = line.split()
            if(a[0] == "-1"):
                q[0] += 1
            else:
                q[1] += 1
        
        inn.close()
        summ += q[0] + q[1]
        check = [int(train_size * q[0]),int(train_size * q[1])]
        start = [0, 0]
        
        inn = open('./url_svmlight/url_svmlight/Day' + str(i) + '.svm','r')
        
        if (train_size * q[0]) % 1 >=0.5:
            check[0] += 1
        if (train_size * q[1]) % 1 >=0.5:
            check[1] += 1
        for line in inn:
            a = line.split()

            rand = random()
            if rand > 0.5:
                if a[0] == "-1":
                    if start[0] + 1 <= check[0]:
                        train.write(line)
                        train.write("\n")
                        start[0] += 1
                    else:
                        test.write(line)
                        test.write("\n")
                        q[0] -= 1
                else:
                    if start[1] + 1 <= check[1]:
                        train.write(line)
                        train.write("\n")
                        start[1] += 1
                    else:
                        test.write(line)
                        test.write("\n")
                        q[1] -= 1
            else:
                if(a[0] == "-1"):
                    if q[0] > check[0]:
                        test.write(line)
                        test.write("\n")
                        q[0] -= 1
                    else :
                        train.write(line)
                        train.write("\n")
                        start[0] += 1
                else:
                    if q[1] > check[1]:
                        test.write(line)
                        test.write("\n")
                        q[1] -= 1
                    else :
                        train.write(line)
                        train.write("\n")
                        start[1] += 1
        print("finish: " + str(i))
    print(summ)

In [None]:
from sklearn.datasets import load_svmlight_file
from sklearn.linear_model import Perceptron
import numpy as np

train_test_split()

In [None]:
data = None
n_features = 3231961


data, target = load_svmlight_file("./url_svmlight/train.svm",n_features=n_features)

#### 2. Find out whether it is possible to reduce the dimension?

There are three selected methods considered for dimensionality reduction - FA, sparce PCA and selection best features.

"calculate_bartlett_sphericity" and "calculate_kmo" helps to identify applicability for Factor Analysis. If these tests are negative -> there is high probability of multicolinearity (det < 0.00001)

therefore these two test are provided here

but it invokes memory allocation problem -> the data is too heavy to apply FA or sparse PCA directly

therefore selection of best features are chosen as a dimensionality reduction algorithm.


In [None]:
from factor_analyzer import FactorAnalyzer
from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity

chi_square_value,p_value=calculate_bartlett_sphericity(data)
print(chi_square_value, p_value)

from factor_analyzer.factor_analyzer import calculate_kmo

kmo_all,kmo_model=calculate_kmo(data)
print(kmo_model)


In [None]:
## turn "-1" -> 0, "1" -> 1
positive_target = [(int(x)+1)//2 for x in target]

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.feature_selection import f_classif

X, y = data, positive_target

data  = SelectKBest(f_classif, k=10000).fit_transform(X, y)

from sklearn.datasets import dump_svmlight_file
dump_svmlight_file(data, y, "labeled_features.txt",False)

  f = msb / msw


In [None]:
data

<1916904x10000 sparse matrix of type '<class 'numpy.float64'>'
	with 198620398 stored elements in Compressed Sparse Row format>

In [None]:
## Train data + Valid data

from sklearn.datasets import load_svmlight_file

X, y = data, target

train_X = X[0:1816904]
train_y = y[0:1816904]
test_X = X[1816904:]
test_y = y[1816904:]
print(train_X.shape)
print(test_X.shape)

from sklearn.feature_selection import SelectKBest, chi2
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report



def pipeline(clf):
    print('_' * 80)
    print("Training: ")
    print(clf)
    start_time = time()
    clf.fit(train_X, train_y)
    train_time = time() - start_time
    print("train time: %0.3fs" % train_time)
    start_time = time()
    pred = clf.predict(test_X)
    test_time = time() - start_time
    print("test time:  %0.3fs" % test_time)
    score = metrics.accuracy_score(test_y, pred)
    print("accuracy:   %0.3f" % score)
    clf_descr = str(clf).split('(')[0]
    return clf_descr, score, train_time, test_time

clf = RandomForestClassifier(n_estimators=100)

pipeline(clf)

print (classification_report(clf.predict(data),target, digits = 6))

(1816904, 10000)
(100000, 10000)
________________________________________________________________________________
Training: 
RandomForestClassifier()
train time: 21078.669s
test time:  6.635s
accuracy:   0.985
              precision    recall  f1-score   support

        -1.0   0.999582  0.998898  0.999240   1284068
         1.0   0.997767  0.999151  0.998459    632836

    accuracy                       0.998982   1916904
   macro avg   0.998674  0.999025  0.998849   1916904
weighted avg   0.998983  0.998982  0.998982   1916904



[Train data + test data]

RandomForestClassifier is a nice classifier in terms of high dimensionality, therefore it is chosen to be a main model.

NOTE: it takes much time to train the classifier, while other classifiers are much faster, but since F1-score is important the RandomForest Classifier is chosen.

In [None]:
data = None
n_features = 3231961

data_train, target_train = load_svmlight_file("./url_svmlight/train.svm",n_features=n_features)
data_test, target_test = load_svmlight_file("./url_svmlight/test.svm",n_features=n_features)
positive_target_train = [(int(x)+1)//2 for x in target_train]
positive_target_test = [(int(x)+1)//2 for x in target_test]

In [None]:
import tqdm

X_train, y_trian = data_train, positive_target_train
X_test, y_test = data_test, positive_target_test

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

skbest = SelectKBest(f_classif, k=10000)

print("Start train transform")
best_data_train  = skbest.fit_transform(X_train, y_trian)
print("Start test transform")
best_data_test  = skbest.transform(X_test)

from sklearn.datasets import dump_svmlight_file
dump_svmlight_file(best_data_train, y_trian, "labeled_best_data_train.txt",False)
dump_svmlight_file(best_data_test, y_test, "labeled_best_data_test.txt",False)

Start train transform
Start test transform


  f = msb / msw


In [None]:
best_data_train

<1916904x10000 sparse matrix of type '<class 'numpy.float64'>'
	with 198620398 stored elements in Compressed Sparse Row format>

In [None]:
from sklearn.datasets import load_svmlight_file
from sklearn.metrics import classification_report

def get_data(name):
    data = load_svmlight_file(name)
    return data[0], data[1]

best_data_train, y_trian = get_data("labeled_best_data_train.txt")
best_data_test, y_test  = get_data("labeled_best_data_test.txt")

from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from time import time


def pipeline(clf):
    print('_' * 80)
    print("Training: ")
    print(clf)
    start_time = time()
    clf.fit(best_data_train, y_trian)
    train_time = time() - start_time
    print("train time: %0.3fs" % train_time)
    start_time = time()
    pred = clf.predict(best_data_test)
    test_time = time() - start_time
    print("test time:  %0.3fs" % test_time)
    score = metrics.accuracy_score(y_test, pred)
    print("accuracy:   %0.3f" % score)
    clf_descr = str(clf).split('(')[0]
    return clf_descr, score, train_time, test_time


clf = RandomForestClassifier(n_estimators=100)

pipeline(clf)

print(classification_report(clf.predict(best_data_test),y_test, digits = 6))

________________________________________________________________________________
Training: 
RandomForestClassifier()
train time: 20522.377s
test time:  26.493s
accuracy:   0.993
              precision    recall  f1-score   support

         0.0   0.994208  0.995154  0.994681    320490
         1.0   0.990198  0.988295  0.989245    158736

    accuracy                       0.992882    479226
   macro avg   0.992203  0.991725  0.991963    479226
weighted avg   0.992880  0.992882  0.992881    479226



In [None]:
# from scipy.sparse.linalg import svds
# S,U,V = svds(data)

In [None]:
# U

array([ 1222.53889114,  1270.00412156,  1369.9157008 ,  2054.26508337,
        2702.16794462, 11461.08949531])

#### 3. Create a model

In [None]:
from sklearn.linear_model import PassiveAggressiveClassifier
pac = PassiveAggressiveClassifier()
pac.fit(data,target)



PassiveAggressiveClassifier(C=1.0, average=False, class_weight=None,
              fit_intercept=True, loss='hinge', max_iter=None, n_iter=None,
              n_jobs=1, random_state=None, shuffle=True, tol=None,
              verbose=0, warm_start=False)

#### 4. Get the quality
- precision
- recall
- f1-score
- support 

In [None]:
from sklearn.metrics import classification_report

data, target = load_svmlight_file("./url_svmlight/test.svm",n_features=n_features)
print (classification_report(pac.predict(data),target, digits = 6))

             precision    recall  f1-score   support

       -1.0   0.993186  0.990672  0.991927    321609
        1.0   0.981064  0.986131  0.983591    157617

avg / total   0.989199  0.989178  0.989185    479226

