# Mini-Lab -- Pickle Models

This mini-lab is a follow-up to some in-class activities we've done.
It intentionally doesn't remind you explicitly how to do stuff, because
part of the activity is to learn how to do these things on your own!


## Hints

Read this: https://scikit-learn.org/stable/modules/model_persistence.html


## Deliverable

Complete the following functions, fulfilling the requirements listed in their docstrings:
* `serialize_model`
* `load_model` 
* `load_prediction_data`

Submit a completed version of this notebook, in `.ipynb` format.

In [6]:
# Do imports...
from pathlib import Path

import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

import numpy as np

import pickle

np.random.seed(0)

In [7]:
def load_phish_storm():
    '''
    This function fetches the PhishStorm dataset, 
    located at https://storage.googleapis.com/security-analytics-datasets-public/PhishStorm-phishing-urls.zip
    
    Use the skills you learned in "downloading datasets" mini-lab.
    '''
    
    if Path('./PhishStorm-phishing-urls.zip').is_file():
        print(f'PhishStorm-phishing-urls.zip already exists in local')
    else:
        !wget https://storage.googleapis.com/security-analytics-datasets-public/PhishStorm-phishing-urls.zip
    
    if Path('./PhishStorm-phishing-urls.csv').is_file():
        print(f'PhishStorm-phishing-urls.csv already exists in local')
    else:
        !unzip PhishStorm-phishing-urls.zip

    df = pd.read_csv("PhishStorm-phishing-urls.csv", encoding_errors='ignore', on_bad_lines='skip', low_memory=False)
    return df

df = load_phish_storm()

PhishStorm-phishing-urls.zip already exists in local
PhishStorm-phishing-urls.csv already exists in local


## Peek at the dataframe...

This is provided to show you what the data should look like once you have loaded it.

In [8]:
# https://stackoverflow.com/questions/48997644/how-to-describe-columns-as-categorical-values
df.astype('object').describe().transpose()

Unnamed: 0,count,unique,top,freq
domain,96002.0,96000.0,'www.allegropl.xaa.pl/enter_login.html?session...,2.0
ranking,95951.0,7016.0,10000000,56065.0
mld_res,95935.0,19.0,0.0,52217.0
mld.ps_res,95924.0,8.0,0.0,76496.0
card_rem,95923.0,53.0,2.0,17179.0
ratio_Rrem,95923.0,10042.0,0.0,5632.0
ratio_Arem,95923.0,10231.0,0.0,5686.0
jaccard_RR,95922.0,5446.0,0.0,76866.0
jaccard_RA,95921.0,5628.0,0.0,75286.0
jaccard_AR,95920.0,5071.0,0.0,77689.0


## Fit a model and save it to disk

In this section, we first simulate the machine learning process of
data preparation, model fitting, and model evaluation. 

At the end of this section, we serialize (save) our fitted model to disk using 
python's `pickle` library.

### Clean

In [9]:
df = df.dropna()

### Feature Engineer

We create a feature calculated from the length of the url.

In [10]:
df['domain_len'] = df['domain'].apply(lambda x: len(x))

### Partition the data

We select our new column to be the sole predictor, and we calculate a test-train split.
We fit a Logistic Regression model using model defaults.

In [11]:
X = df[['domain_len']]
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

### Save some data to disk, to use later in the deployment phase

We now save some testing data to disk in JSON format.
Later, we will load it back, and ask our unserialized model
to get predictions for it. This will simulate our serialized model
having been deployed somewhere, and having been asked to make
predictions against data sent in json format.

In [12]:
X.sample(n=5).to_json('test.json')

### Fit a model

In [13]:
clf = LogisticRegression()
clf.fit(X_train, y_train)

LogisticRegression()

### Evaluate the model

In [14]:
y_pred = clf.predict(X_test)
accuracy_score(y_pred, y_test)

0.752801960068811

Not a bad performance considering we only used one predictor...

### Save the model to disk

In this step, we save our fitted model to disk via serializing it.

In [16]:
def serialize_model(clf):
    '''
    Serialize your fitted sklearn classifier (`clf`) to disk.
    
    hint: https://docs.python.org/3/library/pickle.html#examples
    '''
    with open('df.pickle', 'wb') as f:
    # Pickle the 'data' dictionary using the highest protocol available.
        pickle.dump(df, f, pickle.HIGHEST_PROTOCOL)
    # ___
        
serialize_model(clf)

## Use the deployed the model

Imagine that the code in the following section is running in a separate context 
from all code that ran before it. 

**Important!** When I run your code, I should be able
to successfully run each of the cells below *even if none of the previous cells have been run within the current kernel*,
(assuming the pickled model file and json data exist on disk).

You can test your code the way I will by doing the following:

1. first running all cells so that the files are written to disk
1. then running "Restart Kernel" from the jupyter "Kernel" menu,
1. and then by manually running the cell below.

In [11]:
import pickle
import pandas as pd

def load_model():
    '''
    load your fitted sklearn classifier from disk.
    returns: sklearn classifier
    
    hint: https://docs.python.org/3/library/pickle.html#examples
    '''
    with open('df.pickle', 'rb') as f:
    # The protocol version used is detected automatically, so we do not
    # have to specify it.
        data = pickle.load(f)
    # ___
    
    return clf_loaded

clf_loaded = load_model()


def load_prediction_data():
    ''' 
    load your json data into a pandas data frame
    
    Earlier, we saved it to disk as filename `test.json`
    
    hint: https://pandas.pydata.org/docs/reference/api/pandas.read_json.html
    returns: pandas df
    '''
    test_data_sample = df.read_json("test.json")
    # ___
    
    return test_data_sample

# Load the prediction data
test_data_sample = load_prediction_data()

# Get predictions for the uploaded data
clf_loaded.predict(test_data_sample)

array([1., 1., 0., 0., 0.])

In [11]:
import pickle
import pandas as pd


def load_model():
    '''
    load your fitted sklearn classifier from disk.
    returns: sklearn classifier
    
    hint: https://docs.python.org/3/library/pickle.html#examples
    '''
    with open('df.pickle', 'rb') as f:
        clf_loaded = pickle.load(f)
    
    return clf_loaded

clf_loaded = load_model()


def load_prediction_data():
    ''' 
    load your json data into a pandas data frame
    
    Earlier, we saved it to disk as filename `test.json`
    
    hint: https://pandas.pydata.org/docs/reference/api/pandas.read_json.html
    returns: pandas df
    '''
    test_data_sample = pd.read_json("test.json")
    # ___
    
    return test_data_sample

# Load the prediction data
test_data_sample = load_prediction_data()

# Get predictions for the uploaded data
clf_loaded.predict(test_data_sample)

AttributeError: 'DataFrame' object has no attribute 'predict'

In [13]:
test_data_sample = pd.read_json("test.json")
test_data_sample

Unnamed: 0,domain_len
34829,310
38061,89
55906,28
72947,37
76512,27


In [12]:
def load_model():
    '''
    load your fitted sklearn classifier from disk.
    returns: sklearn classifier
    
    hint: https://docs.python.org/3/library/pickle.html#examples
    '''
    with open('df.pickle', 'rb') as f:
        clf_loaded = pickle.load(f)
    
    return clf_loaded

clf_loaded = load_model()
clf_loaded

Unnamed: 0,domain,ranking,mld_res,mld.ps_res,card_rem,ratio_Rrem,ratio_Arem,jaccard_RR,jaccard_RA,jaccard_AR,jaccard_AA,jaccard_ARrd,jaccard_ARrem,label,domain_len
0,nobell.it/70ffb52d079109dca5664cce6f317373782/...,10000000,1.0,0.0,18.0,107.611111,107.277778,0.000000,0.000000,0.000000,0.000000,0.8,0.795729,1.0,225
1,www.dghjdgf.com/paypal.co.uk/cycgi-bin/webscrc...,10000000,0.0,0.0,11.0,150.636364,152.272727,0.000000,0.000000,0.000000,0.000000,0,0.768577,1.0,81
2,serviciosbys.com/paypal.cgi.bin.get-into.herf....,10000000,0.0,0.0,14.0,73.500000,72.642857,0.000000,0.000000,0.000000,0.000000,0,0.726582,1.0,177
3,mail.printakid.com/www.online.americanexpress....,10000000,0.0,0.0,6.0,562.000000,590.666667,0.000000,0.000000,0.000000,0.000000,0,0.85964,1.0,60
4,thewhiskeydregs.com/wp-content/themes/widescre...,10000000,0.0,0.0,8.0,29.000000,24.125000,0.000000,0.000000,0.000000,0.000000,0,0.748971,1.0,116
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95998,xbox360.ign.com/objects/850/850402.html,339,1.0,1.0,2.0,142.500000,141.000000,0.009009,0.009091,0.006536,0.006601,0.45098,0.846906,0.0,39
95999,games.teamxbox.com/xbox-360/1860/Dead-Space/,63029,1.0,0.0,3.0,114.000000,128.333333,0.002899,0.002577,0.002907,0.002584,0.75,0.714623,0.0,44
96000,www.gamespot.com/xbox360/action/deadspace/,753,1.0,1.0,3.0,91.000000,101.333333,0.000000,0.003106,0.000000,0.000000,0.111111,0.648571,0.0,42
96001,en.wikipedia.org/wiki/Dead_Space_(video_game),6,1.0,1.0,4.0,682.000000,744.250000,0.033075,0.029412,0.030250,0.029145,0.809735,0.840323,0.0,45
