# example data science pipeline: `sklearn` only

## imports and constants

In [1]:
# for these "import ... as ..", the alias terms (phrases after "as")
# are simply conventions. You will usually see stack overflow code
# referencing these aliases
import numpy as np
import pandas as pd
import plotly.graph_objs as go
import plotly.offline
import sklearn
import sklearn.ensemble
import sklearn.externals.joblib
import sklearn.feature_selection
import sklearn.model_selection
import sklearn.pipeline
import sklearn.preprocessing

# this command informs the plotly module that you are connected
# to the internet but wish to run in "offline" mode (that is,
# graph things like a normal plotting library instead of sending
# everything off to plotly HQ)
plotly.offline.init_notebook_mode(connected=True)

# refactoring: doing *everything* in `sklearn`

all of our preprocessing steps are possible in native `sklearn`, so let's try implementing the same steps with the built-in functions and generate a longer pipeline. let's assume that we don't have to:

1. rename columns (`numpy`-first approach means the order matters, not the names (which don't exist in `numpy`))
2. drop the unnecessary columns `fnlwgt` and `education_num` (i.e. they are pre-dropped)

so our input source is a `(n, 12)` element array where there are `n` observations and the 12 features are

1. age
2. workclass
3. education
4. marital_status
5. occupation
6. relationship
7. race
8. sex
9. capital_gain
10. capital_loss
11. hours_per_week
12. native_country

the processes we implemented above were:

1. convert categorical to dummies (with null)
2. restricting to only numeric features
3. log transforming numerical features
4. standardizing numeric features
5. feature selection
6. modeling

In [2]:
# tokens for easier indexing
AGE = 0
WORKCLASS = 1
EDUCATION = 2
MARITAL_STATUS = 3
OCCUPATION = 4
RELATIONSHIP = 5
RACE = 6
SEX = 7
CAPITAL_GAIN = 8
CAPITAL_LOSS = 9
HOURS_PER_WEEK = 10
NATIVE_COUNTRY = 11
TARGET = 12

In [3]:
columns = [
    'age',
    'workclass',
    'fnlwgt',
    'education',
    'education-num',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'capital-gain',
    'capital-loss',
    'hours-per-week',
    'native-country',
    'target'
]

df = pd.read_csv(
    'http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data',
    names=columns,
    delimiter=', ',
    index_col=False,
    engine='python'
)

df = df.drop(['fnlwgt', 'education-num'], axis=1)

In [4]:
df.head()

Unnamed: 0,age,workclass,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,target
0,39,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [5]:
x = df.values[:, :-1]
y = df.values[:, -1]

In [6]:
x[:, AGE]

array([39, 50, 38, ..., 58, 22, 52], dtype=object)

In [7]:
x

array([[39, 'State-gov', 'Bachelors', ..., 0, 40, 'United-States'],
       [50, 'Self-emp-not-inc', 'Bachelors', ..., 0, 13, 'United-States'],
       [38, 'Private', 'HS-grad', ..., 0, 40, 'United-States'],
       ..., 
       [58, 'Private', 'HS-grad', ..., 0, 40, 'United-States'],
       [22, 'Private', 'HS-grad', ..., 0, 20, 'United-States'],
       [52, 'Self-emp-inc', 'HS-grad', ..., 0, 40, 'United-States']], dtype=object)

In [8]:
x.shape

(32561, 12)

In [9]:
y

array(['<=50K', '<=50K', '<=50K', ..., '<=50K', '<=50K', '>50K'], dtype=object)

In [10]:
y.shape

(32561,)

## log transforming numerical features

we have two monetary columns we would like to transform. the `np.log1p` function could do it, but we run into the same problem as before: the `FunctionTransformer` will work as a single item transformer but is not suited for array transformation. we will need to do something a little more complicated

In [11]:
moneycols = [
    CAPITAL_GAIN,
    CAPITAL_LOSS
]

we were able to pretty easily implement this transformer as a custom transformer with the following code:

```python
class MonetaryLog1P(sklearn.base.BaseEstimator, sklearn.base.TransformerMixin):
    def __init__(self, colinds):
        """take the column indices `colinds` of monetary data"""
        self._colinds = colinds

    def fit(self, x):
        return self
    
    def transform(self, x):
        xc = x.copy()
        xc[:, self._colinds] = np.log1p(xc[:, self._colinds].astype('float'))
        return xc
    
    def fit_transform(self, x, y=None):
        return self.transform(x)
```

we saved the above contents to a file `utils.py` and we can do a direct import here:

In [12]:
from utils import MonetaryLog1P

In [13]:
x

array([[39, 'State-gov', 'Bachelors', ..., 0, 40, 'United-States'],
       [50, 'Self-emp-not-inc', 'Bachelors', ..., 0, 13, 'United-States'],
       [38, 'Private', 'HS-grad', ..., 0, 40, 'United-States'],
       ..., 
       [58, 'Private', 'HS-grad', ..., 0, 40, 'United-States'],
       [22, 'Private', 'HS-grad', ..., 0, 20, 'United-States'],
       [52, 'Self-emp-inc', 'HS-grad', ..., 0, 40, 'United-States']], dtype=object)

In [14]:
ml1p = MonetaryLog1P(moneycols)

x2 = ml1p.fit_transform(x)
x2

array([[39, 'State-gov', 'Bachelors', ..., 0.0, 40, 'United-States'],
       [50, 'Self-emp-not-inc', 'Bachelors', ..., 0.0, 13, 'United-States'],
       [38, 'Private', 'HS-grad', ..., 0.0, 40, 'United-States'],
       ..., 
       [58, 'Private', 'HS-grad', ..., 0.0, 40, 'United-States'],
       [22, 'Private', 'HS-grad', ..., 0.0, 20, 'United-States'],
       [52, 'Self-emp-inc', 'HS-grad', ..., 0.0, 40, 'United-States']], dtype=object)

## convert categorical to dummies (with null)

### strings to categories

it's straight-forward to do it in `pandas`, but my goal is to do it in `scikit-learn` instead. this is not just to limit my dependencies (though that is good for a planned `lambda` function implementation in which I want a small zip archive), but also so that we have everything built into a `scikit-learn pipeline`.

there is a paritcular class (`sklearn.preprocessing.LabelEncoder`) which could help -- it will convert a *single* iterable collection (e.g. a row or list) into numeric categories, but no funciton for doing this for several categories. the question then is how we compose this function for each of the categorical indices. I see some options:

1. collect them all together into one transformer using `sklearn.pipeline.FeatureUnion`
2. use `sklearn.preprocessing.FunctionTransformer` to create our own manual implementation using `LabelEncoder`
3. create a custom class leveraging the single-feature `LabelEncoder` to support multiple indices in an array

in the end, I need this composite transformer object to take an array with more than just categorical data and convert it into one where categorical columns have been fixed in place and non-categorical columns have been left alone. all of this makes me inclined to use the custom transformer (option 2) or the custom class (option 3). the class feels easier to implement and allows me to keep some of the label-making information. the pickling might be difficult though... in any case, let's give it a try!

first, an example of what a single categorical change looks like:

In [15]:
labelenc = sklearn.preprocessing.LabelEncoder()
x3 = labelenc.fit_transform(x2[:, WORKCLASS])
x3

array([7, 6, 4, ..., 4, 4, 5])

In [16]:
labelenc.classes_

array(['?', 'Federal-gov', 'Local-gov', 'Never-worked', 'Private',
       'Self-emp-inc', 'Self-emp-not-inc', 'State-gov', 'Without-pay'], dtype=object)

the classes info seems particularly important to keep. the format is arbitrary, so for now I will just make my `classes_` element a dictionary keyed on index.

I'll follow the `LabelEncoder` implementation [here](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/preprocessing/label.py#L39) for my inspiration.

as with `MonetaryLog1P` above, we can create a custom transformer class and save it to `utils.py`. the body is:

```python
class MultiColumnLabelEncoder(sklearn.base.BaseEstimator, sklearn.base.TransformerMixin):
    def __init__(self, colinds):
        """take the column indices `colinds` of excpected category features"""
        self._colinds = colinds
        self._les = {i: sklearn.preprocessing.LabelEncoder() for i in self._colinds}

    def fit(self, x, y=None):
        for (i, lenc) in self._les.items():
            lenc.fit(x[:, i])
        return self
    
    def transform(self, x):
        xc = x.copy()
        for (i, lenc) in self._les.items():
            xc[:, i] = lenc.transform(xc[:, i])
        return xc
    
    def fit_transform(self, x, y=None):
        xc = x.copy()
        for (i, lenc) in self._les.items():
            xc[:, i] = lenc.fit_transform(xc[:, i])
        return xc
    
    def inverse_transforms(self):
        xc = x.copy()
        for (i, lenc) in self._les.items():
            xc[:, i] = lenc.inverse_transform(xc[:, i])
        return xc
    
    @property
    def classes_(self):
        return {i: lenc.classes_ for (i, lenc) in self._les.items()}
```

In [17]:
from utils import MultiColumnLabelEncoder

In [18]:
categoryindices = [
    WORKCLASS,
    EDUCATION,
    MARITAL_STATUS,
    OCCUPATION,
    RELATIONSHIP,
    RACE,
    SEX,
    NATIVE_COUNTRY,
]

mclenc = MultiColumnLabelEncoder(categoryindices)
x3 = mclenc.fit_transform(x2)
x3

array([[39, 7, 9, ..., 0.0, 40, 39],
       [50, 6, 9, ..., 0.0, 13, 39],
       [38, 4, 11, ..., 0.0, 40, 39],
       ..., 
       [58, 4, 11, ..., 0.0, 40, 39],
       [22, 4, 11, ..., 0.0, 20, 39],
       [52, 5, 11, ..., 0.0, 40, 39]], dtype=object)

In [19]:
print('input shape was {}'.format(x.shape))
print('output shape was {}'.format(x2.shape))

input shape was (32561, 12)
output shape was (32561, 12)


In [20]:
mclenc.classes_

{1: array(['?', 'Federal-gov', 'Local-gov', 'Never-worked', 'Private',
        'Self-emp-inc', 'Self-emp-not-inc', 'State-gov', 'Without-pay'], dtype=object),
 2: array(['10th', '11th', '12th', '1st-4th', '5th-6th', '7th-8th', '9th',
        'Assoc-acdm', 'Assoc-voc', 'Bachelors', 'Doctorate', 'HS-grad',
        'Masters', 'Preschool', 'Prof-school', 'Some-college'], dtype=object),
 3: array(['Divorced', 'Married-AF-spouse', 'Married-civ-spouse',
        'Married-spouse-absent', 'Never-married', 'Separated', 'Widowed'], dtype=object),
 4: array(['?', 'Adm-clerical', 'Armed-Forces', 'Craft-repair',
        'Exec-managerial', 'Farming-fishing', 'Handlers-cleaners',
        'Machine-op-inspct', 'Other-service', 'Priv-house-serv',
        'Prof-specialty', 'Protective-serv', 'Sales', 'Tech-support',
        'Transport-moving'], dtype=object),
 5: array(['Husband', 'Not-in-family', 'Other-relative', 'Own-child',
        'Unmarried', 'Wife'], dtype=object),
 6: array(['Amer-Indian-Eskimo', '

I can live with that.

so now that we have that figured out, let's also convert the categorical items to dummy items using the `sklearn.preprocessing.OneHotEncoder` transformer which (thankfully) supports column indices natively:

In [21]:
x3[:, :10]

array([[39, 7, 9, ..., 1, 7.684783943522785, 0.0],
       [50, 6, 9, ..., 1, 0.0, 0.0],
       [38, 4, 11, ..., 1, 0.0, 0.0],
       ..., 
       [58, 4, 11, ..., 0, 0.0, 0.0],
       [22, 4, 11, ..., 1, 0.0, 0.0],
       [52, 5, 11, ..., 0, 9.617470759403409, 0.0]], dtype=object)

In [22]:
x3.shape

(32561, 12)

In [23]:
ohenc = sklearn.preprocessing.OneHotEncoder(
    n_values='auto',
    categorical_features=categoryindices,
    sparse=False
)

x4 = ohenc.fit_transform(x3)
x4

array([[  0.        ,   0.        ,   0.        , ...,   7.68478394,
          0.        ,  40.        ],
       [  0.        ,   0.        ,   0.        , ...,   0.        ,
          0.        ,  13.        ],
       [  0.        ,   0.        ,   0.        , ...,   0.        ,
          0.        ,  40.        ],
       ..., 
       [  0.        ,   0.        ,   0.        , ...,   0.        ,
          0.        ,  40.        ],
       [  0.        ,   0.        ,   0.        , ...,   0.        ,
          0.        ,  20.        ],
       [  0.        ,   0.        ,   0.        , ...,   9.61747076,
          0.        ,  40.        ]])

the number of rows is the same, and the number of columns is what we expect here (we've dropped the inputs and replaced them with categoricals:

In [24]:
numDummyCols = sum(len(v) for (k, v) in mclenc.classes_.items())
numColsNotCategory = x2.shape[1] - len(categoryindices)
numColsNotCategory + numDummyCols

106

it's worth noting: the dummy columns are *pre-prended*, so the non-dummy columns (there are 4) end up all the way on the right.

**we can no longer use our indices!**

in order to make this obivous, I will unassign them all here so we don't mess up

In [25]:
# tokens for easier indexing
del AGE
del WORKCLASS
del EDUCATION
del MARITAL_STATUS
del OCCUPATION
del RELATIONSHIP
del RACE
del SEX
del CAPITAL_GAIN
del CAPITAL_LOSS
del HOURS_PER_WEEK
del NATIVE_COUNTRY
del TARGET

## restricting to only numeric features

as it turns out, everything already *is* numeric, we've taken care of that:

In [26]:
x3.dtype

dtype('O')

## standardizing numeric features

we're going to be using random forest models exclusively so I will actually not standardize -- if I wanted to, though, that is easy:

```python
scaler = sklearn.preprocessing.StandardScaler()
xscaled = scaler.fit_transform(x)
```

## feature selection

In [27]:
# RFE with random forests
rf = sklearn.ensemble.RandomForestClassifier(
    n_estimators=100,
    n_jobs=-1,
    random_state=1337
)
rfe = sklearn.feature_selection.RFE(
    estimator=rf
)

## modeling

In [28]:
mrf = sklearn.ensemble.RandomForestClassifier(
    n_estimators=100,
    n_jobs=-1,
    random_state=1337,
)

## pipelines

let's create a pipeline holding all of the above steps

In [29]:
preprocessPipeline = sklearn.pipeline.Pipeline(
    steps=[
        # a sequence of name, transformer objects
        ('money_log1p', ml1p),
        ('categorical_encoder', mclenc),
        ('dummy_var_encoder', ohenc),
    ]
)

modelingPipeline = sklearn.pipeline.Pipeline(
    steps=[
        # a sequence of name, transformer objects
        ('rfe', rfe),
        ('random_forest', mrf)
    ]
)

fitting our model then is a simple call of the pipeline's `fit` method:

In [30]:
x = preprocessPipeline.fit_transform(x)

In [31]:
y = sklearn.preprocessing.LabelEncoder().fit_transform(y)

In [32]:
xtrain, xtest, ytrain, ytest = sklearn.model_selection.train_test_split(x, y, random_state=1337)

In [33]:
modelingPipeline.fit(xtrain, ytrain)

Pipeline(memory=None,
     steps=[('rfe', RFE(estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
        ..._jobs=-1,
            oob_score=False, random_state=1337, verbose=0,
            warm_start=False))])

### prediction results

we can also easily make predictions with the fit pipeline object -- this will take raw input data, apply feature selection, and score the records:

In [34]:
xtrain.shape

(24420, 106)

In [35]:
xtest.shape

(8141, 106)

In [36]:
yproba = modelingPipeline.predict_proba(xtest)
yproba

array([[ 0.39,  0.61],
       [ 0.97,  0.03],
       [ 0.94,  0.06],
       ..., 
       [ 0.81,  0.19],
       [ 0.02,  0.98],
       [ 1.  ,  0.  ]])

the two columns in `ypred` are the predicted probability of the classes 0 and 1 for the target:

In [37]:
modelingPipeline.classes_

array([0, 1])

thus the probability of having a target value of 1 (equivalently: having a salary over $50K) is the second column

let's combine the predicted probabilities for our test set with the known labeled ground truth:

In [38]:
thresh = 0.5

dfpred = pd.DataFrame({
    'y_actual': ytest,
    'y_pred_prob': yproba[:, 1],
    'y_predicted': (yproba[:, 1] >= thresh).astype(int),
})
dfpred.head()

Unnamed: 0,y_actual,y_pred_prob,y_predicted
0,0,0.61,1
1,0,0.03,0
2,0,0.06,0
3,1,0.1,0
4,0,0.02,0


In [39]:
tallcm = dfpred.groupby(['y_actual', 'y_predicted']).count()
tallcm.columns = ['count']
tallcm.unstack()

Unnamed: 0_level_0,count,count
y_predicted,0,1
y_actual,Unnamed: 1_level_2,Unnamed: 2_level_2
0,5645,485
1,755,1256


let's use plotly to plot the cumulative captured response on the held out test data from the original dataframe. to do this, we will need to pick out the now-trained pipeline corresponding to our best run:

In [40]:
dfpred = dfpred.sort_values(by='y_pred_prob', ascending=False)
ntargets = dfpred.y_actual.sum()
dfpred.loc[:, 'pct_captured'] = dfpred.y_actual.cumsum() / ntargets

xarr = np.array(range(dfpred.shape[0]))
yperf = np.ones(xarr.shape)
yperf[:ntargets] = np.linspace(0, 1, ntargets)

In [41]:
data = [
    # our capture rate
    go.Scatter(
        x=xarr,
        y=dfpred.pct_captured,
        mode='lines',
        line={'width': 2},
        name='our prediction'
    ),
    # random choice
    go.Scatter(
        x=xarr,
        y=xarr / xarr.max(),
        mode='lines',
        line={
            'dash': 'dash',
            'color': 'black',
            'width': 1,
        },
        name='random'
    ),
    # perfect
    go.Scatter(
        x=xarr,
        y=yperf,
        mode='lines',
        line={
            'dash': 'dots',
            'color': 'black',
            'width': 1,
        },
        name='perfect'
    )
]

In [42]:
# create a layout with axes labels and title
layout = go.Layout(
    title='cumulative captured response',
    xaxis={'title': 'number of records recommend and investigated'},
    yaxis={'title': 'fraction of all true cases obtained'}
)

In [43]:
# create a figure to join the above
fig = go.Figure(
    data=data,
    layout=layout
)

plotly.offline.iplot(fig)

that is *pretty good*.

maybe *too pretty good*...

let's save out those trained pipelines as self-contained units:

In [44]:
import pickle

with open('salary_preprocess_pipeline.pkl', 'wb') as f:
    pickle.dump(preprocessPipeline, f)
    
with open('salary_modelling_pipeline.pkl', 'wb') as f:
    pickle.dump(modelingPipeline, f)

let's also keep track of eactly how large those files are, as it matters for our eventual `lambda` function zip archive (has a max size of 200 MB or so, and a 100-tree random forest can get as large as 100 MB on its own

In [45]:
!ls -alh *.pkl

-rw-r--r-- 1 zlamberty zlamberty 116M Oct 29 22:16 salary_modelling_pipeline.pkl
-rw-r--r-- 1 zlamberty zlamberty 3.9K Oct 29 22:16 salary_preprocess_pipeline.pkl


# building a ~~mystery~~ `lambda` function

we can write a `handler` function which can load those files 

In [46]:
event = {
    "httpMethod": "GET",
    "queryStringParameters": {
        "age": 52,
        "capital_gain": 15024,
        "capital_loss": 0,
        "education": "HS-grad",
        "hours_per_week": 40,
        "marital_status": "Married-civ-spouse",
        "native_country": "United-States",
        "occupation": "Exec-managerial",
        "race": "White",
        "relationship": "Wife",
        "sex": "Female",
        "workclass": "Self-emp-inc"
    }
}

context = {}

In [47]:
import json

import numpy as np
import sklearn.externals

from utils import MonetaryLog1P, MultiColumnLabelEncoder

columns = [
    'age',
    'workclass',
    'education',
    'marital_status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'capital_gain',
    'capital_loss',
    'hours_per_week',
    'native_country',
]


def respond(err, res=None):
    return {
        'statusCode': '400' if err else '200',
        'body': json.dumps(str(err)) if err else json.dumps(res),
        'headers': {'Content-Type': 'application/json'},
    }

    
def build_record(params):
    try:
        return np.array([[params[k] for k in columns]])
    except KeyError as e:
        raise KeyError('missing required parameter "{}"'.format(e))


def load_pipelines():
    preprocessor = sklearn.externals.joblib.load('salary_preprocess_pipeline.pkl')
    modeller = sklearn.externals.joblib.load('salary_modelling_pipeline.pkl')
    return preprocessor, modeller


def handler(event, context):
    print("event = {}".format(event))

    reqtype = event['httpMethod']
    if reqtype == 'GET':
        try:
            record = build_record(event['queryStringParameters'])
            preprocessor, modeller = load_pipelines()
            score = modeller.predict_proba(
                preprocessor.transform(record)
            )[0]
            return respond(
                err=None,
                res={'score': dict(zip(['<=50k', '>50k'], score))}
            )
        except Exception as e:
            return respond(e)
    else:
        return respond(ValueError('Unsupported method "{}"'.format(reqtype)))

In [48]:
handler(event, context)

event = {'httpMethod': 'GET', 'queryStringParameters': {'age': 52, 'capital_gain': 15024, 'capital_loss': 0, 'education': 'HS-grad', 'hours_per_week': 40, 'marital_status': 'Married-civ-spouse', 'native_country': 'United-States', 'occupation': 'Exec-managerial', 'race': 'White', 'relationship': 'Wife', 'sex': 'Female', 'workclass': 'Self-emp-inc'}}


{'body': '{"score": {"<=50k": 0.02, ">50k": 0.98}}',
 'headers': {'Content-Type': 'application/json'},
 'statusCode': '200'}

## the files we created

### `salarymodel.py`

this is just implementing the function we defined above with a bit of boiler and the `utils` imports spelled outs. it is available for general download [here](https://s3.amazonaws.com/shared.rzl.gu511.com/salarymodel/salarymodel.py)

### `trainmodel.py`

below we will be deploying everything to a linux ami in order to create a deployment package. it's important that the models we create and serialize here are compatible with the environment that will load and use them in `salarymodel.py` above, so we collected the basics of the code above into one file `train.py`, and we need to run `train.py` on our deployment server in order to feel confident about the `lambda` function deployment package.

it is also available for download [here](https://s3.amazonaws.com/shared.rzl.gu511.com/salarymodel/trainmodel.py)


### `utils.py`

finally, the file `utils.py` contains our custom transformer imports. this file is available on `s3` [here](https://s3.amazonaws.com/shared.rzl.gu511.com/salarymodel/utils.py)


## deployment package building

### what I did

so, sequence of events:

+ create a new `aws ec2` instance (linux ami)
+ `ssh` into it
+ create a code directory: `mkdir -p ~/salarymodel`
    + in a separate session, `scp` the `*.py` files to `~/salarymodel`
    + `scp megaman_remote:~/code/gu/511/hw/salarymodel/*.py ~/temp/`
+ add aws credentials with `aws configure`
+ execute the following commands

```bash
# prereq stuff
sudo yum -y update
sudo yum -y upgrade
sudo yum -y groupinstall "Development Tools"
sudo yum -y install atlas-devel atlas-sse3-devel blas-devel gcc gcc-c++ lapack-devel
sudo yum -y install python36-devel python36-pip python36-virtualenv

# create a virtualenv
virtualenv-3.6 ~/env
source ~/env/bin/activate
pip install --use-wheel numpy
pip install --use-wheel scipy
pip install --use-wheel scikit-learn

# make sure the venv actually runs the code first
#python -c 'import salarymodel'

# in case you need to start over...
#pip uninstall -y scipy
#pip install scipy
#python -c 'import salarymodel'


# using `strip` to remove extraneous stuff from so files
find ~/env/lib64/python3.6/site-packages/numpy -name "*.so" | xargs strip
find ~/env/lib64/python3.6/site-packages/sklearn -name "*.so" | xargs strip

# for reasons I don't understand, stripping the first five packages
# will lead to failure (ELF patching error, basically it looks like
# stripping breaks the way that low-level libraries stitch so files
# back together. dunno?!
# but we can still strip a lot out in the following pieces
strip ~/env/lib64/python3.6/site-packages/scipy/special/cython_special.cpython-36m-x86_64-linux-gnu.so
strip ~/env/lib64/python3.6/site-packages/scipy/special/specfun.cpython-36m-x86_64-linux-gnu.so
strip ~/env/lib64/python3.6/site-packages/scipy/special/_test_round.cpython-36m-x86_64-linux-gnu.so
strip ~/env/lib64/python3.6/site-packages/scipy/special/_ellip_harm_2.cpython-36m-x86_64-linux-gnu.so
strip ~/env/lib64/python3.6/site-packages/scipy/special/_test_round.cpython-36m-x86_64-linux-gnu.so
strip ~/env/lib64/python3.6/site-packages/scipy/special/_comb.cpython-36m-x86_64-linux-gnu.so
strip ~/env/lib64/python3.6/site-packages/scipy/sparse/_csparsetools.cpython-36m-x86_64-linux-gnu.so
strip ~/env/lib64/python3.6/site-packages/scipy/sparse/_sparsetools.cpython-36m-x86_64-linux-gnu.so
strip ~/env/lib64/python3.6/site-packages/scipy/sparse/linalg/dsolve/_superlu.cpython-36m-x86_64-linux-gnu.so
find ~/env/lib64/python3.6/site-packages/scipy/sparse/csgraph -name "*.so" | xargs strip
strip ~/env/lib64/python3.6/site-packages/scipy/linalg/_solve_toeplitz.cpython-36m-x86_64-linux-gnu.so
strip ~/env/lib64/python3.6/site-packages/scipy/linalg/_flinalg.cpython-36m-x86_64-linux-gnu.so
strip ~/env/lib64/python3.6/site-packages/scipy/linalg/_interpolative.cpython-36m-x86_64-linux-gnu.so
strip ~/env/lib64/python3.6/site-packages/scipy/linalg/_decomp_update.cpython-36m-x86_64-linux-gnu.so
strip ~/env/lib64/python3.6/site-packages/scipy/optimize/minpack2.cpython-36m-x86_64-linux-gnu.so
strip ~/env/lib64/python3.6/site-packages/scipy/optimize/_cobyla.cpython-36m-x86_64-linux-gnu.so
strip ~/env/lib64/python3.6/site-packages/scipy/optimize/_trlib/_trlib.cpython-36m-x86_64-linux-gnu.so
strip ~/env/lib64/python3.6/site-packages/scipy/optimize/_lsq/givens_elimination.cpython-36m-x86_64-linux-gnu.so
strip ~/env/lib64/python3.6/site-packages/scipy/optimize/_minpack.cpython-36m-x86_64-linux-gnu.so
strip ~/env/lib64/python3.6/site-packages/scipy/optimize/_nnls.cpython-36m-x86_64-linux-gnu.so
strip ~/env/lib64/python3.6/site-packages/scipy/optimize/_group_columns.cpython-36m-x86_64-linux-gnu.so
strip ~/env/lib64/python3.6/site-packages/scipy/optimize/_slsqp.cpython-36m-x86_64-linux-gnu.so
strip ~/env/lib64/python3.6/site-packages/scipy/optimize/_zeros.cpython-36m-x86_64-linux-gnu.so
strip ~/env/lib64/python3.6/site-packages/scipy/optimize/moduleTNC.cpython-36m-x86_64-linux-gnu.so
strip ~/env/lib64/python3.6/site-packages/scipy/integrate/_test_odeint_banded.cpython-36m-x86_64-linux-gnu.so
strip ~/env/lib64/python3.6/site-packages/scipy/integrate/_test_multivariate.cpython-36m-x86_64-linux-gnu.so
strip ~/env/lib64/python3.6/site-packages/scipy/integrate/_dop.cpython-36m-x86_64-linux-gnu.so
find ~/env/lib64/python3.6/site-packages/scipy/fftpack -name "*.so" | xargs strip
find ~/env/lib64/python3.6/site-packages/scipy/io -name "*.so" | xargs strip
find ~/env/lib64/python3.6/site-packages/scipy/odr -name "*.so" | xargs strip
find ~/env/lib64/python3.6/site-packages/scipy/ndimage -name "*.so" | xargs strip
find ~/env/lib64/python3.6/site-packages/scipy/spatial -name "*.so" | xargs strip
find ~/env/lib64/python3.6/site-packages/scipy/interpolate -name "*.so" | xargs strip
find ~/env/lib64/python3.6/site-packages/scipy/_lib -name "*.so" | xargs strip
find ~/env/lib64/python3.6/site-packages/scipy/cluster -name "*.so" | xargs strip
find ~/env/lib64/python3.6/site-packages/scipy/stats -name "*.so" | xargs strip
find ~/env/lib64/python3.6/site-packages/scipy/signal -name "*.so" | xargs strip
find ~/env/lib64/python3.6/site-packages/scipy/.libs -name "*.so" | xargs strip

# zipping the contents of the env for deployment
cd ~/env/lib64/python3.6/site-packages/
zip -r9 ~/salarymodel/salarymodel.zip *

# copying so's to lib directory
mkdir -p ~/salarymodel/lib
cp /usr/lib64/atlas-sse3/liblapack.so.3 ~/salarymodel/lib/
cp /usr/lib64/atlas-sse3/libptf77blas.so.3 ~/salarymodel/lib/
cp /usr/lib64/atlas-sse3/libf77blas.so.3 ~/salarymodel/lib/
cp /usr/lib64/atlas-sse3/libptcblas.so.3 ~/salarymodel/lib/
cp /usr/lib64/atlas-sse3/libcblas.so.3 ~/salarymodel/lib/
cp /usr/lib64/atlas-sse3/libatlas.so.3 ~/salarymodel/lib/
cp /usr/lib64/atlas-sse3/libptf77blas.so.3 ~/salarymodel/lib/
cp /usr/lib64/libgfortran.so.3 ~/salarymodel/lib/
cp /usr/lib64/libquadmath.so.0 ~/salarymodel/lib/
#find ~/salarymodel/lib/ -name "*.so*" | xargs strip

# striping those so files and adding them to the archive
cd ~/salarymodel
zip -g9 salarymodel.zip lib/*

# build the pipeline pkl files (requires installing pandas,
# which we *didn't* want to archive), so we needed to install
# later (here)
pip install pandas
python trainmodel.py

# adding the salarymodel and the pickled pipelines
zip -g9 salarymodel.zip {salarymodel,utils}.py
zip -g9 salarymodel.zip *.pkl

# see what the expanded contents are here
# I have been unzipping in /tmp/salarymodel/
#mkdir -p /tmp/salarymodel && cp salarymodel.zip /tmp/salarymodel/ && cd /tmp/salarymodel && unzip salarymodel.zip
#rm salarymodel.zip && cd ../ && du -h --summarize . && rm -r salarymodel && cd ~/salarymodel

# set up aws credentials (can skip if this is done)
aws configure

# copy this to s3
aws s3 cp salarymodel.zip s3://shared.rzl.gu511.com/

# at this point I went and created a role using the iam console.
# I probably could have done this from the command line. maybe
# I should have...

# now to create the function using the aws cli
aws --profile gu511 lambda create-function \
    --region us-east-1 \
    --function-name salarymodel \
    --zip-file fileb://salarymodel.zip \
    --role arn:aws:iam::134461086921:role/service-role/salarymodel_role \
    --handler CreateThumbnail.handler \
    --runtime python3.6
```

### my thoughts after doing this

the deployment process here was harder than I thought it would be -- namely, the competition between the upper size limit on the deployment package (200 some MB) and the large size of the included non-base libraries and pickled pipeline objects was a major issue.

several online discussions revolved around the idea of `strip`-ping the `.so` linkable binary libraries prior to compression, and in the end this was the only trick that worked. *however*, not all such `.so` files can be stripped -- for several essential binaries in the `scipy` package, `strip` will break the compiler's ability to re-link these modules. the process of figuring out which `.so` files were `strip`-pable and which were not was entirely manual and pretty difficult; I am sure there is a much, much better way of doing this.

also, the hastle in doing this has really underscored just how much this is *not* a good way to deploy a moderately complicated model. it's not hard to set up a long-running web service which still internally functions this way but doesn't need to re-load the model every time a call was made. the lag time is very noticeable, and it requires that we extend the default memory size and timeout length without even doing anything too complicated.

it's paritcularly worth noting that the size of our pickled model grew as a function of the number of trees. a 1000 tree forest is not a big deal in terms of the speed of scoring (once loaded into memory), and yet it was pretty much a  non-starter under this framework