#### Pickling 
* Pickling - turns python object hierarchy into a 'byte stream' dumped into a file - uses - 'dump'
* turns to 0,1 - flattens it 
* Unpickling - convert byte stream back to object hierarchy - uses - 'load'
* errors: picklingError - object not supported for pickling
* Unpickling error - bad/corrupted data
* good for saving complicated data, easy to use, not easily readable
* bad for non-python, unpickling malicious sources 


pickling a list, wb is bytes format:
```
my_list = [1,2,3,4]
with open('datafile.txt', 'wb') as fh: 
     pickle.dump(mylist, fh)
```

unpickling, rb for read bytes
```
pickle_off = open('datafile.txt', 'rb')
emp = pickle.load(pickle_off)
print(emp) 
```

#### Joblib
* used for large numpy arrays internally
* has similar joblib.dump() and joblib.load()
* can compress files setting a compress= argument
    * default zlib, but can choose gzip, bz2, lzma, xz

```
# to dump
import joblib
joblib.dump(to_persist, filename) 

# to reload
joblib.load(filename)
```



[Walkthrough](https://machinelearningmastery.com/save-load-machine-learning-models-python-scikit-learn/)

In [2]:
# Save Model Using Pickle
import pandas
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
import pickle
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
test_size = 0.33
seed = 7
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=test_size, random_state=seed)
# Fit the model on training set
model = LogisticRegression()
model.fit(X_train, Y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [3]:
filename = 'finalized_model.sav'
pickle.dump(model, open(filename, 'wb'))

In [4]:
loaded_model = pickle.load(open(filename, 'rb'))
result = loaded_model.score(X_test, Y_test)
print(result)

0.7874015748031497


In [5]:
# Save Model Using joblib - outdated not really used - stick with pickle
import pandas
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
import joblib
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
test_size = 0.33
seed = 7
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=test_size, random_state=seed)
# Fit the model on training set
model = LogisticRegression()
model.fit(X_train, Y_train)
#
#
# save the model to disk
filename = 'finalized_model.sav'
joblib.dump(model, filename)
 
# some time later...
 
# load the model from disk
loaded_model = joblib.load(filename)
result = loaded_model.score(X_test, Y_test)
print(result)

0.7874015748031497


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Tips for saving models:
* Python version from serializing to loading must be the same
* library version need to be same when serializing and deserializing
* Manual Serialization - manually output parameters to use them directly later in sklearn or another platform

#### Pipelines
[walkthrough](https://web.archive.org/web/20210507043615/https://iaml.it/blog/optimizing-sklearn-pipelines)
* chains multiple estimators into one (sklearn data processing and modeling steps)
* if there's a fixed sequence of steps in processing data
* convenient and encapsulation (fit and predict), joint parameter selection (grid search over the parameters of all estimators in the pipeline at once)
* safety - avoid leaking statistics from test data into the trained model in cross-validation
* all objects in a pipelines (except the last) must be 'transformers' - have a .transform() method
* last estimator can be any type (transformer, regressor, or classifer)

#### From Lecture
* ugly code .fit and .transform on different objects
* proprocessing and modeling - is distributed and error prone
* gridsearch can only be used on model class - can be done to different number of components or scaling methods
* scaling needs fit and transform 
* ends classifier/regressor with .predict
* can create own Pipeline class 
* must all be sklearn 
* but can create own 
* fit [:-1] - scalers, because the last step is the classifier
* [-1] is for fitting with regressor/classifer
* return self - an sklearn convention
* can later visualize sklearn 

Feature unions
* branches in our process
* different feature engineerings on different branches 
* which can be joined that the end
* but applies still to all features coming in
* but can split - PCA - kbest - then join the features after

Column Transformers
* similar to a feature union
* numerical columns
    * impute means, stnadard scaler
* categorical columns
    * impute missing with mode
    * one hot encode
* fit model on resulting features 
* takes dataframe, and outputs a numpyarray

Visualizing a Pipeline
* set_config(display='diagram')
* or saved as html file

Gridsearch Hyperparameter tuning
* paramgrid - but now 'pipelineStep_ _ parameter':[param list]
* if nested under feature_union: features_ _pca_ _n_components
* two double underscores for the name in pipeline and the name in feature_union
* verbose=1 says how many fits
* hyperparameters - increasing values 

Custom Class Pipeline
* e.g. log transform a column
* create a new class
* LogTransformer(BaseEstimator, TransformerMixin):
    * needs __init__, fit, and transform
* transform is where it changes
* fit is if you need something from the training data - getting mean. 
* if it's just a transform: 
* faster way is to create a function, and use "FunctionTransformer(func)
* if you don't need a .fit  

Model Persistance
* employment vs deployment - two different things
* once model is done - just pickle and pass through to someone else (web developer)
* save as .pickle
* pickle faster than joblib 3.8+ python
* doesn't save code only save object
* recreates pipeline - plugs in values from steps
* needs to be the same version - in case of changes in attribute names e.g. coef_ to weights
* anaconda environments need to be the same - can export the anaconda environments along with pickled file
* custom classes needs to be redefined where the pipeline is running, doesn't save as object 
* on server - need to have the same sklearn, and add the custom code 