# Generalizing the Machine Learning Process
Date: 2019-05-09  
Author: Jason Beach  
Categories: process, data_science  
Tags: machine_learning  
<!--eofm-->

This work describes a general approach to follow when performing machine learning (ML) manually, and when automating in a deployment setting.  Unlike a classical statistical analysis, standard machine learning projects typically follow a general and repeatable process.  While the practicioner should be aware of details for each of the steps and the reasons for choosing them, there is much less design-thinking and checking of assumptions that are necessary components of more mathematical modeling fields.  This makes the machine learning process amenable to deployment as a service because automating the re-training and prediction of a model with consistent data is straight-forward programming.

## Model Theory

Most of the design-thinking in the ML process is in choosing a variety of models for comparing performance against.  The following three characteristics succinctly describe a ML model.

1. Representation: structural model characteristics
    - name
    - family
    - interpretability
    - type 
        + generative / discriminative 
        + bias / var
        + fixed- / variable- learner
2. Evaluation: functions applied to the structure
    - objective
    - cost
    - loss
3. Optimization: algorithms necessary to solve for parameters

It is also important to understand how the chosen model effects the modeling process

- assumptions inherent in representation
- alignment of loss function with project goals
- sources of bias / variance
- determination of resource constraints
- enumeration of how over-tuning can occur (regularization)
- understanding when manual methods are ineffective 'fiddling' of model implementation parameters
- statement of strong false assumptions can be better than weak true ones, because they need more data

_Note:_ This should be considered carefully with feature engineering and feature selection to ensure the input transformations align with the model.

## The Machine Learning Process

The following are the general steps taken in the ML Process.  They are similar to many other problem-solving and design-thinking processes, but tailored to ML specifics.  

There a several hard checks that should be used to ensure the practicioner is maintaining honesty.  One important check is laying-out proper evaluation methods, before implementing them.  This is similar to classical statistics in choosing an accepatable p-value before running the model.  

Another check is on model resource and time requirements.  More sophisticated models need more memory to implement and take a longer time to run.  These are highly dependent on the environment they are deployed to.  These should be determined with the customer, at the beginning. 

_Discover_
* determine problem and constraints
* determine characteristics the problem / scenario dictates on the solution
    - model family
    - acceptable methods of dimensionality reduction and regularization
    - deployment environment
* decide evaluation
    - primary / secondary evaluation score (ie. accuracy)
    - methods of evaluation (ie. confusion matrix, roc)

_Collect and Transform_
* obtain raw data 
    - internal data warehouses
    - external APIs and services
* integration and cleaning	
* filter, aggregate, and query

_Summary and Process_
* exploration
* preparation
     - address balance (classification, anova, etc.)
     - create Train, Validate, Test with split (above)
* configure Feature Extraction with feature_union
* configure Preprocess and choose model-families with pipeline

_Build_
* train the models
    - apply k-folds CV and grid search with Training set
    - perform on multiple model-families and hyper-parameters
* evaluate models
    - review afore-mentioned confusion matrix, scoring, classifier-threshold, and tests
    - select the best model-family / hyper-parameters
    - apply to Validation set or all of Training set to model-family to parameterize it and set as the final model
* refine performance
    - debug performance with learning curve, lift chart
    - use Testing set to evaluate final model characteristics
    - export to binary file

_Deliver_
* select solution
    - design interface most appropriate for using the model
    - automate data integration and pipelines
    - implement model in deployable environment
* deploy solution within system environments


_Note:_ Train, Validate, and Test should be from different (independent) data sets, if possible.

## Stakeholder Interaction and Timeline

It is useful to display these in relation to interactions that must take place with stakeholders.  These may be business users who need a problem solved, or technology departments that will have to support applications that implement the solution.  The y-axis show stage proximity to these stakeholders.

While every project is different, most stages use a similar proportion of time.  The horizontal axis lays-out the timeline.

![machine learning process](images/ml_process.png)

## Demonstation

The following code demonstrates the programming portions of these stages and steps using a toy example on a simulated diverse dataset of both numeric and categorical data.  This does display the important steps that must be taken with stakeholders. 

### Configuration

In [2]:
# Create a pipeline that extracts features from the data then creates a model
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest

# make data (both numeric and categorical)
from sklearn.datasets.samples_generator import make_blobs
# generate 2d classification dataset
X_cat, not_used = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1)#; make_classification(n_samples=100, centers=2, n_features=2, random_state=1)

from sklearn.datasets import make_regression
# generate regression dataset
X_num, y_num = make_regression(n_samples=100, n_features=2, noise=0.1, random_state=1)


In [9]:
type(X)

numpy.ndarray

### Summary and Process

In [3]:
#Encode y and combine datasets, X y
X = data.drop('survived', axis=1)
y = data['survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)



# create feature union of numeric data
features = []
features.append(('pca', PCA()))			#<<<-grid
features.append(('select_best', SelectKBest(k=6))) #<<<-grid
num_feature_eng = FeatureUnion(features)

# We create the preprocessing pipelines for both numeric and categorical data.
numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
	('num_feature_eng', feature_eng)
	])

categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore')),
	('select_best', SelectKBest(k=6))
	])
	
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression(solver='lbfgs'))])

NameError: name 'data' is not defined

### Build

In [None]:
grid = {'pca': ('n_components', [.75, .80, .85, .90, .95]), 
		'select_best':('k',[5, 7, 9, 11]) 
		}
gridClf = GridSearchCV(clf, grid, cv=5)


# Learning curve
from sklearn.learning_curve import learning_curve
title = 'Learning Curves (SVM, linear kernel, $\gamma=%.6f$)' %classifier.best_estimator_.gamma
estimator = SVC(kernel='linear', gamma=classifier.best_estimator_.gamma)
plot_learning_curve(estimator, title, X_train, y_train, cv=cv)
plt.show()


# Final model evaluation
classifier.score(X_test, y_test)