# Scikit-learn Trainer
This is a Scikit-learn trainer Kubeflow component. It trains a variety of Scikit-learn models using your data. 

## Intended Use
You may use this component to train a scikit-learn classifier or regressor. Currently, the following estimators are supported:

* AdaBoostClassifier
* BaggingClassifier
* DecisionTreeClassifier
* ExtraTreesClassifier
* GaussianNB
* GaussianProcessClassifier
* GradientBoostingClassifier
* GradientBoostingRegressor
* KDTree
* KNeighborsClassifier
* KNeighborsRegressor
* Lasso
* LinearRegression
* LogisticRegression
* MLPClassifier
* RandomForestClassifier
* Ridge
* SGDRegressor
* SVC
* SVR

## Argument Definitions
* `estimator_name`: The name of the estimator as it appears in the list above.
* `training_data_path`: Path to the training csv file. It can be the path to a local file, or a file in a GCS bucket. The code expects the target to be the first column, followed by the features.
* `test_data_path`: [optional] Path to the test csv file, with a format similar to the training data file.
* `output_dir`: Path to the output directory which could be a local directory, or a directory in GCS.
* `with_header`: Indicates that the train and test datasets have headers. Otherwise it is assumed that the input files have no headers.
* `hyperparameters`: A string containing all the hyperparameters and their values seprated by spaces.

## Enter Component Arguments

In [32]:
EXPERIMENT_NAME = 'kfp-sklearn-component_1'
estimator_name='GradientBoostingClassifier'
training_data_path='gs://cloud-samples-data/ml-engine/iris/classification/train.csv'
test_data_path='gs://cloud-samples-data/ml-engine/iris/classification/evaluate.csv'
output_dir='gs://chavoshi-dev-mlpipeline'
hyperparameters='n_estimators 100 max_depth 4'

### Install KFP and scikit-learn 

In [7]:
%%capture
KFP_PACKAGE = 'https://storage.googleapis.com/ml-pipeline/release/0.1.8/kfp.tar.gz'
!pip3 install $KFP_PACKAGE --upgrade
!pip3 install scikit-learn==0.20

### Create Pipeline

In [33]:
import kfp
from kfp import compiler
import kfp.dsl as dsl
import kfp.notebook
import kfp.gcp as gcp
import kfp.components as comp

In [34]:
client = kfp.Client()
exp = client.create_experiment(EXPERIMENT_NAME)

In [38]:
scikit_learn_train = comp.load_component_from_url(
    'https://storage.googleapis.com/kf-pipeline-contrib-public/ai-hub-assets/sklearn/component.yaml')
scikit_learn_train

<function Scikit Learn Trainer(training_data_path, test_data_path, output_dir, estimator_name, hyperparameters)>

In [36]:
@dsl.pipeline(
    name='Sklearn Trainer', description='Trains a Scikit-learn model')
def scikit_learn_trainer(
    training_data_path=dsl.PipelineParam(
        'training-data-path',
        value='gs://cloud-samples-data/ml-engine/iris/classification/train.csv'
    ),
    test_data_path=dsl.PipelineParam(
        'test-data-path',
        value='gs://cloud-samples-data/ml-engine/iris/classification/evaluate.csv'
    ),
    output_dir=dsl.PipelineParam('output-dir', value='/tmp'),
    estimator_name=dsl.PipelineParam(
        'estimator-name', value='GradientBoostingClassifier'),
    hyperparameters=dsl.PipelineParam(
        'hyperparameters', value='n_estimators 100 max_depth 4')):
    
    sklearn_op = scikit_learn_train(training_data_path, test_data_path, output_dir,
                           estimator_name, hyperparameters).apply(gcp.use_gcp_secret('user-gcp-sa'))
    
compiler.Compiler().compile(scikit_learn_trainer, 'one_step_pipeline.tar.gz')

### Run Pipeline

In [37]:
run = client.run_pipeline(
    exp.id,
    'run 1',
    'one_step_pipeline.tar.gz',
    params={
        'training-data-path':training_data_path,
        'test-data-path':test_data_path,
        'output-dir':output_dir,
        'estimator-name':estimator_name,
        'hyperparameters':hyperparameters,
    })

  return yaml.load(f)


### Locate exported pickled model
The trained model was exported as a pickle to `output_dir` on GCS. Locate the full path of the file on the GCS UI browser or through the command line with `gsutil ls {output_dir}`. You may run this command in this notebook if the environment has the approriate permissions.

Enter your file path in the cell below before executing.

### Load trained model after run and test

In [26]:
import tensorflow as tf
import pickle

# Replace with your retreived gcs path from above.
PICKLE_FILE_PATH=output_dir+'/GradientBoostingClassifier_20190611173835.pkl' 

f = tf.io.read_file(PICKLE_FILE_PATH)

with tf.Session() as sess:
    pickle_string = sess.run(f)
    
model = pickle.loads(pickle_string)
model.predict([[5.9, 3.0, 4.2, 1.5]])



array([1])

In [None]:
comp.func_to_component_text()