# Scikit-learn Iris Classifier - Local Example

_**Train and export a scikit-learn classifier for the [Iris data set](https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data) dataset: Performing all storage and computation locally on the notebook.**_

This notebook works well with the `Python 3 (Data science)` kernel on SageMaker Studio, or `conda_python 3` on classic SageMaker Notebook Instances.

---

The [dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/iris/) is hosted in the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php) and maintain 622 data sets.

>❓*Can you figure out how to re-create this notebook's workflow using SageMaker more effectively?*

## Contents

1. **[Prepare the Data](#Prepare-the-Data)**
1. **[Load the Data From File](#Load-the-Data-From-File)**
1. **[Pre-Process the Data for our CNN](#Pre-Process-the-Data-for-our-CNN)**
1. **[Build a Model](#Build-a-Model)**
1. **[Fit the Model](#Fit-the-Model)**
1. **[Save the Trained Model](#Save-the-Trained-Model)**
1. **[Explore Results](#Explore-Results)**

See the accompanying **Instructions** notebook for more guidance!

In [None]:
import argparse
import numpy as np
import os
import pandas as pd
from sklearn.externals import joblib
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
import joblib


## Prepare the Data

Now let's download the image data.

The original MNIST data has 70000 small 28x28 pixel PNG files (60000 in the training dataset, and 10000 in the test dataset). This format is nice and familiar - but a large number of tiny files is inefficient for storage and transfer - so **to keep things performant** we will:

- Download the data to a local temporary folder under `/tmp` (meaning you won't see the files in the left sidebar in SageMaker)
- Sample just a subset of the data to work with.

In [None]:
# Dictionary to encode labels to codes
label_encode = {
    'Iris-virginica': 0,
    'Iris-versicolor': 1,
    'Iris-setosa': 2
}

# Dictionary to convert codes to labels
label_decode = {
    0: 'Iris-virginica',
    1: 'Iris-versicolor',
    2: 'Iris-setosa'
}



In [None]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data

In [None]:
data = data = pd.read_csv('iris.data', 
                   names=['sepal length', 'sepal width', 
                          'petal length', 'petal width', 
                          'label'])
data.head()

In [None]:
data = data.sample(frac=1).reset_index(drop=True)

Y_encoded = data['label'].map(label_encode)
X =  data.drop(["label"], axis=1)

train_X, test_X, train_y, test_y = train_test_split(X, Y_encoded, test_size=0.2) 
sc = StandardScaler()
X_train = sc.fit_transform(train_X)
X_test = sc.fit_transform(test_X)
# print(X_train)   
# print(X_test)


## Build a Model

The model chosen from the Scikit- learn classifiers, is the widely used logistic regression model and takes the features and labels as input and returns the predicted lable or the probabilities (if chosen) as output.


In [None]:
#train the logistic regression model
model = LogisticRegression().fit(X_train, train_y)
model

## Fit the Model

Scikit-learn makes fitting and evaluating the model straightforward enough: We don't have any fancy hooks, and are happy with the default logging:


## Save the Trained Model

We use Joblib to save the model and then load it for prediction.


In [None]:
#use Joblib to save the model 
# see scikit learn documentation here:https://scikit-learn.org/stable/model_persistence.html
joblib.dump(model, "model.joblib")

Let's Explore Results

In [None]:
# load the model using joblib
model = joblib.load("model.joblib")

#get the data to predict
result = loaded_model.predict(X_test)
results=' | '.join([label_decode[t] for t in result])
results

All done!
