<!--NAVIGATION-->

<a href="https://colab.research.google.com/github/bpesquet/machine-learning-katas/blob/master/classic-datasets/Iris.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open in Google Colaboratory"></a>

# Guided ML With The Iris Dataset

| Learning type | Activity type | Objective |
| - | - | - |
| Supervised | Multiclass classification | Identify a flower's class |

Contents:
1. Loading the data
2. Setting up supervised learning problem (selecting features)
3. Creating a first model
    - Creating train and test datasets
    - Normalizing train and test
    - Fitting and predicting
4. Evaluate the frist model predictions
5. Crossvalidation of the model
6. Creating an end to end ML pipeline
    - Train/Test Split
    - Normalize
    - Crossvalidations
    - Model
    - fitting and predicting

## Instructions with NBGrader removed

Complete the cells beginning with `# YOUR CODE HERE` and run the subsequent cells to check your code.

## About the dataset

[Iris](https://archive.ics.uci.edu/ml/datasets/iris) is a well-known multiclass dataset. It contains 3 classes of flowers with 50 examples each. There are a total of 4 features for each flower.

![](./classic-datasets/images/Iris-versicolor-21_1.jpg)

## Package setups

1. Run the following two cells to initalize the required libraries. 

In [29]:
#to debug package errors
import sys
sys.path
sys.executable

'/home/klarakhalo/anaconda3/bin/python'

In [2]:
# Import needed packages
# You may add or remove packages should you need them
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn import datasets
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.pipeline import make_pipeline

# Set random seed
np.random.seed(0)

# Display plots inline and change plot resolution to retina
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
# Set Seaborn aesthetic parameters to defaults
sns.set()

## Step 1: Loading the data

1. Load the iris dataset using ```datasets.load_iris()```
2. Investigate the data structure with ```.keys()```
3. Construct a dataframe from the dataset
4. Create a 'target' and a 'class' column that contains the target names and values
5. Display a random sample of the dataframe 

In [52]:
#Your code here.
iris = datasets.load_iris()
print(iris.keys())

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])


In [53]:
iris["target_names"]

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

In [54]:
iris["feature_names"]

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [55]:
#your code here
iris_df = pd.read_csv('../data/iris.data', header=None)
iris_df["target"] = iris["target"]
iris_df.columns = ["Sepal Length", "Sepal Width", "Petal Length", "Petal Width", "Class", "Target"]
iris_df.sample(10)

Unnamed: 0,Sepal Length,Sepal Width,Petal Length,Petal Width,Class,Target
92,5.8,2.6,4.0,1.2,Iris-versicolor,1
141,6.9,3.1,5.1,2.3,Iris-virginica,2
130,7.4,2.8,6.1,1.9,Iris-virginica,2
119,6.0,2.2,5.0,1.5,Iris-virginica,2
48,5.3,3.7,1.5,0.2,Iris-setosa,0
143,6.8,3.2,5.9,2.3,Iris-virginica,2
122,7.7,2.8,6.7,2.0,Iris-virginica,2
63,6.1,2.9,4.7,1.4,Iris-versicolor,1
26,5.0,3.4,1.6,0.4,Iris-setosa,0
64,5.6,2.9,3.6,1.3,Iris-versicolor,1


### Question
Find the X and y values we're looking for. Notice that y is categorical and thus, we could **one-hot encode it** if we are looking at **class** or we can just pick **target**. In order to one hot encode we have  to re-shape `y` it using the **.get_dummies** function. 

#### For the purpose of this exercise, do not use hot encoding, go only for target but think about if you have to drop it somewhere or not...

In [56]:
# YOUR CODE HERE
X = iris_df.drop("Target", axis=1)
y = iris_df["Target"]
X.head()

Unnamed: 0,Sepal Length,Sepal Width,Petal Length,Petal Width,Class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [57]:
y.head()

0    0
1    0
2    0
3    0
4    0
Name: Target, dtype: int64

In [58]:
type(y)

pandas.core.series.Series

In [59]:
type(X)

pandas.core.frame.DataFrame

## Step 2: Setting up supervised learning problem (selecting features)

Feature selection is an essential step in improving a model's perfromance. In the first version of the model we will use the 'sepal length' and 'sepal width' as predicting features. Later we will see the effect of adding additional features.

1. Assign the values of the 'target' to Y as a numpy array
2. Assign the remaining feature values to X as a numpy array
3. Check the shape of X and Y. Check the first few values.
    - Can we confirm our X and Y are created correctly?

In [60]:
#your code here
Y = np.array(y.values)
Y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [61]:
print(Y.shape)
Y[:5]

(150,)


array([0, 0, 0, 0, 0])

In [62]:
X = X.drop(["Petal Length", "Petal Width", "Class"], axis=1)

In [63]:
X = X.values

In [64]:
print(X.shape)
X[:5]

(150, 2)


array([[5.1, 3.5],
       [4.9, 3. ],
       [4.7, 3.2],
       [4.6, 3.1],
       [5. , 3.6]])

## Step 3: Creating the first model

In lecture we learned about creating a train and test datasets, normalizing, and fitting a model. In this step we will see how to build a simple version of this.

We have to be careful when constructing our train and test datasets. First, when we create train and test datasets we have to be careful that we always have the same datapoints in each set. Otherwise our results won't be reproduceable or we might introduce a bias into our model.

We also need to be attentive to when we normalize the data. What would be the effect of normalizing the data (i.e. with StandardScaler to a range between 0 - 1) before we create our train and test sets? Effectively we would use information in the test set to structure the values in the training set and vice versa. ***Therefore normalizing train and test independently is the preferred method.***

1. Create X_train, X_test, Y_train, Y_test using ```train_test_split()``` with an 80/20 train/test split. Look in the SKLearn documentation to understand how the function works.
    - Inspect the first few rows of X_train.
    - Run the cell a few times. Do the first few rows change?
    - What option can we use in ```train_test_split()``` to stop this from happening?
2. Normalize the train and test datasets with ```StandardScaler```
    - We can fit the transform with ```.fit()``` and ```.transform()``` to apply it. Look in the documentation for an esample of how to do this.
    - Does it make sense to normalize Y_train and Y_test?
3. Initalize a ```LogisticRegression()``` model and use the ```.fit()``` method to initalize the first model.
    - We will pass the X_train and Y_train variables to the ```.fit()``` method.
    - Once the model is fit, use the ```.predict()``` with the X_test and save the output as predictions.

In [65]:
#split train and test data 80/20
#your code here
X_train, X_test, y_train, y_test = train_test_split(X, Y, train_size=0.8, test_size=0.2, random_state=42)
X_train[:10]

array([[4.6, 3.6],
       [5.7, 4.4],
       [6.7, 3.1],
       [4.8, 3.4],
       [4.4, 3.2],
       [6.3, 2.5],
       [6.4, 3.2],
       [5.2, 3.5],
       [5. , 3.6],
       [5.2, 4.1]])

In [66]:
from sklearn.preprocessing import StandardScaler

In [67]:
scaler = StandardScaler().fit(X_train)

In [68]:
X_train = scaler.transform(X_train) 
X_test = scaler.transform(X_test)

In [69]:
#initalize and fit with Logistic Regression
#your code here
from sklearn.linear_model import LogisticRegression
logisticRegr = LogisticRegression()
clf = logisticRegr.fit(X_train, y_train)

In [70]:
predictions = clf.predict(X_test)
predictions

array([1, 0, 2, 1, 2, 0, 1, 2, 1, 1, 2, 0, 0, 0, 0, 2, 2, 1, 1, 2, 0, 1,
       0, 2, 2, 2, 2, 2, 0, 0])

## Step 4: Evaluate the frist model's predictions

We will learn more about how to evaluate the performance of a classifier in later lessons. For now we will use % accuracy as our metric. It is important to know that this metric only helps us understand the specific performance of our model and not, for example, where we can improve it, or where it already perfoms well.

1. Use ```.score()``` to evaluate the performance of our first model.

In [71]:
#evaluating the performace of our first model
#your code here
print(clf.score(X_train, y_train))
print(clf.score(X_test, y_test))

0.7916666666666666
0.9


## Step 5: Question your results. 
What accuracy did you achieve? Is it 70, 90%? Anything above 70% is a good fit for our first result. How do we know it is reproducible? **If we run the model again and our performance is 85%, which one is correct**? And what about improving our model? 

## However ...
There is one crucial mistake that has been made in the exercise above -even if we achieved great results-. Can you spot it? You can go back to the lecture slides for inspiration. 

*Type your answer here...*
***
***No idea***

## Optional:
Repeat the cells you need to change in the exercise and run the classifier again. What is the new accuracy and why is this better?

# Step 5: Crossvalidation of the model

Our first model achived ~90% accruacy. This is quite good. How do we know it is reproducable? If we run the model again and our performance is 85% which is correct? And what about improving our model? Can you think of one thing we can do to potentially improve the model?
Crossvalidation

Corssvalidation is when we create multiple X and Y datasets. On each dataset we train and fit the model. We then average the results and return a 'crossvalidated' accruacy.

    Initalize a new version of the model you trained above with the same paramters.
    Use cross_validate() to run the model with 5 crossvalidation folds.



In [72]:
#your code here
#model with cross validation
#your code here

clf_cv = LogisticRegression(solver='lbfgs', multi_class='multinomial').fit(X_train, y_train)

#cross validate the training set
cv = cross_validate(clf_cv, X_train, y_train, cv=5)

def print_scores(cv):
    #print out cross validation scores
    [print('Crossvalidation fold: {}  Accruacy: {}'.format(n, score)) for n, score in enumerate(cv['test_score'])]
    #print out the mean of the cross validation
    print('Mean train cross validation score {}'.format(cv['test_score'].mean()))
    
print_scores(cv)

Crossvalidation fold: 0  Accruacy: 0.7083333333333334
Crossvalidation fold: 1  Accruacy: 0.75
Crossvalidation fold: 2  Accruacy: 0.875
Crossvalidation fold: 3  Accruacy: 0.8333333333333334
Crossvalidation fold: 4  Accruacy: 0.75
Mean train cross validation score 0.7833333333333334


# Step 6: Creating an end to end ML pipeline

Congraulations you've trained, crossvalidated, predicted, and evaluated your frist classifier. Now that you understand the basic steps we will look at a way to combine all these steps together.

Before we go further think about what you would have to do if you wanted to change the model. Intalize a new model, change the vairables, redo the cross validation...etc. Seems like a lot. And when we have to change lots of code it is easy to make mistakes. And what if you wanted to try many models and see which one performed best? Or try changing many different features? How could you do it without writing each one out as we have?

The solution is to use SKLearn's pipeline class. A pipeline is an object that will execute the various steps in the machine learning process. We can choose what elements we want in the pipeline and those that we do not. Once setup, we can rapidly change models, or input data and have it return our results in an ordered way.

    Initalize a scaler and a classifer object like we did previously.
    Use the make_pipeline() function to construct a transofmraiton pipeline for the scaler and the classifier
    Input the pipeline object to the cross_validator and evaluate with 5 folds.
    Print out your results (hint: make a function for repetitve tasks like printing)


In [73]:
#define the scaler
scaler = StandardScaler();
#define the classifier
classifier = LogisticRegression(solver='lbfgs', multi_class='multinomial');
#make the pipeline
pipe = make_pipeline(scaler, classifier);
#run the cross validation
scores = cross_validate(pipe, X, Y, cv=5);
#print results
print_scores(scores)

Crossvalidation fold: 0  Accruacy: 0.7333333333333333
Crossvalidation fold: 1  Accruacy: 0.8333333333333334
Crossvalidation fold: 2  Accruacy: 0.7666666666666667
Crossvalidation fold: 3  Accruacy: 0.8666666666666667
Crossvalidation fold: 4  Accruacy: 0.8666666666666667
Mean train cross validation score 0.8133333333333332


# Challenge Exercise

In this notebook we only used two features to predict the class of the flower. We also did not do any hypter parameter tuning. The challenge is to impove the prediction results. Some ideas we can try:

    Add features to the input and run the cross validation pipeline
    Investigate how to use GridSearchCV, a powerful funtion that searches through hyperparmetrs and does cross validation.
        Hint: Input the pipeline directly into GridSearchCV
    Try a different models like RandomForest or SVM.



In [74]:
#your challenge code here
from sklearn.model_selection import GridSearchCV

In [75]:
params = {'logisticregression__penalty' : ['l2', 'none'],
         'logisticregression__C' : [0.1, 1, 10]}


In [76]:
gscv = GridSearchCV(pipe, params, cv=5, verbose=0)


In [None]:
gscv.fit(X_train, y_train)
results = gscv.cv_results_;


In [79]:
results

{'mean_fit_time': array([0.00866446, 0.01328249, 0.00628061, 0.01248856, 0.00844049,
        0.01307125]),
 'std_fit_time': array([0.00153262, 0.0007402 , 0.00018595, 0.00053145, 0.00091657,
        0.0006715 ]),
 'mean_score_time': array([0.00072889, 0.00047536, 0.00044737, 0.00044594, 0.0006691 ,
        0.00053806]),
 'std_score_time': array([1.96415552e-04, 3.16288643e-05, 2.40595417e-05, 2.38249719e-05,
        2.61481889e-04, 9.46295805e-05]),
 'param_logisticregression__C': masked_array(data=[0.1, 0.1, 1, 1, 10, 10],
              mask=[False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'param_logisticregression__penalty': masked_array(data=['l2', 'none', 'l2', 'none', 'l2', 'none'],
              mask=[False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'logisticregression__C': 0.1,
   'logisticregression__penalty': 'l2'},
  {'logisticregression__C': 0.1, 'logisticregression__pe