# Machine Learning Workflow: Iris

A typical machine learning workflow:

1. Dataset Curation
2. Dataset Preprocessing
3. Dataset Provisioning
4. Training Configuration
5. Model Training
6. Evaluation
7. Iterative Optimization

We will look at the above workflow based on a commonly used machine learning dataset called the *Iris* Dataset.

The Iris Dataset is relatively small and has the following characteristics:

* 4 Features (Attributes)
* 150 Samples (Instances, Rows)

You can read more more about the dataset here: https://archive.ics.uci.edu/ml/datasets/iris/.

## [Step 1] Dataset Curation

Luckily in this case the dataset curation has already taken place - by biologist Ronald Fisher back in 1936.

In case the data has not already been assembled - which is often the case in real-world applications - the dataset curation step and all the manual efforts in collecting samples and annotation can potentially take much longer than all other steps combined.

## [Step 2] Dataset Preprocessing

The easiest way to load the Iris dataset is to use the built-in functionality of *scikit-learn*.

You can load the Iris dataset with the following commmands:

``from sklearn import datasets``

``iris = datasets.load_iris()``

To familiarize ourselves with the structure of a dataset and to get to know the tooling we need to load our own datasets we will load the dataset ourselves.



### Downloading the Data

The data for the Iris dataset can be downloaded from.

https://archive.ics.uci.edu/ml/machine-learning-databases/iris/
    
The web directory contains two files that are important for us:

* iris.data
* iris.names

On a Linux or Mac OS machine you can use the following commands to download the files to a local directory.


In [None]:
# wget is a handy command line utility that allows downloading the specified URL:

# - iris.data contains the raw data
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
    
# - Optional: iris.name contains a description of the dataset
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.names

### Inspecting the Iris Dataset Format

Using the command line (or a text editor) we can inspect the dataset.

The `!` operator will allow you to execute command line commands from a Jupyter cell. 
This should work on all supported operating systems (Mac OS, Linux, Windows).

On a Mac or Linux machine you can make use of the following command line commands:

* `head` : Show top n lines of a text file
* `tail` : Show last n lines of a text file
* `cat`  : Print full content of a text file
* `wc -l`: Count number of lines of a text file

On a Windows machine the following should work:

* `more` : Show content of a text file (might hang in Jupyter)
* `type` : Print content of a text file 

In [None]:
# head and tail are useful command line utilities on a Linux machine that allow us to see the first n or last 
# n lines of a text file.

# Take your time to inspect both files with the head and tail commands. If you know that a file is not too long you can
# also make use of the cat command that prints an entire files contents. For large files this is not advised as it can 
# easily overpower the JavaScript-based rendering on the browser and cause it to crash.

!tail -n 150 iris.data

As we can see the Iris dataset is a `CSV` (Comma Separated Values) file.

Each row contains 5 values: 4 `double` values and a `string`.
The 4 `double` values represent the 4 `features` (in this case measurements of the plant).
The `string` value represents the `class` (type) of plant.

## [Step 3] Provision: Loading the Dataset into a Dataframe

Provisioning data means making data available in the expected format of a library or application program.
In our case we will be using the **scikit-learn** library and the machine learning algorithms that are available as part of it. 

### Pandas

Pandas is a library for handling of dataframes and the loading from and to other dataformats.

We will start of with the following initial command:

* `import pandas as pd` : Importing as `pd` is a common convention
* `pd.read_csv()` : Read CSV input into a dataframe

In order to understand how to use the read_csv() method we can use the `?` operator as shown in the cell below.

This will pull up documentation for the method parameters, the return type, and example usage (the latter at the end of the documentation).

In [None]:
import pandas as pd

### Exercise: Load the Iris.data CSV file with pandas

Take a look at the documentation of the read_csv() method and load the `Iris.data` file with pandas.
In case you are missing the pandas package use the `!conda install -y` functionality to pull and install the package. 

In [None]:
# load the Iris.data CSV file into a dataframe called iris_dataframe

iris_dataframe = pd.read_csv(...)

In [None]:
# You can use the head() method in order to inspect the loaded dataframe.
# Your result should look exactly like shown below.
# If your result looks different then please have a look at the documentation of the parameters of the read_csv method 
# and load the data again. 

iris_dataframe.head()

In [None]:
# Assign the columns more descriptive names for future reference

iris_dataframe.columns = [...]

***Exercise***: Inspect the iris_dataframe with the shape, describe(), and len() attributes and methods.

In [None]:
# Inspect the iris_dataframe

iris_dataframe.describe()

### Exercise: Create an Input and Response Dataframe

If we want to train a supervised machine learning model based on the Iris dataset we have to split our original iris_dataframe into an Input and a Response dataframe.

* The `Input` dataframe contains the `features` that are the input for the learning and decision making of the machine learning model.
* The `Response` (a.k.a. `Target`) dataframe contains the correct expected values (a.k.a answers) that the system is suppposed to learn.

***Iris Setosa Classifier***

If we want to train a classifier (a machine learning model that predicts the class/type of Setosa flower based on the 4 measurements) then the content of the `Input` and `Response` dataframes would consist of the following:

* `Input`: Each row of the dataframe consists of the 4 measurement values
* `Response` : Each row of the dataframe consists of the `class` of the flower 

To create these two dataframes we can use the functionality provided by pandas:

* https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#basics holds the full documentation but is not very easy to read

The main functionality we will make use of are:

***Selecting columns by label:***

We can use `[]` brackets to select a subset of a dataframe. To select a set of columns we can use an array of labels (column headings).

E.g. `sample_dataframe[["sepal_length", "class"]]` would give us back a dataframe consisting of the columns with the ***names*** `sepal_length` and `class`.

Use `sample_dataframe.columns` to get the available columns and their labels listed. 


***Slicing subsets by row:***

This can be achieved by using the `[start:end]` operator.
It is a pretty simple way to select rows.

* `sample_dataframe[0:5]` : Select row 0,1,2,3,4 of sample_dataframe (***end*** is not inclusive)
* `sample_dataframe[start:]` : Select everything from start until the end of the dataframe
* `sample_dataframe[:end]` : Select everything from the start until end - 1 index of the dataframe
* `sample_dataframe[:]` This would select the full dataframe (all rows)

This notation is not limited to dataframes. It works with all lists in Python.  For a list the use of this operator will result in a new copy of the dataframe.

Better documentation of the slice operator is provided here: https://stackoverflow.com/questions/509211/understanding-slice-notation?page=1&tab=votes#tab-top

Based on the above documentation create the `Input` and `Response` dataframes in the cell below.

In [None]:
# There are different naming conventions people use for these dataframes
# input, response or X,Y are common. input, target is also commonly used.
iris_dataframe_input = 
iris_dataframe_response = 

***Exercise: Inspect the newly created dataframes***

Use the tooling we introduced before in order to inspect your newly created dataframes. 

* `head()`
* `shape`
* `len()`
* `describe()`


In [None]:
# Inspect your dataframes with the above tools in order to get familiar with them
# Inspecting the intermediary artifacts in the machine learning workflow is a common and crucial task.
# It is easy to imagine how one can be off when sub-setting or slicing through the input data by making a mistake. 
# These kind of errors are usually disastrous in terms of the outcome of the trained model. The earlier we catch them
# the less expensive they are to fix.


## [Step 4] Training Configuration

The next step consist of creating the configuration for the training. 

The main dependencies for choosing a training setup are:

* The data used for training (data type, quality, amount)
* The task we want to solve (what we want the machine learning system to achieve)

Based on these two aspects designing the training set up consists of the following steps:

1. Choose training algorithm
2. Create initial configuration for training algorithm

### Criteria for Choosing A ML Algorithm

Some main criteria for choosing a training algorithm are the following:

* Task Fit : I.e. can the task we want to solve with ML be solved with the given algorithm
* Scalability: How scalable in terms of the shape (columns, rows) of the input data is the algorithm 
    * The amount of features has a major impact on the scalability of algorithms
    * The amount of samples (rows) has a major impact on the execution time of the algorithm
* Expected Performance: What is the expected accuracy of the algorithm.
* Interpretability: How easy, hard is it to understand what is happening in the algorithm. How hard would it be to 'debug' the behaviour of the algorithm.
* Updatable Learning: Can the learned model be updated with more data at a later stage.
* Availability: In the pragmatic sense; is a trusty implementation of the algorithm available (also from a license perspective).
* Solution requirements: Do we have requirements from the software solution side? E.g. maximum latency, memory limitations, etc ... . 

As the above list highlights, choosing the 'right' algorithm is a complex tasks with many potential considerations.
On the flip side it means that making the right choices has massive potential value. 

### Choosing an Initial Configuration

The choice of an initial training configuration often depends mainly on:

* Stats of the training data 
     * Hyperparameters often allow us to adjust the training to the amount of the training data
* Experience or documented well working configurations 
     * This is often based on identifying `baselines` that worked well on data that we deem similar to our training data.
     

### Classification Task

One machine learning task that fits well to the Iris dataset is `classification`.
`Classification` is the task of assigning a `class` (type) to samples based on the input features. 
For the Iris dataset that translates to using the features (4 measurements) as input for taking the decision (to classify) which type of Iris plant it is. 

### Classification Algorithm

Lets say we aim for classifing the type of Iris based on the input.

This is the `task` that we want to solve.

The choice of `task` informs the choice of our machine learning algorithm.
In this case we choose a `classification algorithm`, also called `classifier`.

A commonly used classification algorithm is called `logistic regression`.
For our initial classification experiments we will make use of this algorithm type.

We can make set up this algorithm by the following steps:

* Import: `from sklearn.linear_model import LogisticRegression`
* Instantiate Model Class: `LogisticRegression`



In [None]:
# Exercise: Set up the logistic regression model

from sklearn.linear_model import LogisticRegression

classifier = 

## [Step 5] Model Training Run

Algorithms in scikit-learn can be trained by using the `fit` method. Calling it `fit` is based on the process of `fitting` the model's weights (also called model parameters) during training.
`Fitting` means that the weights of the model are adjusted during the training (a.k.a learning) phase based on the input data we have seen.



### Training Run

Training a model based on the input data is often referred to as executing or making a `training run`; or simply just a `run`. 

Common ways to use this terminology are e.g.:

* "I have made a run with the following input data and these hyperparameters".
* Which parameters (meaning hyperparameters) were used for the run?
* What was the best run?

Running training in scikit-learn can be based on executing the following:

In [None]:
trained_model = classifier.fit(iris_dataframe_input.values, iris_dataframe_response.values.ravel())

### Testing our trained model

For a first easy test we can used the `predict` method as shown below.
The predict method allows us to pass in data with the same format as the input data.

As an exercise try passing in data with a different format. 

In [None]:
trained_model.predict([[0.5,5.0,2.0,1.0]])

You can also use the slicing approach we introduced earlier to pass in a subset of the rows of our input data to test the model.

In [None]:
trained_model.predict(iris_dataframe_input[0:10])

## [Step 6] Evaluation

In [None]:
trained_model.score(iris_dataframe_input, iris_dataframe_response.values)

## Splitting into Train and Test


If we use the same samples for the training of the model and its evaluation, the measured performance is likely a bad
indicator of the performance we can expect when we let the model take decisions on data it has not seen before.

Testing with the same data that was used for training will usually result in a much higher score. This is due to the effect of fitting the weights the model learns completely on the seen samples.

Because of that, the standard approach is to split the available labeled data into two parts before we start training of the ML model:
* Train Portion
* Test Portion

The train portion is the part of the dataset that is used for the training of the model.
The test portion is the part of the dataset that is ***excluded*** from the training of the model.
This is also called ***holding out*** part of the data. This ***unseen data*** (i.e. data that the model has never seen before) is then used to evaluate how well the model is doing.

Typical splits between train and test are 70/30 or 80/20. 


### Considering the Order of the Input Data

When splitting the data into a test and train portion it is important to try to split the data in random fashion. 
We try to avoid an in-balance for the frequency of classes in either test or train, or an in-balance in terms of 'hard' and 'easy' cases when splitting the data.

The best way to achieve this, is to randomly sample (or shuffle) the input data before making the split. 

Pandas offers us the `sample(frac=double_value)` method as a simple way to do this.


### Exercise: Create Train and Test datasets

* Create a Train and Test portion of the input data.
* You should take the original `iris_dataframe` as input for this
* The iris_dataframe is pre-sorted by class. It is a good example of a case where shuffling or sampling from data before the split is a must. So you should start with this. 

In [None]:
iris_df_shuffled = 

iris_df_train = 
iris_df_test = 

iris_df_train_input = 
iris_df_train_response = 

iris_df_test_input = 
iris_df_test_response = 

### Exercise: Train Classifier on Train Dataset

Take the train dataset you have created and use it to re-train the classifier.

In [None]:
trained_model = classifier.fit(iris_df_train_input, iris_df_train_response.values.ravel())

### Exercise: Evaluate Classifier with Test Dataset

* Use the test dataset you have created and evaluate the classifier trained on the train dataset.
* Execute the splitting, training and evaluation multiple times and observe what happens to the score.


In [None]:
trained_model.score(iris_df_test_input, iris_df_test_response)

### Classifier Report

The `classification_report` function can be used to generate a detailed breakdown of the evaluation scores on class-level.

In [None]:
from sklearn.metrics import classification_report

print(classification_report(iris_df_test_response, trained_model.predict(iris_df_test_input), target_names=trained_model.classes_))


### Confusion Matrix

A confusion matrix is a chart that visualizes the results of a classifier's output on the evaluation set. It shows a breakdown of the actual (true) label vs. the predicted labels for each class. Ideally, the true and predicted labels should overlap but when the classifier makes mistakes, we can use the confusion matrix to learn what kinds of mistakes it is making (e.g. which classes is it confusing).

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(iris_df_test_response, trained_model.predict(iris_df_test_input))
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=trained_model.classes_)
disp.plot()
plt.show()


## Further Optimizations

Scikit-learn allows plugging in various implementations of machine learning algorithms. Experimenting with different algorithms is a good way to develop an intuition for which algorithms work well for which scenarios.

As an exercise, experiment with the configurations of the classifiers below and observe the impact on the resulting classification scores. 

In [None]:
import time
from tqdm import tqdm
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

# K-Nearest Neighbors
# - n_neighbors: number of neighbors to consider (default: 5)
# - weights: weighting of distance to neighbors: 'uniform' or 'distance' (default: 'uniform')
knn = KNeighborsClassifier()

# Decision Tree:
# - max_depth: maximum depth of decision nodes (default: None)
decision_tree = DecisionTreeClassifier(max_depth=2)

# Random Forest
# - n_estimators: number of individual decision trees used internally by the model (default: 100)
random_forest = RandomForestClassifier()

# Support Vector Machine:
support_vector_machine = SVC()

# Naive Bayes:
naive_bayes = MultinomialNB()

# Logistic Regression:
# - max_iter: maximum number of iterations (default: 100)
logistic_regression = LogisticRegression()

classifiers = [
    knn,
    decision_tree,
    random_forest,
    support_vector_machine,
    naive_bayes,
    logistic_regression
]

model_metrics = []
for classifier in tqdm(classifiers):
    
    # Train the classifier
    start_time = time.time()
    trained_model = classifier.fit(iris_df_train_input, iris_df_train_response.values.ravel())
    end_training_time = time.time()
    training_time_elapsed = end_training_time - start_time
    
    # Apply trained classifier to test set
    start_time = time.time()
    predictions = trained_model.predict(iris_df_test_input)
    prediction_time = time.time()
    prediction_time_elapsed = prediction_time - start_time
    
    # Measure model performance
    score = classifier.score(iris_df_test_input, iris_df_test_response)
    #print(model_score)
    
    # Record model metrics
    model_metrics.append({
        "model": trained_model.__class__.__name__,
        "training_time": training_time_elapsed,
        "prediction_time": prediction_time_elapsed,
        "score": score,
    })
    
# Print model metrics table
pd.DataFrame(model_metrics)

## Explainability

Some models are easier to interpret than others. For instance, if we'd like to generate a set of rules for people to be able to identify Iris plants in the wild without using a computer, a simple Decision Tree model would likely be preferable over a neural network, even if the neural network might be some percentage points more accurate.

How important the explainability of a model is will often depend on the use case.

In [None]:
from sklearn import tree

tree.plot_tree(decision_tree, feature_names=iris_dataframe.columns[:-1], class_names=trained_model.classes_, filled=True)
plt.show()