# [Lesson 15: Easier Experimenting in Python](https://machinelearningmastery.com/easier-experimenting-in-python/)

When we work on a machine learning project, we quite often need to experiment
with multiple alternatives. Some features in Python allow us to try out
different options without much effort. In this tutorial, we are going to see
some tips to make our experiments faster.

After finishing this tutorial, you will learn:

*  How to leverage a duck-typing feature to easily swap functions and objects
*  How making components into drop-in replacements for  each other can help experiments run faster

## Overview

This tutorial is in three parts; they are:

*    Workflow of a machine learning project
*    Functions as objects
*    Caveats

## Workflow of a Machine Learning Project

Consider a very simple machine learning project as follows:

In [12]:
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

# Load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = read_csv(url, names=names)

# Split-out validation dataset
array = dataset.values
X = array[:,0:4]
y = array[:,4]
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20, random_state=1, shuffle=True)

# Train
clf = SVC()
clf.fit(X_train, y_train)

# Test
score = clf.score(X_val, y_val)
print("Validation accuracy", score)

Validation accuracy 0.9666666666666667


This is a typical machine learning project workflow. We have a stage of
preprocessing the data, then training a model, and afterward, evaluating our
result. But in each step, we may want to try something different. For example,
we may wonder if normalizing the data would make it better. So we may rewrite
the code above into the following:

In [13]:
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = read_csv(url, names=names)

# Split-out validation dataset
array = dataset.values
X = array[:,0:4]
y = array[:,4]
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20, random_state=1, shuffle=True)

# Train
clf = Pipeline([('scaler',StandardScaler()), ('classifier',SVC())])
clf.fit(X_train, y_train)

# Test
score = clf.score(X_val, y_val)
print("Validation accuracy", score)

Validation accuracy 0.9666666666666667


So far, so good. But what if we keep experimenting with different datasets,
different models, or different score functions? Each time, we keep flipping
between using a scaler and not would mean a lot of code change, and it would be
quite easy to make mistakes.

Because Python supports duck typing, we can see that the following two
classifier models implemented the same interface:

```python
clf = SVC()
clf = Pipeline([('scaler',StandardScaler()), ('classifier',SVC())])
```

Therefore, we can simply select between these two version and keep everything
intact. We can say these two models are **drop-in replacements** for each other.

Making use of this property, we can create a toggle variable to control the
design choice we make:

```python
USE_SCALER = True

if USE_SCALER:
    clf = Pipeline([('scaler',StandardScaler()), ('classifier',SVC())])
else:
    clf = SVC()
```

By toggling the variable USE_SCALER between True and False, we can select
whether a scaler should be applied. A more complex example would be to select
among different scaler and the classifier models, such as:

```python
SCALER = "standard"
CLASSIFIER = "svc"

if CLASSIFIER == "svc":
    model = SVC()
elif CLASSIFIER == "cart":
    model = DecisionTreeClassifier()
else:
    raise NotImplementedError

if SCALER == "standard":
    clf = Pipeline([('scaler',StandardScaler()), ('classifier',model)])
elif SCALER == "maxmin":
    clf = Pipeline([('scaler',MaxMinScaler()), ('classifier',model)])
elif SCALER == None:
    clf = model
else:
    raise NotImplementedError
```

A complete example is as follows:

In [14]:
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# toggle between options
SCALER = "maxmin"    # "standard", "maxmin", or None
CLASSIFIER = "cart"  # "svc" or "cart"

# Load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = read_csv(url, names=names)

# Split-out validation dataset
array = dataset.values
X = array[:,0:4]
y = array[:,4]
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20, random_state=1, shuffle=True)

# Create model
if CLASSIFIER == "svc":
    model = SVC()
elif CLASSIFIER == "cart":
    model = DecisionTreeClassifier()
else:
    raise NotImplementedError

if SCALER == "standard":
    clf = Pipeline([('scaler',StandardScaler()), ('classifier',model)])
elif SCALER == "maxmin":
    clf = Pipeline([('scaler',MinMaxScaler()), ('classifier',model)])
elif SCALER == None:
    clf = model
else:
    raise NotImplementedError

# Train
clf.fit(X_train, y_train)

# Test
score = clf.score(X_val, y_val)
print("Validation accuracy", score)

Validation accuracy 0.9666666666666667


If you go one step further, you may even skip the toggle variable and use a
string directly for a quick experiment:

In [15]:
import numpy as np
import scipy.stats as stats

# Covariance matrix and Cholesky decomposition
cov = np.array([[1, 0.8], [0.8, 1]])
L = np.linalg.cholesky(cov)

# Generate 100 pairs of bi-variate Gaussian random numbers
if not "USE SCIPY":
   z = np.random.randn(100,2)
   x = z @ L.T
else:
   x = stats.multivariate_normal(mean=[0, 0], cov=cov).rvs(100)

...

Ellipsis

## Functions as Objects

In Python, functions are first-class citizens. You can assign functions to a
variable. Indeed, functions are objects in Python, as are classes (the classes
themselves, not only incarnations of classes). Therefore, we can use the same
technique as above to experiment with similar functions.

In [16]:
import numpy as np

DIST = "normal"

if DIST == "normal":
    rangen = np.random.normal
elif DIST == "uniform":
    rangen = np.random.uniform
else:
    raise NotImplementedError

random_data = rangen(size=(10,5))
print(random_data)

[[ 0.86619299 -0.77271035 -0.34083775  0.94584046  1.20764607]
 [ 2.58773325  1.87003246 -0.06062041 -1.35006146 -0.18093381]
 [ 0.25698579  0.97225107  1.10114656 -0.08460816  0.98743163]
 [-1.15273051 -1.33611763 -0.295313    0.26662775  0.61416427]
 [-0.42849996 -0.29843637  0.48836518  0.97169038  0.80842449]
 [ 1.72818914 -1.02712406 -0.88652396  0.80512848  2.22228372]
 [-0.28981209  0.8426274   1.06113198 -0.17351963 -1.04606195]
 [ 0.61776735  0.03378074 -1.07895001  0.82524569 -0.69548583]
 [-0.58358819  1.50284277 -1.7930183  -0.89507719  1.56105326]
 [ 0.35353722  1.61784557 -0.28944177 -0.69882227  0.39400301]]


The above is similar to calling `np.random.normal(size=(10,5))`, but we hold the
function in a variable for the convenience of swapping one function with
another. Note that since we call the functions with the same argument, we have
to make sure all variations will accept it. In case it is not, we may need some
additional lines of code to make a wrapper. For example, in the case of
generating Student’s t distribution, we need an additional parameter for the
degree of freedom:

In [17]:
import numpy as np

DIST = "t"

if DIST == "normal":
    rangen = np.random.normal
elif DIST == "uniform":
    rangen = np.random.uniform
elif DIST == "t":
    def t_wrapper(size):
        # Student's t distribution with 3 degree of freedom
        return np.random.standard_t(df=3, size=size)
    rangen = t_wrapper
else:
    raise NotImplementedError

random_data = rangen(size=(10,5))
print(random_data)

[[-6.19396605e-01  3.38445042e-01 -1.55661615e+00 -1.18975355e+01
   3.32717421e-02]
 [-1.38300917e+00  3.39365346e-02  1.16849881e-01 -5.05266443e-01
   3.81296725e-03]
 [-5.04704399e-01 -3.78061996e+00  1.15895686e-01  8.81443056e-01
   1.38002427e+00]
 [-2.44610708e+00 -1.30681111e+00 -2.21450226e+00 -1.21229398e+00
   1.49474875e-01]
 [ 3.60770537e-01 -2.25237759e-01  3.01802173e+00 -1.01772683e+00
   8.48512583e+00]
 [ 8.44193526e-01  5.41234948e-01 -1.10938159e-01 -4.20624675e+00
  -7.85112961e-01]
 [ 7.15638095e-01 -1.81740908e+00 -6.75764943e-01 -1.42905855e+00
   3.99267812e-01]
 [-1.81070355e+00 -3.57456675e-01  2.69219360e+00 -9.40644657e-01
  -5.53745257e-01]
 [ 2.04680196e+00  5.62444894e-01 -1.36432432e+00 -1.35008782e+00
  -1.03299582e+00]
 [ 4.77883311e-02  5.00605340e+00  5.11059348e-01 -1.02327832e+00
  -1.14930244e+00]]


This works because in the above, `np.random.normal`, `np.random.uniform`, and
`t_wrapper` as we defined, are all drop-in replacements of each other.

## Caveats

Machine learning differs from other programming projects because there are more
uncertainties in the workflow. When you build a web page or build a game, you
have a picture in your mind of what to achieve. But there is some exploratory
work in machine learning projects.

You will probably use some source code control system like git or Mercurial to
manage your source code development history in other projects. In machine
learning projects, however, we are trying out different combinations of many
steps. Using git to manage the different variations may not fit, not to say
sometimes may be overkill. Therefore, using a toggle variable to control the
flow should allow us to try out different things faster. This is especially
handy when we are working on our projects in Jupyter notebooks.

However, as we put multiple versions of code together, we made the program
clumsy and less readable. It is better to do some clean-up after we confirm what
to do. This will help us with maintenance in the future.

## Further reading

This section provides more resources on the topic if you are looking to go deeper.

### Books

* Fluent Python, second edition, by Luciano Ramalho, https://www.amazon.com/dp/1492056359/

### Summary

In this tutorial, you’ve seen how the duck typing property in Python helps us create drop-in replacements. Specifically, you learned:

*    Duck typing can help us switch between alternatives easily in a machine learning workflow
*    We can make use of a toggle variable to experiment among alternatives
