## Iris

Here are some of the information provided by the official website:

```text
This is perhaps the best known database to be found in the pattern recognition literature.
The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant.
Predicted attribute: class of iris plant.
```

And here's the pandas-view of the raw data:

```text
      f0   f1   f2   f3           label
0    5.1  3.5  1.4  0.2     Iris-setosa
1    4.9  3.0  1.4  0.2     Iris-setosa
2    4.7  3.2  1.3  0.2     Iris-setosa
3    4.6  3.1  1.5  0.2     Iris-setosa
4    5.0  3.6  1.4  0.2     Iris-setosa
..   ...  ...  ...  ...             ...
145  6.7  3.0  5.2  2.3  Iris-virginica
146  6.3  2.5  5.0  1.9  Iris-virginica
147  6.5  3.0  5.2  2.0  Iris-virginica
148  6.2  3.4  5.4  2.3  Iris-virginica
149  5.9  3.0  5.1  1.8  Iris-virginica

[150 rows x 5 columns]
```

> We didn't use pandas in our code, but it is convenient to visualize some data with it though 🤣
>
> You can download the raw data (`iris.data`) with [this link](https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data).

In [1]:
# preparations

import os
import torch
import pickle
import cflearn
import numpy as np
from cflearn.misc.toolkit import seed_everything

seed_everything(123)

123

### Basic Usages

Traditionally, we need to process the raw data before we feed them into our machine learning models (e.g. encode the label column, which is a string column, into an ordinal column). In carefree-learn, however, we can train neural networks directly on files without worrying about the rest:

In [2]:
processor_config = cflearn.MLBundledProcessorConfig(has_header=False, num_split=25)
data = cflearn.MLData.init(processor_config=processor_config).fit("iris.data")
config = cflearn.MLConfig(
    model_name="fcnn",
    model_config=dict(input_dim=data.num_features, output_dim=data.num_labels),
    loss_name="focal",
    metric_names=["acc", "auc"],
)
m = cflearn.api.fit_ml(data, config=config)

                                    Internal Default Configurations Used by `carefree-learn`                                    
--------------------------------------------------------------------------------------------------------------------------------
                                                   train_samples   |   125
                                                   valid_samples   |   25
                                               max_snapshot_file   |   25
                                                encoder_settings   |   {}
                                                       workspace   |   _logs\2023-03-19_21-40-56-199595
                                   model_config.encoder_settings   |   {}
                                                 index_mapping.0   |   0
                                                 index_mapping.1   |   1
                                                 index_mapping.2   |   2
                                                

What's going under the hood is that carefree-learn will try to parse the `iris.data` automatically, split the data into training set and validation set, with which we'll train a fully connected neural network (fcnn).

We can further inspect the processed data if we want to know how carefree-learn actually parsed the input data:

In [3]:
data = m.data
x_train = data.train_dataset.x
print("> mean", x_train.mean(0))
print("> std ", x_train.std(0))

> mean [-0.02245645 -0.00092561 -0.01379941 -0.01192022]
> std  [0.99158337 1.00485133 0.9964612  0.98410995]


It shows that the raw data is carefully normalized into numerical data that neural networks can accept. What's more, by saying *normalized*, it means that the input features will be automatically normalized to `mean=0.0` and `std=1.0`:

In [4]:
data = m.data
x_train = data.train_dataset.x
x_valid = data.valid_dataset.x
stacked = np.vstack([x_train, x_valid])
print("> mean", stacked.mean(0))
print("> std ", stacked.std(0))

> mean [-4.42608912e-16 -7.05546732e-16  1.93918955e-16  6.80936788e-16]
> std  [1. 1. 1. 1.]


> The results shown above means we first normalized the data before we actually split it into train & validation set.

After training on files, carefree-learn can predict & evaluate on files directly as well. We'll handle the data parsing and normalization for you automatically:

In [5]:
loader = data.build_loader("iris.data")
predictions = m.predict(loader)
# evaluations could be achieved easily with cflearn.api.evaluate
cflearn.api.evaluate(loader, dict(m=m))

|        metrics         |                       acc                        |                       auc                        |
--------------------------------------------------------------------------------------------------------------------------------
|                        |      mean      |      std       |     score      |      mean      |      std       |     score      |
--------------------------------------------------------------------------------------------------------------------------------
|           m            |    0.853333    |    0.000000    |    0.853333    |    0.977733    |    0.000000    |    0.977733    |


{'acc': {'m': Statistics(sign=1.0, mean=0.8533333333333334, std=0.0, score=0.8533333333333334)},
 'auc': {'m': Statistics(sign=1.0, mean=0.9777333333333332, std=0.0, score=0.9777333333333332)}}

### Benchmarking

As we know, neural networks are trained with **_stochastic_** gradient descent (and its variants), which will introduce some randomness to the final result, even if we are training on the same dataset. In this case, we need to repeat the same task several times in order to obtain the bias & variance of our neural networks.

Fortunately, carefree-learn introduced `repeat_ml` API, which can achieve this goal easily with only a few lines of code:

In [6]:
# With num_repeat=3 specified, we'll train 3 models on `iris.data`.
results = cflearn.api.repeat_ml(data, m.config, num_repeat=3)
pipelines = cflearn.api.load_pipelines(results)
cflearn.api.evaluate(loader, pipelines)

100%|████████████████████████████████████████████████████████████████████████████| 3/3 [00:07<00:00,  2.52s/it]

|        metrics         |                       acc                        |                       auc                        |
--------------------------------------------------------------------------------------------------------------------------------
|                        |      mean      |      std       |     score      |      mean      |      std       |     score      |
--------------------------------------------------------------------------------------------------------------------------------
|          fcnn          |    0.842222    |    0.008314    |    0.833907    |    0.975111    |    0.013052    |    0.962058    |





{'acc': {'fcnn': Statistics(sign=1.0, mean=0.8422222222222223, std=0.008314794192830995, score=0.8339074280293913)},
 'auc': {'fcnn': Statistics(sign=1.0, mean=0.9751111111111112, std=0.013052864970979294, score=0.9620582461401319)}}

We can also compare the performances across different models:

In [7]:
# With models=["linear", "fcnn"], we'll train both linear models and fcnn models.
models = ["linear", "fcnn"]
results = cflearn.api.repeat_ml(data, m.config, models=models, num_repeat=3)
pipelines = cflearn.api.load_pipelines(results)
cflearn.api.evaluate(loader, pipelines)

100%|████████████████████████████████████████████████████████████████████████████| 6/6 [00:17<00:00,  2.84s/it]

|        metrics         |                       acc                        |                       auc                        |
--------------------------------------------------------------------------------------------------------------------------------
|                        |      mean      |      std       |     score      |      mean      |      std       |     score      |
--------------------------------------------------------------------------------------------------------------------------------
|          fcnn          | -- 0.860000 -- | -- 0.056829 -- | -- 0.803170 -- | -- 0.966088 -- | -- 0.016351 -- | -- 0.949737 -- |
--------------------------------------------------------------------------------------------------------------------------------
|         linear         |    0.562222    |    0.388132    |    0.174089    |    0.675977    |    0.363547    |    0.312429    |





{'acc': {'fcnn': Statistics(sign=1.0, mean=0.86, std=0.056829830455752954, score=0.803170169544247),
  'linear': Statistics(sign=1.0, mean=0.5622222222222222, std=0.38813259793561133, score=0.17408962428661084)},
 'auc': {'fcnn': Statistics(sign=1.0, mean=0.9660888888888888, std=0.016351086345908615, score=0.9497378025429801),
  'linear': Statistics(sign=1.0, mean=0.6759777777777778, std=0.36354788500465524, score=0.31242989277312255)}}

It is worth mentioning that carefree-learn supports distributed training, which means when we need to perform large scale benchmarking (e.g. train 100 models), we could accelerate the process through multiprocessing:

> In `carefree-learn`, Distributed Training in Machine Learning tasks sometimes doesn't mean training your model on multiple GPUs or multiple machines. Instead, it may mean training multiple models at the same time.

In [8]:
# With num_jobs=2, we will launch 2 processes to run the tasks in a distributed way.
results = cflearn.api.repeat_ml(data, m.config, models=models, num_repeat=10, num_jobs=2)

100%|██████████████████████████████████████████████████████████████████████████| 20/20 [00:34<00:00,  1.72s/it]


On iris dataset, however, launching distributed training may actually hurt the speed because iris dataset only contains 150 samples, so the relative overhead brought by distributed training might be too large.

### Advanced Benchmarking

But this is not enough, because we want to know whether other models (e.g. scikit-learn models) could achieve a better performance than carefree-learn models. In this case, we can perform an advanced benchmarking with the `Experiment` helper class.

In [9]:
experiment = cflearn.dist.ml.Experiment()
data_folder = experiment.dump_data(data)

# Add carefree-learn tasks
experiment.add_task(model="fcnn", config=config, data_folder=data_folder)
experiment.add_task(model="linear", config=config, data_folder=data_folder)
# Add scikit-learn tasks
run_command = f"python run_sklearn.py"
common_kwargs = {"run_command": run_command, "data_folder": data_folder}
experiment.add_task(model="decision_tree", **common_kwargs)
experiment.add_task(model="random_forest", **common_kwargs)

'D:\\GitHub\\carefree-learn-new\\examples\\ml\\iris\\_experiment\\random_forest\\0'

Notice that we specified `run_command="python run_sklearn.py"` for scikit-learn tasks, which means `Experiment` will try to execute this command in the current working directory for training scikit-learn models. The good news is that we do not need to speciy any command line arguments, because `Experiment` will handle those for us.

Here is basically what a `run_sklearn.py` should look like ([source code](run_sklearn.py)):

```python
import os
import pickle

import numpy as np

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from cflearn.constants import INPUT_KEY
from cflearn.constants import LABEL_KEY
from cflearn.dist.ml.runs._utils import get_info


if __name__ == "__main__":
    info = get_info()
    meta = info.meta
    # data
    data = info.data
    assert data is not None
    data.prepare(None)
    loader = data.initialize()[0]
    dataset = loader.get_full_batch()
    x, y = dataset[INPUT_KEY], dataset[LABEL_KEY]
    assert isinstance(x, np.ndarray)
    assert isinstance(y, np.ndarray)
    # model
    model = meta["model"]
    if model == "decision_tree":
        base = DecisionTreeClassifier
    elif model == "random_forest":
        base = RandomForestClassifier
    else:
        raise NotImplementedError
    sk_model = base()
    # train & save
    sk_model.fit(x, y.ravel())
    with open(os.path.join(info.workplace, "sk_model.pkl"), "wb") as f:
        pickle.dump(sk_model, f)

```

With `run_sklearn.py` defined, we could run those tasks with one line of code:

In [10]:
results = experiment.run_tasks()

100%|████████████████████████████████████████████████████████████████████████████| 4/4 [00:10<00:00,  2.73s/it]


After finished running, we should be able to see the following file structure in the current working directory:

```text
|--- _experiment
   |--- __data__
      |-- npd
      |-- id.txt
      |-- info.json
   |--- fcnn/0
      |-- __meta__.json
      |-- __dl_config__
      |-- pipeline
   |--- linear/0
      |-- ...
   |--- decision_tree/0
      |-- __meta__.json
      |-- sk_model.pkl
   |--- random_forest/0
      |-- ...
```

As we expected, `carefree-learn` pipeline are saved into the `pipeline` folder, while scikit-learn models are saved into `sk_model.pkl` files. Since these models are not yet loaded, we should manually load them into our environment:

In [11]:
pipelines = cflearn.api.load_pipelines(results)
for workspace, workspace_key in zip(results.workspaces, results.workspace_keys):
    model = workspace_key[0]
    if model in ["decision_tree", "random_forest"]:
        model_file = os.path.join(workspace, "sk_model.pkl")
        with open(model_file, "rb") as f:
            predictor = cflearn.SKLearnClassifier(pickle.load(f))
            pipelines[model] = cflearn.GeneralEvaluationPipeline(config, predictor)

After which we can finally perform benchmarking on these models:

In [12]:
cflearn.api.evaluate(loader, pipelines)

|        metrics         |                       acc                        |                       auc                        |
--------------------------------------------------------------------------------------------------------------------------------
|                        |      mean      |      std       |     score      |      mean      |      std       |     score      |
--------------------------------------------------------------------------------------------------------------------------------
|     decision_tree      | -- 1.000000 -- | -- 0.000000 -- | -- 1.000000 -- | -- 1.000000 -- | -- 0.000000 -- | -- 1.000000 -- |
--------------------------------------------------------------------------------------------------------------------------------
|          fcnn          |    0.900000    | -- 0.000000 -- |    0.900000    |    0.977000    | -- 0.000000 -- |    0.977000    |
-------------------------------------------------------------------------------------------------

  return np.log(proba)
  return np.log(proba)


{'acc': {'decision_tree': Statistics(sign=1.0, mean=1.0, std=0.0, score=1.0),
  'fcnn': Statistics(sign=1.0, mean=0.9, std=0.0, score=0.9),
  'linear': Statistics(sign=1.0, mean=0.8866666666666667, std=0.0, score=0.8866666666666667),
  'random_forest': Statistics(sign=1.0, mean=1.0, std=0.0, score=1.0)},
 'auc': {'decision_tree': Statistics(sign=1.0, mean=1.0, std=0.0, score=1.0),
  'fcnn': Statistics(sign=1.0, mean=0.977, std=0.0, score=0.977),
  'linear': Statistics(sign=1.0, mean=0.9524666666666667, std=0.0, score=0.9524666666666667),
  'random_forest': Statistics(sign=1.0, mean=1.0, std=0.0, score=1.0)}}

### Conclusions

Contained in this notebook is just a subset of the features that `carefree-learn` offers, but we've already walked through many basic & common steps we'll encounter in real life machine learning tasks.