# How to Easily Create Machine Learning Pipelines With DVC
## Create robust, reproducible pipelines with DVC

### What is a machine learning pipeline?

Imagine a machine learning pipeline as an unbreakable tunnel that stretches from one end of a mountain to the other. At the entrance of the tunnel is a massive avalanche of raw data, tumbling and cascading from all directions.

As the data enters the tunnel, it is cleaned, preprocessed, transformed and feature-selected into a form that is usable by machine learning models. Along the way, it passes through a series of checkpoints, where different models and algorithms are trained and tested, selecting only the strongest and most accurate ones to move on.

Finally, at the other end of the tunnel, the data emerges as a fully trained and operational machine learning model, ready to be deployed to the real world.

In a typical ML project, any checkpoint or stage in a pipeline may take dozens or even hundreds of iterations to get right. For this reason, it is imperative to use a reliable tool to track the pipeline building process and make it as straightforward and reproducible as possible.

One of those popular tools in the Python ecosystem is DVC - Data Version Control.

### How to create a pipeline in DVC?

### How to track metrics and plots in DVC?

### How to run experiments in DVC?

### Experimentation workflow in machine learning and DVC

### Next steps

Even though we have covered a lot of ground in the tutorial, there is still so much you can improve. I recommend reading the [DVC docs](https://dvc.org/doc) from top to bottom (like I did), focusing especially on the [User Guide section](https://dvc.org/doc/user-guide/overview).

From there, you can see more sophisticated ways you can run experiments, capture better metrics and plots (for example, plot templates) with `DVCLive` library, etc. If you want to learn about deploying DVC-tracked projects, don't neglect the [Use Cases section](https://dvc.org/doc/use-cases) as well.

### Conclusion

Massive congratulations to taking the first steps to clean, organized, reproducible machine learning projects! Here is an outline of the steps you should take to set up an experiment management system with DVC when starting a new project:

1. Track and store large files with `dvc add` and `dvc push`.
2. Create the scripts for stages of a pipeline. A typical pipeline consists of preprocessing, training and evaluation stages.
3. Add each stage with `dvc stage add` commands, specifying dependencies with `-d`, outputs to track with DVC with `-o`, metrics with `-M`, plots with `--plots` and so on.
4. Create a `params.yaml` file that lists hyperparameters of each pipeline stage. Use the `dvc.api.params_show` function inside your stage scripts to read the parameters as key-value pairs.
5. Run your entire pipeline as an experiment with `dvc exp run -n exp_name` or queue multiple experiments with `dvc exp run --queue`.
6. Manage and compare experiments using the DVC extension's view pane inside VSCode. 
7. Persist only chosen experiments to Git history with `dvc exp apply exp_ID` and `git add/git commit`. 
8. Iterate.

Thank you very much for reading!