# Getting Ready for Development

Bruno M. Pacheco

GEIA - Grupo de Estudos em Inteligência Artificial

## Content

1. Operational System
1. Programming Language
1. Dependencies Management
1. Project Structure
1. Development Environment
1. Tools/libs

# Operational System

What is the best OS for machine learning development?

## Why Linux?

### Widely used

* Google uses a debian derivative version
* Even IBM Watson and Cortana
* Top 100 mainframes all use Linux

<a href="https://www.quora.com/What-linux-distribution-is-best-for-AI-Machine-Learning-Researchers"><p style="text-align:right">Quora</p></a>

### **Dev tools**
* They are usually developed in Linux and then ported to Windows
* They usually run much smoother and with more predictable results in Linux

## Which Linux?

You should use the one that makes you more confortable with.

<a href="https://www.slant.co/topics/9702/~os-for-deep-learning"><p style="text-align:right">What are the best OS for deep learning</p></a>

### Arch Linux

- Really raw Linux, which means no bloatware
- But also means it does not comes ready out of the box
- Meant for experienced Linux users, so gives the user control over everything (no handholding)

### Ubuntu

- Easy and works out of the box, with already a wide range of software
- Less steep learning curve
- Better visual

### Debian GNU/Linux

- One of the oldest distros, which means a very complete community
- Limited driver support
- Very standard, kind of the starting point for most distros

## Ubuntu

Because it is pretty out-of-the-box.

### Installation

1. Watch out for LTS versions (LTS = Long Term Support for 5 years instead of 6 months)
1. Learn Terminal basics
    * cd, ls, rm, mv, mkdir, echo, cat, ">", sudo, --help, chmod,...
1. Learn how to install software through package managers (apt-get or yum)

# Programming Language

>There is no such thing as a ‘best language for machine learning’ and it all depends on what you want to build, where you’re coming from and why you got involved in machine learning"
><p  style="font-size:0.8em; text-align:right">- Developer Economics, <a href="http://vmob.me/DE1Q17"><em>State of the Developer Nation Q1 2017</em></a></p>

![Most Used Languages for ML](img/most_used_languages_ml.png)

## Python 2 or 3?

3, because 2 is **legacy**!

# Dependencies Management

## But why should I care about my dependencies?

* Outdated applications/packages
* Co-workers sync
* Deploy to client

## Environments

The purpose of virtual environments is to isolate one project from another, each one of them having its own depencies and dependencies versions.


## Solutions

* pip + virtualenv + virtualenvwrapper [1]
* **conda** [2]
* Docker

<p style="text-align:right" >[1]<a href=https://realpython.com/python-virtual-environments-a-primer/ >Python Virtual Environments: A Primer</a></p>
<p style="text-align:right" >[2]<a href=https://medium.freecodecamp.org/why-you-need-python-environments-and-how-to-manage-them-with-conda-85f155f4353c >Why you need Python environments and how to manage them with Conda</a></p>

### Conda

Because it is easy and has a lot of packages.

![Conda Draw](img/conda-draw.jpeg)

#### Conda Installation

* **Anaconda** vs
* Miniconda

Also 32- vs 64-bit and Python 2.7 vs 3.x versions.

<a href="https://conda.io/docs/user-guide/install/index.html"><p style="text-align:right">Installation</p></a>

#### Getting started

Learn how to:
* Create environments
* Activate environments
* Manage packages (install, update, specify versions, search, remove, etc.)

<a href="https://conda.io/docs/_downloads/conda-cheatsheet.pdf"><p style="text-align:right">Conda Cheatsheet</p></a>
<a href="https://conda.io/docs/user-guide/getting-started.html"><p style="text-align:right">Getting started with conda</p></a>

![Conda Channels](img/conda-channels.jpeg)

# Project Structure

> Do machine learning like the great engineer you are, not like the great machine learning expert you aren’t
> <p style="text-align:right" >Martin Zinkevich, <a href=http://martin.zinkevich.org/rules_of_ml/rules_of_ml.pdf><em>Rules of Machine Learning: Best Practices for ML Engineering</a></em></p>

## Why?

* Readability
* Reproducibility

<p style="text-align:right" ><a href=http://drivendata.github.io/cookiecutter-data-science/#why-use-this-project-structure >Cookiecutter Data Science: Why use this project structure?</a></p>
<p style="text-align:right" >Whitenack, D., <a href=https://www.oreilly.com/ideas/putting-the-science-back-in-data-science ><em>Putting the science back in data science</em></a></p>
<p style="text-align:right" >Towards Machine Learning, <a href=https://towardsml.com/2018/08/06/how-great-products-are-made-rules-of-machine-learning-by-google-a-summary/ ><em>How great products are made: Rules of Machine Learning by Google, a Summary</em></a></p>

> Ever tried to reproduce an analysis that you did a few months ago or even a few years ago? You may have written the code, but it's now impossible to decipher whether you should use make_figures.py.old, make_figures_working.py or new_make_figures01.py to get things done
> <p style="text-align:right" ><a href=http://drivendata.github.io/cookiecutter-data-science/#why-use-this-project-structure >Cookiecutter Data Science: Why use this project structure?</a></p>

## Cookiecutter Data Science

![Cookiecutter DS Initialization](img/cookiecutter-ds-intialization.png)

```
├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
│
└── tox.ini            <- tox file with settings for running tox; see tox.testrun.org
```

### Workflow

But now, what should I do with all these files and folders?

#### Some tips

> * Determine your goals — what error metric to use, and your target value for this error metric. These goals and error metrics should be driven by the problem that the application is intended to solve.
> * Establish a working end-to-end pipeline as soon as possible, including the estimation of the appropriate performance metrics.
> * Instrument the system well to determine bottlenecks in performance. Diagnose which components are performing worse than expected and whether poor performance is due to overﬁtting, underﬁtting, or a defect in the data or software.
> * Repeatedly make incremental changes such as gathering new data, adjusting hyperparameters, or changing algorithms, based on speciﬁc ﬁndings from your instrumentation.
<p style="text-align:right">Goodfellow, I., Bengio, Y., Courville, A., and Ng, A., <a href="https://www.deeplearningbook.org/contents/guidelines.html"><em>Deep Learning</em>, pg. 416</a></p>

#### Also, Mateusz Bednarski's [*Structure and automated workflow for a machine learning project*](https://towardsdatascience.com/structure-and-automated-workflow-for-a-machine-learning-project-2fa30d661c1e) is a good step-by-step to understand the reasons for all the pieces in a project structure.

# Development Environment

The software you use to write code.

## Jupyter/IPython Notebook

<div style="column-count: 2;">
    <div style="display: inline-block;vertical-align:top">
        <h3>Pros</h3>
        <ul>
            <li><b>Easy to share documents</b></li>
            <li>Interactive</li>
            <li>Fast to use and visualize results</li>
        </ul>
    </div>
    <div style="display: inline-block;vertical-align:top">
        <h3>Cons</h3>
        <ul>
            <li><b>Pretty easy to get messy</b></li>
            <li><b>Hard to debug</b></li>
        </ul>
    </div>
</div>


### Recommended for data exploration, data cleaning and experimentations

### Getting started with Jupyter

* Notebook basics [1]
* Learn Markdown for documentation [2]
* IPython basics [3]

<p style="text-align:right">[1] <a href=https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Notebook%20Basics.html>Jupyter Notebook Read the Docs: Notebook Basics</a></p>
<p style="text-align:right">[2] <a href=https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet>Markdown Cheatsheet</a></p>
<p style="text-align:right">[3] <a href=https://ipython.org/ipython-doc/3/interactive/tutorial.html>Introducing IPython</a></p>

## IDEs (PyCharm, Spyder, etc.)

<div style="column-count: 2;">
    <div style="display: inline-block;vertical-align:top">
        <h3>Pros</h3>
        <ul>
            <li><b>Error highlighting, easy to debug</b></li>
            <li>Auto-completion, code indentation, etc</li>
            <li>Refactoring tools</li>
            <li>Visualization tools</li>
        </ul>
    </div>
    <div style="display: inline-block;vertical-align:top">
        <h3>Cons</h3>
        <ul>
            <li><b>Hard to integrate with other tools</b></li>
            <li>Usually heavy</li>
            <li>Logarithmic learning curve</li>
            <li>Abstraction from compile-run-debug process</li>
        </ul>
    </div>
</div>

### Recommended for "hardcore" code writing

### Getting started with PyCharm

1. Creating and managing a project
1. Customize your environment
1. Learn how to run, debug and test
1. Learn the Scientific Tools (if not in the free version)

<p style="text-align:right"><a href=https://www.jetbrains.com/help/pycharm/quick-start-guide.html>Quick Start Guide</a></p>
<p style="text-align:right"><a href=https://www.jetbrains.com/help/pycharm/scientific-tools.html>Scientific Tools</a></p>

## Code Editors (Atom, VS Code, Sublime, etc.)

<div style="column-count: 2;">
    <div style="display: inline-block;vertical-align:top">
        <h3>Pros</h3>
        <ul>
            <li><b>Customizable</b></li>
            <li>Lightweight</li>
            <li>Good for learning about compile-run-debug tools</li>
            <li>Integration with external tools</li>
        </ul>
    </div>
    <div style="display: inline-block;vertical-align:top">
        <h3>Cons</h3>
        <ul>
            <li><b>Usually bad for visualization</b></li>
            <li>Linear learning curve</li>
            <li>Not so good debug tools (compared to IDEs)</li>
        </ul>
    </div>
</div>

### Recommended for code writing and workflow automation

### Getting started with Visual Studio Code

1. Learn the basics [1]
1. Learn how to debug [2]
1. Use tasks [3]
1. Setup a linter [4]
1. Explore extensions [5, 6 and 7]


<p style="text-align:right">[1] <a href=https://code.visualstudio.com/docs/introvideos/basics>Getting started with Visual Studio Code</a></p>
<p style="text-align:right">[2] <a href=https://code.visualstudio.com/docs/editor/debugging>Debugging</a></p>
<p style="text-align:right">[3] <a href=https://code.visualstudio.com/docs/editor/tasks>Integrate with External Tools via Tasks</a></p>
<p style="text-align:right">[4] <a href=https://code.visualstudio.com/docs/python/linting>Linting Python in VS Code</a></p>
<p style="text-align:right">[5] <a href=https://marketplace.visualstudio.com/items?itemName=ms-toolsai.vscode-ai>Visual Studio Code Tools for AI</a></p>
<p style="text-align:right">[6] <a href=https://marketplace.visualstudio.com/items?itemName=donjayamanne.jupyter>Jupyter</a></p>
<p style="text-align:right">[7] <a href=https://marketplace.visualstudio.com/items?itemName=jithurjacob.nbpreviewer>VS Code Jupyter Notebook Previewer</a></p>
