# Getting Ready for Development

Bruno M. Pacheco

GEIA - Grupo de Estudos em Inteligência Artificial

## Content

1. Operational System
1. Programming Language
1. Dependencies Management
1. Project Structure
1. IDE
1. Tools/libs

# Operational System

What is the best OS for machine learning development?

## Why Linux?

### Widely used

* Google uses a debian derivative version
* Even IBM Watson and Cortana
* Top 100 mainframes all use Linux

<a href="https://www.quora.com/What-linux-distribution-is-best-for-AI-Machine-Learning-Researchers"><p style="text-align:right">Quora</p></a>

### **Dev tools**
* They are usually developed in Linux and then ported to Windows
* They usually run much smoother and with more predictable results in Linux

## Which Linux?

You should use the one that makes you more confortable with.

<a href="https://www.slant.co/topics/9702/~os-for-deep-learning"><p style="text-align:right">What are the best OS for deep learning</p></a>

### Arch Linux

- Really raw Linux, which means no bloatware
- But also means it does not comes ready out of the box
- Meant for experienced Linux users, so gives the user control over everything (no handholding)

### Ubuntu

- Easy and works out of the box, with already a wide range of software
- Less steep learning curve
- Better visual

### Debian GNU/Linux

- One of the oldest distros, which means a very complete community
- Limited driver support
- Very standard, kind of the starting point for most distros

## Ubuntu

Because it is pretty out-of-the-box.

### Installation

1. Watch out for LTS versions (LTS = Long Term Support for 5 years instead of 6 months)
1. Learn Terminal basics
    * cd, ls, rm, mv, mkdir, echo, cat, ">", sudo, --help, chmod,...
1. Learn how to install software through package managers (apt-get or yum)

# Programming Language

>There is no such thing as a ‘best language for machine learning’ and it all depends on what you want to build, where you’re coming from and why you got involved in machine learning"
><p  style="font-size:0.8em; text-align:right">- Developer Economics, <a href="http://vmob.me/DE1Q17"><em>State of the Developer Nation Q1 2017</em></a></p>

![Most Used Languages for ML](img/most_used_languages_ml.png)

## Python 2 or 3?

3, because 2 is **legacy**!

# Dependencies Management

## But why should I care about my dependencies?

* Outdated applications/packages
* Co-workers sync
* Deploy to client

## Environments

The purpose of virtual environments is to isolate one project from another, each one of them having its own depencies and dependencies versions.


## Solutions

* pip + virtualenv + virtualenvwrapper [1]
* **conda** [2]
* Docker

<p style="text-align:right" >[1]<a href=https://realpython.com/python-virtual-environments-a-primer/ >Python Virtual Environments: A Primer</a></p>
<p style="text-align:right" >[2]<a href=https://medium.freecodecamp.org/why-you-need-python-environments-and-how-to-manage-them-with-conda-85f155f4353c >Why you need Python environments and how to manage them with Conda</a></p>

### Conda

Because it is easy and has a lot of packages.

![Conda Draw](img/conda-draw.jpeg)

#### Conda Installation

* **Anaconda** vs
* Miniconda

Also 32- vs 64-bit and Python 2.7 vs 3.x versions.

<a href="https://conda.io/docs/user-guide/install/index.html"><p style="text-align:right">Installation</p></a>

#### Getting started

Learn how to:
* Create environments
* Activate environments
* Manage packages (install, update, specify versions, search, remove, etc.)

<a href="https://conda.io/docs/_downloads/conda-cheatsheet.pdf"><p style="text-align:right">Conda Cheatsheet</p></a>
<a href="https://conda.io/docs/user-guide/getting-started.html"><p style="text-align:right">Getting started with conda</p></a>

![Conda Channels](img/conda-channels.jpeg)

# Project Structure

> Do machine learning like the great engineer you are, not like the great machine learning expert you aren’t
> <p style="text-align:right" >Martin Zinkevich, <a href=http://martin.zinkevich.org/rules_of_ml/rules_of_ml.pdf><em>Rules of Machine Learning: Best Practices for ML Engineering</a></em></p>

## Why?

* Readability
* Reproducibility

<p style="text-align:right" ><a href=http://drivendata.github.io/cookiecutter-data-science/#why-use-this-project-structure >Cookiecutter Data Science: Why use this project structure?</a></p>
<p style="text-align:right" >Whitenack, D., <a href=https://www.oreilly.com/ideas/putting-the-science-back-in-data-science ><em>Putting the science back in data science</em></a></p>
<p style="text-align:right" >Towards Machine Learning, <a href=https://towardsml.com/2018/08/06/how-great-products-are-made-rules-of-machine-learning-by-google-a-summary/ ><em>How great products are made: Rules of Machine Learning by Google, a Summary</em></a></p>

> Ever tried to reproduce an analysis that you did a few months ago or even a few years ago? You may have written the code, but it's now impossible to decipher whether you should use make_figures.py.old, make_figures_working.py or new_make_figures01.py to get things done
> <p style="text-align:right" ><a href=http://drivendata.github.io/cookiecutter-data-science/#why-use-this-project-structure >Cookiecutter Data Science: Why use this project structure?</a></p>

## Cookiecutter Data Science

In [None]:
# Call cookiecutter with the Data Science boilerplate's GitHub link
# cookiecutter https://github.com/drivendata/cookiecutter-data-science
# Since I've already downloaded cookiecutter-data-science, I just call it by its name
!cookiecutter cookiecutter-data-science

project_name [project_name]: 

```
├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
│
└── tox.ini            <- tox file with settings for running tox; see tox.testrun.org
```