# Machine Learning and Statistics: Introduction

Material for the [2018 Asterics and Obelics School](https://indico.in2p3.fr/event/16864/).

Content is maintained on [github](https://github.com/Asterics2020-Obelics/School2018/tree/master/mls) and distributed under a [BSD3 license](https://opensource.org/licenses/BSD-3-Clause).

![sponsor-logos](img/sponsor-logos.png)

### What is "Machine Learning"?

Using **machines** to **learn** how to explain data with models.

### What is "Machine Learning"?

Using **machines** to **learn** how to explain data with models.

The "machines" responsible for most of the progress in ML are:
 - software algorithms
 - hardware architectures
 - human ingenuity
 
The "learning" consists of passively identifying statistical correlations, which is very different from how we learn with active experimentation and identifying causal relationships.

### What is "Machine Learning?"

Using machines to learn how to explain **data** with **models**.

![MLS-triangle1](img/MLS-triangle1.png)

## What is "Machine Learning?"

Machine learning uses models to learn from data.

![MLS-triangle2](img/MLS-triangle2.png)

## What is Data?

Data is (are?) a finite set of measurements:
- Usually viewed as a 2D table e.g., spreadsheet, [FITS table](http://docs.astropy.org/en/stable/io/fits/usage/table.html), [Pandas dataframe](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)...
- **colums = features**
- **rows = samples** (observations)
- richer data structures (images, [ROOT trees](https://root.cern.ch/root/html/guides/users-guide/Trees.html#trees), etc) must be flattened.

![data-table](img/data-table.png)

## What is Data?

Data is (are?) a finite set of measurements:
- Usually viewed as a 2D table e.g., spreadsheet, [FITS table](http://docs.astropy.org/en/stable/io/fits/usage/table.html), , [Pandas dataframe](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)...
- **colums = features**
- **rows = samples** (observations)
- richer data structures (images, [ROOT trees](https://root.cern.ch/root/html/guides/users-guide/Trees.html#trees), etc) must be flattened.

Questions to ask about your data:
- Are my features categorical / discrete / continuous?
- Is the ordering of my samples significant?
- Are my samples statistically independent? drawn from the sample distribution?
- What are the measurement uncertainties?
- Is it binned / un-binned?
- Is there a natural similarity / distance measure on samples (rows)?

## What is a Model?

Two important types of models: generative, probabilistic.

All ML algorithms use a model to explain your data.

Models have parameters.

![models1](img/models1.png)

## What is a Model?

Two important types of models: generative, probabilistic.

Models can explain data **and parameters**.

Models have parameters **and hyper-parameters.**

![models2](img/models2.png)

## What is Learning?

Three broad types of learning:
 - **Unsupervised: learn to predict new data.**
   - Given data: what patterns are present? (learn a model).
   - Given data and model: how likely is new data to be from same model? (generate new data).
 - **Supervised: Learn to predict specific features of new data.**
   - Classification: predict discrete features (learn a conditional model).
   - Regression: predict continuous features  (learn a conditional model).
 - **Inference: explain observed data.**
   - Assuming a model: what parameters (with what uncertainties) best describe my data? (learn a model).
   - Given competing models: which best describes my data? (model selection).
 
(Also: reinforcement learning.)

## ML in Astrophysics and Astroparticle Physics

Scientific applications of ML benefit a lot from advances in industry but we work in a different context:
- **We are data producers, not data consumers:**
  - Experiment / survey design.
  - Optimization of statistical errors.
  - Control of systematic errors.
- **Our data measures physical processes:**
  - Measurements often reduce to counting photons, etc, with known a-priori random errors.
  - Dimensions and units are important.
- **Our models are usually traceable to an underlying physical theory:**
  - Models constrained by theory and previous observations.
  - Parameter values often intrinsically interesting.
- **A parameter uncertainty estimate is just as important as its value:**
  - Prefer methods that handle input data uncertainties (weights) and provide output parameter uncertainty estimates.

![outline](img/outline.png)

These notebooks share some functions that are defined in the `machinelearning/mls/` subdirectory of the school repo. To make these functions accessible, you will need to install the corresponding python package into your `school18` conda environment.

First, navigate to your local copy of the [school repo](https://github.com/Asterics2020-Obelics/School2018):
```
cd .../School2018
```
Next activate your school environment if necessary:
```
conda activate school18
```
Finally, make sure you have the latest changes and install the `mls` shared code:
```
git pull
cd machinelearning
python setup.py install
```
Test that this worked using:
```
python -c 'import mls'
```
You only need to perform the steps above once.

To follow along with these notebooks:
```
conda activate school18
cd .../School2018/machinelearning
jupyter notebook Contents.ipynb
```
Note that you need to start jupyter from the `machinelearning/` directory to access the data files used by some of the notebooks.

### Acknowledgement:
**H2020-Astronomy ESFRI and Research Infrastructure Cluster (Grant Agreement number: 653477).**