In [1]:
# Import libraries to embed videos and scripts, we hide it in the slides:
from IPython.display import display, HTML

# Girls Can Do: Machine Learning!

### Who are we?
#### Emma Scala
Geophysicist for 17 years, now exploring the possibilities of Data Science

<img src="http://vbpr.no/wp-content/uploads/2016/09/PCable-Norwegian-Sea-seismic-sample.jpg" alt="Seismic section" title="Seismic Section" style="width: 50pc;"/>

### Who are we?
#### Harjeet Harpal
Statistics student, working at AIA Science as a Junior Data Scientist

# What is Machine Learning?

Machine learning is that branch of computer science that has the development of Artificial Intelligence in mind.

Machine learning:
- first, makes use of computers to interpret available data and find _patterns_ in it
- then, allows computers to get the knowledge to make predictions for new (still unobserved) data. 

## Goal:
we want the machines to "learn" from examples instead of being explicitly programmed to perform a task / make a decision.

## How do we do it?
we use *__Data Science__* to extract information from (lots of!) data in order to "teach" the AI how to predict some variable, or to classify an object

## And what is Data Science, then?
What we call
- Data Science
- Data Mining
- Data Analytics

are in practice &nbsp;&nbsp; &asymp; &nbsp; __Statistics__: the branch of mathematics dealing with the collection, analysis, interpretation, presentation, and organization of data.

# Getting knowledge from data: the scientific method
<center><img src="https://github.com/DataForGood-Norway/GirlsCanDoIt/blob/master/MachineLearning/Slides/files/ScientificMethod.png?raw=true" title="Statistics" width="70%"/></center>

Even for humans telling apart a dog from a cat can be a difficult task _just_ from a picture:


<center><img src="https://github.com/DataForGood-Norway/GirlsCanDoIt/blob/master/MachineLearning/Slides/files/dogcat.jpg?raw=true" title="Cat Ground Truth" width="50%"/></center>

(Add cute picture to slides: &#10003;)

# Data Scientist's skill set(s)
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
<img src="https://github.com/DataForGood-Norway/GirlsCanDoIt/blob/master/MachineLearning/Slides/files/Data-Science-vs.-Data-Analytics-vs.-Machine-Learning.jpg?raw=true" title="Data Scientist Skill Set" width="90%"/>

### The danger zone
http://www.tylervigen.com/spurious-correlations
<img src="https://github.com/DataForGood-Norway/GirlsCanDoIt/blob/master/MachineLearning/Slides/files/spurious_correlation.png?raw=true" title="Spurious Correlation" />

<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician.</p>&mdash; Josh Wills (@josh_wills) <a href="https://twitter.com/josh_wills/status/198093512149958656?ref_src=twsrc%5Etfw">May 3, 2012</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

# Our tools:
- Your computer! (or a partner)
- Google Account to run some Python code on a [CoLaboratory](https://research.google.com/colaboratory/faq.html) notebook

# Why Python?
- Easy to learn language: it's great to start with it if you never coded before!
- Powerful
- Widely spread (= lots of packages and learning resources!)
- Growing popularity

<img src="https://github.com/DataForGood-Norway/GirlsCanDoIt/blob/master/MachineLearning/Slides/files/Google_Trends.jpeg?raw=true" title="R vs Python trends" />

# Why not R?
R is another programming language we could have used (we recommend to [try it yourself](https://www.r-project.org/)!) but Python seems to be more flexible and popular.

... and we couldn't set it up for this workshop because we didn't want to ask you to install anything :)


# But why should we code at all?
As a rule of thumb:

#### _The easier it is to use a tool, the less flexible it is._

# Outline:
- Types of machine learning
    - Examples of applications of machine learning
- Types of data (and how we learn from it)
    - Laboratory

<img src="https://github.com/DataForGood-Norway/GirlsCanDoIt/blob/master/MachineLearning/Slides/files/ml_types.png?raw=true" title="Types of Machine Learning" />

(We won't deal with reinforcement learning this time, since it requires us to talk about Markov processes and maximization of cumulative reward.)

## Examples of applications of Machine Learning:
### through _supervised_ learning:
When we have observations of the **target variable** (what we want to predict)
- Sports
- Astronomy
- Price-making for flight tickets

Other examples: stock market, house pricing, prediction of salary, lifetime earnings, etc.

[Women computers](https://www.atlasobscura.com/articles/how-female-computers-mapped-the-universe-and-brought-america-to-the-moon) at the Harvard College Observatory, who worked for the astronomer Edward Charles Pickering.
<center><img src="https://github.com/DataForGood-Norway/GirlsCanDoIt/blob/master/MachineLearning/Slides/files/astronomy_b4_computers.jpg?raw=true" title="Manual labelling in astronomy" width="70%" /></center>

Predicting athlete's success: [Moneyball](https://www.imdb.com/title/tt1210166/)
<center><img src="https://github.com/DataForGood-Norway/GirlsCanDoIt/blob/master/MachineLearning/Slides/files/moneyball.jpg?raw=true" title="Moneyball" width="70%"/></center>

(Add Brad's picture to slides: &#10003;)

## Examples of applications of Machine Learning:
### through _unsupervised_ learning
When we don't have observations for a target variable, but we **find patterns in data**

- Netflix recommendations
- Fraud detection
- Natural Language Processing, sentiment analysis

Other examples: self-driving cars, image recognition, article-writing bots

Or [book-writing](http://botnik.org/content/harry-potter.html) bots!
<center><img src="https://github.com/DataForGood-Norway/GirlsCanDoIt/blob/master/MachineLearning/Slides/files/harry_potter.jpg?raw=true" title="AI Harry Potter" width=40% /></center>

Self-driving cars, finally!
<center><img src="https://github.com/DataForGood-Norway/GirlsCanDoIt/blob/master/MachineLearning/Slides/files/selfdriving_car.gif?raw=true" title="Self-driving car"/></center>

In [2]:
# Google Duplex: AI assistant making call
yt = """<center>
<iframe width="560" height="315" src="https://www.youtube.com/embed/bd1mEm2Fy08?rel=0&amp;showinfo=0&amp;start=43" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>
</center>"""
display(HTML(yt))

In [3]:
# Pixel Buds instant translation
yt = """<center>
<iframe width="560" height="315" src="https://www.youtube.com/embed/7PknYqkG9TE?rel=0&amp;showinfo=0" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>
</center>"""
display(HTML(yt))

# Types of data (and how the machines learn from it)

<img src="https://github.com/DataForGood-Norway/GirlsCanDoIt/blob/master/MachineLearning/Slides/files/ml_task.png?raw=true" title="Types of Data" />

# The learning process: training and testing a model

The process of defining a model takes place in two phases:
- Training
- Testing

So we need to split our data in **training dataset** and **test dataset** before working on it.

## But why?
We "sacrifice" some data we could train our model on, because we want to have a measure of its performance.

# The overfitting problem

We want to have a model that is good enough to fit our training data, but not to the point where it models even the noise.

In [4]:
### WARNING: this piece of code is quite advanced, but we thought of leaving it in just to satisfy curiosity
# This is how we generated the overfitting problem plots
# (http://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html):

import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score


def true_fun(X):
    return np.cos(1.5 * np.pi * X)

np.random.seed(0)

n_samples = 30
degrees = [1, 4, 15]

X = np.sort(np.random.rand(n_samples))
y = true_fun(X) + np.random.randn(n_samples) * 0.1

plt.figure(figsize=(14, 5))
for i in range(len(degrees)):
    ax = plt.subplot(1, len(degrees), i + 1)
    plt.setp(ax, xticks=(), yticks=())

    polynomial_features = PolynomialFeatures(degree=degrees[i],
                                             include_bias=False)
    linear_regression = LinearRegression()
    pipeline = Pipeline([("polynomial_features", polynomial_features),
                         ("linear_regression", linear_regression)])
    pipeline.fit(X[:, np.newaxis], y)

    # Evaluate the models using crossvalidation
    scores = cross_val_score(pipeline, X[:, np.newaxis], y,
                             scoring="neg_mean_squared_error", cv=10)

    X_test = np.linspace(0, 1, 100)
    plt.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
    plt.plot(X_test, true_fun(X_test), label="True function")
    plt.scatter(X, y, edgecolor='b', s=20, label="Samples")
    plt.xlabel("x")
    plt.ylabel("y")
    plt.xlim((0, 1))
    plt.ylim((-2, 2))
    plt.legend(loc="best")
    plt.title("Degree {}\nMSE = {:.2e}(+/- {:.2e})".format(
        degrees[i], -scores.mean(), scores.std()))
plt.show()

<Figure size 1400x500 with 3 Axes>

<img src="https://github.com/DataForGood-Norway/GirlsCanDoIt/blob/master/MachineLearning/Slides/files/under_overfitting.png?raw=true" alt="Overfitting vs Underfitting" title="Seismic Section" style="width: 50pc;"/>

In [5]:
### Break!
# CGP Grey video: How Machines Learn
yt = """<center>
<iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/R9OHn5ZF4Uo?rel=0" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>
</center>"""
display(HTML(yt))

Why don't we use Neural Networks for everything, then?
#### _You don't take the car for a five minutes walk._

# Laboratory notebooks:
- Continuous data, with task driven model (we want to make predictions)
    - We'll tackle it with [**Linear Regression**](https://colab.research.google.com/github/DataForGood-Norway/GirlsCanDoIt/blob/master/MachineLearning/Lab/Linear%20Regression%20Example.ipynb)
- Discrete data, with data driven model (we don't know what classes and how many they are, we'll see how the data will tend to be clustered)
    - We'll tackle it with [**K-means Clustering**](https://colab.research.google.com/github/DataForGood-Norway/GirlsCanDoIt/blob/master/MachineLearning/Lab/Clustering%20Example.ipynb)

# What are notebooks?
Documents created with _Jupyter Notebook_ that support:

- **Python code**: code to run to perform data analysis 
- _Markdown_ or _HTML code_
    - the analysis description (text)
    - the results (figures, tables, equations, etc..)
    
... basically everything you've seen so far in these slides (videos!)

# Linear Regression: a quick introduction

We want to find the linear function: $$\pmb{F}(\pmb{X}) = \pmb{y}$$ where:

* $\pmb{X}$ is your matrix of **Features**
* $\pmb{y}$ is your vector of **Target values**
* $\pmb{F}$ is the **Model**/function that will be fitted on your data to match the target values from their features.

<center><img src="https://github.com/DataForGood-Norway/GirlsCanDoIt/blob/master/MachineLearning/Slides/files/features_target.png?raw=true" title="Features & Target"/></center>

## What we want to achieve:
<img src="https://github.com/DataForGood-Norway/GirlsCanDoIt/blob/master/MachineLearning/Slides/files/OLS_data.jpeg?raw=true" alt="Overfitting vs Underfitting" title="OLS data points"/>

## Fitting a line through data
<img src="https://github.com/DataForGood-Norway/GirlsCanDoIt/blob/master/MachineLearning/Slides/files/OLS_fitted.jpeg?raw=true" alt="Overfitting vs Underfitting" title="OLS fitted line"/>

## How to find the fitting line

To find the line that fits "the best" through all our data points, we'll first compute the
<h3><center>Squared error</center></h3>
Which is the sum of the squares of the error of our model
The error is the difference between the prediction of the target variable that our model made and its actual realization.
$$\sum \left(\pmb{y_{test}} - \pmb{F}\left(\pmb{X_{test}}\right)\right)^2$$

### Errors, visually
<center><img src="https://github.com/DataForGood-Norway/GirlsCanDoIt/blob/master/MachineLearning/Slides/files/residuals_regression.png?raw=true" title="Errors, visually"/></center>

# K-means Clustering: a quick introduction

K-means clustering is a type of **unsupervised learning**: used when you have unlabeled data (i.e., data without defined categories or groups).

**Goal**: find groups in the data, with the number of groups represented by the variable K.

Rather than defining groups _before_ looking at the data, clustering allows you to find and analyze the groups that have formed _organically_.

For this reason it can be used as a data pre-processing technique that will help us to find outliers, **having no previous knowledge** about the population.

## The algorithm
**Input**:
- all data points (observations)
- number of desired groups K


**Output**:
- centroids of the K clusters, which can be used to label new data
- labels for the training data (each data point is assigned to a single cluster)


### Iterative algorithm:

<code>assign random value to the features of the K centroids</code>

<code>repeat:
    1) assign/re-assign each data point to its nearest centroid
    2) recompute centroids --> new centroids = mean of all data points assigned to that centroid's cluster</code>
    
<code>until a stopping criteria is met</code>

### K-means algorithm, visually
<center><img src="https://github.com/DataForGood-Norway/GirlsCanDoIt/blob/master/MachineLearning/Slides/files/Kmeans.gif?raw=true" title="Kmeans algo animation"/></center>

# How to install Python?
## ... we install Anaconda!
Anaconda is a Python Data Science platform, it contains all the necessary libraries / IDE / tools for data processing using Python and it contains Python too, so there is no need to install anything else:
https://docs.anaconda.com/anaconda/install/#
# Python course
https://www.datacamp.com/learn-python-with-anaconda

# Python libraries we used
All of them are included in the Anaconda distribution!
- **[`numpy`](http://www.numpy.org/)**: numerical analysis, manipulating arrays
- **[`pandas`](https://pandas.pydata.org/)**: data manipulation (`DataFrames` are great!)
- **[`matplotlib`](https://matplotlib.org/index.html)**: plots
- **[`sklearn`](http://scikit-learn.org/stable/)**: `scikit-learn` is probably _The Library_ for Machine Learning, well supported by community

And for the notebooks we used **[`jupyter notebook`](http://jupyter.org/install)**.

# What's next?
- Getting you set up on your machine
- Exploring `scikit-learn`'s functonalities
- Natural Language Processing

...any suggestion?

### Thanks to:
- Girls Can Do IT
- Data For Good Norway

### Special thanks to:
Patrick Merlot
