# Machine Learning with scikit-learn

## Getting Started with This Course

Let us take a look at how we will install the software and learning materials needed for this course...

> <font size="+2">https://github.com/DavidMertz/ML-Webinar</font>

<hr/>

## What Is Machine Learning?

> **"If you torture the data enough, nature will always confess."** –Ronald Coase

As a one line version—not entirely original—I like to think of machine learning as "statistics on steroids."  That characterization may be more cute than is necessary, but it is a good start.  Others have used phrases like "extracting knowledge from raw data by computational means."

The lede on the Wikipedia article provides a bit more.

![Wikipedia entry](img/ML-Wikipedia.png)

Cite: [Wikipedia, 09:29, 2018 October 4](https://en.wikipedia.org/w/index.php?title=Machine_learning&oldid=862453222)

## Machine Learning Libraries

There are many software libraries available for machine learning.  Some of them are listed below.

### For General Machine Learning

* **[scikit-learn](http://scikit-learn.org/)**: Free Software (BSD License). The topic of this course
* **[Spark MLLib](https://spark.apache.org/mllib/)**: Free Software (Apache License 2.0). Spark based machine learning with interfaces to Java, Scala, Python, and R. MLlib fits into Spark's APIs and interoperates with NumPy in Python and R libraries. You can use any Hadoop data source (e.g. HDFS, HBase, or local files), making it easy to plug into Hadoop workflows.
* **[mlpack](https://www.mlpack.org/)**: Free Software (3-clause
BSD license;  Mozilla Public License v2.0; Boost Software License, version 1.0). A fast, flexible machine learning library, written in C++, that aims to provide fast, extensible implementations of cutting-edge machine learning algorithms. mlpack provides these algorithms as simple command-line programs, Python bindings, and C++ classes which can then be integrated into larger-scale machine learning solutions.
* **[Accord.NET Framework](http://accord-framework.net/)**: Free Software (LGPLv2.1) The Accord.NET Framework is a .NET machine learning framework combined with audio and image processing libraries completely written in C#. It is a complete framework for building production-grade computer vision, computer audition, signal processing and statistics applications even for commercial use. 
* **[WEKA](https://www.cs.waikato.ac.nz/ml/weka/)**: Free Software (GPL). Data Mining Software in Java. Weka is a collection of machine learning algorithms for data mining tasks. It contains tools for data preparation, classification, regression, clustering, association rules mining, and visualization. 
* **[Shogun](http://shogun-toolbox.org/)**: Free Software (GPLv3). Shogun is among the oldest of machine learning libraries, but continues to be well maintained and optimized. Shogun was created in 1999 and written in C++. Via SWIG, Shogun can be used in Java, Python, C#, Ruby, R, Lua, Octave, and Matlab. Shogun is designed for unified large-scale learning for a broad range of feature types and learning settings, like classification, regression, or explorative data analysis.
* **[Torch](http://torch.ch/)** and **[PyTorch](https://pytorch.org/)**: Free Software (custom BSD-ish license). Torch is a scientific computing framework with wide support for machine learning algorithms, emphasizing GPU computation. Torch is based on the scripting language Lua, PyTorch is Python bindings to the underlying engine and C/CUDA implementation. The goal of Torch is to have maximum flexibility and speed in building your scientific algorithms while making the process extremely simple. Torch comes with a large ecosystem of community-driven packages in machine learning, computer vision, signal processing, parallel processing, image, video, audio and networking among others, and builds on top of the Lua community.

### For Deep Learning

* **[TensorFlow](https://www.tensorflow.org/)**: Free Software (Apache 2.0 open source license). TensorFlow is an open source software library for numerical computation using data flow graphs. TensorFlow implements what are called data flow graphs, where batches of data ("tensors") can be processed by a series of algorithms described by a graph. The movements of the data through the system are called "flows". Graphs can be assembled with C++ or Python and can be processed on CPUs or GPUs.
* **[Theano](http://deeplearning.net/software/theano/)**: Free Software (BSD License). Theano is a Python library that lets you to define, optimize, and evaluate mathematical expressions, especially ones with multi-dimensional arrays (numpy.ndarray). Using Theano it is possible to attain speeds rivaling hand-crafted C implementations for problems involving large amounts of data. It was written at the LISA lab to support rapid development of efficient machine learning algorithms. Theano is named after the Greek mathematician, who may have been Pythagoras’ wife. 
* **[Keras](https://keras.io/)**: Free Software (MIT License). Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. It was developed with a focus on enabling fast experimentation. Keras allows for easy and fast prototyping (through user friendliness, modularity, and extensibility). It Supports both convolutional networks and recurrent networks, as well as combinations of the two. Runs seamlessly on CPU and GPU.
* Apache MXNet
* **[Caffe](http://caffe.berkeleyvision.org/)**: Free Software (BSD 2-Clause License). Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is developed by Berkeley AI Research (BAIR) and by community contributors. Yangqing Jia created the project during his PhD at UC Berkeley. Bindings for Python and MATLAB are part of the library.
* **[Chainer](https://chainer.org/)**: Free Software (MIT License). Chainer supports CUDA computation and runs on multiple GPUs with little effort. Chainer supports various network architectures including feed-forward nets, convnets, recurrent nets and recursive nets. It also supports per-batch architectures. Forward computation can include any control flow statements of Python without lacking the ability of backpropagation. 

### Cloud Focused

* **[Amazon SageMaker](https://aws.amazon.com/sagemaker)**: Commercial. Amazon SageMaker provides fully managed instances running Jupyter notebooks for training data exploration and preprocessing. These notebooks are pre-loaded with CUDA and cuDNN drivers for popular deep learning platforms, Anaconda packages, and libraries for TensorFlow, Apache MXNet, Chainer, and PyTorch.
* **[Google Cloud Machine Learning]()**: Commercial. Google Cloud Machine Learning (ML) Engine is a managed service that allows developers and data scientists to build and bring machine learning models to production. Cloud ML Engine offers training and prediction services, which can be used together or individually. Cloud ML provides access to Python libraries TensorFow, Keras, XGBoost, and scikit-learn.
* **[Azure ML Studio](https://studio.azureml.net/)**: Commercial and proprietary. Azure ML Studio allows Microsoft Azure users to create and train models, then turn them into APIs that can be consumed by other services. A wide range of algorithms are available, from both Microsoft and third parties. A free-of-cost trial allows evaluation for eight hours.

## What Is scikit-learn?

Scikit-learn provides a large range of algorithms in machine learning that are unified under a common and intuitive API. Most of the dozens of classes provided for various kinds of models share the large majority of the same calling interface. Very often—as we will see in examples below—you can easily substitute one algorithm for another with nearly no change in your underlying code. This allows you to explore the problem space quickly, and often arrive at an optimal, or at least satisficing$^1$ approach to your problem domain or datasets.

* Simple and efficient tools for data mining and data analysis
* Accessible to everybody, and reusable in various contexts
* Built on NumPy, SciPy, and matplotlib
* Open source, commercially usable - BSD license

<hr/>

<small>$^1$<i>Satisficing is a decision-making strategy of searching through the alternatives until an acceptability threshold is met. It is a portmanteau of satisfy and suffice, and was introduced by Herbert A. Simon in 1956. He maintained that many natural problems are characterized by computational intractability or a lack of information, both of which preclude the use of mathematical optimization procedures.</i></small>

## Overview of Techniques Used in Machine Learning

The diagram below is from the scikit-learn documentation, but the same general schematic of different techniques and algorithms that it outlines applies equally to any other library.  The classes represented in bubbles mostly will have equivalent versions in other libraries.

![Scikit-learn topic areas](img/sklearn-topics.png)

## Difference between "Deep Learning" and other ML Techniques

### Neural Networks

The basic idea of a "multilayer perceptron" is a "feed-forward" artificial neural network, composed of "neurons" arranged in "layers." A common illustration is similar to that at right. This idea of "Hebbian networks" has existed since the 1940s, but it really only became a machine learning technique with Paul Werbos' 1975 introduction of "backpropagation" as a means to train such networks. Either way, the ideas are fairly old.

![Basic perceptron](img/basic-perceptron.png)

Included in diagram is a network with 4 layers and 12 connections (i.e. "parameters"). If it were "fully connected" the diagram would have 16 parameters. What makes a particular trained network special is the set of "weights" in the connections, illustrated and commonly named as subscripted  $w$ values.

For many decades after neural networks were known, they remained a minor area of interest. Usually a variety of other techniques rooted in statistics and linear algebra were more effective in solving problems of classification, regression, and clustering.

Image credit: ["Feedforward Neural Networks", John McGonagle and yushi 21](https://brilliant.org/wiki/feedforward-neural-networks/)

---

### What if We Had a LOT More Neurons?

In the last decade or less, neural networks—mathematically not much different from those described in the 1940s—grew much larger. For example, the extremely power Inception v3 image classifier consists of approximately 23.8 million parameters across about 140 layers. Layers generally each have many more neurons than the dozen or fewer shown in textbook illustrations like the one above. Scikit-learn has basic neural network techniques, but their use is mostly for the uses that made sense more than five years ago.

![Inception v3](img/inception-v3.png)

Classic "fully connected" layers make up only a small number of those used. More than anything else, the effect and reason for this is to limit the combinatorial explosion of connections, limiting the parameters to only 24 million.

Image credit: ["Advanced Guide to Inception v3 on Cloud TPU" (Google)](https://cloud.google.com/tpu/docs/inception-v3-advanced)

## Classification versus Regression versus Clustering

### Classification

Classification is a type of supervised learning in which the targets for a prediction are a set of categorical values.

### Regression

Regression is a type of supervised learning in which the targets for a prediction are quantitative or continuous values.

### Clustering

Clustering is a type of unsupervised learning where you want to identify similarities among collections of items without an *a prior* classification scheme. You may or may not have an *a priori* about the number of categories.

## Overfitting and Underfitting

In machine learning models, we have to worry about twin concerns.  On the one hand, we might **overfit** our model to the dataset we have available.  If we train a model extremely accurately against the data itself, metrics we use for the quality of the model will probably show high values.  However, in this scenario, the model is unlikely to extend well to novel data, which is usually the entire point of developing a model and making predictions.  By training in a fine tuned way against one dataset, we might have done nothing more than memorize that collection of values; or at least memorize a spurious pattern that exists in that particular sample data collection.

To some extent (but not completely), overfitting is mitigated by larger dataset sizes.

In contrast, if we choose a model that simply does not have the degree of detail necessary to represent the underlying real-world phenomenon, we get an **underfit** model.  In this scenario, we *smooth too much* in our simplification of the data into a model.

Some illustrations are useful.

In [None]:
from src.over_under_fit import doc, show

In [None]:
doc()

In [None]:
show()

The above example is for a regression, but the same concept applies to categorization or clustering problems.  For example:

In [None]:
from src.over_under_fit import cluster

First let's look at a collection of points about which we have no *a priori* of their clustering.

In [None]:
# "Cluster" everything into just one category
cluster(1)

To the human eye, it would seem reasonable to guess that this represents three categories of observations. Therefore, we can reasonable say that this data is **underfit** by our clustering model.  Indeed, that would also be true if we guessed there were two clusters.

In [None]:
# Guess there might be two categories
cluster(2)

This model is not terrible, and it indeed seems to identify an important difference in the data.  But looking at the base-line known values for the categories, we can see it really is three types:

In [None]:
# Show the "known true" categories
cluster(1, known=True)

If we cluster into three categories algorithmically, we almost (but not quite) recover the underlying truth.  The algorithms puts the categories in arbitrary order, so the colors are rotated; but you can seem that most-but-not-all the points are in the same clusters.

In [None]:
cluster(3)

Moving farther along, if we guessed *more* clusters we would start to **overfit** the data, and impute category distinctions that do not exist in the underlying dataset.  In this case we known the true number because we have specifically generated it as such. In real-world data we usually do not know this in advance, so we can only tell by performing various validations on the strength of the fit.

In [None]:
# Guess there might be 5 categories
cluster(5)

In [None]:
# Guess there might be 15 categories
cluster(15)

## Dimensionality Reduction

Dimensionality reduction is most often a technique used to assist with other techniques. By reducing a large number of features to relatively few features; very often other techniques are more successful relative to these transformed synthetic features. Sometimes the dimensionality reduction itself is sufficient to identify the "main gist" or your data.

## Feature Engineering

Very often, the "features" we are given in our original data are not those that will prove most useful in our final analysis. It is often necessary to identify "the data inside the data." Sometimes feature engineering can be as simple as normalizing the distribution of values. Other times it can involve creating synthetic features out of two or more raw features.

## Feature Selection

Often, the features you have in your raw data contain some features with little to no predictive or analytic value. Identifying and excluding irrelevant features often improves the quality of a model.

## Categorical versus Ordinal versus Continuous Variables

Features come in one of three basic types.

### Categorical variables 

Some are **categorical** (also called nominal): A discrete set of values that a feature may assume, often named by words or codes (but sometimes confusingly as integers where an order may be misleadingly implied).

### Ordinal variables

Some are **ordinal**: There is a scale from low to high in the data values, but the spacing in the data may have little to no relationship to the underlying phenomenon. For example, while an airline or credit card "reward program" might have levels of Gold/Silver/Platinum/Diamond, there is probably no real sense in which Diamond is "4 times as much" as Gold, even though they are encoded as 1-4.

### Continuous variables

Some are **continuous** or quantitative: Some quantity is actually measured such that a number represents the amount of it. The distribution of these measurements is likely not to be uniform and linear (in which case scaling might be relevant), but there is a real thing being measured. Measurements might be quantized for continuous variables, but that does not necessarily make them ordinal instead. For example, we might measure annual rainfall in each town only to the nearest inch, and hence have integers for that feature.

This notion of types of variables applies to statistics broadly. Some other concepts are genuinely specific to machine learning.  

## One-hot Encoding

For many machine learning algorithms, including neural networks, it is more useful to have a categorical feature with N possible values encoded as N features, each taking a binary value. Several tools, including a couple functions in scikit-learn will transform raw datasets into this format. Obviously, by encoding this way, dimensionality is increased.

Let us illustrate using a toy test dataset.  The following whimsical data is suggested in a blog post by [Håkon Hapnes Strand](https://www.quora.com/What-is-one-hot-encoding-and-when-is-it-used-in-data-science).  Imagine we collected some data on individual organisms—namely taxonomic class, height, and lifespan.  Depending on our purpose, we might use this data for either supervised or unsupervised learning techniques (if we had a lot more observations, and a number more features).

In [None]:
data= [
    ['human', 1.7, 85],
    ['alien', 1.8, 92],
    ['penguin', 1.2, 37],
    ['octopus', 2.3, 25],
    ['alien', 1.7, 85],
    ['human', 1.2, 37],
    ['octopus', 0.4, 8],
    ['human', 2.0, 97]
]

In [None]:
# The data with its original feature, just as a DataFrame
import pandas as pd
naive = pd.DataFrame(data, columns=['species', 'height (M)', 'lifespan (years)'])
naive

In [None]:
# The data one-hot encoded
encoded = pd.get_dummies(naive)
encoded.columns = [c.replace('species_','') for c in encoded.columns]
encoded

## Hyperparameters

The notion of parameters was introduced to define the way in which a model was trained. For neural networks, parameters are the weights of all the connections between the neurons. But in other models a similar parameterization exists. For example, in a basic linear regression, the coefficients in each dimension are parameters of the trained/fitted model.

However, many algorithms used in machine learning take "hyperparameters" that tune how the training itself occurs. These may be cut-off values where a "good enough" estimate is obtained, for example. Or there may be hidden terms in an underlying equation that can be set. Or an algorithm may actually be a family of closely related algorithms, and a hyperparameter chooses among them. Models in scikit-learn typically have a number of hyperparameters to set before they are trained (with "sensible" defaults when you do not specify).

## Grid Search

While scikit-learn usually provides "sensible" defaults for hyperparameters, there is often a great deal of domain and dataset specificity for which hyperparameters are most effective. An API is provided to search across the combinatorial space of hyperparameter values and evaluate each collection.

## Metrics

After you have trained a model, the big question is "how good" is the model.  There is a lot of nuance to answering that question, and correspondingly a large number of measures and techniques.

One common technique to look at a combination of successes and failure in a machine learning model is a *confusion matrix*.  Let us look at an example, picking up the whimsical data used above.  Suppose we wanted to guess the taxonomic class of an observed organism and our model had these results:

| Predict/Actual | Human    | Octopus  | Penguin  |
|----------------|----------|----------|----------|
| Human          |  **5**   |    0     |    2     |
| Octopus        |    3     |  **3**   |    3     |
| Penguin        |    0     |    1     |  **11**  |

Giving a single number to describe *how good* the model is is not immediately obvious.  The model is very good at predicting penguins, but it gets rather bad when it predicts octopi.  In fact, if the model predicts something is an octopus, it probably isn't (only 1/3rd of such predictions are accurate).

### Accuracy versus Precision versus Recall

Naïvely, we might simply ask about the "accuracy" of a model (at least for classification tasks).  This is simply the number of *right* answers divided by the number of data points.  In our example, we have 28 observations of organisms, and 19 were classified accurately, so that's a **68%** accuracy.  Again though, the accuracy varies quite a lot if we restrict it to just one class of the predictions.  For our multi-class labels, this may not be a bad measure.  

Consider a binary problem though:

| Predict/Actual | Positive | Negative |
|----------------|----------|----------|
| Positive       |    1     |    0     |
| Negative       |    2     |   997    | 

Calculating *accuracy*, we find that this model is **99.8%** accurate! That seems pretty good until you think of this test as a medical screening for a fatal disease.  *Two thirds of the people who actually have the disease will be judged free of it by this model* (and hence perhaps not be treated for the condition); that isn't such a happy real-world result.

<hr/>

In contrast with accuracy, the "precision" of a model is defined as:

$$\text{Precision} = \frac{true\: positive}{true\: positive + false\: positive}$$

Generalizing that to the multi-class case, the formula is as follows (for i being the index of the class):

$$\text{Precision}_{i} = \cfrac{M_{ii}}{\sum_i M_{ij}}$$

Applying that to our hypothetical medical screening, we get a a precision of **1.0**.  We cannot do better than that.  The problem is with "recall" which is defined as:

$$\text{Recall} = \frac{true\: positive}{true\: positive + false\: negative}$$

Generalizing that to the multi-class case:

$$\text{Recall}_{i} = \cfrac{M_{ii}}{\sum_j M_{ij}}$$

Here we do much worse by having a recall of **33.3%** in our medical diagnosis case! This is obviously a terrible result if we care about recall.

### F1 Score

There are several different algorithms that attempt to *blend* precision and recall to product a single "score."  Scikit-learn provides a number of other scalar scores that are useful for differing purposes (and other libraries are similar), but F1 score is one that is used very frequently.  It is simply:

$$\text{F1} = 2 \times \cfrac{precision \times recall}{precision + recall}$$

Applying that to our medical diagnostic model, we get an F1 score of 50%.  Still not good, but we account for the high precision to some extent.  For intermediate cases, the F1 score provides good balance.

F1 score can be generalized to multi-class models by averaging the F1 score across each class, counting only correct/incorrect per class.

### Code Examples

In [None]:
from sklearn.metrics import confusion_matrix
import numpy as np

y_true = ["human",   "octopus", "human", "human", "octopus", "penguin", "penguin"]
y_pred = ["octopus", "octopus", "human", "human", "octopus", "human",   "penguin"]
labels = ['octopus', 'penguin', 'human']

In [None]:
cm = confusion_matrix(y_true, y_pred, labels=labels)
print("Confusion Matrix (actual/predict):\n", 
      pd.DataFrame(cm, index=labels, columns=labels), sep="")

recall = np.diag(cm) / np.sum(cm, axis=1)
print("\nRecall:\n", pd.Series(recall, index=labels), sep="")

precision = np.diag(cm) / np.sum(cm, axis=0)
print("\nPrecision:\n", pd.Series(precision, index=labels), sep="")

print("\nAccuracy:\n", np.sum(np.diag(cm)) / np.sum(cm))

In this particular case, F1 score is very close to accuracy.  In fact, using the "micro" averaging method reduces the result to accuracy.  Using the "macro" averaging makes it equivalent to a NumPy reduction from the formula given.

In [None]:
from sklearn.metrics import f1_score
weighted_f1 = f1_score(y_true, y_pred, average="weighted")
print("\nF1 score:\n", weighted_f1, sep="")

In [None]:
print("Naive averaging F1 score:", np.mean(2*(recall*precision)/(recall+precision)))
print(" sklearn macro averaging:", f1_score(y_true, y_pred, average="macro"))

## Next Lesson

**Exploring a data set**: This lessson got us as far as understading some general concepts in machine learning, with an overview of most of the key ideas.  Next we will start working with a concrete dataset, clean it up and examine it, and being to use scikit-learn APIs.

<a href="Exploring.ipynb"><img src="img/open-notebook.png" align="left"/></a>