# Machine Learning with scikit-learn

## What is Machine Learning?

As a one line version—not entirely original—I like to think of machine learning as "statistics on steroids."


![Wikipedia entry](../img/ML-Wikipedia.png)

Cite: [Wikipedia, 09:29, 2018 October 4](https://en.wikipedia.org/w/index.php?title=Machine_learning&oldid=862453222)


## What is scikit-learn?

Scikit-learn provides a large range of algorithms in machine learning that are unified under a common and intuitive API. Most of the dozens of classes provided for various kinds of models share the large majority of the same calling interface. Very often—as we will see in examples below—you can easily substitute one algorithm for another with nearly no change in your underlying code. This allows you to explore the problem space quickly, and often arrive at an optimal, or at least satisficing approach to your problem domain or datasets.

* Simple and efficient tools for data mining and data analysis
* Accessible to everybody, and reusable in various contexts
* Built on NumPy, SciPy, and matplotlib
* Open source, commercially usable - BSD license

## Overview of techniques used in Machine Learning

![Scikit-learn topic areas](../img/sklearn-topics.png)

## Classification vs. Regression vs. Clustering

> "If you torture the data enough, nature will always confess." –Ronald Coase

**Classification**

Classification is a type of supervised learning in which the targets for a prediction are a set of categorical values.

**Regression**
Regression is a type of supervised learning in which the targets for a prediction are quantitative or continuous values.

**Clustering**
Clustering is a type of unsupervised learning where you want to identify similarities among collections of items without an a prior classification scheme. You may or may not have an a priori about the number of categories.


This notion of types of variables applies to statistics broadly. Some other concepts are genuinely specific to machine learning.  The 

## Dimensionality Reduction

Dimensionality reduction is most often a technique used to assist with other techniques. By reducing a large number of features to relatively few features; very often other techniques are more successful relative to these transformed synthetic features. Sometimes the dimensionality reduction itself is sufficient to identify the "main gist" or your data.

## Feature Engineering

Very often, the "features" we are given in our original data are not those that will prove most useful in our final analysis. It is often necessary to identify "the data inside the data." Sometimes feature engineering can be as simple as normalizing the distribution of values. Other times it can involve creating synthetic features out of two or more raw features.

## Feature Selection

Often, the features you have in your raw data contain some features with little to no predictive or analytic value. Identifying and excluding irrelevant features often improves the quality of a model.

## Categorical vs. Ordinal vs. Continuous variables

Features come in one of three basic types.

Some are categorical (also called nominal): A discrete set of values that a feature may assume, often named by words or codes (but sometimes confusingly as integers where an order may be misleadingly implied).

Some are ordinal: There is a scale from low to high in the data values, but the spacing in the data may have little to no relationship to the underlying phenomenon. For example, while an airline or credit card "reward program" might have levels of Gold/Silver/Platinum/Diamond, there is probably no real sense in which Diamond is "4 times as much" as Gold, even though they are encoded as 1-4.

Some are continuous or quantitative: Some quantity is actually measured such that a number represents the amount of it. The distribution of these measurements is likely not to be uniform and linear (in which case scaling might be relevant), but there is a real thing being measured. Measurements might be quantized for continuous variables, but that does not necessarily make them ordinal instead. For example, we might measure annual rainfall in each town only to the nearest inch, and hence have integers for that feature.

## One-hot encoding

For many machine learning algorithms, including neural networks, it is more useful to have a categorical feature with N possible values encoded as N features, each taking a binary value. Several tools, including a couple functions in scikit-learn will transform raw datasets into this format. Obviously, by encoding this way, dimensionality is increased.

## Hyperparameters

The notion of parameters was introduced to define the way in which a model was trained. For neural networks, parameters are the weights of all the connections between the neurons. But in other models a similar parameterization exists. For example, in a basic linear regression, the coefficients in each dimension are parameters of the trained/fitted model.

However, many algorithms used in machine learning take "hyperparameters" that tune how the training itself occurs. These may be cutoff values where a "good enough" estimate is obtained, for example. Or there may be hidden terms in an underlying equation that can be set. Or an algorithm may actually be a family of closely related algorithms, and a hyperparameter chooses among them. Models in scikit-learn typically have a number of hyperparameters to set before they are trained (with "sensible" defaults when you do not specify).

## Grid Search

While scikit-learn usually provides "sensible" defaults for hyperparameters, there is often a great deal of domain and dataset specificity for which hyperparameters are most effective. An API is provided to search across the combinatorial space of hyperparameter values and evaluate each collection.

## Metrics

## Difference between "Deep Learning" and other ML techniques

### Neural networks

The basic idea of a "multilayer perceptron" is a "feed-forward" artificial neural network, composed of "neurons" arranged in "layers." A common illustration is similar to that at right. This idea of "Hebbian networks" has existed since the 1940s, but it really only became a machine learning technique with Paul Werbos' 1975 introduction of "backpropagation" as a means to train such networks. Either way, the ideas are fairly old.

![Basic perceptron](../img/basic-perceptron.png)

Included in diagram is a network with 4 layers and 12 connections (i.e. "parameters"). If it were "fully connected" the diagram would have 16 parameters. What makes a particular trained network special is the set of "weights" in the connections, illustrated and commonly named as subscripted  $w$ values.

For many decades after neural networks were known, they remained a minor area of interest. Usually a variety of other techniques rooted in statistics and linear algebra were more effective in solving problems of classification, regression, and clustering.

Image credit: ["Feedforward Neural Networks", John McGonagle and yushi 21](https://brilliant.org/wiki/feedforward-neural-networks/)

---

### What if we had a LOT more neurons?

In the last decade or less, neural networks—mathematically not much different from those described in the 1940s—grew much larger. For example, the extremely power Inception v3 image classifier consists of approximately 23.8 million parameters across about 140 layers. Layers generally each have many more neurons than the half-dozen or fewer shown in textbook illustrations like the one shown above. Scikit-learn has basic neural network techniques, but their use is mostly for the uses that made sense more than five years ago.

![Inception v3](../img/inception-v3.png)

Classic "fully connected" layers make up only a small number of those used. More than anything else, the effect and reason for this is to limit the combinatorial explosion of connections, limiting the parameters to only 24 million.

Image credit: ["Advanced Guide to Inception v3 on Cloud TPU" (Google)](https://cloud.google.com/tpu/docs/inception-v3-advanced)