# A Brief Introduction to Machine Learning

# Overview

Welcome! In this IPython notebook we give a brief overview of machine learning, including a discussion of popular applications and algorithms.  

These notes are part of the supplementary material to the textbook Machine Learning Refined (Cambridge University Press), visit <http://www.mlrefined.com>
for free chapter downloads and tutorials, and [our Amazon site here](https://www.amazon.com/Machine-Learning-Refined-Foundations-Applications/dp/1107123526/ref=sr_1_1?ie=UTF8&qid=1471024084&sr=8-1&keywords=machine+learning+refined) for details regarding a hard copy of the text.

# 1.  How is machine learning used today?

Machine learning as it exists today is a set of data-driven tools for determining how the input and output of a given system are related or - in other words - the rules that govern a system.   Lets briefly discuss some popular applications of machine learning today 

* **target advertising** - big tech companies (think Google, Facebook, etc.,) use machine learning to create useful rules that determine which online advertisement (the output) to show a user based on their personal information (the input) 


* **self driving cars** - cameras on a self-driving car constantly scan the car's surrounding enviroment, with the captured images (the input) scanned for objects (the output) like pedestrians, roadsigns and roadmarkers, etc., that will tell the car how to behave


* **face detection on your phone** - commonly built in to most smartphones' photo app, the input (an image) to your camera is scanned for the presence of faces (the output) placing a small square around any detected face - the camera is then automatically focused on these regions to help take in focus snapshots


* **automatic trading** - financial engineers use machine learning to predict the future price of a stock/bond/comodity in order make buy / sell decisions, typically based a variety of inputs like e.g., previous prices, economic indicators 


* **matching and recommendations** - used to connect people to people for e.g., professional (think Linkedin) or romantic (think OKcupid / Match) reasons, and people to products (e.g., Amazon, Netflix, etc.,)


* **speech recognition** - like face detection, another technology on most smartphones today - voice command systems like Siri, Echo, Google Voice, Cortana, Dragon, etc., are built on core of machine learning algorithms that translate a voice command (the input) into a command the phone can perform (the output) 


* **analtyics and business intelligence** - how can we make business process X (the input) more effecient (the output)?


* **genetic data mining** - which few genes (the input) often correlate with the prescence of disease X (the output)? if we can determine these relationships maybe gene-targeted drugs can be developed 

# 2  Why is machine learning so hot right now?

The underlying technology of machine learning itself has not actually changed that much in the last 25 years - this includes the fundamental models and mathematical algorithms we will cover in this course (including the popular deep neural networks framework). So why is machine learning so hot today? Because while the fundamentals have largely remained the same computer power has not - they grown exponentially more powerful during this time. The same goes for our access to large datasets - Google, Amazon, Facebook, Apple (at least as we know it today), etc.,, none of these enormous data-stockpiling companies existed 25 years ago. 

As we will see in this course learning algorithms are often very data-hungry, meaning that often times a large amount of data is required for machine learning to work well in practice. Associated with this need for large quantities of data is a computational burden that does not always scale gracefully with the size of a dataset. Thus, in summary, today with access to large data sets and powerful computers machine learning can finally be used to great effect.

# 3 Two very popular machine learning problems today

Lets briefly discuss the two fundamental types of machine learning problems which are the sum total result of many generations worth of development in the field. Yes - there are only really two so far - albeit each has a number of subspecies. Virtually every person working in machine learning, from the theoretical researcher to the scrappy data scientist, works on one or more of these tasks (thus a major goal of this course is to familiarize you with the inner workings of these problems). These are the problems of prediction - of which regression and classification are the two major problems - and dimension reduction - of which feature selection and clustering play the most fundamental roles. While these carve out only a narrow band of what we might consider to be a human-level (artificial) intelligence - there is still an enormous amount of work to be done in developing new machine learning tasks - the two fundamental problems are nonetheless extremely useful e.g., every application we have and will discuss fall into one of these two basic problem types, with new applications being made all the time. 

## 3.1 Prediction problems

With prediction problems (called supervised learning problems in the jargon of machine learning) we are always trying to learn a relationship between some sort of input and output to a system using some example pairs - our dataset. In other words, prediction problems are all about learning how an 

### Regression - predict something that takes on continuous values

I am guessing you have seen you have all seen a line or curve fit to a dataset, right? The problem of regression is one of using some sort of input (called a feature in machine learning jargon) or multiple inputs to predict a continuous (or essentially continuous) output. Often this is just done to produce a trend line that summarizes the nature of the data under study. For example, shown below is a dataset of points whose input (feature) is the number of search queries for a movie via Google and output its associated opening box office take (this image is taken from [2]). Here the regression line makes the point "quantity of Google search queries is an excellent indicator of opening weekend box office success".

<img src="images/movie_prediction.png" width=400 height=300/>


Here's an even better example - where a trend line really represents the underlying phenomenon extremely well. The figure below (taken from [#MLrefined]) shows a dataset of quarterly measurements of total student debt in the U.S. from 2006 to 2014 (the input feature is time while the output is total debt in trillions of dollars) along with a corresponding regression line. The regression line (shown in magenta) fits the data extremely well and simply helps deliver the gist of this particular dataset: "student debt in the U.S. is increased at a constant (and perhaps terrifying) rate over the years 2006 - 2014". This also gives us a fairly trustworthy tool for predicting total student debt in the future - our regression line: to make an estimate for total student debt in, say, the last quarter of 2015 we just plug this value into the learned linear model. 

<img src="images/student_debt.png" width=400 height=250/>


Of course its not always the case that a line is a good choice for representing a dataset. For example, here is a dataset of average monthly temperatures in Chicago from the year 2010 - 2012 (here the input feature is again time while the output is average monthly temperature in Farenheight). Fitting a line to this dataset would be counter productive - the input (time) and output (average temperature) are not linearly related - they are (more or less) related periodically to one another (e.g., every January tends to be cold, and every July tends to be hot). If we want to make predictions (e.g., what will the average temperature be in July of the current year?) we will need to fit some nonlinear curve or function to this dataset (which we'll do around class # 3). 

<img src="images/chicago_weather.png" width=600 height=500/>



### Classification - predict something that takes on discrete values 

Like regression, classification is a prediction task where we aim to learn the relationship between a dataset consisting of input (again called a feature in machine learning jargon) or multiple inputs and output. In complete analogy to basic regression, where one aims to fit a line to a dataset of input/output points, with basic classification one aims to separate or distinguish two types of data using a line. This is shown figuratively in the picture below, taken from [1]), where two different classes of data points (here colored red and blue respectively) are distinguished by learning the parameters of a line (or in higher dimensions a hyperplane) that nicely separates them. Here each data point consists of two inputs or features with corresponding output being its class number (or, likewise, its color). 

<img src="images/classification_prototype.png" width=400 height=400/>


Have you ever taken a picture with your smartphone and noticed how a little square gets placed around any face in your viewfinder? This is a classification task known as face detection, and it is done so that your phone's camera knows where to focus when taking a photo. Facebook and other photo sharing schemes often employ face detection as well in order to organize photos efficiently. Below is a diagram from [1] that shows pictorially how face detection is framed as a classification problem. First an input image (here a shot of the famous Wright Brothers [3], inventors of the airplane) is examined block-by-block and the features of each block, once extracted, are represented as a point in some high dimensional space (as shown figuratively in the right panel). If these features are designed properly than those blocks containing faces (here colored blue) should be separable by a hyperplane (whose parameters we must learn) from those which contain other items in the image (here colored red). 

<img src="images/wright_bros_face_detect.png" width=600 height=500/>


The task of identifying general objects in a visual scene like e.g., bicycles, chairs, etc. is also a classification task called object detection. Beyond being a staple ability of what we might consider to be even relatively intelligent (artificial) beings, the process of object detection is indispensable in e.g., self-driving cars, so that a self-driving vehicle can identify things like road markers, street signs, and pedestrians (which it aims to avoid). 

Some additional popular classification tasks we have briefly discussed also include speech recognition, sentiment analysis, high frequency trading, and the facial recognition (that is, given a picture of a face recognizing who that person is). 

## Dimension reduction - create a simpler representation of a dataset

The phrase 'dimension reduction' often refers to the (proper) shrinking of large dimensional input, i.e., the number of input features to a given problem, or to the (proper) shrinking of a large dataset. Dimension reduction procedures almost always play a supportive role by either enabling more computationally efficient parameter tuning for regression / classification / reinforcement problems, or by providing 'a human in the loop' insight as to how the input and output of a problem are related. The former motivation is strictly practical: training a model on dataset that has too many input features and / or data points can be a computationally costly (and thus slow) procedure. If you need something to work fast, or on the fly even, then you make the associated computation much more tractable by properly reducing the dimension of your data before training. The latter reason to reduce the dimension of your input or dataset is more esoteric: you (a human being, purportedly) want to understand the nature your dataset. How are the inputs and output related? Which input features best predict the associated output? Which individual datapoints are widely representative of large sections of your data? Lets look at a few common examples to grow our intuition. 


* **text data** - Web documents are often represented as word-frequency vectors when performing common machine learning tasks (like e.g., document clustering sentiment analysis), and these document vector features can be quite long (from the hundreds to the tens of thousands) as they contain an entry for each word in a reference dictionary. Therefore one often looks to reduce the dimension of text data by e.g., keeping only those words listed X times or more in a given corpus, or by projecting the entire vector onto a proper lower dimensional subspace using Principal Component Analaysis or a related method. 


* **image data** - A digital grayscale image is made up of many little square pixels (e.g., a megapixel image consists of 1 million pixels) arranged in a regular rectangular grid, and each pixel or little square can be thought of as a number that represents the pixel's brightness level between 0 (black) and 255 (white). In short, a grayscale image is an array (of vertical and horizontal dimensions equal to the number of pixels spanning the height and width of the image, respectively) whose (i,j)th entry is a number between 0 - 255 representing the pixel intensity of the corresponding (i,j)th pixel in the image. A color image consists of three such equally sized arrays, one for red, green, and blue channels. This means that even a small thumbnail size (grayscale) image of 100x100 pixels, about the size of a folder icon on your computer desktop, has 1002 = 10,000 input features. This is already too large for efficient tuning of predictive tasks like object detection, and because of this a dimension reduction technique called pooling is often (repeatedly) performed to an input image in order to make learning tasks more computationally manageable.


* **financial data** - Financial firms regularly place big bets on whether the price of e.g., stocks, bonds, etc., will the rise or fall in the near future, or likewise whether they should be buying / selling / shorting. However often it is a-priori unclear which individual inputs (e.g., the price of certain commodities, other stocks, etc.,) best predict the value of such instruments, and so often an analyst will collect every possible input they can get their hands on and then try to methodically select the most indicative features from the bunch via cross-validation. This form of dimension reduction arises in almost every application domain, and is often referred to as feature selection.


* **genetic data** - A standard genetics dataset might consist of the genetic profile of each member from a group of test patients, where each genetic profile contains thousands (or tens of thousands) of values resulting from the chemical treatment of a single patient's genetic material. Using this sort of data one would then like to determine what sort of genetic profile is indicative of a particular disease like e.g., diabetes or Alzheimer's. Moreover it is of particular interest to narrow down the thousands of genes measured for each patient to just a handful of powerfully indicative ones, as doing so provides opportunity to design gene-targeted therapies which could potentially ameliorate a disease. This is another example of feature selection and is a universal problem in bioinformatics applications of machine learning.


* **customer data** - By providing strong recommendations companies help onboard customers and keep them engaged with their products, services, or content. Often such recommender systems work, at least in part, by reducing a large customer base down to a handful of fundamental customer profiles. Then recommendations are given to an individual customer based on the products / services / content that other customers of the same profile type have already tried and highly rated. 

# The two basic elements of every machine learning problem

Now that we are somewhat familiar with the two basic problems of machine learning, at least at a high level, lets zoom out one more level to look at two important technical elements shared by both these tasks  These are referred to as *feature design* and *mathematical optimization* respectively. Each will be a constant subject of the course as we fully discuss machine learning problems. 

#  1.  Feature design

Remember that the term *feature* means *input* in the parlance of machine learning.  What, then, does it mean to 'design' a feature or set of features?  We will use the phrase 'feature design' 'to refer to two ideas - each of which is extremely important in practical application of machine learning

1.  Selecting a few relevant features from a large pool of candidates - this very commonly occurs in e.g., the financial and genetic applications discussed previously

2.  Mathematically transforming a given set of inputs to capture nonlinearity in a dataset - this is virtually always done with applications in images, text, and speech 



## 1.1  Selecting relevant features

Very often we must select the most proper input to a machine learning problem like e.g., regression because we while we might have something we wish to predict, we do not know what inputs will give us the greatest insight.  For example, if we wanted to predict the price of a particular stock one month from now - what should we use as our input?  Several possibly useful input features might come to mind - e.g., previous prices, certain economic indicators like the federal fund rate, maybe even the general sentiment of insightful financial journalists if we can get ahold of it - but a single 'silver bullet' input feature, i.e., one that perfectly describes the historical price of a stock, is not apparent.   

So, based solely on the ignorance of what particular input would work best, a common approach is to try to find as many input features as possible, dump them into the model, and select the ones that are most indicative of our target output.  

Let's look at a very simple example of doing this.  Suppose we're interested in understanding the total amount of student debt in the United States for the past decade or so, and predicting its future value.  This is a regression problem, and we have already seen in our previous introduction to the machine learning problem of regression that indeed the input feature *time* is a fairly good one for this output, as it correlates quite strongly with student debt.  But suppose we did not know this - because commonly in practice we will not have such insight - and that to compensate for our ignorance we gathered two candidate input features (remember that in practice we would try to gather as many viable inputs as we could).  

Our two candidate input features are 1) time (in years) and 2) the annual sales of the Chiquita banana company.  What in the world do banana sales have to do with student debt?  Likely none - but lets take a look.  First lets take a look at the entire dataset - that is we use both inputs and the output.  Since we have two input features and one output the full dataset is 3-dimensional

<img src="images/student_debt_and_chiquita_3D.png" width=500 height=250/>

Now lets look at each input individually with the output - unsurprisingly just glancing at the left panel (where the input feature is time) and right panel (where it is banana sales) time appears to be a much better input for predicting student debt.  Time is the far better choice of input here it almost perfectly correlates with the output, whereas the relationship between banana sales and student debt looks vague at best.

<img src="images/student_debt_and_chiquita_2D.png" width=500 height=250/>

The feature design task of *feature selection* - which we will learn about in the course - will allow us to automate the task of selecting the better of these two input features - time - so that we can produce the most useful regression model possible.  More generally it will allow us determine the best feature or set of features for general regression problems as well.


## 1.2  Transforming input features to capture nonlinearity

Very often we must try to transform an input design the final features we feed into our machine learning model. We do this by leveraging our understanding of the phenomenon under study, and by encoding this knowledge into a tractable mathematical or computational transformation of given inputs. These transformed features - as we will see - allows for significantly greater learning. 

Before diving into the details for a modern problem, lets first discuss a revealing historical example of rule-finding. This will not only set the stage for the typical modern task but will highlight one of most critical challenges associated with today's machine learning problems.

###   Galileo and the fundamental rule of gravity

Galileo Galilei - the 17th century scientist and philosopher - is perhaps most famous for his championing of the Copernican model of the solar system (in which the sun was the center of the universe instead of the earth, a long held belief since the days of Aristotle) in the face of much scrutiny from the Catholic church - the governing institution of his time and place. But Galileo also discovered a huge array of scientific principles in his lifetime, and put other principles that were perhaps philosophically 'intuitive' at the time on more solid ground by creating experimental evidence of their veracity. His experiments in determining the rule of earth-bound gravity - which was later codified as Newton's second law - is just such an example. It combines an absolutely brilliant experimental design and approach to data collection with a straightforward application of rule finding via machine learning. 


In order to quantify the pull of gravity on an object Galileo designed the following experiment that measures how far an object falls in a given allotment of time. The basic idea behind the experiment was to drop an object - like a metal ball - multiple times at a fixed distance from the ground and measure how long it took the ball to traverse certain portions of the length. However because accurate enough timekeeping devices did not yet existIt was Galileo himself who, in studying pendulums, eventually led to the development of humankind's first accurate time pieces: the pendulum clock. This was the most precise instrument for keeping time for some 300 years - from about 1650 until the early 1930s. he had to slow things down in order to measure time precisely enough, and so instead of dropping the ball he rolled it down a smooth ramp starting from the top, as shown figuratively below (taken from [1]).

<img src="images/galileo_ramp.png" width=500 height=250/>


Repeating this experiment a number of times, Galileo collected data on how long it took the ball to traverse certain portions of the ramp (specifically he measured how long it took the ball traverse $\frac{1}{4}$, $\frac{1}{2}$, $\frac{2}{3}$, $\frac{3}{4}$ and the full length of the ramp). Repeating this several time he averaged the results - leaving a single data point representing the average time it took the ball to travel  down each fraction of the ramp - as shown below (this data is actually taken from a modern reenactment of Galileo's experiment - see [1]).

<img src="images/galileo_data.png" width=250 height=250/>

From philisophical reflection and visual examination of a dataset very much like this one, Galileo proposed a simple nonlinear rule that appeared to explain or equivalently generate this data: that the distance an object travels due to the pull of gravity is *quadratic* in time.  In other words, that

\begin{equation}
\text{portion of ramp traveled / distance an object travels}^{} =\text{constant}^{}\times^{}\text{(time spent traveling)}^2
\end{equation}

Fitting such a quadratic to the above dataset (by properly choosing the value of the constant) we can see that it does indeed represents the dataset quite well.

<img src="images/galileo_data_and_fit.png" width=250 height=250/>

Moreover this quadratic rule - derived by examining such a simple dataset - was  found to be extremely accurate, standing up to both further empirical examination as well as philisophical study (e.g., it is the basis for Newton's second law of gravity).
 

###  From Galiileo to machine learning

In the example above, Galileo determined the quadratic rule for gravity by looking at his dataset and by employing his physical intuition.  Machine learning - in its current state - is a set of tools for replicating this (and only this) part of determining the rules that govern a given system.  That is, machine learning can automatically determine (using a dataset) 

1.  The correct nonlinear relationship between the input and output of a  system, in other words the correct nonlinear function of the input predicts  the output well - in the case of the Galileo example that the relationship between the time an object is falling and the distance it has traveled is quadratic

2.   A proper value for the parameters of this (potentially) nonlinear relationship so that the rule fits the dataset well - in the case of the Galileo example this consists of a single constant



Note - very importantly - what is not included here is *how* we get the data itself - an obviously critical component to forging rules.  Machine learning is a substitute for philisophical / scientific understanding and visual examination in the forging of rules, and so relies entirely on having solid datasets to work with.  The severity of this deficincy ranges from problem to problem, and for many of the examples listed in the first part of this section it is not really a problem at all given that the data in those cases is usally plentiful.  But in a case like Galileo's it is a very serious obstical -  here the data was compiled from a seriously ingenious experiment.  In short - machine learning cannot yet 'collect or create the right' data for determining rules, that part is still very much up to we humans.


But enough of what it cannot do - let's celebrate what machine learning can do!  The fact machine learning can automatically determine the form of the potentially nonlinear relationship between inputs and outputs of a dataset, and tune associated parameters accordingly, gives us incredible power.  This is because there are many instances - as in the examples described in the first part of this section - where we can gather large datasets but the nature of this data - e.g., that it is too high dimensional to visualize - completely prevents us from even proposing a reasonable nonlinear rule.  

Take the task of face detection for example - the technology that places a little square around faces when you take a picture with your smartphone (in order to focus the lens on these portions of the captured image).  In order to make this work one first collects a large database of small facial and non-facial images, like those shown below.   

<img src="images/face_detection_data.tif" width=400 height=400/>

In order to make face detection work we want to use such a dataset to derive a rule that distinguishes facial images from non-facial ones.  Remember that a grayscale digitial image is made up of many small squares called 'pixels', each of which has a brightness level between 0 (completely black) and 255 (completely white). In other words a grayscale digital image can be thought of as a matrix or array whose $\left(i,j\right)^{th}$ value is equal to the brightness level of the $\left(i,j\right)^{th}$ pixel in the image. (A color image is then just a set of three such matrices, one
for each color channel red, green, and blue).  

<img src="images/nugety_pixels.png" width=500 height=500/>

In other words - our dataset consists of small *input* images (which are high dimensional arrays of pixel values) and their associated *output* type or *class* - either face or non-face.  Because the output class label 'face' and 'non-face' are not numeric in nature, these labels are translated into distinct numbers - e.g., +1 for a face image and -1 for a non-face image.  So, in other words, in order to determine a successful rule distinguishing faces from non-faces we must determine a (potentially) nonlinear function of image pixels which accurately returns +1 if the input image is a face, and -1 otherwise.  That is for some function $f$ that takes in an input image from the database

\begin{equation}
\text{class of input image} = f(\text{input image pixels}) = \begin{cases}
+1 & \,\,\text{if input is a face}\\
-1 & \,\,\text{if input is not a face}
\end{cases}
\end{equation}

Machine learning, as we will see, can be used to automatically determine a proper form of for the function $f$ and properly tune its parameters.    Just think -  even determining a proper function for such a problem would be absolutely impossible to do 'by eye' - as we saw Galileo did with the gravity-experiment data - since the image data is far too high dimensional for us to even visualize.

Once machine learning is used to properly determine a function and as well as its parameters, when one wants to detect faces in a new full image (as on your smartphone) a small window square window is passed over all regions of the input image.  The content in each small windowed - image is then passed through the function $f$ to determine if it contains a face or not, as illustrated figuratively below.

<img src="images/sliding_window.bmp" width=500 height=500/>


So, in summary, **Feature Design** is the task of selecting the right input or determining the proper nonlinear relationship between the given input and output of a system, in other words the a nonlinear function of the input predicts which accurately predicts the output  **However gathering or creating proper datasets is still up to we humans.**

# 2.   Mathematical optimization

Every learning problem has parameters that must be tuned properly to ensure optimal learning. For example, there are two parameters that must be properly tuned in the case linear regression (with one dimensional input): the slope and intercept of the linear model.  These two parameters are tune by forming a 'cost function' - a continuous function in both parameters - that measures how well the linear model fits a dataset given a value for its slope and intercept.  The proper tuning of these parameters via the cost function corresponds geometrically to finding the values for the parameters that make the cost function as small as possible or, in other words, *minimize* the cost function.  In the image below - taken from [MLRefined] - you can see how choosing a set of parameters higher on the cost function results in a corresponding linear fit that is poorer than the one corresponding to parameters at the lowest point on the cost surface.

<img src="images/bigpicture_regression_optimization.png" width=500 height=250/>


This same idea holds true for regression with higher dimensional input, as  as well as classification where we must properly tune the intercept and normal vector to the fitting hyperplane.  Again, the parameters minimizing the cost function provide the better classification result.  This is illustrated for classification below - again taken from [1].

<img src="images/bigpicture_classification_optimization.png" width=500 height=250/>

The tuning of these parameters is accomplished by a set of tools known collectively as mathematical optimization. Mathematical optimization is the formal study of how to properly minimize cost functions and is used not only in machine learning, but reasons in a variety of other fields including operations, logistics, and physics.

So, in summary, **Mathematical Optimization** is the method by which we determine the proper parameters for machine learning models.  When viewed geometrically the pursuit of proper parameters is also the search for the lowest point - or minimum - of a machine learning model's associated cost function.

## References

[1]  Jeremy Watt, Reza Borhani, and Aggelos. Katsaggelos. Machine Learning Refined. Cam- bridge University Press, 2016.

[2]  Reggie Panaligan and Andrea Chen. Quantifying movie magic with google search. Google
Whitepaper:Industry Perspectives and User Insights, 2013.

[3]  David G.Â McCullough. The Wright Brothers. Simon & Schuster, 2015.