# **Lecture 1: Introduction to Machine Learning**
**Applied Machine Learning**\
Machine Learning is the one of today's most exciting technologies.\
In this course, we will learn what machine learning is, what are the most important techniques in machine learning, and how to apply them to solve the problem in the real world.

# **Part 1: What is ML?**
ML in everyday life:

*   Search engines
*   Personal Assistants: ML powers speech recognition, question answering and other intelligent capabilities of smartphone assistance like Apple Siri. 

*   Spam/Fraud detection
*   Self-driving cars






**A Definition of Machine Learning**\
In 1959, Arthur Samel defined ML as follows:
```
Machine Learning is a field of study that gives computers the ability to learn 
without being explicitly programmed.

```







**An Example: Self Driving Car** \
A Self-Driving car system uses dozen of components that include detection of cars, padestrians, and other objects.\
**A Ruled-based algorithm:**\
One way to build a detection system is to write down the rules. In practices, it is almost impossible for human to specify all the edge cases.\
**An ML Approach**\
The ML approach is to teach a computer how to detection by showing it many examples of diffenrent objects. No manual programming is needed, the computer learns what defines a padestrian and a car on its own!


## **Why Machine Learning?**
Why is this approach to building software interesting?


*   It allows building practical systems for the real-world application that couldn't be solved otherwise.
*   Learning is widely regarded as a key approach towards building general-purpose artificail intelligent systems.


*  The science and engineering of machine learning offers insights into humnan intelligence.







# **Part 2: Three Approaches to Machine Learning**

Machine learning is broadly defined as the science of building software that has the ability to learn without being explitcitly programmed.\
**How might we enable machines to learn?**

## **Supervised Learning**
The most common approach to machine learning is supervised learning:

1.   First, we collect the dataset of labeled training examples. 
2.   We train the model to output accurate predictions on this dataset.
3.   When the model see new, similar data, it will be also accurate.





# **Lecture 2: A supervised learning problem**

## **A supervised learning dataset**
Consider a simple dataset for supervised learning: We will use the UCI Diabetes Dataset: it's a toy dataset that's often used to demonstrate machine learning algorithm.

*   For each patient we have a access to a measurement of their body max index (BMI) and a quatitive diabetes risk score (from 0-400).
*   We are interested in understading how BMI affects an individual's diabetes risk.






In [None]:
import numpy as np
import pandas as pd
from sklearn import datasets

# Load the diabetes dataset
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True, as_frame=True)

# Use only the BMI feature
diabetes_X = diabetes_X.loc[:, ['bmi']]

# The BMI is zero-centered and normalized; we recenter it for ease of presentation
diabetes_X = diabetes_X * 30 + 25

# Collect 20 data points
diabetes_X_train = diabetes_X.iloc[-20:]
diabetes_y_train = diabetes_y.iloc[-20:]

# Display some of the data points
pd.concat([diabetes_X_train, diabetes_y_train], axis=1)

We can also visualize this two-dimensionl dataset

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [12,4]

plt.scatter(diabetes_X_train, diabetes_y_train, color='red' )
plt.xlabel('Body Mass Index (BMI)')
plt.ylabel('Diabetes risk')

## **A Supervised Learning Algorithm (Part 1)**
What is the relationship between BMI and diabetes risk? \
We could assume that risk is a linear function of BMI. In other words, for some unknown $\theta_0, \theta_1 \in \mathbb{R}$, we have:
$$y = \theta_1x + \theta_0$$
where $x$ is the BMI (also called the independent variable), $y$ is the diabetes risk (also called the dependent variable).\
Note that $\theta_0, \theta_1$ are the slope and intercept of the line relates $x$ to $y$. We call them *parameters*.\
We can visualize this for a few values of $\theta_0, \theta_1$.


In [None]:
theta_list = [(1,2), (2,1), (1,0), (0,1)]
for theta0, theta1 in theta_list:
  x = np.arange(10)
  y = theta1*x + theta0
  plt.plot(x,y)

## **A Supervised Learning Algorithm (Part 2)**
Assuming that $x,y$ follow the above relationship, the goal of the **supervised learning algorithm** is to find a good set of parameters consistent with the data. \
We will see many algorithms for this task. For now, let's call the `sklearn.liner_model` library to find a $\theta_0, \theta_1$ that fit the data well.

In [None]:
from sklearn import linear_model
from sklearn.metrics import mean_squared_error

# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train.values )

# Make prediction on the training set
diabetes_y_train_pred = regr.predict(diabetes_X_train)

# The coefficients
print('Slope theta1: \t', regr.coef_[0])
print('Intercept theta2: \t', regr.intercept_)

## **A Supervised Learning Model** 
The supervised learning algorithm gave us a pair of $\theta_0^*, \theta_1^*$. These define the *predictive model* $f^*$, defined as:
$$f(x) = \theta_1^*x + \theta_0^*$$
where again $x$ is the BMI, and $y$ is the diabetes risk score.\
We can visualize the linear model that fits our data

In [None]:
plt.xlabel('Body Mass Index (BMI)')
plt.ylabel('Diabetes risk')
plt.scatter(diabetes_X_train, diabetes_y_train)
plt.plot(diabetes_X_train, diabetes_y_train_pred, color='black', linewidth=2)

## **Predictions Using Supervised Learning**
Given a new dataset of patients with a known BMI, we can use this model to estimate their diabetes risk.\
Given a new $x^{'}$, we can output a predicted $y^{'}$ as:
$$y^{'} = f(x^{'}) = \theta_1^{*}.x^{'} + \theta_0^*$$
Let's start by loading more data. We will load three new patients (shown in read below) that haven't seen before:

In [None]:
# Collect three new data points
diabetes_X_test = diabetes_X.iloc[:3]
diabetes_y_test = diabetes_y.iloc[:3]

plt.scatter(diabetes_X_train, diabetes_y_train)
plt.scatter(diabetes_X_test, diabetes_y_test, color='red')
plt.xlabel('Body Mass Index (BMI)')
plt.ylabel('Diabetes Risk')
plt.legend(['Initial patients', 'New patients'])


Our linear model provides an estimate of the diabetes risk for these patients


In [None]:
# generate predictions on the new patients 
diabetes_y_test_pred = regr.predict(diabetes_X_test)  

# Visualize the result 
plt.xlabel('Body Mass Index (BMI)')
plt.ylabel('Diabetes Risk')
plt.scatter(diabetes_X_train, diabetes_y_train)
plt.scatter(diabetes_X_test, diabetes_y_test, color='red', marker='o')
plt.plot(diabetes_X_train, diabetes_y_train_pred, color='black', linewidth=1)
plt.plot(diabetes_X_test, diabetes_y_test_pred, 'x', color='red', mew=3, markersize=8)
plt.legend(['Initial patients','New patients','Model', 'Predictions'])

## **Why Supervised Learning?**
Supervised learning can be useful in many way:
*   Making predictions on new data
*   Understanding the mechanisms through which input variables affect targets.



## **Applications of Supervised learning**
Many of the most important applications of machine learning are supervised learning:
*   Classifying medical images,
*   Translating between pairs of languages,
*   Detecting objects in a self-driving car.





## **Part 2: Anatomy of a Supervised Learning problem: Datasets**
We have seen a simple example of a supervised learning prolem and an algorithm for solving this problem. \
Let's now look at what a general supervised learning prolem looks like 

## **Recall: Three components of A Supervised Machine Learning Problem**
At a high level, a supervised machine learning problem has following structure:
$$Dataset + Algorithm \rightarrow Predictive ~ Model$$
The predictive model is chosen to model the relationships between inputs and targets. For instance, it can predict future targets.

## **A Supervised Learning Dataset**

We are going to dive deeper into what's a supervised learning dataset. As an example, consider the full version of the UCI Diabetes Dataset seem earlier.\
Previously, we only looked at the patients' BMI, but this dataset actually records many additional measurements.\
The UCI Diabetes contains many additional data columns besides `bmi`, including age, sex, and blood pressure. We can ask `sklearn` to give us more information about this dataset.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [12,4]
from sklearn import datasets

# Load the diabetes dataset
diabetes = datasets.load_diabetes(as_frame=True)
print(diabetes.DESCR)

In [None]:
diabetes

## **A Supervied Learning Dataset: Notation:**
We say that a training dataset of size n (e.g., n patients) is a set:
$$\mathcal{D} = \{(x^{(i)},y^{(i)}) | i= 1,2,...,n\}$$
Each $x^{(i)}$ denotes an input (e.g., the measurement for patient i), and each $y^{(i)} \in \mathcal{Y}$ is a target (e.g., the diabetes risk).\
Together, $(x^{(i)},y^{(i)})$ form a *training example*.\
We can look at the diabetes dataset in this form

In [None]:
# Load the diabetes datasset
diabetes_X, diabetes_y = diabetes.data, diabetes.target

# Print part of the dataset
diabetes_X.head()


## **Training Dataset: Inputs**
More precisely, an input $x^{(i)} \in \mathcal{X}$ is a $d$-dimensional vector of the form
$$\begin{align}
    x^{(i)} &= \begin{bmatrix}
                x_1^{(i)} \\
                x_2^{(i)} \\
                \vdots \\
                x_d^{(i)}
                \end{bmatrix}
  \end{align}$$ 

For example, it could be the measurements the values of the $d$ features for the patient $i$.\
The set $\mathcal{X}$ is called the feature space. Often, we have, $\mathcal{X} = \mathbb{R}^d$.\
Let's look at data for one patient:

In [None]:
diabetes_X.iloc[0]

## **Training Dataset: Attributes**
We refer to the numerical variables describing the patient as attributes. Example of attributes include:

*   The age of the patient,
*   The patient's gender,
*   The patient's BMI

Note that these attributes in the above example have been mean-centered at zero and re-scaled to have a variance of one. 

## **Training Dataset: Features**
Often, an input object has many attributes, as we want to use these attributes to define more complex descriptions of the input  

*   Is the patient old and a man? (Useful if old men are at risk)
*   Is the BMI above the obesity threshold?

We call these custom attributes *features*.\
Let's create an "old man" feature 

In [None]:
diabetes_X['old man'] = (diabetes_X['sex']>0) & (diabetes_X['age']>0.05)
diabetes_X.head()

## **Training Dataset: Features**
More formally, we can define a function $\phi: \mathcal{X} → \mathbb{R}^p$ that takes an input $x^{(i)} \in \mathcal{X}$ and outputs $p$-dimensional vector
$$\begin{align}
    \phi(x^{(i)}) &= \begin{bmatrix}
                \phi(x^{(i)})_1 \\
                \phi(x^{(i)})_2 \\
                \vdots \\
                \phi(x^{(i)})_p
                \end{bmatrix}
  \end{align}$$ 

We say that $\phi(x^{(i)})$ is a featurized input, and each $\phi(x^{(i)})_j$ is a *feature*.

## **Features and Attributes**
In practice, the terms attribute and feature are often used interchangeably. Most author refer to $x^{(i)}$ as a vector of features (i.e., they've have been precomputed).\
We will follow this convention and use attribute only when there is ambiguity between attribute and feature.

## **Feature: Discrete vs. Continuous**
Features can be either discrete or continuous. We will see later that they may be handled differently by ML algorithms.\
The BMI feature that we have seen earlier is an example of continuous feature.\
We can visualise its distribution.


In [None]:
diabetes_X.loc[:,'bmi'].hist()

Other features take on one of a finite number of dicrete values. The `sex` column is an example of a categorical feature.\
In this example, the dataset has been pre-processed such that the two values happen to be `0.05068012` and `-0.04464164`.

In [None]:
print(diabetes_X.loc[:,'sex'].unique())
diabetes_X.loc[:,'sex'].hist()

# **Training Dataset: Targets**
For each patient, we are interested in predicting a quantity of interest, the *target*. In our example, this is the patient's diabetes risk.\
Formally, when $(x^{(i)},y^{(i)})$ form a *training example*, each $y^{(i)} \in \mathcal{Y}$ is the target, and  $\mathcal{Y}$ is the target space.\
We plot the distribution of risk scores below:

In [None]:
plt.xlabel('Diabetes risk')
plt.ylabel('Number of patients')
diabetes_y.hist()

## **Target: Regression vs. Classification**
We distinguish between two broad types of supervised learning problems that differ in the form of the target variables.


*   **Regression**:  The target variable $y$ is continuos. We are fitting a curve in a high-dimensional feature space that approximates the shape of the dataset.
*   **Classification**: The target variable $y$ is discrete. Each discrete value corresponds to a classs and we are looking for a hyperplane that separates the different classes.

We can easily turn our earlier regression example into classification by discretizing the diabetes risk scores into high or low



In [None]:
# Load the diabetes dataset
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True, as_frame=True)

# Use only the BMI feature
diabetes_X = diabetes_X.loc[:, ['bmi']]

# The BMI is zero-centered and normalized; we recenter it for ease of presentation
diabetes_X = diabetes_X * 30 + 25

# Collect 20 data points
diabetes_X_train = diabetes_X.iloc[-20:]
diabetes_y_train = diabetes_y.iloc[-20:]

# Display some of the data points
pd.concat([diabetes_X_train, diabetes_y_train], axis=1)

In [None]:
# Discretizing the target
diabetes_y_train_discr = np.digitize(diabetes_y_train, bins=[150])

# Visualize it
plt.scatter(diabetes_X_train[diabetes_y_train_discr==0], diabetes_y_train[diabetes_y_train_discr==0],marker='o', s=80, facecolors='none', edgecolors='green')
plt.scatter(diabetes_X_train[diabetes_y_train_discr==1], diabetes_y_train[diabetes_y_train_discr==1],marker='o', s=80, facecolors='none', edgecolors='red')
plt.legend(['Low-Risk Patients','High-Risk Patients'])

Let's generate predictions for this dataset

In [None]:
%matplotlib inline
# Create the Logistic regression object (note: this is actually classification algorithm!)
clf = linear_model.LinearRegression()

# Train the model using training sets
clf.fit(diabetes_X_train, diabetes_y_train_discr)

# Make predictions on the training set
diabetes_y_train_pred = clf.predict(diabetes_X_train)

# Visualize it
plt.scatter(diabetes_X_train[diabetes_y_train_discr==0], diabetes_y_train[diabetes_y_train_discr==0], marker='o', s=140, facecolors='none', edgecolors='g')
plt.scatter(diabetes_X_train[diabetes_y_train_discr==1], diabetes_y_train[diabetes_y_train_discr==1], marker='o', s=140, facecolors='none', edgecolors='r')
plt.scatter(diabetes_X_train[diabetes_y_train_pred==0], diabetes_y_train[diabetes_y_train_pred==0], color='g', s=20)
plt.scatter(diabetes_X_train[diabetes_y_train_pred==1], diabetes_y_train[diabetes_y_train_pred==1], color='r', s=20)
plt.legend(['Low-Risk Patients', 'High-Risk Patients', 'Low-Risk Predictions', 'High-Risk Predictions'])

## **Part 3: Anatomy of a Supervised Learning Problem: Learning Algorithm**

Let's now look at what a general supervised learning algorithms looks like

### **Recall: Three components of A Supervised Machine Learning Problem** 
At a high level, a supervised machine learning prolem has the following structure: \
$$Dataset + Algorithm → Predictive ~ Model$$
The predictive model is chosen to model the relationship between inputs and targets. For instance, it can predict future targets.

### **The Components of A Supervised Machine Learning Algorithm**
We can also define the high-level structure of a supervised machine learning algorithm as consisting of three components: 

*   A **model class**: the set of possible models we consider
*   An **objective function**, which defines how good a model is
*   An **optimizer**, which finds the best predictive model in the model class according to the objective function.


Let's look at our diabetes dataset for an example:

In [None]:
import numpy as np
import pandas as pd
from sklearn import datasets
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [12,4]

# Load the diabetes dataset
diabetes = datasets.load_diabetes(as_frame=True)
diabetes_X, diabetes_y = diabetes.data, diabetes.target

# Print part of dataset
diabetes_X.head()

### **Model: Notation**
We'll say that a model is a function 
$$f: \mathcal{X} → \mathcal{Y}$$
that maps inputs $x \in \mathcal{X}$ to targets $y \in \mathcal{Y}$.\
Often, models have parameters $\theta \in Θ$. We will then write the model as:
$$f_{\theta}: \mathcal{X} → \mathcal{Y}$$
to denote that it's parametrized by $θ$

### **Model Class: Notation**
Formally, the model class is a set
$$\mathcal{M} \subseteq \{f \mid f:\mathcal{X} \to \mathcal{Y}\}$$
of possible models that map features to targets.\
When the model $f_{\theta}$ is parametrized by $\theta \in \Theta$. Thus we can write as:
$$\mathcal{M} = \{f_{\theta} \mid f:\mathcal{X} \to \mathcal{Y}; \theta \in \Theta\}.$$

### **Model Class: Example**
One simple example is to assume that x and y are related by a linear model of the form
$$y = \theta_0 + \theta_1x_1 + \theta_2x_2 + \dots + \theta_dx_d$$
where x is feturized output and y is the target.\
The $\theta_j$ are the *parameters* of the model.

### **Objectives: Notation**
To capture this intuition, we define an objecctive function (also call a *loss function*)
$$J(f) : \mathcal{M} \to [0,∞),$$

which describes the extent to which f "fits" tha data $\mathcal{D} = \{(x^{(i)},y^{(i)}) \mid i = 1,2, \dots, n\}$. \

When $f$ is parametrized by $\theta \in \Theta$, the ojective becomes the function $J(\theta): \Theta \to [0,∞). $


### **Objective: Example**
What would if some possible objective functions? We will see many, but here aer a few example:


*   Mean Squared Error:
$$J(\theta) = \frac{1}{2n}\sum_{i=1}^n(f_{\theta}(x^{(i)}) - y^{(i)})^2$$
*   Absolute (L1) Error:
$$J(\theta) = \frac{1}{2}\sum_{i=1}^n\left|f_{\theta}(x^{(i)}) - y^{(i)}\right|$$

These are defined for the dataset $\mathcal{D} = \{(x^{(i)},y^{(i)}) \mid i = 1,2, \dots, n\}$.


In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error

y1 = np.array([1,2,3,4])
y2 = np.array([-1,1,3,5])

print('Mean Squared Error: %.2f' % mean_squared_error(y1,y2))
print('Mean Absolute Error: %.2f' % mean_absolute_error(y1,y2))

### **Optimizer: Notation**
At a high-level an optimizer takes an objective $J$ and a model class $\mathcal{M}$ and finds a model $f \in \mathcal{M}$ with the smallest value of the objective $J$

$$min_{f\in\mathcal{M}}J(f)$$

intuitively, this is the function that bests "fits" the data on the training set. \

When $f$ is parametrized by $\theta \in \Theta$, the optimizer mminimizes a function $J(\theta)$ over all $\theta \in \Theta$


### **Optimizer: Example**
We will see that behind the scenes, the [sklearn.linear_models.LinearRegression]() algorithm optimize the MSE loss:
$$min_{\theta \in \mathbb{R}}\frac{1}{2n}\sum_{i=1}^n(f_{\theta}(x^{(i)}) - y^{(i)})^2$$

We can easily measure the quality of the fit on the training set and the test set. \

Lat's run the above algorithm on our diabetes dataset:

In [None]:
from sklearn import linear_model
# Collect 20 data points for training
diabetes_X_train = diabetes_X.iloc[-20:]
diabetes_y_train = diabetes_y.iloc[-20:]

# Create a linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training set
regr.fit(diabetes_X_train, diabetes_y_train.values)

# Make prediction on the training set
diabetes_y_train_pred = regr.predict(diabetes_X_train)

# Collect three data points for testing
diabetes_X_test = diabetes_X.iloc[:3]
diabetes_y_test = diabetes_y.iloc[:3]

# Generate prediction on the new patients 
diabetes_y_test_pred = regr.predict(diabetes_X_test)

The algorithm returns a predictive. We can visualize its prediction below

In [None]:
# visualize the result
plt.xlabel('Body Mass Index (BMI)')
plt.ylabel('Diabetes Risk')
plt.scatter(diabetes_X_train.loc[:, ['bmi']], diabetes_y_train)
plt.scatter(diabetes_X_test.loc[:, ['bmi']], diabetes_y_test, color='r', marker='o')
plt.scatter(diabetes_X_train.loc[:, ['bmi']], diabetes_y_train_pred, color='y', linewidth=1)
plt.plot(diabetes_X_test.loc[:, ['bmi']], diabetes_y_test_pred, 'x', color='r', mew=3, markersize=8)
plt.legend([ 'Initial Patients', 'New Patients', 'Prediction', 'Model'])

In [None]:
from sklearn.metrics import mean_squared_error

print('Training set mean squared error: %.2f'
      % mean_squared_error(diabetes_y_train, diabetes_y_train_pred))
print('Test set mean squared error: %.2f'
      % mean_squared_error(diabetes_y_test, diabetes_y_test_pred))
print('Test set mean squared error on random inputs: %.2f'
      % mean_squared_error(diabetes_y_test, np.random.randn(*diabetes_y_test_pred.shape)))

### **Summary: Components of A Supervised Machine Learning Problem**
At a high-level, a supervised machine learning problem has the following structure:
$$Dataset + \underbrace{\text{Algorithm}}_\text{Model Class + Objective + Optimizer} \to Predictive ~ Model$$

The predictive model is chosen to model the relationship between the inputs and targets. For instance, it can predict future targets.


### **Notation: Feature Matrix**
Suppose that we have a dataset of size n (e.g., n patients), indexed by $i = 1,2, \dots, n$. Each $x^{(i)}$ is a vetor of $d$ features.

**Feature Matrix:**
Machine Learning algorithms are most easily defined in the language of linear algebra. Therefore, it will be useful to preresent the entire dataset as one matrix $X \in \mathbb{R}^{n+d}$, of the form:
$$X = \begin{bmatrix}
x^{(1)}_1 + x^{(2)}_1 + \ldots + x^{(n)}_1\\
x^{(1)}_2 + x^{(2)}_2 + \ldots + x^{(n)}_2\\
\vdots \\
x^{(1)}_d + x^{(2)}_d + \ldots + x^{(n)}_d\\
\end{bmatrix}.$$
Similarly, we can vectorize the target variables into a vector $y \in \mathbb{R}^n$ of the form:
$$y = \begin{bmatrix}
x^{(1)} \\
x^{(2)}\\
\vdots \\
x^{(n)}
\end{bmatrix}.$$