# Ungraded Lab:  Multiclass Classification: One-vs-all

One vs All is one method for selection when there are more than two categories.   
![pic](./figures/onevsmany.png)

## Goals
In this lab you will:
- utilize the functions you have developed (compute_cost, compute_gradient, predict, gradient_descent) to do binomial (yes/no) classification
- write a multi-class prediction routine to select between n binomial decisions
- utilize decision boundary plotting logic


# Outline
- [Tools](#tools)
- [Dataset](#dataset)
- [One vs All Implementation](#ova)



# Multiclass Classification: One-vs-all (OVA)
In this lab, we will explore how to use the One-vs-All method for classification when there are more than two categories. This technique is an extention of two class or binomial logistic regression that we have working with. 

In binomail logistic regression, we train a model to classify samples that are in a class or not in a class. One-vs-All(OVA) extends this method by training $n$ models. Each model is responsible for identifying one class. A model for a given class is trained by recasting the training set to identify one class as positive and all the rest as negative. To make predictions, an example is processed by all $n$ models and the model with the largest prediction output is selected.

In this lab, we will build an OVA classifier.
## Tools 
- We will utilize our previous work to build and train models. These routines are provided. 
- Plotting decision boundaries and datasets is helpful. Producing these plots is quite involved so helper routines are provided below.
        - plot_mc_decision_boundary() will use a prediction routine you will write in this assigment, `predict_mc`
- We will create a multi-class data set. Popular [`SkLearn`](https://scikit-learn.org/stable/) routines are utilized.

In [None]:
from lab_utils import *
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.linear_model import LogisticRegression
import copy
import math

These routines are provided but reviewing their operation is instructive. Plotting routines often make use of many esoteric but useful numpy routines. Plotting decision boundaries makes use of `matplotlib's` contour plot. A contour plot draws a line at boundary of a change in values. That capability is used to delineate changes in decisions. Briefly, the routine has 3 steps:
- create a fine mesh of locations in a 2-D grid. Build an array of those points.
- make predictions for each of those points. In this case, this includes the vote for the best prediction.
- plot the mesh vs the predictions(`Z`) using a contour plot.

In [None]:
#Plot a multi-class decision boundary
def plot_mc_decision_boundary(X,nclasses, W, b , predict_mc_function, class_labels=None, legend=False):

    # create a mesh to points to plot
    h = 0.1  # step size in the mesh
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    points = np.c_[xx.ravel(), yy.ravel()]

    #make predictions for each point in mesh
    Z = predict_mc_function(points,W,b)
    Z = Z.reshape(xx.shape)

    #contour plot highlights boundaries between values - classes in this case
    plt.figure()
    plt.contour(xx, yy, Z, colors='g') 
    plt.axis('tight')


In [None]:
# Plot  multi-class training points
def plot_mc_data(X, y, class_labels=None, legend=False):
    classes = np.unique(y)
    for i in classes:
        label = class_labels[i] if class_labels else "class {}".format(i)
        idx = np.where(y == i)
        plt.scatter(X[idx, 0], X[idx, 1],  cmap=plt.cm.Paired,
                    edgecolor='black', s=20, label=label)
    if legend: plt.legend()

We're providing the routines which you have developed in previous labs to create and fit/train a model. Feel free to replace these with your own versions. (Keep a copy of the original just in case.)

In [None]:
def gradient_descent(X, y, w_in, b_in, cost_function, gradient_function, predict_function, alpha, num_iters): 
    """
    Performs batch gradient descent to learn theta. Updates theta by taking 
    num_iters gradient steps with learning rate alpha
    
    Args:
      X :    (array_like Shape (m,n)
      y :    (array_like Shape (m,1))
      w_in : (array_like Shape (n,1)) Initial values of parameters of the model
      b_in : (scalar)                 Initial value of parameter of the model
      cost_function:                  function to compute cost
      gradient_function:              function to compute the gradient
      predict_function                function to compute the output of the model
      alpha : (float)                 Learning rate
      num_iters : (int)               number of iterations to run gradient descent
    Returns
      w : (array_like Shape (n,)) Updated values of parameters of the model after
          running gradient descent
      b : (scalar)                Updated value of parameter of the model after
          running gradient descent
    """
    
    # number of training examples
    m = len(X)
    
    # An array to store cost J and w's at each iteration primarily for graphing later
    J_history = []
    w_history = []
    w = copy.deepcopy(w_in)  #avoid modifying global w within function
    b = b_in
    
    for i in range(num_iters):

        # Calculate the gradient and update the parameters
        dJdb,dJdw = gradient_function(X, y, w, b, predict_function)   ##None

        # Update Parameters using w, b, alpha and gradient
        w = w - alpha * dJdw               ##None
        b = b - alpha * dJdb               ##None

        # Save cost J at each iteration
        if i<100000:      # prevent resource exhaustion 
            cost =  cost_function(X, y, w, b)
            J_history.append(cost)

        # Print cost every at intervals 10 times or as many iterations if < 10
        if i% math.ceil(num_iters/10) == 0:
            w_history.append(w)
            print(f"Iteration {i:4}: Cost {float(J_history[-1]):8.2f}   ")
    #print(w,b,len(J_history), len(w_history) )   
    return w, b, J_history, w_history #return w and J,w history for graphing

In [None]:
def compute_cost_logistic_matrix(X, y, w, b):
    """
    Computes the cost over all examples
    Args:
      X : (array_like Shape (m,n)) data, m examples by n features
      y : (array_like Shape (m,1)) target value 
      w : (array_like Shape (n,1)) Values of parameters of the model      
      b : (array_like Shape (n,1)) Values of bias parameter of the model
    Returns:
      total_cost: (scalar)         cost 
    """
    m = X.shape[0]
    
    f = sigmoid(X @ w + b)
    total_cost = (1/m)*(np.dot(-y.T, np.log(f)) - np.dot((1-y).T, np.log(1-f)))
    
    return total_cost[0,0]

In [None]:
def predict_logistic_matrix(X, w, b): 
    """
    single predict using linear regression
    Args:
      x : (array_like Shape (m,n)) feature values house size, bedrooms..
      w : (array_like Shape (n,)) parameters for prediction   
      b : (scalar               ) parameter  for prediction   
    Returns
      p: ((array_like Shape (n,)  predictions
    """
    p = sigmoid(X @ w + b )         
    
    return(p)    

In [None]:
def predict_thresh(X, w, b, threshold=0.5): 
    """
    Predict whether the label is 0 or 1 using learned logistic
    regression parameters w, b. Threshold the output of the sigmoid to predict 1 or 0.
    
    Parameters
    ----------
    X : array_like
        Shape (m, n) 
    
    w : array_like
        Parameters of the model
        Shape (n, 1)
    b : scalar
    
    Returns
    -------

    p: array_like
        Shape (m,)
        The predictions for X using a threshold at 0.5
    """
    # number of training examples
    m = X.shape[0]   
    p = np.zeros(m)
   
    for i in range(m):
        f_w = sigmoid(np.dot(w.T, X[i])+b)
        p[i] = f_w >= threshold
    
    return p

In [None]:
def compute_gradient_logistic_matrix(X, y, w, b, predict_function): 
    """
    Computes the gradient for linear regression 
 
    Args:
      X : (array_like Shape (m,n)) variable such as house size 
      y : (array_like Shape (m,1)) actual value 
      w : (array_like Shape (n,1)) Values of parameters of the model      
      b : (scalar )                Values of parameter of the model      
      predict_function: (function) function to call to make prediction
    Returns
      dJdw: (array_like Shape (n,1)) The gradient of the cost w.r.t. the parameters w. 
      dJdb: (scalar)                 The gradient of the cost w.r.t. the parameter b. 
                                  
    """
    m,n = X.shape
    f_wb = predict_function(X, w, b) 
    err  = f_wb - y                 
    dJdw = (1/m) * (X.T @ err)     
    dJdb = (1/m) * np.sum(err)     
        
    return dJdb,dJdw

<a name='dataset'></a>
##  Dataset
Below, we use an `SkLearn` tool to create 3 'blobs' of data. This is a quick and easy way to create data to classify. Using NumPy's [`np.unique`](https://numpy.org/doc/stable/reference/generated/numpy.unique.html), we can look at the number and values of the classes.
- **Note** we're creating 3 classes

In [None]:
# make 3-class dataset for classification
centers = [[-5, 0], [0, 4.5], [5, -1]]
X_train, y_train = make_blobs(n_samples=500, centers=centers, cluster_std=0.85,random_state=40)


In [None]:
plot_mc_data(X_train,y_train,["blob one", "blob two", "blob three"], legend=True)
plt.show()

In [None]:
# show classes in data set
print(f"unique classes {np.unique(y_train)}")
# show how classes are represented
print(f"unique classes {y_train[:10]}")
# show shapes of our dataset
print(f"shape of X_train: {X_train.shape}, shape of y_train: {y_train.shape}")

When you create the all-vs one training set, for each class, you will need to create a 'binary' training set from `y_train`. This is a set with values set to `1` for all examples in that class. NumPy has a feature which makes this convenient. Shown below:

In [None]:
y_cat_2 = (y_train == 2) + 0
print(y_cat_2[:10])

You will need binary (0,1) values. 

In [None]:
y_cat_2 = (y_train == 2).astype(float)
print(y_cat_2[:10])

<a name='ova'></a>
##  One Vs All Implementation

You will implement the OVA algorithm in three steps.
- create and train three 'models'. Each trained to select one of the three classes.
- create a routine `predict_mc` that will use the models to make predictions and select the best answer.
- plot the decision boundary using the prediction routine.

### Step 1: Create and Train 3 models.
The steps involved will be familiar from past labs utilizing gradient descent.  
For each class:
- separate the target, `y` associated with the current class. 
- create `w_init` and `b_init`, initial values for the parameters. 
- call gradient descent. alpha=1e-2 and num_iters=1000 works well. 
    - The call to gradient_descent has a number of arguments. The routine is above. Double check your solution with the code in *Hint*. Notice how gradient descent utilizes all of the routines you have developed thus far.
    - you will call `predict_logistic_matrix` to perform the prediction, this does Not contain the threshold logic. You trian without threshold and later, when using the model add a threshold.
    - This returns parameters $w$ and $b$. $w$ and $b$ constitute your *model* which you will store in an arrays. The array, importantly, has *one column for each model*. This arrangement will allow the use of matrix operations and will become familiar in future labs.
- call predict with the training data and your model ($w$,$b$) to plot the results of training. 

Below there is a for loop over each of the classes. 
- creates an target array with the current class set to one and all others set to zero.
- your code
- plots this interpretation of the data
- plots the predicted values
Replace `None` with your code.

<details>
  <summary><font size="2" color="darkgreen"><b>Hints</b></font></summary>

```python
classes=np.unique(y_train)   # three classes, [0,1,2]
m,n = X_train.shape          # number of examples, number of features
c = len(classes)             # number of classe

# storage for our models (w), one column per class
W_models = np.zeros((n,len(classes)))   
b_models = np.zeros(c)
plt.figure(figsize=(14, 14))             

for i in classes:
    yc = (y_train == classes[i]).astype(float)
    yc = yc.reshape(-1,1)  

    ### START CODE HERE ### 
    w_init = np.zeros((2,1))                                                               
    b_init = 0.                                                                            
    w_final, b_final,_,_ = gradient_descent(X_train, yc, w_init, b_init,                   
                                      compute_cost_logistic_matrix,                       
                                      compute_gradient_logistic_matrix,                  
                                      predict_logistic_matrix,                            
                                      alpha = 1e-2, num_iters=1000)                         
    W_models[:,i] = w_final[:,0]                                                          
    b_models[i] = b_final                                                                 
    pred =  predict_thresh(X_train, w_final,b_final )                                    

    ### END CODE HERE ###         

    #Left Side, training data in All vs i
    ax = plt.subplot(3,2, 2*i + 1)
    plot_mc_data(X_train, yc,legend=True); plt.title(f"Training Classes, class {i}"); 

    #Right Side, model i's prediction after training
    ax = plt.subplot(3,2, 2*i + 2)
    plot_mc_data(X_train,pred,legend=True); plt.title("Predicted Classes after training");
plt.show
```
</details>

In [None]:
classes=np.unique(y_train)   # three classes, [0,1,2]
m,n = X_train.shape          # number of examples, number of features
c = len(classes)             # number of classe

# storage for our models (w), one column per class
W_models = np.zeros((n,len(classes)))   
b_models = np.zeros(c)
plt.figure(figsize=(14, 14))             

for i in classes:
    yc = (y_train == classes[i]).astype(float)
    yc = yc.reshape(-1,1)  

    ### START CODE HERE ### 
    ### BEGIN SOLUTION ###
    w_init = np.zeros((2,1))                                                              ##None 
    b_init = 0.                                                                           ##None 
    # call gradient descent, double check your solution with Hint
    w_final, b_final,_,_ = gradient_descent(X_train, yc, w_init, b_init,                  ##None 
                                      compute_cost_logistic_matrix,                       
                                      compute_gradient_logistic_matrix,                    
                                      predict_logistic_matrix,                             
                                      alpha = 1e-2, num_iters=1000)                        
    ### END SOLUTION ### 
    ### END CODE HERE ###         
    W_models[:,i] = w_final[:,0]                                                           
    b_models[i] = b_final                                                                  
    pred =  predict_thresh(X_train, w_final,b_final )                                      

    #Left Side, training data in All vs i
    ax = plt.subplot(3,2, 2*i + 1)
    plot_mc_data(X_train, yc,legend=True); plt.title(f"Training Classes, class {i}"); 

    #Right Side, model i's prediction after training
    ax = plt.subplot(3,2, 2*i + 2)
    plot_mc_data(X_train,pred,legend=True); plt.title("Predicted Classes after training");
plt.show()

<details>
<summary>
    <b>**Expected Output**:</b>
</summary>

 ![asdf](./figures/C1W3_trainvpredict.PNG)

Now that we have trained our 3 models we will write a routine to select the best prediction. Recall, the operation involves 
- making a prediction for each model
- picking the largest prediction

-Step 1: Given $X$ and matrices `W_model` and `b_models`, perform a prediction resulting in three predictions. This can be implemented in vectorized form as descibed pictorially below. This is not a trivial operation. It is worth spending the time to understand what this is doing. A for loop implementation can also be used.
![pic](./figures/C1W3_mcpredict.PNG)  
-Step 2: use `np.argmax(axis=1)` to return the **class** of the prediction with the highest value. Note that class is one of [0,1,2] and the index returned by `np.argmax` is, conveniently also [0,1,2].

<details>
  <summary><font size="2" color="darkgreen"><b>Hints</b></font></summary>

```python
def predict_mc(X,W,b, verbose = False):
    """
    Computes n predictions and selects the best.
    Args:
      X : (array_like Shape (m,n)) feature values used in prediction.  
      W : (array_like Shape (n,c)) Matrix of parameter. Each column represents 1  model
      b : (array_like Shape (c, )) vector of bias parameter. Each column represents 1  model
    Returns
      sclass: (array_like Shape (m,1)) The selected class the values belong in. Values 0 to c.
    """
    ### START CODE HERE ### 
    ### BEGIN SOLUTION ###  
    z_wb = X @ W + b               #Matrix multiply and add  ##None
    f_wb = sigmoid(z_wb)              #sigmoid                  ##None
    pclass = f_wb.argmax(axis=1)      #argmax                   ##None
    ### END SOLUTION ###  
    ### END CODE HERE ### 
    if verbose: print("z_wb.shape",z_wb.shape); print(z_wb)
    if verbose: print("pclass",pclass)
    return(pclass)
```
</details>

In [None]:
def predict_mc(X,W,b, verbose = False):
    """
    Computes n predictions and selects the best.
    Args:
      X : (array_like Shape (m,n)) feature values used in prediction.  
      W : (array_like Shape (n,c)) Matrix of parameter. Each column represents 1  model
      b : (array_like Shape (c, )) vector of bias parameter. Each column represents 1  model
    Returns
      sclass: (array_like Shape (m,1)) The selected class the values belong in. Values 0 to c.
    """
    ### START CODE HERE ### 
    ### BEGIN SOLUTION ###  
    #Matrix multiply and add
    z_wb = X @ W + b  
    #sigmoid  
    f_wb = sigmoid(z_wb) 
    #argmax
    pclass = f_wb.argmax(axis=1)      
    ### END SOLUTION ###  
    ### END CODE HERE ### 
    if verbose: print("z_wb.shape",z_wb.shape); print(z_wb)
    if verbose: print("pclass",pclass)
    return(pclass)

In [None]:
#Test your model
tmp_X = np.array([[-2.,-6.],[6,0],[-2,6]])                            #(2,2)
tmp_w = np.array([[-1.117, 0.103, 0.963], [-0.863, 1.155, -0.954]])   #(2,3)
tmp_b = np.array([-0.267 -1.4577 -0.238])                             #(3, )
print(tmp_X.shape, tmp_w.shape, tmp_b.shape)

tmp_fw = predict_mc(tmp_X,tmp_w,tmp_b, verbose = True)
print(tmp_fw)

<details>
<summary>
    <b>**Expected Output**:</b>
</summary>

```
(3, 2) (2, 3) (1,)
z_wb.shape (3, 3)
[[ 5.4493 -9.0987  1.8353]
 [-8.6647 -1.3447  3.8153]
 [-4.9067  4.7613 -9.6127]]
pclass [0 2 1]
[0 2 1]
```

Now that we can make a prediction for any point, you can now produce a plot with the decision boundary shown. `plot_mc_decision_boundary` utilizes your `predict_mc` to function.

In [None]:
#plot the decison boundary. Pass in our models - the w's and b's assocated with each model and predict_mc
plot_mc_decision_boundary(X_train,3, W_models, b_models, predict_mc)
plt.title("model decision boundary vs original training data")

#add the original data to the decison boundary
plot_mc_data(X_train,y_train,["blob one", "blob two", "blob three"], legend=True)
plt.show()

<details>
<summary>
    <b>**Expected Output**:</b>
</summary>

![sdf](./figures/C1W3_boundary.PNG)

There you are! You have now build a Multi-Class classifier.

Lets try another case. We'll just move the blobs around a bit:
## Second Test Case

In [None]:
# make 3-class dataset for classification
centers = [[-5, 0], [0, 1], [5, -1]]
X_train, y_train = make_blobs(n_samples=500, centers=centers, cluster_std=1.2,random_state=40)


In [None]:
plot_mc_data(X_train,y_train,["blob one", "blob two", "blob three"], legend=True)
plt.show()

In [None]:
# show classes in data set
print(f"unique classes {np.unique(y_train)}")
# show shapes of our dataset
print(f"shape of X_train: {X_train.shape}, shape of y_train: {y_train.shape}")

Examaning the plot above, do you see any potential issues with our current approach?

Piece together the pieces from above, or create subroutines to create a decision boundary diagram like the one in the first example.

<details>
  <summary><font size="2" color="darkgreen"><b>Hints</b></font></summary>

```python
classes=np.unique(y_train)   # three classes, [0,1,2]
m,n = X_train.shape          # number of examples, number of features
c = len(classes)             # number of classe

# storage for our models (w), one column per class
W_models = np.zeros((n,len(classes)))   
b_models = np.zeros(c)

for i in classes:
    yc = (y_train==classes[i]) + 0
    yc = yc.reshape(-1,1)

    w_init = np.zeros((2,1))   
    b_init = 0.
    w_final, b_final,_,_ = gradient_descent(X_train, yc, w_init, b_init,
                                      compute_cost_logistic_matrix, 
                                      compute_gradient_logistic_matrix, 
                                      predict_logistic_matrix,
                                      alpha = 1e-2, num_iters=1000)     
    W_models[:,i] = w_final[:,0]
    b_models[i] = b_final
    pred =  predict_thresh(X_train, w_final,b_final ) 

plot_mc_decision_boundary(X_train,3, W_models, b_models, predict_mc)
plt.title("model decision boundary vs original training data")

#add the original data to the decison boundary
plot_mc_data(X_train,y_train,["blob one", "blob two", "blob three"], legend=True)
plt.show()
```
</details>

In [None]:
#Rewrite code here



<details>
<summary>
    <b>**Expected Output**:</b>
</summary>

![asdf](./figures/C1W3_example2.PNG)
    
We will study logistic regression with polynomial features in the next lab. That will allow us to handle situations where purely linear solutions are not enough.

This notebook was informed by an example at scikit-learn.org. The author was Tom Dupre la Tour <tom.dupre-la-tour@m4x.org>