# SLU10 - Classification: Exercise notebook

In [None]:
import pandas as pd 
import numpy as np 

In this notebook you will practice the following: 

    - What classification is for
    - Logistic regression
    - Cost function
    - Binary classification
    
You thought that you would get away without implementing your own little Logistic Regression? Hah!


# Exercise 1. Implement the Logistic Function
*aka the sigmoid function*

As a very simple warmup, you will implement the logistic function. Let's keep this simple!

Here's a quick reminder of the formula:

$$\hat{p} = \frac{1}{1 + e^{-z}}$$

**Complete here:**

In [None]:
def logistic_function(z):
    """ 
    Implementation of the logistic function by hand
    
    Args:
        z (np.float64): a float

    Returns:
        proba (np.float64): the predicted probability for a given observation

    """
    
    # define the numerator and the denominator and obtain the predicted probability 
    # clue: you can use np.exp()
    numerator = None
    denominator = None
    proba = None
    # YOUR CODE HERE
    raise NotImplementedError()
    return proba

In [None]:
z = 1.2
print('Predicted probability: %.2f' % logistic_function(z))

Expected output:

    Predicted probability: 0.77

In [None]:
z = 3.4
assert np.isclose(np.round(logistic_function(z),2), 0.97)

z = -2.1
assert np.isclose(np.round(logistic_function(z),2), 0.11)

# Exercise 2: Make Predictions From Observations

The next step is to implement a function that receives observations and returns predicted probabilities.

For instance, remember that for an observation with two variables we have:

$$z = \beta_0 + \beta_1 x_1 + \beta_2 x_2$$

where $\beta_0$ is the intercept and $\beta_1, \beta_2$ are the coefficients.

**Complete here:**

In [None]:
def predict_proba(x, coefficients):
    """ 
    Implementation of a function that returns a predicted probability for a given data observation
    
    Args:
        x (np.array): a numpy array of shape (n,)
            - n: number of variables
        coefficients (np.array): a numpy array of shape (n + 1,)
            - coefficients[0]: intercept
            - coefficients[1:]: remaining coefficients

    Returns:
        proba (np.array): the predicted probability for a given data observation

    """
    
    # start by assigning the intercept to z 
    # clue: the intercept is the first element of the list of coefficients
    z = None
    # YOUR CODE HERE
    raise NotImplementedError()
    
    # sum the remaining variable * coefficient products to z
    # clue: the variables and coefficients indeces are not exactly aligned, but correctly ordered
    for i in range(None):                     # iterate through the observation variables (clue: you can use len())
        z += None                             # multiply the variable value by its coefficient and add to z
    # YOUR CODE HERE
    raise NotImplementedError()
    
    # obtain the predicted probability from z
    # clue: we already implemented something that can give us that
    proba = None
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return proba

In [None]:
x = np.array([0.2,2.32,1.3,3.2])
coefficients = np.array([2.1,0.22,-2, 0.4, 0.1])
print('Predicted probability:  %.3f' % predict_proba(x, coefficients))

Expected output:

    Predicted probability:  0.160

In [None]:
x = np.array([1,0,2,3.2])
coefficients = np.array([-0.2,2,-6, 1.2, -1])
assert np.isclose(np.round(predict_proba(x, coefficients),2), 0.73)

x = np.array([3.2,1.2,-1.2])
coefficients = np.array([-1.,3.1,-3,4])
assert np.isclose(np.round(predict_proba(x, coefficients),2), 0.63)

# Exercise 3: Compute the Cross-Entropy Cost Function

As you will implement stochastic gradient descent, you only have to do the following for each prediction: 

$$H_{\hat{p}}(y) =  - (y \log(\hat{p}) + (1-y) \log (1-\hat{p}))$$

**Complete here:**

In [None]:
def cross_entropy(y, proba):
    """ 
    Implementation of a function that returns the Cross-Entropy loss
    
    Args:
        y (np.int64): an integer
        proba (np.float64): a float

    Returns:
        loss (np.float): a float with the resulting loss for a given prediction

    """
    
    # compute the inner left side of the loss function (for when y == 1)
    # clue: use np.log()
    left = None 
    # YOUR CODE HERE
    raise NotImplementedError()
    
    # compute the inner right side of the loss function (for when y == 0)
    right = None 
    # YOUR CODE HERE
    raise NotImplementedError()
    
    # compute the total loss
    # clue: do not forget the minus sign
    loss = None
    # YOUR CODE HERE
    raise NotImplementedError()
    return loss

In [None]:
y = 1
proba = 0.7
print('Computed loss:  %.3f' % cross_entropy(y, proba))

Expected output:
    
    Computed loss:  0.357

In [None]:
y = 1
proba = 0.35
assert np.isclose(np.round(cross_entropy(y, proba),3), 1.050)

y = 1
proba = 0.77
assert np.isclose(np.round(cross_entropy(y, proba),3), 0.261)

# Exercise 4: Obtain the Optimized Coefficients 
Now that the warmup is done, let's do the most interesting exercise. Here you will implement the optimized coefficients through Stochastic Gradient Descent.

Quick reminders:

$$H_{\hat{p}}(y) = - \frac{1}{N}\sum_{i=1}^{N} \left [{ y_i \ \log(\hat{p}_i) + (1-y_i) \ \log (1-\hat{p}_i)} \right ]$$

and

$$\beta_{0(t+1)} = \beta_{0(t)} - learning\_rate \frac{\partial H_{\hat{p}}(y)}{\partial \beta_{0(t)}}$$

$$\beta_{t+1} = \beta_t - learning\_rate \frac{\partial H_{\hat{p}}(y)}{\partial \beta_t}$$

which can be simplified to

$$\beta_{0(t+1)} = \beta_{0(t)} + learning\_rate \left [(y - \hat{p}) \ \hat{p} \ (1 - \hat{p})\right]$$

$$\beta_{t+1} = \beta_t + learning\_rate \left [(y - \hat{p}) \ \hat{p} \ (1 - \hat{p}) \ x \right]$$

You will have to initialize a numpy array full of zeros for the coefficients. If you have a training set $X$, you can initialize it this way:
```python
coefficients = np.zeros(X.shape[1]+1)
```

where the $+1$ is adding the intercept.

You will also iterate through the training set $X$ alongside their respective labels $Y$. To do so simultaneously you can do it this way:
```python
for x_sample, y_sample in zip(X, Y):
    ...
```

**Complete here:**

In [None]:
def compute_coefficients(x_train, y_train, learning_rate = 0.1, n_epoch = 50, verbose = False):
    """ 
    Implementation of a function that returns the optimized intercept and coefficients
    
    Args:
        x_train (np.array): a numpy array of shape (m, n)
            m: number of training observations
            n: number of variables
        y_train (np.array): a numpy array of shape (m,)
        learning_rate (np.float64): a float
        n_epoch (np.int64): an integer of the number of full training cycles to perform on the training set

    Returns:
        coefficients (np.array): a numpy array of shape (n+1,)

    """
    
    # initialize the coefficients array with zeros
    # clue: use np.zeros()
    coefficients = None
    # YOUR CODE HERE
    raise NotImplementedError()
    
    # run the stochastic gradient descent algorithm n_epoch times and update the coefficients
    for epoch in range(None):               # iterate n_epoch times
        loss = None                         # initialize the cross entropy loss with an empty list
        for x, y in zip(None, None):        # iterate through the training set observations and labels
            proba = None                    # compute the predicted probability
            loss.append(None)               # compute the cross entropy loss and append it to the list
            coefficients[0] += None         # update the intercept            
            for i in range(None):           # iterate through the observation variables (clue: use len())
                coefficients[i + 1] += None # update each coefficient
        loss = None                         # average the obtained cross entropies (clue: use np.mean())
        # YOUR CODE HERE
        raise NotImplementedError()
        
        if((epoch%10==0) & verbose):
            print('>epoch=%d, learning_rate=%.3f, error=%.3f' % (epoch, learning_rate, loss))
    return coefficients

In [None]:
x_train = np.array([[1,2,3], [2,5,9], [3,1,4], [8,2,9]])
y_train = np.array([0,1,0,1])
learning_rate = 0.1
n_epoch = 50
coefficients = compute_coefficients(x_train, y_train, learning_rate=learning_rate, n_epoch=n_epoch, verbose=True)
print('Computed coefficients:')
print(coefficients)

Expected output:
    
    >epoch=0, learning_rate=0.100, error=0.811
    >epoch=10, learning_rate=0.100, error=0.675
    >epoch=20, learning_rate=0.100, error=0.640
    >epoch=30, learning_rate=0.100, error=0.606
    >epoch=40, learning_rate=0.100, error=0.574
    Computed coefficients:
    [-0.82964483  0.02698239 -0.04632395  0.27761155]

In [None]:
x_train = np.array([[3,1,3], [1,0,9], [3,3,4], [2,-1,10]])
y_train = np.array([0,1,0,1])
learning_rate = 0.3
n_epoch = 100
coefficients = compute_coefficients(x_train, y_train, learning_rate=learning_rate, n_epoch=n_epoch, verbose=False)
assert np.allclose(coefficients, np.array([-0.25917811, -1.15128387, -0.85317139,  0.55286134]))

x_train = np.array([[3,-1,-2], [-6,9,3], [3,-1,4], [5,1,6]])
coefficients = compute_coefficients(x_train, y_train, learning_rate=learning_rate, n_epoch=n_epoch, verbose=False)
assert np.allclose(coefficients, np.array([-0.53111811, -0.16120628,  2.20202909,  0.27270437]))

# Exercise 5: Normalize Data

Just a quick and easy function to normalize the data. It is crucial that your variables are adjusted between $[0;1]$ (normalized) or standardized so that you can correctly analyse some logistic regression coefficients for your possible future employer.

You only have to implement this formula

$$ x_{normalized} = \frac{x - x_{min}}{x_{max} - x_{min}}$$

Don't forget that the `axis` argument is critical when obtaining the maximum, minimum and mean values! As you want to obtain the maximum and minimum values of each individual feature, you have to specify `axis=0`. Thus, if you wanted to obtain the maximum values of each feature of data $X$, you would do the following:

```python
X_max = np.max(X, axis=0)
```

**Complete here:**

In [None]:
def normalize_data(data):
    """ 
    Implementation of a function that normalizes your data variables
    
    Args:
        data (np.array): a numpy array of shape (m, n)
            m: number of observations
            n: number of variables

    Returns:
        normalized_data (np.array): a numpy array of shape (m, n)

    """
    # compute the numerator
    # clue: use np.min()
    numerator = None
    # YOUR CODE HERE
    raise NotImplementedError()
    
    # compute the numerator
    # clue: use np.max() and np.min()
    denominator = None
    # YOUR CODE HERE
    raise NotImplementedError()
    
    # obtain the normalized data
    normalized_data = None
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return normalized_data

In [None]:
data = np.array([[9,5,2], [7,7,3], [2,2,11], [1,5,2], [10,1,3], [0,9,5]])
normalized_data = normalize_data(data)
print('Before normalization:')
print(data)
print('\n-------------------\n')
print('After normalization:')
print(normalized_data)

Expected output:
    
    Before normalization:
    [[ 9  5  2]
     [ 7  7  3]
     [ 2  2 11]
     [ 1  5  2]
     [10  1  3]
     [ 0  9  5]]

    -------------------

    After normalization:
    [[0.9        0.5        0.        ]
     [0.7        0.75       0.11111111]
     [0.2        0.125      1.        ]
     [0.1        0.5        0.        ]
     [1.         0.         0.11111111]
     [0.         1.         0.33333333]]

In [None]:
data = np.array([[9,5,2,6], [7,5,1,3], [2,2,11,1]])
normalized_data = normalize_data(data)
assert np.allclose(normalized_data, np.array([[1., 1., 0.1, 1.],[0.71428571, 1., 0., 0.4],[0., 0., 1., 0.]]))

data = np.array([[9,5,3,1], [1,3,1,3], [2,2,4,6]])
normalized_data = normalize_data(data)
assert np.allclose(normalized_data, np.array([[1., 1., 0.66666667, 0.],[0., 0.33333333, 0., 0.4],
                                              [0.125, 0., 1., 1.]]))

# Exercise 6: Putting it All Together

The Wisconsin Breast Cancer Diagnostic dataset is another data science classic. It is the result of extraction of breast cell's nuclei characteristics to understand which of them are the most relevent for developing breast cancer.

Your quest, is to first analyze this dataset from the materials that you've learned in the previous SLUs and then create a logistic regression model that can correctly classify cancer cells from healthy ones.

Dataset description:

    1. Sample code number: id number 
    2. Clump Thickness
    3. Uniformity of Cell Size
    4. Uniformity of Cell Shape
    5. Marginal Adhesion 
    6. Single Epithelial Cell Size
    7. Bare Nuclei
    8. Bland Chromatin
    9. Normal Nucleoli
    10. Mitoses 
    11. Class: (2 for benign, 4 for malignant) > We will modify to (0 for benign, 1 for malignant) for simplicity
    
The data is loaded for you below.

In [None]:
columns = ['Sample code number','Clump Thickness','Uniformity of Cell Size','Uniformity of Cell Shape',
           'Marginal Adhesion','Single Epithelial Cell Size','Bare Nuclei','Bland Chromatin','Normal Nucleoli',
           'Mitoses','Class']
data = pd.read_csv('data/breast-cancer-wisconsin.csv',names=columns, index_col=0)
data["Bare Nuclei"] = data["Bare Nuclei"].replace(['?'],np.nan)
data = data.dropna()
data["Bare Nuclei"] = data["Bare Nuclei"].map(int)
data.Class = data.Class.map(lambda x: 1 if x == 4 else 0)
X = data.drop('Class').values
y_train = data.Class.values

You will also have to return several values, such as the number of cancer and healthy cells. To do so, remember that you can do masks in numpy arrays. If you had a numpy array of labels called `labels` and wanted to obtain the ones with label $3$, you would do the following:

```python
filtered_labels = labels[labels==3]
```

You will additionally be asked to obtain the number of correct cancer cell predictions. Imagine that you have a numpy array with the predictions called `predictions` and a numpy array with the correct labels called `labels` and you wanted to obtain the number of correct predictions of a label $4$. You would do the following:

```python
n_correct_predictions = labels[(labels==4) & (predictions==4)].shape[0]
```

Also, don't forget to use these values for your logistic regression!

In [None]:
# Hyperparameters
learning_rate = 0.01
n_epoch = 100

# For validation
verbose = True

Now let's do this!

**Complete here:**

In [None]:
# STEP ONE: Initial analysis and data processing
# How many cells have cancer? (clue: use y_train)
n_cancer = None
# YOUR CODE HERE
raise NotImplementedError()

# How many cells are healthy? (clue: use y_train)
n_healthy = None
# YOUR CODE HERE
raise NotImplementedError()

# Normalize the training data X (clue: we have already implemented this)
x_train = None
# YOUR CODE HERE
raise NotImplementedError()

print("Number of cells with cancer: %i" % n_cancer)

print("\nThe last three normalized rows:")
print(x_train[-3:])

Expected output:

    Number of cells with cancer: 239

    The last three normalized rows:
    [[0.44444444 1.         1.         0.22222222 0.66666667 0.22222222
      0.77777778 1.         0.11111111 1.        ]
     [0.33333333 0.77777778 0.55555556 0.33333333 0.22222222 0.33333333
      1.         0.55555556 0.         1.        ]
     [0.33333333 0.77777778 0.77777778 0.44444444 0.33333333 0.44444444
      1.         0.33333333 0.         1.        ]]

In [None]:
# STEP TWO: Model training and predictions
# What coefficients can we get? (clue: we have already implemented this)
# note: don't forget to use all the hyperparameters defined above
coefficients = None
# YOUR CODE HERE
raise NotImplementedError()

# Initialize the predicted probabilities list
probas = None
# YOUR CODE HERE
raise NotImplementedError()

# What are the predicted probabilities on the training data?
for x in None:              # iterate through the training data x_train
    probas.append(None)     # append the list the predicted probability (clue: we already implemented this)
    
# YOUR CODE HERE
raise NotImplementedError()

# If we had to say whether a cells had breast cancer, what are the predictions?
# clue 1: Hard assign the predicted probabilities by rounding them to the nearest integer
# clue 2: use np.round()
preds = None
# YOUR CODE HERE
raise NotImplementedError()

print("\nThe last three coefficients:")
print(coefficients[-3:])

print("\nThe last three obtained probas:")
print(probas[-3:])

print("\nThe last three predictions:")
print(preds[-3:])

Expected output:

    >epoch=0, learning_rate=0.010, error=0.617
    >epoch=10, learning_rate=0.010, error=0.209
    >epoch=20, learning_rate=0.010, error=0.143
    >epoch=30, learning_rate=0.010, error=0.114
    >epoch=40, learning_rate=0.010, error=0.097
    >epoch=50, learning_rate=0.010, error=0.086
    >epoch=60, learning_rate=0.010, error=0.077
    >epoch=70, learning_rate=0.010, error=0.071
    >epoch=80, learning_rate=0.010, error=0.066
    >epoch=90, learning_rate=0.010, error=0.062

    The last three coefficients:
    [0.70702475 0.33306501 3.27480969]

    The last three obtained probas:
    [0.9679181578309998, 0.9356364708465178, 0.9482109014966041]

    The last three predictions:
    [1. 1. 1.]

In [None]:
# STEP THREE: Results analysis
# How many cells were predicted to have breast cancer? (clue: use preds and len() or .shape)
n_predicted_cancer = None
# YOUR CODE HERE
raise NotImplementedError()

# How many cells with cancer were correctly detected? (clue: use y_train, preds and len() or .shape)
n_correct_cancer_predictions = None
# YOUR CODE HERE
raise NotImplementedError()

print("Number of correct cancer predictions: %i" % n_correct_cancer_predictions)

Expected output:

    Number of correct cancer predictions: 239

In [None]:
print('You have a dataset with %s cells with cancer and %s healthy cells. \n\n'
     'After analysing the data and training your own logistic regression classifier you find out that it correctly '
     'identified %s out of %s cancer cells which were all of them. You feel very lucky and happy. However, shortly '
     'after you get somewhat suspicious after getting such amazing results. You feel that they should not be '
     'that good, but you do not know how to be sure of it. This, because you trained and tested on the same '
     'dataset, which does not seem right! You say to yourself that you will definitely give your best focus when '
     'doing the next Small Learning Unit 11, which will tackle exactly that.' % 
      (n_cancer, n_healthy, n_predicted_cancer, n_correct_cancer_predictions))

In [None]:
assert np.allclose(probas[:3], np.array([0.05075437808498781, 0.30382227212694596, 0.05238389294132284]))
assert np.isclose(n_predicted_cancer, 239)
assert np.allclose(coefficients[:3], np.array([-3.22309346, 0.40712798, 0.80696792]))
assert np.isclose(n_correct_cancer_predictions, 239)