# SLU09 - Classification With Logistic Regression: Exercise notebook

In [None]:
import pandas as pd 
import numpy as np 
import hashlib
import json
from matplotlib import pyplot as plt
from sklearn.preprocessing import StandardScaler
import utils

You thought that you would get away without implementing your own little logistic regression? Hah! In this notebook, you will:
- implement one pass of maximum likelihood optimization in three steps: implement the estimated probability function, calculate the log-likelihood cost function, and calculate one iteration of the optimization
- standardize data manually
- use sklearn for the same steps: standardize data, train the classifier and output predictions

### Exercise 1.1: Calculate the estimated probability

Recall the formula for the estimated probability for logistic regression:

$$\hat{p} = \frac{1}{1 + e^{-z}}$$

Where z is the linear combination of the features $x_n$ and $\beta_n$ are the coefficients of the model:

$$z = \beta_0 + \beta_1 x_1 + ... + \beta_n x_n$$

Implement a function that calculates the estimated probability for an observation. The input are two arrays, one with the features (x1, x2, ..., xn) and another with the model coefficients (b0, b1, .., bn). The output is the estimated probability for the given observation.

In [None]:
def predict_proba(data, coefs):
    """ 
    Function that returns the estimated probability for an observation.
    
    Args:
        data (np.array): a numpy array of shape (n) with the features
        coefs (np.array): a numpy array of shape (n + 1, 1) with model coefficients
            - coefs[0]: intercept
            - coefs[1:]: remaining coefficients

    Returns:
        proba (float): the estimated probability, value between 0 and 1.

    """

    # hint: if using array multiplication, don't forget to add a field 
    #       for the intercept to the features (like you did in SLU07)
    
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return proba

In [None]:
x = np.array([-1.2, -1.5])
coefficients = np.array([0 ,4, -1])
np.testing.assert_almost_equal(round(predict_proba(x, coefficients),3),0.036)

x_1 = np.array([-1.5, -1, 3, 0])
coefficients_1 = np.array([0 ,2.1, -1, 0.5, 0])
np.testing.assert_almost_equal(round(predict_proba(x_1, coefficients_1),3),0.343)

### Exercise 1.2: Compute the log-likelihood cost function

During the optimization of the model coefficients, you need to calculate the log-likelihood cost function: 

$$H_{\hat{p}}(y) = \sum_{i=1}^{N} \left [ y_i \log\left(\hat{p}_i(x_i,\beta)\right) + (1-y_i) \log\left(1-\hat{p}_i(x_i,\beta)\right) \right ]$$

where N is the number of observations, $y_i$ are the true class labels, $x_i$ the feature vector of the ith observation, and $\beta$ are the model coefficients.

In this exercise, you will calculate the cost function for the given dataset. The inputs are an array of the feature vectors, an array of the model coefficients, and an array of the true class labels. You can use the function above or calculate everything from scratch, in which case it will be easier if you still remember how to multiply matrices. :)

In [None]:
def log_likelihood_cost_function(var_x, coefs, var_y):
    """ 
    Function that calculates log-likelihood for the given dataset
    
    Args:
        var_x (np.array): array with the features of the training data of size (m, n)
                   where m is the number of observations and n the number of features
        coefs (float64): an array with the model coefficients of size (1, n+1)
        var_y (float64): an array with the true class labels
        
    Returns:
        cost (np.float): a float with the resulting log-likelihood for the dataset

    """
   
    # YOUR CODE HERE
    raise NotImplementedError()
    return cost

In [None]:
x = np.array([[-2, -2], [3.5, 0], [6, 4]])
coefficients = np.array([[0 ,2, -1]])
y = np.array([[1],[1],[0]])
np.testing.assert_almost_equal(round(log_likelihood_cost_function(x, coefficients, y)),-30.0)
coefficients_1 = np.array([[3 ,4, -0.6]])
x_1 = np.array([[-4, -4], [6, 0], [3, 2], [4, 0]])
y_1 = np.array([[0],[1],[0],[1]])
np.testing.assert_almost_equal(round(log_likelihood_cost_function(x_1, coefficients_1, y_1)),-55.0)

### Exercise 1.3: Compute one iteration of the gradient descent

Now that we know how to calculate probabilities and the cost function, let's do an interesting exercise - compute the first iteration of the gradient descent for the given dataset according to the update rule

$$\beta_{t+1} = \beta_t + learning\_rate*\sum_{i=1}^{N}  \left[ x_i \left(y_i-\hat{p}_i(x_i,\beta_t)\right) \right] $$

Write a function that takes as arguments the training data and the learning rate and outputs the model coefficients $\beta$ after one iteration of the gradient descent. Initialize the coefficients with 0 like this:
```python
coefficients = np.zeros(n+1)
```
where n is the number of features of the model. Before you start, think for a moment about the dimensions of the terms in the sum that you need to multiply. Writing it down on paper helps. :)

In [None]:
def compute_coefs_gd(x_train, y_train, learning_rate = 0.1, verbose = False):
    """ 
    Function that calculates the logistic regression coefficients 
    after the first iteration of gradient descent.

    Args:
        x_train (np.array): a numpy array with features of shape (m, n)
            m: number of training observations
            n: number of features
        y_train (np.array): a numpy array with the true class labels of shape (m,)
        learning_rate (np.float64): learning rate for the optimization

    Returns:
        coefficients (np.array): a numpy array of updated model coefficients of shape (n+1,)

    """
    
    # YOUR CODE HERE
    raise NotImplementedError()
    return coefficients

In [None]:
#Test 1
x_train = np.array([[5.5,2.3,4.0,1.3], [6.9,3.1,4.9,1.5], [7.3,2.9,6.3,1.8], [6.4,2.8,5.6,2.1]])
y_train = np.array([0,0,1,1])
learning_rate = 0.1
x_standard=StandardScaler().fit_transform(x_train)
coef=compute_coefs_gd(x_standard, y_train, learning_rate)

np.testing.assert_almost_equal(round(coef[0],3),0)
np.testing.assert_almost_equal(round(coef[1],3),0.097)
np.testing.assert_almost_equal(round(coef[2],3),0.051)
np.testing.assert_almost_equal(round(coef[3],3),0.176)
np.testing.assert_almost_equal(round(coef[4],3),0.181)

#Test 2
x_train_1 = np.array([[6.7,3.0,5.2,2.3], [6.3,2.5,5.0,1.9], [7.7,3.8,6.7,2.2], [7.7,2.6,6.9,2.3],
                      [6.0,2.7,5.1,1.6], [5.4,3.0,4.5,1.5], [6.3,3.3,4.7,1.6], [4.9,2.4,3.3,1.0]])
y_train_1 = np.array([0,0,0,0,1,1,1,1])
learning_rate = 0.1
x_1_standard=StandardScaler().fit_transform(x_train_1)
coef1=compute_coefs_gd(x_1_standard, y_train_1, learning_rate)

np.testing.assert_almost_equal(round(coef1.max(),3) ,0.)
np.testing.assert_almost_equal(round(coef1.min(),3) ,-0.349)
np.testing.assert_almost_equal(round(coef1.mean(),3),-0.2)
np.testing.assert_almost_equal(round(coef1.var(),3) ,0.02)

### Exercise 1.4: Compute one iteration of Newton's method

After you mastered the previous exercise, this one will be a breeze. Do one iteration for Newton's method using the update rule

$$\beta_{t+1} = \beta_t + learning\_rate * \sum_{i=1}^{N} \frac{\left(y_i - \hat{p}_i(x_i,\beta_t)\right)} {\hat{p}(x_i,\beta_t) \ \left(1 - \hat{p}(x_i,\beta_t)\right) \ x_i }$$

Write a function that takes as arguments the training data and the learning rate and outputs the model coefficients $\beta$ after one iteration of Newton's method. Initialize the coefficients with 0 as before.

In [None]:
def compute_coefs_nm(x_train, y_train, learning_rate = 0.1, verbose = False):
    """ 
    Function that calculates the logistic regression coefficients 
    after the first iteration of Newton's method.

    Args:
        x_train (np.array): a numpy array with features of shape (m, n)
            m: number of training observations
            n: number of features
        y_train (np.array): a numpy array with the true class labels of shape (m,)
        learning_rate (np.float64): learning rate for the optimization

    Returns:
        coefficients (np.array): a numpy array of updated model coefficients of shape (n+1,)

    """
    
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return coefficients

In [None]:
#Test 1
x_train = np.array([[5.5,2.3,4.0,1.3], [6.9,3.1,4.9,1.5], [7.3,2.9,6.3,1.8], [6.4,2.8,5.6,2.1]])
y_train = np.array([0,0,1,1])
learning_rate = 0.1
x_standard=StandardScaler().fit_transform(x_train)
coef=compute_coefs_nm(x_standard, y_train, learning_rate)

np.testing.assert_almost_equal(round(coef[0],3),0)
np.testing.assert_almost_equal(round(coef[1],3),-1.129)
np.testing.assert_almost_equal(round(coef[2],3),2.772)
np.testing.assert_almost_equal(round(coef[3],3),1.29)
np.testing.assert_almost_equal(round(coef[4],3),1.136)

#Test 2
x_train_1 = np.array([[6.7,3.0,5.2,2.3], [6.3,2.5,5.0,1.9], [7.7,3.8,6.7,2.2], [7.7,2.6,6.9,2.3],
                      [6.0,2.7,5.1,1.6], [5.4,3.0,4.5,1.5], [6.3,3.3,4.7,1.6], [4.9,2.4,3.3,1.0]])
y_train_1 = np.array([0,0,0,0,1,1,1,1])
learning_rate = 0.1
x_1_standard=StandardScaler().fit_transform(x_train_1)
coef1=compute_coefs_nm(x_1_standard, y_train_1, learning_rate)

np.testing.assert_almost_equal(round(coef1.max(),3) ,0.037)
np.testing.assert_almost_equal(round(coef1.min(),3) ,-11.567)
np.testing.assert_almost_equal(round(coef1.mean(),3),-3.173)
np.testing.assert_almost_equal(round(coef1.var(),3) ,18.671)

### Exercise 2: Standardize data

To get this concept in your head, let's do a quick and easy function to standardize the data. Recall that standardized data have zero mean and unit variance:

$$ x_{standardized} = \frac{x - mean(x)}{std(x)}$$

Don't forget that the `axis` argument is critical when obtaining the mean values!

Implement the function to standardize given data below. Inputs is an array of features and output is an array of the same size with standardized features.

In [None]:
def standardize_data_function(data):
    """ 
    Function that standardizes the features
    
    Args:
        data (np.array): a numpy array with observations of shape (m, n)
            m: number of observations
            n: number of features

    Returns:
        standardized_data (np.array): a numpy array with standardized features of shape (m, n)

    """
   
    # YOUR CODE HERE
    raise NotImplementedError()
    return standardized_data

In [None]:
data = np.array([[7,7,3], [2,2,11], [9,5,2], [0,9,5], [10,1,3], [1,5,2]])
standardized_data = standardize_data_function(data)
print('Before standardization:')
print(data)
print('\n-------------------\n')
print('After standardization:')
print(standardized_data)

In [None]:
data = np.array([[2,2,11,1], [7,5,1,3], [9,5,2,6]])
standardized_data = standardize_data_function(data)
np.testing.assert_almost_equal(round(standardized_data.mean(),0),0.)
np.testing.assert_almost_equal(round(standardized_data.var(axis=0).mean(),0),1.)
np.testing.assert_almost_equal(round(standardized_data.min(),3),-1.414)
np.testing.assert_almost_equal(round(standardized_data.max(),3),1.408)

data1 = np.array([[1,3,1,3], [9,5,3,1], [2,2,4,6]])
standardized_data1 = standardize_data_function(data1)
np.testing.assert_almost_equal(round(standardized_data1.mean(),0),0.)
np.testing.assert_almost_equal(round(standardized_data1.var(axis=0).mean(),0),1.)
np.testing.assert_almost_equal(round(standardized_data1.min(),3),-1.336)
np.testing.assert_almost_equal(round(standardized_data1.max(),3),1.405)

### Exercise 3.1: Train a logistic regression classifier with sklearn

Finally, we're getting to use sklearn! You will train a logistic regression classifier to distinguish between two varieties of raisins, Kecimen and Besni, based on their size and shape. The raisins were photographed and features describing their size and shape were extracted from the images. The original dataset is available [here](https://www.kaggle.com/datasets/muratkokludataset/raisin-dataset). Take a look at the dataset. `Class` indicates raisin variety with True for Kecimen and False for Besni. All the other columns are size and shape features.

In [None]:
# We will load the dataset for you
raisins = pd.read_csv('data/raisins_dataset.csv')
raisins.head()

Implement a function that will train a sklearn logistic regression model on the `raisins` dataset. It should return the classifier instance, the probabilities for the raisins to be of Kecimen variety, and the coefficients of the model including the intercept.

- use all available features to train the model
- use `Class` as the target
- standardize the features
- fit a logistic regression for a maximum of 100 iterations and random state = 100 (look in the API reference for the necessary parameters)

The input of the function is the `raisins` dataset. The output is the classifier, an array of probabilities, an array of model coefficients, and the model intercept. Notice that the target is encoded as True/False - sklearn will understand this. Make sure to return the probabilities of the positive class!

In [None]:
from sklearn.linear_model import LogisticRegression

def train_model_sklearn(dataset):
    '''
    Fits logistic regression to the raisins dataset
    and returns the classifier instance, the probabilities, the model coefficients and the intercept.
    
    Args:
        dataset(pd.DataFrame): training dataset
    
    Returns:
        clf: the classifier
        probas (np.array): Array of floats with the probability 
                           of each raisin being the Kecimen variety
        coefficients (np.array): coefficients of the trained logistic regression.
        intercept (np.array): intercept of the trained logistic regression          
    '''
    
    # YOUR CODE HERE
    raise NotImplementedError()
    return clf, probas, coefficients, intercept
    

In [None]:
lr, probas, coef, intercept = train_model_sklearn(raisins)

assert str(lr)=='LogisticRegression(random_state=100)',"Did you use the correct classifier?"

# Testing Probas
np.testing.assert_almost_equal(round(probas.max()), 1), "Something is wrong with your probabilities."
np.testing.assert_almost_equal(round(probas.min()), 0), "Something is wrong with your probabilities."
np.testing.assert_almost_equal(round(probas.mean(),3), 0.500, 2), "Something is wrong with your probabilities."
np.testing.assert_almost_equal(round(probas.std(),5), 0.36992, 3), "Something is wrong with your probabilities."
np.testing.assert_almost_equal(round(probas.sum())*0.001, 0.450, 3), "Something is wrong with your probabilities."

# Testing Coefs
np.testing.assert_almost_equal(round(coef.max(),3), 0.733, 2), "Something is wrong with your model coefficients."
np.testing.assert_almost_equal(round(coef.min(),3), -2.333, 2), "Something is wrong with your model coefficients."
np.testing.assert_almost_equal(round(coef.mean(),4), -0.4395, 3), "Something is wrong with your model coefficients."
np.testing.assert_almost_equal(round(coef.var(),4), 0.7687, 3), "Something is wrong with your model coefficients."
np.testing.assert_almost_equal(round(coef.sum(),4), -3.0768, 3), "Something is wrong with your model coefficients."

assert hashlib.sha256(json.dumps(str(round(intercept[0],3))).encode()).hexdigest()=='ad5db3ccd28d807e79ebc49fcb89236070829c4145ec1ff8db72b06b06bcb350',"Something is wrong with your intercept"

### Exercise 3.2: Decision boundary

In general, the decision boundary in binary logistic regression is a hyperplane of dimension n -1 in the feature space, with n being the number of features. You can imagine this in 3D: it's like a cloud of observations cut with a decision boundary knife. Recall that you can derive the equation for this hyperplane from the logistic regression formula.

For the classification model from exercise 3.1, calculate the value of the feature `Perimeter` on the decision boundary given the values of the other six features. Return it as a float of the same name. The values of the other six features are given below.

In [None]:
# the features were scaled that's why the negative numbers
Area = 1.94
MajorAxisLength = 2.31
MinorAxisLength = 0.92
Eccentricity = -0.73
ConvexArea=3.45
Extent=-2.78

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(Perimeter,float), 'Perimeter should be a float, not an array'
assert hashlib.sha256(json.dumps(str(round(Perimeter,1))).encode()).hexdigest()=='81cb0c0ea658f6d9b2de914a4cc2b72fb79b1f3453e18c843ccb64e6bc7b4aa6',"Not correct, try again."

### Exercise 3.3: Logistic regression with less features

Train another logistic regression for the `raisins` dataset, but use only two features, `MinorAxisLength` and `Perimeter`. As before, use max 100 iterations and set random_state to 100. Standardize the features.

The input of the function is the `raisins` dataset. The output is the classifier, an array of probabilities of the positive class, an array of model coefficients, and the model intercept.

In [None]:
def train_model_sklearn_2_features(dataset):
    '''
    Fits logistic regression to selected features of the raisins dataset
    and returns the classifier, the probabilities, the model coefficients and the intercept.
    Uses the features MinorAxisLength and Perimeter.
    
    Args:
        dataset(pd.DataFrame): training dataset
    
    Returns:
        clf: the classifier
        probas (np.array): Array of floats with the probability 
                           of each raisin being the Kecimen variety
        coefficients (np.array): coefficients of the trained logistic regression.
        intercept (np.array): intercept of the trained logistic regression          
    '''
    
    # YOUR CODE HERE
    raise NotImplementedError()
    return clf, probas, coefficients, intercept
    

In [None]:
lr2, probas2, coef2, intercept2 = train_model_sklearn_2_features(raisins)

assert str(lr2)=='LogisticRegression(random_state=100)',"Did you use the correct classifier?"

# Testing Probas
np.testing.assert_almost_equal(round(probas2.max()), 1), "Something is wrong with your probabilities."
np.testing.assert_almost_equal(round(probas2.min()), 0), "Something is wrong with your probabilities."
np.testing.assert_almost_equal(round(probas2.mean(),1), 0.5, 1), "Something is wrong with your probabilities."
np.testing.assert_almost_equal(round(probas2.std(),5), 0.36578, 3), "Something is wrong with your probabilities."
np.testing.assert_almost_equal(round(probas2.sum())*0.001, 0.450, 3), "Something is wrong with your probabilities."

# Testing Coefs
np.testing.assert_almost_equal(round(coef2.mean(),3), -1.465, 2), "Something is wrong with your model coefficients."
np.testing.assert_almost_equal(round(coef2.var(),5), 5.98857, 3), "Something is wrong with your model coefficients."
np.testing.assert_almost_equal(round(coef2.sum(),3), -2.931, 2), "Something is wrong with your model coefficients."

assert hashlib.sha256(json.dumps(str(round(intercept2[0],2))).encode()).hexdigest()=='90248082ff854cb0699ef9c82c9514d456a536e4092bb1fd69e72e446dfc8cbd',"Something is wrong with your intercept"

correct1,correct2=utils.compare_classifiers(lr,lr2,raisins.drop(columns=['Class']),
                          raisins[['MinorAxisLength','Perimeter']],raisins.Class)
print("The 7-feature model classified %d out of %d raisins correctly." % (correct1,raisins.shape[0]))
print("The 2-feature model classified %d out of %d raisins correctly." % (correct2,raisins.shape[0]))

As you can see, both models performed similarly well. It is because most of the features have little influence on the outcome. The most important feature is `Perimeter`. You can see it on the size of the corresponding model coefficient (uncomment and run the cell below). Selection of features and their importance for the model predictions will be discussed in SLU14.

In [None]:
# uncomment this cell to see the coefficients
# The higher is the absolute value of the coefficient, the more it influences the model predictions.
# The order of the coefficients is the same as the order of features input into the model.
#print('7-features model coefficients')
#print(coef)
#print('2-features model coefficients')
#print(coef2)

Here is a plot of the datapoints and the decision boundary for the 2-feature model (with scaled features). We can't plot the result for the 7-feature model, the boundary cannot be projected into 2d space. Think about why!

In [None]:
utils.plot_exercise_boundary(raisins,['MinorAxisLength','Perimeter'],raisins.Class)

Congratulations, you've learned to train your first classifier! But how good is your model at predicting the 
outcome? You will learn how to evaluate model performance using metrics in the next SLU!

We have one more optional ungraded exercise below if you'd like to practice more. It is the same as exercise 3, just with another dataset.

<img src="https://imgs.xkcd.com/comics/machine_learning.png">

### Exercise 4.1 - optional, ungraded

The dataset for this exercise is the dependency of cannabis use on personality measures - neuroticism, extraversion, openness to experience, agreeableness, conscientiousness, impulsivity, and sensation seeking. It is a subset of [this dataset](https://www.kaggle.com/datasets/obeykhadija/drug-consumptions-uci).

In [None]:
cannabis = pd.read_csv('data/cannabis_consumption.csv')
cannabis.head()

Implement a function that will train a sklearn logistic regression model on the `cannabis` dataset. It should return the classifier instance, the probabilities of cannabis use, and the coefficients of the model including the intercept.

- use only the numerical features to train the model (Nscore, Escore, Oscore, Ascore, Cscore, Impulsive, SS)
- use `Cannabis` as the target which is True for use in the past year
- standardize the features
- fit a logistic regression for a maximum of 100 iterations and random state = 100

The input of the function is the `cannabis` dataset. The output is the classifier, an array of probabilities, an array of model coefficients, and the model intercept. Make sure to return the probabilities of the positive class!

In [None]:
def train_model_sklearn_cannabis(dataset):
    '''
    Fits logistic regression to the cannabis dataset
    using the numerical features Nscore, Escore, Oscore, Ascore, Cscore, Impulsive, SS
    and returns the classifier instance, the probabilities, the model coefficients and the intercept.
    
    Args:
        dataset(pd.DataFrame): training dataset
    
    Returns:
        clf: the classifier
        probas (np.array): array of floats with the probability 
                           of cannabis use in the past year
        coefficients (np.array): coefficients of the trained logistic regression.
        intercept (np.array): intercept of the trained logistic regression          
    '''
    
    # YOUR CODE HERE
    raise NotImplementedError()
    return clf, probas, coefficients, intercept
    

In [None]:
lr_can, probas_can, coef_can, intercept_can = train_model_sklearn_cannabis(cannabis)

assert str(lr_can)=='LogisticRegression(random_state=100)',"Did you use the correct classifier?"

# Testing Probasa
np.testing.assert_almost_equal(round(probas_can.max()), 1), "Something is wrong with your probabilities."
np.testing.assert_almost_equal(round(probas_can.min(),5), 0.006,3), "Something is wrong with your probabilities."
np.testing.assert_almost_equal(round(probas_can.mean(),5), 0.46975, 3), "Something is wrong with your probabilities."
np.testing.assert_almost_equal(round(probas_can.std(),5), 0.28087, 3), "Something is wrong with your probabilities."
np.testing.assert_almost_equal(round(probas_can.sum())*0.001, 0.885, 3), "Something is wrong with your probabilities."

# Testing Coefs
np.testing.assert_almost_equal(round(coef_can.max(),3), 0.484, 2), "Something is wrong with your model coefficients."
np.testing.assert_almost_equal(round(coef_can.min(),5), -0.8901, 2), "Something is wrong with your model coefficients."
np.testing.assert_almost_equal(round(coef_can.mean(),3), -0.113, 2), "Something is wrong with your model coefficients."
np.testing.assert_almost_equal(round(coef_can.var(),3), 0.213, 2), "Something is wrong with your model coefficients."
np.testing.assert_almost_equal(round(coef_can.sum(),5), -0.90509, 3), "Something is wrong with your model coefficients."

assert hashlib.sha256(json.dumps(str(round(intercept_can[0],3))).encode()).hexdigest()=='814a53cf1b085b11d13b72d79dedfe3b0b0f65322cd9743284577ad571ae4473',"Something is wrong with your intercept"

### Exercise 4.2 - optional ungraded

Now train another logistic regression model on the same target using only the Oscore and SS features.

In [None]:
def train_model_sklearn_2_features_cannabis(dataset):
    '''
    Fits logistic regression to selected features of the cannabis dataset
    and returns the classifier, the probabilities, the model coefficients and the intercept.
    Uses the features Oscore and SS.
    
    Args:
        dataset(pd.DataFrame): training dataset
    
    Returns:
        clf: the classifier
        probas (np.array): array of floats with the probability 
                           of cannabis use in the past year
        coefficients (np.array): coefficients of the trained logistic regression.
        intercept (np.array): intercept of the trained logistic regression          
    '''
    
    # YOUR CODE HERE
    raise NotImplementedError()
    return clf, probas, coefficients, intercept
    

In [None]:
lr_can2, probas_can2, coef_can2, intercept_can2 = train_model_sklearn_2_features_cannabis(cannabis)

assert str(lr_can2)=='LogisticRegression(random_state=100)',"Did you use the correct classifier?"

# Testing Probas
np.testing.assert_almost_equal(round(probas_can2.max(),3), 0.975, 2), "Something is wrong with your probabilities."
np.testing.assert_almost_equal(round(probas_can2.min(),4), 0.0209, 3), "Something is wrong with your probabilities."
np.testing.assert_almost_equal(round(probas_can2.mean(),4), 0.4697, 2), "Something is wrong with your probabilities."
np.testing.assert_almost_equal(round(probas_can2.std(),4), 0.2508, 3), "Something is wrong with your probabilities."
np.testing.assert_almost_equal(round(probas_can2.sum())*0.001, 0.885, 3), "Something is wrong with your probabilities."

# Testing Coefs
np.testing.assert_almost_equal(round(coef_can2.mean(),3), -0.773, 2), "Something is wrong with your model coefficients."
np.testing.assert_almost_equal(round(coef_can2.var(),5), 0.01275, 3), "Something is wrong with your model coefficients."
np.testing.assert_almost_equal(round(coef_can2.sum(),3), -1.547, 2), "Something is wrong with your model coefficients."

assert hashlib.sha256(json.dumps(str(round(intercept_can2[0],2))).encode()).hexdigest()=='2c166a03f5d797741a30f163766b3a052ec21f52c10c6fefbd5dc11912632748',"Something is wrong with your intercept"

correct1,correct2=utils.compare_classifiers(lr_can,lr_can2,cannabis.select_dtypes(include='number'),
                          cannabis[['Oscore','SS']],cannabis.Cannabis)
print("The 7-feature model classified %d out of %d subjects correctly." % (correct1,cannabis.shape[0]))
print("The 2-feature model classified %d out of %d subjects correctly." % (correct2,cannabis.shape[0]))

Again, both models performed similarly well because the features Oscore and SS are the most important ones. Uncomment the cell below to see the model coefficients.

In [None]:
# uncomment this cell to see the coefficients
# The higher is the absolute value of the coefficient, the more it influences the model predictions.
# The order of the coefficients is the same as the order of features input into the model.
#print('7-features model coefficients')
#print(coef_can)
#print('2-features model coefficients')
#print(coef_can2)

Here is a plot of the datapoints and the decision boundary for the 2-feature model.

In [None]:
utils.plot_exercise_boundary(cannabis,['Oscore','SS'],cannabis.Cannabis)

The original Kaggle dataset has data on several drugs, not just cannabis, so you can practice even more. :) Just make sure that you choose a balanced target.