In [11]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
from sklearn.model_selection import train_test_split
import math
import random
from random import uniform
from sklearn import preprocessing
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LogisticRegression

# Exercise 1 : Binary classification using logistic regression

## Load data and inspection of the data

In [8]:
# We will load the data that is in the file 'donnees_exo1.txt'
data = pd.read_csv('donnees_exo1.txt', sep = ' ')

In [None]:
data

Questions : 
1. How many examples are available in this dataset ? 
2. How many features ? 
3. What is the distribution of the target variable ?

*Hint : value_counts can help for the last question*

Exercise: Plot on the same graph this dataset with different colors depending on the class value.

X1 should be on the x-axis, X2 on the y-axis and the points colored depending on the class value (0 or 1)

*Hints:* 
- *You can use scatter function for the plot*
- *There are different possibilities for the colors: you can create two sets (one with class 0 and one with class 1) or use colorMaps (look at the documentation)* 

## Train / Validation / Test split

Exercise : Split the dataset into 3 sets : train (70%), validation (15%) and test (15%).

*Hint : you can use the 'train_test_split' function that splits a set in 2 sets with a proportion that you can give as parameter. To split into 3 sets, you'll need to use it twice (be careful about the proportions)*

Exercise : Plot on the same graph the train and validation data with different colors for the class values and different markers between train and validation set.

## Fitting a logistic regression model

We will now fit a logistic regression model to the training set.
For this, we can use the command on the next cell:

In [None]:
lr_model = LogisticRegression().fit(train_data.iloc[:,0:2], train_data.Y)
# Parameters can be added to the 'LogisticRegression' call (we will see this later when we'll need)
# 2 parameters are given to the 'fit' function : 
#  - a dataset containing the features of the examples (here X1 and X2, the 2 first columns of the training set)
#  - a vector containing the labels (Y) of the examples (in the same order), 0 or 1 here

Question : How many parameters should have this model ?

The constant of the model (theta_0 in the slides) can be obtained by the command:

In [None]:
lr_model.intercept_

The parameters associated to each feature can be obtained by the command:

In [None]:
lr_model.coef_

Using the values of the features for the first example of the training set (row 0) and the values of the parameters of the model above, calculate the output that the model should give (using the equation of the logistic regression model).

*Hint : exponential can be obtained with math.exp()*

What shoud be the decision made by this model for this example ?

It is possible to obtain automatically this value by the command:

In [None]:
lr_model.predict_proba(train_data.iloc[0:1,0:2])
# actually this command gives you the two probabilities : P(Y = 0 | X1, X2) and P(Y = 1 | X1, X2)
# You should obtain the same probability than the one calculated above

In [None]:
# The final decision of the model can be obtained by the command:
lr_model.predict(train_data.iloc[0:1,0:2])

Question : Does this model make a good decision for this example ?

The accuracy of a classifier on a dataset (percentage of good prediction made for all the examples of the dataset) can be obtained automatically by the function 'score' applied to a classifier.
It needs two 2 parameters : 
 - a dataset containing the examples that we want to predict (their features)
 - the real class of these examples.
 Below, you will see how to compute the accuracy of the model 'lr_model' on the training set

In [None]:
lr_model.score(train_data.iloc[:,0:2], train_data.Y)

Exercise : Compute the accuracy of this model on the validation set 

## Visualization of the decision boundary

The function below allows you to see the decision boundary of a logisic regression model.
There are 7 parameters : 
 - model : a logistic regression model already fitted
 - data : the data points that you want to draw together with the boundary (training set for instance)
 - deg : the degree of the polynomial features used in the model (1 if it is a classical model). These polynomial features will be discussed later
 - xmin, xmax, ymin, ymax : the min and max coordinates that you want to be displayed (depends on the data)

An example of the use of this function is given after. 

In [None]:
def draw_boundary(model, data, deg, xmin, xmax, ymin, ymax):
    h = 0.05

    # create a grid of points from xmin to xmax and from ymin to ymax with a step h
    xx, yy = np.meshgrid(np.arange(xmin, xmax, h), np.arange(ymin, ymax, h))

    zz = np.c_[xx.ravel(), yy.ravel()] # create a matrix containing the grid points
    zz = pd.DataFrame(zz) # convert it to a dataframe to be able to use it with the model
    
    # adjust the points to the desired degree
    if(deg>1):
        poly = PolynomialFeatures(degree = deg)
        zz2 = poly.fit_transform(zz)
        zz2 = pd.DataFrame(zz2)
    else:
        zz2 = zz

    # predict the class of each point of the grid
    pred_zz= pd.Series(model.predict(zz2))

    color_map = matplotlib.colors.ListedColormap(pd.Series(['blue', 'red']))
    fig = plt.figure(figsize=(8,8))

    # plot a grid of points to show where the decision boundary is
    fig = plt.scatter(zz.iloc[:,0], zz.iloc[:,1], c = pred_zz, cmap = color_map, marker='+')

    # plot the dataset points
    fig = plt.scatter(data.iloc[:,0], data.iloc[:,1], s = 50, c = data.iloc[:,2], cmap = color_map)

    plt.xlabel('X1')
    plt.ylabel('X2')
    plt.title('Decision boundary of degree ' + str(deg))
    plt.show()
    

In [None]:
draw_boundary(lr_model, train_data, 1, -1.2, 1.2, -1.2, 1.2)

Questions : 
1. What is the type of this decision boundary ? 
2. Was it expected ? 
3. Does it seem adapted to this dataset ?

We will now see how to add polynomial features to the dataset before fitting a linear regression model:

In [None]:
poly = PolynomialFeatures(degree = 2) # Here we create an object that will be used to add polynomial features.
# The degree of the features can be modified in the parameters
X_train2 = poly.fit_transform(train_data.iloc[:,0:2]) # Here we use the 'poly' created above to add polynomial
# features of degree 2 to the training set
# It creates 4 new columns : 
# - one with 'ones' everywhere
# - one with X1^2
# - one with X2^2
# - one with X1*X2


Question : Check the dimensions of the new training set created above

Now we can fit a new logistic regression model using the new training set:

In [None]:
lr_model2 = LogisticRegression().fit(X_train2, train_data.Y)

Question : Compute the prediction accuracy of this new model on the training set. Is it better than the first model ? 

Exercise : Look at the decision boundary of this new model, using the 'draw_boundary' function. You will need to set the 'deg' parameter to 2

Exercise : Compute the prediction accuracy of this new model on the validation data

*Hint : You will first need to create a 'new' validation dataset by adding the same polynomial features to the original validation set*

Exercise : Try now to add higher degree polynomial features to the dataset, and apply the procedure explained during the CM to select the most adapted model. Estimate its generalization error

#  Exercice 2 : Multi-class classification

In this exercice, we will see how to apply logistic regression for multi-class classification (when the number of possible values for the target is more than 2)

In [None]:
# Load the data. 
segment = pd.read_csv("segment.csv", sep = ',')
segment

Questions : 
1. How many features are there in this dataset ? 
2. How many examples ? 
3. What is the distribution of the target values ? 

*Hint: You can find some informations about this dataset on the file segment.dat*

In [None]:
# It is adviced to "standardize" the features before applying a logistic regression model. Santardize means
# transforming each feature so that the mean of each feature is 0 and the standard deviation 1.
# This can be done by the following commands
X_segment = segment.iloc[:,0:19] # X_segment contains the features
y_segment = segment.y # y_segment contains the class
scaler = preprocessing.StandardScaler().fit(X_segment) # Standard Scaler is the command to standardize
X_segment = scaler.transform(X_segment)
X_segment = pd.DataFrame(X_segment)
# Now the features are standardized inside the table X_segment
data_segment = X_segment
data_segment['y'] = y_segment
# data_segment contains the features (standardized) and the class (column 'y')

Exercise: Check (for 1 feature of your choice) whether its mean is 0 (you can use np.mean)

Exercise : Split the dataset into a train, a validation, and a test set

We will now fit a logistic regression model to the traning set. It is like in exercice 1, but we now need to specify that the problem is multi-class. For that, you need to use multi-class = 'multinomial' inside the call to LogisticRegression:

In [None]:
mc_model = LogisticRegression(multi_class='multinomial').fit(train_data.iloc[:,0:19], train_data.y)

You should have a warning that says 'TOTAL NO. of ITERATIONS REACHED LIMIT'
It means that the parameters should not have been learned well beacuse the algorithm has reached the maximum number of iterations without converging. It is possible to increase the maximum number of iterations:

In [None]:
mc_model = LogisticRegression(multi_class='multinomial', max_iter=1000).fit(train_data.iloc[:,0:19], train_data.y)
# Now it should converge

Question : How many parameters should have this model ?

Exercise : Check the number of parameters by looking at their values (same commands as exercice 1)

Questions : 

1. How many outputs should have this model ? Check the output values of the model for the first example of the training set using the function predict_proba (as in exercice 1). 

2. What should be the decision of the model here ? Check it with the 'predict' function

Questions : What is the prediction accuracy of this model on the train set ? and on the validation set ?

Exercise : Using the same procedure as in Exercice 1, try different polynomial features and select the best one. 
Then, estimate the generalization error of logistic regression on this dataset
Don't go over degree 5, otherwise it might be quite long to learn the model (a lot of features)