<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Data-preparation" data-toc-modified-id="Data-preparation-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data preparation</a></span><ul class="toc-item"><li><span><a href="#Load-data" data-toc-modified-id="Load-data-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Load data</a></span></li><li><span><a href="#First-observations-on-the-categorical-variables" data-toc-modified-id="First-observations-on-the-categorical-variables-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>First observations on the categorical variables</a></span></li><li><span><a href="#Discard-NA-(Not-Assigned)-values" data-toc-modified-id="Discard-NA-(Not-Assigned)-values-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Discard NA (Not Assigned) values</a></span></li><li><span><a href="#Extracting-the-target-variable" data-toc-modified-id="Extracting-the-target-variable-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Extracting the target variable</a></span></li></ul></li><li><span><a href="#Data-Visualization-and-data-processing" data-toc-modified-id="Data-Visualization-and-data-processing-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Data Visualization and data processing</a></span><ul class="toc-item"><li><span><a href="#Proportion-of-classes" data-toc-modified-id="Proportion-of-classes-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Proportion of classes</a></span></li><li><span><a href="#Short-summary-of-(scalar)-data" data-toc-modified-id="Short-summary-of-(scalar)-data-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Short summary of (scalar) data</a></span></li><li><span><a href="#Convert-categorical-variables-to-dummy-variables" data-toc-modified-id="Convert-categorical-variables-to-dummy-variables-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Convert categorical variables to dummy variables</a></span></li><li><span><a href="#Split-data-$\mapsto$-train/test" data-toc-modified-id="Split-data-$\mapsto$-train/test-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Split data $\mapsto$ train/test</a></span></li></ul></li><li><span><a href="#Logistic-Regression" data-toc-modified-id="Logistic-Regression-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Logistic Regression</a></span></li><li><span><a href="#Feature-Selection" data-toc-modified-id="Feature-Selection-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Feature Selection</a></span><ul class="toc-item"><li><span><a href="#Recursive-Feature-Elimination-(RFE)" data-toc-modified-id="Recursive-Feature-Elimination-(RFE)-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Recursive Feature Elimination (RFE)</a></span></li><li><span><a href="#Manual-Feature-Selection" data-toc-modified-id="Manual-Feature-Selection-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Manual Feature Selection</a></span></li></ul></li><li><span><a href="#(Bonus)-Using-a-linear-regression-model-for-binary-classification" data-toc-modified-id="(Bonus)-Using-a-linear-regression-model-for-binary-classification-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>(Bonus) Using a linear regression model for binary classification</a></span></li></ul></div>

In [1]:
import numpy as np
import matplotlib.pyplot as plt

# use pandas to play with dataset
import pandas as pd

# use seaborn to display data
import seaborn as sns

# use sklearn to practice ML
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix, roc_curve, auc

# Algorithms of the day
from sklearn.linear_model import LinearRegression as LinReg
from sklearn.linear_model import LogisticRegression as LogReg

from sklearn.feature_selection import RFE

# Alternative package for Statistical models, better than sklearn for feature selection
import statsmodels.api as sm

# Introduction

In this practical session we are going to investigate a new classification problem. Our dataset is the result of a census on the US population. The objective was to collect personal informations about the individuals to put into perspective with their professional situation.

Our goal is simple: we want to train a classification model to predict if an individual earn an annual income of more or less than $50K$, knowing several variables.

The dataset is a bit more complicated than in the last session: this time we mix both categorical and continuous variables. To handle the categorical variables we need to create new features. So after training our first models we will see some feature selection methods, that can be helpful to keep a good level of interpretability in machine learning models.

# Data preparation

## Load data

In [36]:
data = pd.read_csv('census.csv', na_values='?')
data.drop(columns=['fnlwgt','native.country'], inplace=True)

print('Size of the data set', data.shape)
data.head()

Size of the data set (32561, 13)


Unnamed: 0,age,workclass,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,income
0,90,,HS-grad,9,Widowed,,Not-in-family,White,Female,0,4356,40,<=50K
1,82,Private,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,<=50K
2,66,,Some-college,10,Widowed,,Unmarried,Black,Female,0,4356,40,<=50K
3,54,Private,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,<=50K
4,41,Private,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,<=50K


## First observations on the categorical variables

Print the contingency table between two categorical variables of your choice.

In [4]:
## A contingency table ## 

## Discard NA (Not Assigned) values

How many missing values are there in the dataset? How are they spread in all the features? 
Remove all the rows containing missing values.

Clue: Use the `isna` and `.dropna` methods.

In [2]:
## Count and Drop the nan ## 

## Extracting the target variable

In [44]:
X_data = data.drop(columns='income', axis=1)

In [45]:
income_to_label = {inc: lab for lab, inc in enumerate(['<=50K', '>50K'])}
y_data = data.income.map(income_to_label)

# Data Visualization and data processing

## Proportion of classes

In [5]:
## Plot or print the proportion of samples from each class ##


## Short summary of (scalar) data

1) Use the `describe` method to print some statistics about the scalar features

2) Use the `pairplot` function from TP1 to plot the repartition on each class according to each pair of features.

3) Draw the `Boxplot` for the distribution of each feature for each class.

## Convert categorical variables to dummy variables

Categorial variables can also play a role in decision process!

Regression models are based on quantitative variables, hence we convert categorial variables to binary variables for each modality.

This may not a "perfect" way to handle quantitative variables, but with no further asumption (order for instance) there are not many other options. This method can be viewed as adapting the intercept of the regression to each combination of instance.

Complete with the features you want to replace with dummy variables. Explain your choice.

In [19]:
categorical_features = []

In [14]:
X_data = pd.get_dummies(X_data, columns=categorical_features, drop_first=True)

Unnamed: 0,age,education.num,capital.gain,capital.loss,hours.per.week,workclass_Local-gov,workclass_Private,workclass_Self-emp-inc,workclass_Self-emp-not-inc,workclass_State-gov,...,relationship_Not-in-family,relationship_Other-relative,relationship_Own-child,relationship_Unmarried,relationship_Wife,race_Asian-Pac-Islander,race_Black,race_Other,race_White,sex_Male
1,82,9,0,4356,18,0,1,0,0,0,...,1,0,0,0,0,0,0,0,1,0
3,54,4,0,3900,40,0,1,0,0,0,...,0,0,0,1,0,0,0,0,1,0
4,41,10,0,3900,40,0,1,0,0,0,...,0,0,1,0,0,0,0,0,1,0
5,34,9,0,3770,45,0,1,0,0,0,...,0,0,0,1,0,0,0,0,1,0


## Split data $\mapsto$ train/test

Think about `shuffle` and `stratified` argument!

In [15]:
test_frac = 1/3 # Fraction of the data set to consider as test set

X_train, X_test,\
y_train, y_test = train_test_split(X_data, y_data,
                                   test_size=test_frac,
                                   shuffle=True,
                                   stratify=y_data) # Respect the proportion of classes

# Logistic Regression

1) Use the LogisticRegression function of 'sklearn.linear_model' to perform a logistic regression.

2) Use the Logit function of 'statsmodels.api' to perform a logistic regression.

3) Use the `summary2` function of the statsmodels regression model. Describe some of the statistics in the output and  explain each column of the table. How can we use these informations?

# Feature Selection 

How can you describe a feature selection process? Why is it important?

## Recursive Feature Elimination (RFE)

We first use an automatic method for feature selection with the `RFE` function. Summarize the algorithm and explain its parameters. Run it and print the selected features.

## Manual Feature Selection

Read this article until the paragraph "A case study in Python": https://www.datacamp.com/community/tutorials/feature-selection-python.

Describe a wrapper method, and the "Backward Elimination" method in particular. Try to implement this method with either all the variables or a subset of variables of your choice (to speed up the computation, we would put all features in a real case).

Remark: depending on time you can just implement the first round of Backward Elimination.

# (Bonus) Using a linear regression model for binary classification

Apply the same methods of section 4 and 5 by replacing the logistic regression with a linear regression with an appropriate decision boundary.

Compare the results with the logistic model with 1) all features and 2) an appropriate feature selection algorithm. Do you find the same important features?