# Classification

AI Black Belt - Yellow (May 2019).

---

## Census data

In this notebook we will consider Census data which gathers socio-demographic information about individuals. From the features describing a person, we will build a classifier that predicts whether she or he earns over 50k$ a year.

In [None]:
import pandas as pd 
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
df = pd.read_csv("data/adult.csv", index_col=0)
df.head()

In [None]:
df.describe()

<div class="alert alert-success">

<b>EXERCISE</b>:

Experiment with visualizing the data. Can you find out which features influence the income the most?

</div>

In [None]:
# %load solutions/day2-02-01.py

<div class="alert alert-success">

<b>EXERCISE</b>:

 <ul>
    <li>Make a dataframe <code>X</code> that contains all but the <code>income</code> column.</li>
    <li>Make a series <code>y</code> that contains the <code>income</code> column only.</li>
</ul>
</div>

In [None]:
# %load solutions/day2-02-02.py

## Preprocessing

### Missing values

Some columns have missing values, encoded as <code>'?'</code> in the original data.

In [None]:
X["native-country"].value_counts()[:10]

For convenience, we will replace them with NaNs.

In [None]:
X = X.replace([" ?", "?"], np.nan)

In [None]:
X.isna().sum()

In [None]:
X.info()

We will now fill in the missing values using a simple imputation strategy which consists in replacing them with the most frequent value.

In [None]:
from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy="most_frequent")

In [None]:
X["workclass"] = imp.fit_transform(X["workclass"].values.reshape(-1, 1)).flatten()
X["occupation"] = imp.fit_transform(X["occupation"].values.reshape(-1, 1)).flatten()
X["native-country"] = imp.fit_transform(X["native-country"].values.reshape(-1, 1)).flatten()

In [None]:
X.isna().sum()

<div class="alert alert-success">

<b>EXERCISE</b>: (optional)

Can you think of a more elaborate imputation strategy?
</div>

### Converting categorical variables

As shown below, not all columns are numerical. Those which aren't must be converted before being ingested by a Scikit-Learn machine learning algorithm. 

In [None]:
categories = X.dtypes == object
categories

Categorical variables can be encoded into numerical values by associating a unique number to each unique value.

In [None]:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OrdinalEncoder
tf = make_column_transformer((OrdinalEncoder(), categories), remainder="passthrough")
X_new = tf.fit_transform(X)

In [None]:
X.shape

In [None]:
X_new.shape

In [None]:
print(X.iloc[0])
print(X_new[0])

This transformation implicitly assumes an arbitrary ordering between values. Depending on the downward machine learning algorithm, this might lead to good or bad results.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_new, y, random_state=0)

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(solver="lbfgs", C=0.1)
lr.fit(X_train, y_train)
lr.score(X_test, y_test)

An alternative transformation is to encode categorical variables as one-hot binary vectors.

In [None]:
from sklearn.preprocessing import OneHotEncoder
tf = make_column_transformer((OneHotEncoder(sparse=False), categories), remainder="passthrough")
X_new = tf.fit_transform(X)

In [None]:
X_new.shape

In [None]:
print(X_new[0])

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_new, y, random_state=0)

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(solver="lbfgs", C=0.1)
lr.fit(X_train, y_train)
lr.score(X_test, y_test)

### Scaling

Some algorithms, such as linear models or KNNs, are sensitive to the scale of the features. It is often critical to rescale feature values to a fixed range.

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
tf = make_column_transformer((OneHotEncoder(sparse=False), categories), 
                             (StandardScaler(), ~categories),
                             remainder="passthrough")
X_new = tf.fit_transform(X)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_new, y, random_state=0)

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(solver="lbfgs", C=0.1)
lr.fit(X_train, y_train)
lr.score(X_test, y_test)

## Model comparison

<div class="alert alert-success">

<b>EXERCISE</b>:

Compare the performance of <code>LogisticRegression</code>, <code>KNeighborsClassifier</code>, <code>DecisionTreeClassifier</code> and <code>GaussianNB</code>.

</div>

<div class="alert alert-success">

<b>EXERCISE</b>:

Observe how preprocessing might change your results above. For example, switch to ordinal encoding of the categorical variables.
</div>