## Working with Numerical and Categorical Variables
After decoupling the steps required to preprocess either of Numerical and Categorical variables, now, we would work with both variable type in this notebook. 


In [3]:
import pandas as pd

adult_census = pd.read_csv("adult_census.csv")
# we will drop the duplicated columns -- education-num and the fnlwgt column. 
duplicated_columns = ["education-num", "fnlwgt"]
adult_census = adult_census.drop(columns=duplicated_columns)

#Identify the target class column and separate it from the input data. 
target_name = "class"
target = adult_census[target_name]

data = adult_census.drop(columns=target_name)

### Separate Categorical and Numerical variables
We make use of make_column_selector helper to select the corresponding column

In [4]:
from sklearn.compose import make_column_selector as selector

numerical_columns_selector = selector(dtype_exclude=object)
categorical_columns_selector = selector(dtype_include=object)

numerical_columns = numerical_columns_selector(data)
categorical_columns = categorical_columns_selector(data)

### `Column Transformer`: Add specific columns to specific processor

In [5]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler

categorical_preprocessor = OneHotEncoder(handle_unknown="ignore")
numerical_preprocessor = StandardScaler()

In [6]:
# At this point we create the transformer and associate each of these processors with their respective columns

from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer([
    ('one-hot-encoder', categorical_preprocessor, categorical_columns),
    ('standard-scaler', numerical_preprocessor, numerical_columns)])

In [7]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

model = make_pipeline(preprocessor, LogisticRegression(max_iter=500))

In [8]:
# To display an interactive diagram we can call the following command
from sklearn import set_config
set_config(display='diagram')
model

In [9]:
# We split our data into train and test sets
from sklearn.model_selection import train_test_split

data_train, data_test, target_train, target_test = train_test_split(
    data, target, random_state=42)

In [10]:
_ = model.fit(data_train, target_train)

In [11]:
data_test.head()

Unnamed: 0,age,workclass,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
7762,56,Private,HS-grad,Divorced,Other-service,Unmarried,White,Female,0,0,40,United-States
23881,25,Private,HS-grad,Married-civ-spouse,Transport-moving,Own-child,Other,Male,0,0,40,United-States
30507,43,Private,Bachelors,Divorced,Prof-specialty,Not-in-family,White,Female,14344,0,40,United-States
28911,32,Private,HS-grad,Married-civ-spouse,Transport-moving,Husband,White,Male,0,0,40,United-States
19484,39,Private,Bachelors,Married-civ-spouse,Sales,Wife,White,Female,0,0,30,United-States


In [12]:
model.predict(data_test)[:5]

array([' <=50K', ' <=50K', ' >50K', ' <=50K', ' >50K'], dtype=object)

In [13]:
target_test[:5]

7762      <=50K
23881     <=50K
30507      >50K
28911     <=50K
19484     <=50K
Name: class, dtype: object

In [14]:
model.score(data_test, target_test)

0.8575055278028008

### Model Evaluation with cross-validation
Typically, we expect a predictive model to be evaluted by cross-validation. Here, we will continue to evaluate the performacnce of our model. 

In [17]:
from sklearn.model_selection import cross_validate

cv_results = cross_validate(model, data, target, cv=5)
cv_results

{'fit_time': array([1.88670564, 1.9923079 , 1.0229125 , 1.25377488, 1.87571383]),
 'score_time': array([0.05885553, 0.04751778, 0.04050756, 0.0773809 , 0.07230449]),
 'test_score': array([0.8512642 , 0.8498311 , 0.84756347, 0.85227273, 0.85513923])}

In [18]:
scores = cv_results["test_score"]
print("The mean cross-validation accuracy is: "
      f"{scores.mean():.3f} +/- {scores.std():.3f}")

The mean cross-validation accuracy is: 0.851 +/- 0.003


### Fit a more powerful model:: `Gradient Boosting Trees`
When working with linear models, certain advantages are acrued to them such as the cost of training, the smallness of deployment annd fast results. However, these models are often limited and it is useful to check if other existing tree-based models can lead to higher predictive performance.

In [20]:
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.preprocessing import OrdinalEncoder

categorical_preprocessor = OrdinalEncoder(handle_unknown="use_encoded_value",
                                          unknown_value=-1)

preprocessor = ColumnTransformer([
    ('categorical', categorical_preprocessor, categorical_columns)],
    remainder="passthrough")

model = make_pipeline(preprocessor, HistGradientBoostingClassifier())

#### Now, we can evaluate the statistical performance of our model.

In [22]:
%%time
_ = model.fit(data_train, target_train)

Wall time: 1.83 s


In [23]:
model.score(data_test, target_test)

0.8794529522561625

We can observe that we get significantly higher accuracies with the Gradient
Boosting model. This is often what we observe whenever the dataset has a
large number of samples and limited number of informative features (e.g. less
than 1000) with a mix of numerical and categorical variables.

This explains why Gradient Boosted Machines are very popular among
datascience practitioners who work with tabular data.