# Using numerical and categorical variables together

As mentioned, there is no reason to use the data separately. So let's combine them

In [1]:
import pandas as pd

adult_census = pd.read_csv('../datasets/adult-census.csv')
adult_census = adult_census.drop(columns='education-num')

target_name = 'class'
target = adult_census[target_name]

data = adult_census.drop(columns=target_name)

Now let's make 2 separate groups for categorical and numerical data

In [2]:
from sklearn.compose import make_column_selector as selector

numerical_selector = selector(dtype_exclude=object)
categorical_selector = selector(dtype_include=object)

numerical_columns = numerical_selector(data)
categorical_columns = categorical_selector(data)

numerical_columns, categorical_columns

(['age', 'capital-gain', 'capital-loss', 'hours-per-week'],
 ['workclass',
  'education',
  'marital-status',
  'occupation',
  'relationship',
  'race',
  'sex',
  'native-country'])

### Cautionary note:

Selecting by object (strings) is not always sufficient/appropriate to split data by numerical categorical

Q: Can you think of:
- a situation where string could be used for numerical features?
- And an example where numbers would be used for categorical data?

A: many correct answers possible, e.g.
- Dates in string formatting: January 4th
- salary scale (?), online questionnaires/phone decision tree, education-num from census thing, lane in western blot/genomic sequencing, etc

ALWAYS INSPECT/UNDERSTAND YOUR DATA BEFORE DOING ANYTHING WITH IT

In [3]:
# Use ColumnTransformer to handle both types of data:

from sklearn.preprocessing import OneHotEncoder, StandardScaler

categorical_preprocessor = OneHotEncoder(handle_unknown='ignore')
numerical_preprocessor = StandardScaler()

from sklearn.compose import ColumnTransformer

# be mindful/careful about ALL the brackets here!!
preprocessor = ColumnTransformer([
    ('onehot_encoder', categorical_preprocessor, categorical_columns),
    ('standard_scaler', numerical_preprocessor, numerical_columns)
])

This is what our preprocessor does:

Show: https://inria.github.io/scikit-learn-mooc/_images/api_diagram-columntransformer.svg

Explain:
- 1 split the data according to provided input
- 2 transform each subset independently
- 3 concatenate data back into a single dataset

Importantly, it can be used in make_pipeline

In [4]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

model = make_pipeline(
    preprocessor,
    LogisticRegression(max_iter=500)
)

model

In [5]:
# We will use train_test_split here,   
# but this could also be done using cross validation as was done before


from sklearn.model_selection import train_test_split

data_train, data_test, target_train, target_test = train_test_split(
    data, target, train_size=0.75, random_state=22)

In [6]:
# We will once again use fit, predict, and score separately.

_ = model.fit(data_train, target_train)

In [7]:
# the data can be sent stright to pipeline, because the model handles the preprocessing for us when using predict

pred = model.predict(data_test)

# let's see the prediction
pred[:10]


array([' <=50K', ' >50K', ' >50K', ' <=50K', ' <=50K', ' <=50K', ' <=50K',
       ' <=50K', ' <=50K', ' >50K'], dtype=object)

In [8]:
# and the original data
target_test[:10]


35057     <=50K
7449       >50K
22289      >50K
11354     <=50K
30394     <=50K
41467     <=50K
43462     <=50K
10231     <=50K
14557     <=50K
19433      >50K
Name: class, dtype: object

In [9]:
# how well did we do on the first 10 data points?
(pred[:10] == target_test[:10])



35057    True
7449     True
22289    True
11354    True
30394    True
41467    True
43462    True
10231    True
14557    True
19433    True
Name: class, dtype: bool

In [10]:
model.score(data_test,target_test)

0.8561133404307592

## Now with cross validation

In [11]:
from sklearn.model_selection import cross_validate

cv_results = cross_validate(model, data, target, cv=5)
cv_results

{'fit_time': array([0.89855623, 1.0266614 , 0.65395331, 0.80641556, 0.70578289]),
 'score_time': array([0.03640103, 0.02162123, 0.02391267, 0.0267849 , 0.02091599]),
 'test_score': array([0.8512642 , 0.8498311 , 0.84756347, 0.8523751 , 0.85534398])}

In [12]:
scores = cv_results['test_score']

print(f'the mean score using ALL input data is {scores.mean():.3f} +/- {scores.std():.3f} (stdev)')

the mean score using ALL input data is 0.851 +/- 0.003 (stdev)


## Let's try a more powerful model

Linear models are nice because they are usually cheap to train, small to deploy, fast to predict and give a good baseline.

Now, let's try: HistGradientBoostingClassifier

This does not require scaling of numerical features and preferably uses ordinal encoding (even if "inappropriate")

In [16]:
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.preprocessing import OrdinalEncoder

categorical_encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)

# pay close attention to brackets once again!
transformer = ColumnTransformer([
    ('categorical', categorical_encoder, categorical_columns)],
    remainder='passthrough' # this deals with all other column (and does nothing to them)
)

model = make_pipeline(transformer, HistGradientBoostingClassifier())



In [17]:
# Visualizing the model:

model

In [14]:

_ = model.fit(data_train, target_train)

In [15]:
model.score(data_test, target_test)

0.8826467938743756

We can observe that we get significantly higher accuracies with the Gradient Boosting model. 

This model works well if:
- large number of samples
- limited (informative) features (<100)
- mix of categorical and numerical features


### Now let's quickly do Quiz M1.03 in the collaborative document:

https://inria.github.io/scikit-learn-mooc/predictive_modeling_pipeline/03_categorical_pipeline_quiz_m1_03.html


## Then do Wrap up quiz in breakout rooms

Ask for volunteers to show their answer?