# Encoding: from a dataframe to a numerical matrix for machine learning

This example demonstrates how to transform a somewhat complicated dataframe to a matrix well suited for machine-learning. We study the case of predicting wages using the employee salaries dataset.

## A simple prediction pipeline

Let’s first retrieve the dataset:

In [2]:
from skrub.datasets import fetch_employee_salaries

dataset = fetch_employee_salaries()

We denote X, employees characteristics (our input data), and y, the annual salary (our target column):

In [4]:
X = dataset.X
y = dataset.y

X

Unnamed: 0,gender,department,department_name,division,assignment_category,employee_position_title,date_first_hired,year_first_hired
0,F,POL,Department of Police,MSB Information Mgmt and Tech Division Records...,Fulltime-Regular,Office Services Coordinator,09/22/1986,1986
1,M,POL,Department of Police,ISB Major Crimes Division Fugitive Section,Fulltime-Regular,Master Police Officer,09/12/1988,1988
2,F,HHS,Department of Health and Human Services,Adult Protective and Case Management Services,Fulltime-Regular,Social Worker IV,11/19/1989,1989
3,M,COR,Correction and Rehabilitation,PRRS Facility and Security,Fulltime-Regular,Resident Supervisor II,05/05/2014,2014
4,M,HCA,Department of Housing and Community Affairs,Affordable Housing Programs,Fulltime-Regular,Planning Specialist III,03/05/2007,2007
...,...,...,...,...,...,...,...,...
9223,F,HHS,Department of Health and Human Services,School Based Health Centers,Fulltime-Regular,Community Health Nurse II,11/03/2015,2015
9224,F,FRS,Fire and Rescue Services,Human Resources Division,Fulltime-Regular,Fire/Rescue Division Chief,11/28/1988,1988
9225,M,HHS,Department of Health and Human Services,Child and Adolescent Mental Health Clinic Serv...,Parttime-Regular,Medical Doctor IV - Psychiatrist,04/30/2001,2001
9226,M,CCL,County Council,Council Central Staff,Fulltime-Regular,Manager II,09/05/2006,2006


### We observe diverse columns in the dataset:
   - binary (``'gender'``),
   - numerical (``'employee_annual_salary'``),
   - categorical (``'department'``, ``'department_name'``, ``'assignment_category'``),
   - datetime (``'date_first_hired'``)
   - dirty categorical (``'employee_position_title'``, ``'division'``).

Using skrub's |TableVectorizer|, we can now already build a machine-learning pipeline and train it:

In [5]:
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.pipeline import make_pipeline
from skrub import TableVectorizer

pipeline = make_pipeline(TableVectorizer(), HistGradientBoostingRegressor())
pipeline.fit(X, y)

What just happened here?

We actually gave our dataframe as an input to the ``TableVectorizer`` and it returned an output useful for the scikit-learn model.

Let's explore the internals of our encoder, the ``TableVectorizer``:

In [6]:
from pprint import pprint

# Recover the TableVectorizer from the Pipeline
tv = pipeline.named_steps["tablevectorizer"]

pprint(tv.transformers_)

[('numeric', 'passthrough', ['year_first_hired']),
 ('datetime', DatetimeEncoder(), ['date_first_hired']),
 ('low_card_cat',
  OneHotEncoder(drop='if_binary', handle_unknown='infrequent_if_exist'),
  ['gender', 'department', 'department_name', 'assignment_category']),
 ('high_card_cat',
  GapEncoder(n_components=30),
  ['division', 'employee_position_title'])]


We observe it has automatically assigned an appropriate encoder to corresponding columns:

- The ``OneHotEncoder`` for low cardinality string variables, the columns ``'gender'``, ``'department'``, ``'department_name'`` and ``'assignment_category'``.

In [7]:
tv.named_transformers_["low_card_cat"].get_feature_names_out()

array(['gender_F', 'gender_M', 'gender_nan', 'department_BOA',
       'department_BOE', 'department_CAT', 'department_CCL',
       'department_CEC', 'department_CEX', 'department_COR',
       'department_CUS', 'department_DEP', 'department_DGS',
       'department_DHS', 'department_DLC', 'department_DOT',
       'department_DPS', 'department_DTS', 'department_ECM',
       'department_FIN', 'department_FRS', 'department_HCA',
       'department_HHS', 'department_HRC', 'department_IGR',
       'department_LIB', 'department_MPB', 'department_NDA',
       'department_OAG', 'department_OCP', 'department_OHR',
       'department_OIG', 'department_OLO', 'department_OMB',
       'department_PIO', 'department_POL', 'department_PRO',
       'department_REC', 'department_SHF', 'department_ZAH',
       'department_name_Board of Appeals Department',
       'department_name_Board of Elections',
       'department_name_Community Engagement Cluster',
       'department_name_Community Use of Public Fac

- The ``GapEncoder`` for high cardinality string columns, ``'employee_position_title'`` and ``'division'``. The |GapEncoder| is a powerful encoder that can handle dirty categorical columns.