# Representing Data and Engineering Features

Representing your data in the right way can have a bigger infuence on the performance of a supervised model than the exact parameters you choose.


##### Index:
* Categorical Variables
    * One-Hot-Encoding (Dummy Variables)
    * Numbers Can Encode Categoricals
* OneHotEncoder and ColumnTransformer: Categorical Variables with scikit-learn

In [9]:
import pandas as pd
import numpy as np
import mglearn
import IPython.display
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Categorical Variables

The dataset of adult incomes will be used to predict if their salary is above or belowe 50k.


## One-Hot-Encoding (Dummy Variables)

The most common way to represent categorical variables. The idea is to add more features that consist of 0 and 1. Each new feature represents one of the categories.

In [6]:
import os
# The file has no headers naming the columns, so we pass header=None
# and provide the column names explicitly in "names"
adult_path = os.path.join(mglearn.datasets.DATA_PATH, "adult.data")
data = pd.read_csv(
    adult_path, header=None, index_col=False,
    names=['age', 'workclass', 'fnlwgt', 'education',  'education-num',
           'marital-status', 'occupation', 'relationship', 'race', 'gender',
           'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
           'income'])
# For illustration purposes, we only select some of the columns
data = data[['age', 'workclass', 'education', 'gender', 'hours-per-week',
             'occupation', 'income']]
# IPython.display allows nice output formatting within the Jupyter notebook
display(data.head())

Unnamed: 0,age,workclass,education,gender,hours-per-week,occupation,income
0,39,State-gov,Bachelors,Male,40,Adm-clerical,<=50K
1,50,Self-emp-not-inc,Bachelors,Male,13,Exec-managerial,<=50K
2,38,Private,HS-grad,Male,40,Handlers-cleaners,<=50K
3,53,Private,11th,Male,40,Handlers-cleaners,<=50K
4,28,Private,Bachelors,Female,40,Prof-specialty,<=50K


We must check if a column contains meaningful categorical data. If these columns are product of user input, they might require preprocessing (like making man and male the same).

We can check the contents of a column as follows. In this case value_counts() tells us that there are exactly two values for gender in the dataset: Male and Female.

In the real world I should check all the columns to see if they can be encoded, right now that will be skipped.

In [7]:
print(data["gender"].value_counts())

 Male      21790
 Female    10771
Name: gender, dtype: int64


In [8]:
# Encode the data with pandas
print("Original features:\n", list(data.columns), "\n")
data_dummies = pd.get_dummies(data)
print("Features after get_dummies:\n",list(data_dummies.columns))

Original features:
 ['age', 'workclass', 'education', 'gender', 'hours-per-week', 'occupation', 'income'] 

Features after get_dummies:
 ['age', 'hours-per-week', 'workclass_ ?', 'workclass_ Federal-gov', 'workclass_ Local-gov', 'workclass_ Never-worked', 'workclass_ Private', 'workclass_ Self-emp-inc', 'workclass_ Self-emp-not-inc', 'workclass_ State-gov', 'workclass_ Without-pay', 'education_ 10th', 'education_ 11th', 'education_ 12th', 'education_ 1st-4th', 'education_ 5th-6th', 'education_ 7th-8th', 'education_ 9th', 'education_ Assoc-acdm', 'education_ Assoc-voc', 'education_ Bachelors', 'education_ Doctorate', 'education_ HS-grad', 'education_ Masters', 'education_ Preschool', 'education_ Prof-school', 'education_ Some-college', 'gender_ Female', 'gender_ Male', 'occupation_ ?', 'occupation_ Adm-clerical', 'occupation_ Armed-Forces', 'occupation_ Craft-repair', 'occupation_ Exec-managerial', 'occupation_ Farming-fishing', 'occupation_ Handlers-cleaners', 'occupation_ Machine-op-i

With the transformation we can now transform it into numpy array and pass it to a machine learning model.

One needs to be careful to separate the output variable before passing everything into a model.

##### DANGER:
Column indexing in pandas includes the end of the range. Indexing a numpy array or a list doesn't include the last element.

After all this we can do what we love and train and test the model

In [10]:
features = data_dummies.loc[:, 'age':'occupation_ Transport-moving']
# Extract NumPy arrays
X = features.values
y = data_dummies['income_ >50K'].values
print("X.shape: {}  y.shape: {}".format(X.shape, y.shape))


from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
print("Test score: {:.2f}".format(logreg.score(X_test, y_test)))

X.shape: (32561, 44)  y.shape: (32561,)
Test score: 0.81




##### DANGER
It is important to ensure that the training and test sets have the same encoding and number of features after getting the dummies. This can be done simply by calling the dummies on the whole dataset before separation into train and test sets.

## Numbers Can Encode Categoricals

In the case of the adults, categorial variables where encoded as strings, but categories can also be numbers and should NOT be treated as continuous variables.

The get_dummies from pandas treats all numbers as continuos and will not crate categories from them.

Let's try it.

In [13]:
# create a DataFrame with an integer feature and a categorical string feature
demo_df = pd.DataFrame({'Integer Feature': [0, 1, 2, 1],
                        'Categorical Feature': ['socks', 'fox', 'socks', 'box']})
display(demo_df)

Unnamed: 0,Integer Feature,Categorical Feature
0,0,socks
1,1,fox
2,2,socks
3,1,box


In [12]:
display(pd.get_dummies(demo_df))

Unnamed: 0,Integer Feature,Categorical Feature_box,Categorical Feature_fox,Categorical Feature_socks
0,0,0,0,1
1,1,0,1,0
2,2,0,0,1
3,1,1,0,0


If I want to treat an integer category as a category, I need to ask pandas explicitly to do it.

In [15]:
demo_df["Integer Feature"] = demo_df["Integer Feature"].astype(str)
pd.get_dummies(demo_df, columns = ["Integer Feature", "Categorical Feature"])

Unnamed: 0,Integer Feature_0,Integer Feature_1,Integer Feature_2,Categorical Feature_box,Categorical Feature_fox,Categorical Feature_socks
0,1,0,0,0,0,1
1,0,1,0,0,1,0
2,0,0,1,0,0,1
3,0,1,0,1,0,0


# OneHotEncoder and ColumnTransformer: Categorical Variables with scikit-learn

Scikit-learn has also a wa to perform one-hot-encoding with the class OneHoTEncoder. The OneHotEncoder  applies the encoding to all input columns.

In [16]:
from sklearn.preprocessing import OneHotEncoder
# Setting sparce = False means OneHotEncoder will return a numpy array, not a sparse matrix
ohe = OneHotEncoder(sparse = False)
print(ohe.fit_transform(demo_df))

[[1. 0. 0. 0. 0. 1.]
 [0. 1. 0. 0. 1. 0.]
 [0. 0. 1. 0. 0. 1.]
 [0. 1. 0. 1. 0. 0.]]


To get the feature names whe use the get_feature_names method

In [18]:
print(ohe.get_feature_names())

['x0_0' 'x0_1' 'x0_2' 'x1_box' 'x1_fox' 'x1_socks']
