In [1]:
import numpy as np 
import matplotlib.pyplot as plt
import pandas as pd 
import mglearn
%matplotlib inline 

## One-Hot-Encoding(Dummy Variables)

By far the most common way to represent categorical variables is using the one-hot-encoding or one-out-of-N encoding, also known as dummy variables.

The one-hot-encoding we use is quite similar, but not identical, to the dummy coding used in statistics. For simplicity, we encode each category with a different binary feature.

In statistics, it is common to encode a categorical feature with k different possible values into k–1 features (the last one is represented as all zeros). This is done to simplify the analysis (more technically, this will avoid making the data matrix rank-deficient).

There are two ways to convert your data to a one-hot encoding of categorical variables, using either pandas or scikit-learn. At the time of writing, using pandas is slightly easier, so let’s go this route.


In [6]:
import os
# The file has no headers naming the columns, so we pass header=None 
# and provide the column names explicitly in "names" 
adult_path = os.path.join(mglearn.datasets.DATA_PATH, "adult.data") 
data = pd.read_csv(    adult_path, header=None, index_col=False, 
                   names=['age', 'workclass', 'fnlwgt', 'education',  'education-num',  
                          'marital-status', 'occupation', 'relationship', 'race', 'gender', 
                          'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 
                          'income']) 
# For illustration purposes, we only select some of the columns 
data = data[['age', 'workclass', 'education', 'gender', 'hours-per-week',  
             'occupation', 'income']] 
data.head()
# IPython.display allows nice output formatting within the Jupyter notebook 
# display(data.head()) Note sure what this was supposed to do 

Unnamed: 0,age,workclass,education,gender,hours-per-week,occupation,income
0,39,State-gov,Bachelors,Male,40,Adm-clerical,<=50K
1,50,Self-emp-not-inc,Bachelors,Male,13,Exec-managerial,<=50K
2,38,Private,HS-grad,Male,40,Handlers-cleaners,<=50K
3,53,Private,11th,Male,40,Handlers-cleaners,<=50K
4,28,Private,Bachelors,Female,40,Prof-specialty,<=50K


After reading a dataset like this, it is often good to first check if a column actually contains meaningful categorical data. When working with data that was input by humans (say, users on a website), there might not be a fixed set of categories, and differences in spelling and capitalization might require preprocessing. For example, it might be that some people specified gender as “male” and some as “man,” and we might want to represent these two inputs using the same category. A good way to check the contents of a column is using the value_counts method of a pandas Series (the type of a single column in a DataFrame), to show us what the unique values are and how often they appear:

In [7]:
print(data.gender.value_counts())

 Male      21790
 Female    10771
Name: gender, dtype: int64


## Pandas Get Dummies

### Categories encoded as strings:

There is a very simple way to encode the data in pandas, using the get_dummies function. The get_dummies function automatically transforms all columns that have object type (like strings) or are categorical (which is a special pandas concept that we haven’t talked about yet). The continuous featres will be left alone.

In [10]:
print("Original features:\n", list(data.columns), "\n") 
data_dummies = pd.get_dummies(data)
print("Features after get_dummies:\n", list(data_dummies.columns))
data_dummies.head()

Original features:
 ['age', 'workclass', 'education', 'gender', 'hours-per-week', 'occupation', 'income'] 

Features after get_dummies:
 ['age', 'hours-per-week', 'workclass_ ?', 'workclass_ Federal-gov', 'workclass_ Local-gov', 'workclass_ Never-worked', 'workclass_ Private', 'workclass_ Self-emp-inc', 'workclass_ Self-emp-not-inc', 'workclass_ State-gov', 'workclass_ Without-pay', 'education_ 10th', 'education_ 11th', 'education_ 12th', 'education_ 1st-4th', 'education_ 5th-6th', 'education_ 7th-8th', 'education_ 9th', 'education_ Assoc-acdm', 'education_ Assoc-voc', 'education_ Bachelors', 'education_ Doctorate', 'education_ HS-grad', 'education_ Masters', 'education_ Preschool', 'education_ Prof-school', 'education_ Some-college', 'gender_ Female', 'gender_ Male', 'occupation_ ?', 'occupation_ Adm-clerical', 'occupation_ Armed-Forces', 'occupation_ Craft-repair', 'occupation_ Exec-managerial', 'occupation_ Farming-fishing', 'occupation_ Handlers-cleaners', 'occupation_ Machine-op-i

Unnamed: 0,age,hours-per-week,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,workclass_ Private,workclass_ Self-emp-inc,workclass_ Self-emp-not-inc,workclass_ State-gov,...,occupation_ Machine-op-inspct,occupation_ Other-service,occupation_ Priv-house-serv,occupation_ Prof-specialty,occupation_ Protective-serv,occupation_ Sales,occupation_ Tech-support,occupation_ Transport-moving,income_ <=50K,income_ >50K
0,39,40,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,0
1,50,13,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,1,0
2,38,40,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,53,40,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,28,40,0,0,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,1,0


We can now use the values attribute to convert the data_dummies DataFrame into a NumPy array, and then train a machine learning model on it. Be careful to separate the target variable (which is now encoded in two income columns) from the data before training a model. Including the output variable, or some derived property of the output variable, into the feature representation is a very common mistake in building supervised machine learning models.

Note: Be careful: column indexing in pandas includes the end of the range, so 'age':'occupation_ Transport-moving' is inclusive of occupation_ Transport-moving. This is different from slicing a NumPy array, where the end of a range is not included: for example, np.arange(11)[0:10] doesn’t include the entry with index 10.

In this case, we extract only the columns containing features—that is, all columns from age to occupation_ Transport-moving. This range contains all the features but not the target:

In [12]:
features = data_dummies.loc[:, 'age':'occupation_ Transport-moving']

# Extract NumPy arrays 
X = features.values 
y = data_dummies['income_ >50K'].values 
print("X.shape: {}  y.shape: {}".format(X.shape, y.shape))

X.shape: (32561, 44)  y.shape: (32561,)


Now the data is represented in a way that scikit-learn can work with, and we can proceed as usual:


In [14]:
from sklearn.linear_model import LogisticRegression 
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) 
logreg = LogisticRegression() 
logreg.fit(X_train, y_train) 
print("Test score: {:.2f}".format(logreg.score(X_test, y_test)))

Test score: 0.81




In this example, we called get_dummies on a DataFrame containing both the training and the test data. This is important to ensure categorical values are represented in the same way in the training set and the test set.

### Categories Encoded as Integers

Categorical features are often encoded using integers. That they are numbers doesn’t mean that they should necessarily be treated as continuous features. It is not always clear whether an integer feature should be treated as continuous or discrete (and one-hotencoded). If there is no ordering between the semantics that are encoded (like in the workclass example), the feature must be treated as discrete. For other cases, like five-star ratings, the better encoding depends on the particular task and data and which machine learning algorithm is used.

The get_dummies function in pandas treats all numbers as continuous and will not create dummy variables for them. To get around this, you can either use scikitlearn’s OneHotEncoder, for which you can specify which variables are continuous and which are discrete, or convert numeric columns in the DataFrame to strings. To illustrate, let’s create a DataFrame object with two columns, one containing strings and one containing integers:

In [15]:
# create a DataFrame with an integer feature and a categorical string feature
demo_df = pd.DataFrame({'Integer Feature': [0, 1, 2, 1],          
                        'Categorical Feature': ['socks', 'fox', 'socks', 'box']}) 
display(demo_df)

Unnamed: 0,Integer Feature,Categorical Feature
0,0,socks
1,1,fox
2,2,socks
3,1,box


Using get_dummies will only encode the string feature and will not change the integer feature, as you can see in Table 4-5:

In [16]:
display(pd.get_dummies(demo_df))

Unnamed: 0,Integer Feature,Categorical Feature_box,Categorical Feature_fox,Categorical Feature_socks
0,0,0,0,1
1,1,0,1,0
2,2,0,0,1
3,1,1,0,0


If you want dummy variables to be created for the “Integer Feature” column, you can explicitly list the columns you want to encode using the columns parameter. Then, both features will be treated as categorical (see Table 4-6):


In [18]:
demo_df['Integer Feature'] = demo_df['Integer Feature'].astype(str) 
display(pd.get_dummies(demo_df, columns=['Integer Feature', 'Categorical Feature']))


Unnamed: 0,Integer Feature_0,Integer Feature_1,Integer Feature_2,Categorical Feature_box,Categorical Feature_fox,Categorical Feature_socks
0,1,0,0,0,0,1
1,0,1,0,0,1,0
2,0,0,1,0,0,1
3,0,1,0,1,0,0
