# One-Hot Encoding a Feature on a Pandas Dataframe: an Example

<a href="http://queirozf.com/entries/one-hot-encoding-a-feature-on-a-pandas-dataframe-an-example">原文链接</a>

<p>How would you calculate the distance between users in a dataset, where their country of origin is the only feature?</p>
<p>Take this dataset for example:</p>

In [2]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'country': ['russia', 'germany', 'australia', 'korea', 'germany']})
df

Unnamed: 0,country
0,russia
1,germany
2,australia
3,korea
4,germany


<p>One of the ways to do it is to encode the categorical variables as a one-hot vector, i.e. a vector where only one element is non-zero, or hot.</p>
<p>With one-hot encoding, a categorical feature becomes an array whose size is the number of possible choices for that features, i.e.:</p>

In [3]:
pd.get_dummies(df)

Unnamed: 0,country_australia,country_germany,country_korea,country_russia
0,0,0,0,1
1,0,1,0,0
2,1,0,0,0
3,0,0,1,0
4,0,1,0,0


## One-hot encoding vs Dummy variables

<p>By default, the get_dummies() does not do dummy encoding, but one-hot encoding.</p>
<p>To produce an actual dummy encoding from your data, use drop_first=True(not that 'australia' is missing from the columns)

In [4]:
pd.get_dummies(df, prefix=['country'], drop_first=True)

Unnamed: 0,country_germany,country_korea,country_russia
0,0,0,1
1,1,0,0
2,0,0,0
3,0,1,0
4,1,0,0


## Add columns for categories that only appear in the test set

<p>You need to inform pandas if you want it to create dummy columns for categories even though never appear(for example, if you one-hot encode a categorical variable that may have unseen values in the test).</p>

In [5]:
# Say you want a column for 'japan' too (it will be always zero, of course)
df["country"] = df["country"].astype('category', categories=['australia', 'germany', 'korea', 'russia', 'japan'])
pd.get_dummies(df, prefix=['country'])

  


Unnamed: 0,country_australia,country_germany,country_korea,country_russia,country_japan
0,0,0,0,1,0
1,0,1,0,0,0
2,1,0,0,0,0
3,0,0,1,0,0
4,0,1,0,0,0


## Add dummy columns to dataframe

<p>For example, if you have other columns (in addition to the column you want to one-hot encode) this is how you replace the country column with all 3 derived columns, and keep the other one:</p>

<p>Use pd.concat() to join the columns and then drop() the original country column:</p>

In [10]:
df = pd.DataFrame({
    'name': ['josef', 'michael', 'john', 'bawool', 'klaus'],
    'country': ['russia', 'germany', 'australia', 'korea', 'germany']
})
df

Unnamed: 0,country,name
0,russia,josef
1,germany,michael
2,australia,john
3,korea,bawool
4,germany,klaus


In [11]:
df = pd.concat([df, pd.get_dummies(df['country'], prefix='country')], axis=1)
df.drop(['country'], axis=1, inplace=True)
df

Unnamed: 0,name,country_australia,country_germany,country_korea,country_russia
0,josef,0,0,0,1
1,michael,0,1,0,0
2,john,1,0,0,0
3,bawool,0,0,1,0
4,klaus,0,1,0,0


## Treat Nulls/NaNs as a separate category

In [12]:
df = pd.DataFrame({
    'country': ['germany',np.nan,'germany','united kingdom','america','united kingdom']
})

pd.get_dummies(df, dummy_na=True)

Unnamed: 0,country_america,country_germany,country_united kingdom,country_nan
0,0,1,0,0
1,0,0,0,1
2,0,1,0,0
3,0,0,1,0
4,1,0,0,0
5,0,0,1,0
