# Computing Indicator/Dummy Variables

Another type of transformation for statistical modeling or machine learning applications is converting a categorical variable into a “dummy” or “indicator” matrix. If a column in a DataFrame has k distinct values, you would derive a matrix or DataFrame containing k columns containing all 1’s and 0’s. pandas has a get_dummies function for doing this, though devising one yourself is not difficult. Let’s return to an earlier example DataFrame:

In [1]:
import pandas as pd
import numpy as np
from pandas import DataFrame, Series

In [2]:
df = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                'data1': range(6)})

In [3]:
df

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,b,5


In [4]:
pd.get_dummies(df['key'])

Unnamed: 0,a,b,c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


In some cases, you may want to add a prefix to the columns in the indicator DataFrame, which can then be merged with the other data. get_dummies has a prefix argument for doing just this:

In [5]:
dummies = pd.get_dummies(df['key'], prefix='col')

In [6]:
dummies

Unnamed: 0,col_a,col_b,col_c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


In [7]:
df_with_dummy = df[['data1']].join(dummies)

In [8]:
df_with_dummy

Unnamed: 0,data1,col_a,col_b,col_c
0,0,0,1,0
1,1,0,1,0
2,2,1,0,0
3,3,0,0,1
4,4,1,0,0
5,5,0,1,0


If a row in a DataFrame belongs to multiple categories, things are a bit more complicated. Let’s return to the MovieLens 1M dataset from earlier in the book:

In [18]:
mov_names = ['movie_id', 'title', 'genres']

movies = pd.read_table('../../CSV Files\O_Reilly\ch02\movielens\movies.dat',
sep = '::', header= 0, names = mov_names, encoding= 'latin-1')

  movies = pd.read_table('../../CSV Files\O_Reilly\ch02\movielens\movies.dat',


In [19]:
movies[:10]

Unnamed: 0,movie_id,title,genres
0,2,Jumanji (1995),Adventure|Children's|Fantasy
1,3,Grumpier Old Men (1995),Comedy|Romance
2,4,Waiting to Exhale (1995),Comedy|Drama
3,5,Father of the Bride Part II (1995),Comedy
4,6,Heat (1995),Action|Crime|Thriller
5,7,Sabrina (1995),Comedy|Romance
6,8,Tom and Huck (1995),Adventure|Children's
7,9,Sudden Death (1995),Action
8,10,GoldenEye (1995),Action|Adventure|Thriller
9,11,"American President, The (1995)",Comedy|Drama|Romance


Adding indicator variables for each genre requires a little bit of wrangling. First, we extract the list of unique genres in the dataset (using a nice set.union trick):

In [20]:
genres_iter = (set(x.split('|'))for x in movies.genres)

In [21]:
genres = sorted(set.union(*genres_iter))

Now, one way to construct the indicator DataFrame is to start with a DataFrame of all zeros:

In [22]:
dummies = DataFrame(np.zeros((len(movies), len(genres))), columns = genres)

Now, iterate through each movie and set entries in each row of dummies to 1:

In [25]:
for i, gen in enumerate(movies.genres):
    dummies.loc[i, gen.split('|')] = 1

Then, as above, you can combine this with movies:

In [26]:
movies_windic = movies.join(dummies.add_prefix('Genre_'))

In [27]:
movies_windic.loc[0]

movie_id                                        2
title                              Jumanji (1995)
genres               Adventure|Children's|Fantasy
Genre_Action                                  0.0
Genre_Adventure                               1.0
Genre_Animation                               0.0
Genre_Children's                              1.0
Genre_Comedy                                  0.0
Genre_Crime                                   0.0
Genre_Documentary                             0.0
Genre_Drama                                   0.0
Genre_Fantasy                                 1.0
Genre_Film-Noir                               0.0
Genre_Horror                                  0.0
Genre_Musical                                 0.0
Genre_Mystery                                 0.0
Genre_Romance                                 0.0
Genre_Sci-Fi                                  0.0
Genre_Thriller                                0.0
Genre_War                                     0.0


> For much larger data, this method of constructing indicator variables with multiple membership is not especially speedy. A lower-level function leveraging the internals of the DataFrame could certainly be written.

A useful recipe for statistical applications is to combine get_dummies with a discretization function like cut:

In [28]:
values = np.random.rand(10)

In [29]:
values

array([0.04249882, 0.88327265, 0.83332966, 0.6194501 , 0.93560934,
       0.92764437, 0.59431638, 0.54917207, 0.91751161, 0.92586522])

In [30]:
bins = [0, 0.2, 0.4, 0.6, 0.8, 1]

In [31]:
pd.get_dummies(pd.cut(values, bins))

Unnamed: 0,"(0.0, 0.2]","(0.2, 0.4]","(0.4, 0.6]","(0.6, 0.8]","(0.8, 1.0]"
0,1,0,0,0,0
1,0,0,0,0,1
2,0,0,0,0,1
3,0,0,0,1,0
4,0,0,0,0,1
5,0,0,0,0,1
6,0,0,1,0,0
7,0,0,1,0,0
8,0,0,0,0,1
9,0,0,0,0,1
