# How do I create dummy variables in pandas?

🐼 Tuto on pandas by Data School - Exercice performed by Dorian.H Mekni 🥇 | Sun 13 Dec 2020


🧐 Dummy variables are also known as indicator variables.

⭐️ Let's pretend your dataframe has a column for gender. Female as 0 and Male as 1 (or the way around) would be what we would refer as dummy encoding. 

➕ This process is also used to include unordered categorical features in machine learning models. 



⭐️ How do we do it in pandas ? 


In [60]:
import pandas as pd 

In [61]:
train = pd.read_csv('http://bit.ly/kaggletrain')

In [62]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [63]:
# New columns created : 
train['Sex-male'] = train.Sex.map({'female': 0, 'male':1})

In [64]:
# Checking it is now integrated into our dataframe : 
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Sex-male
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,1



⭐️ Another way to do it : 
    

In [65]:
pd.get_dummies(train.Sex)

Unnamed: 0,female,male
0,0,1
1,1,0
2,1,0
3,1,0
4,0,1
...,...,...
886,0,1
887,1,0
888,1,0
889,0,1



☝🏻 It creates one column for every possible value. There are two possible values for sex in this dataset and one column for each sex has been generated. Now for each row, it tells you if it was male or female by putting a 1 in the appropriate column.  


In [66]:
# Going after the baseline, we remove the column we don't need as one defines enough evidence : 
pd.get_dummies(train.Sex).iloc[:, 1:]



Unnamed: 0,male
0,1
1,0
2,0
3,0
4,1
...,...
886,1
887,0
888,0
889,1



✅ We have now obtained the same result see in the map method processed previosuly. 


In [67]:
# Before inserting back this column into the dataframe, we want to identify where it came from : 
pd.get_dummies(train.Sex, prefix='Sex').iloc[:, 1:]

Unnamed: 0,Sex_male
0,1
1,0
2,0
3,0
4,1
...,...
886,1
887,0
888,0
889,1



⭐️ Let's now try with this method a categorical variable that has more than two possible values : 


In [68]:
train.Embarked.value_counts()


S    644
C    168
Q     77
Name: Embarked, dtype: int64


🧐 If you have three possible values you need three minus one or two possible dummy variables to capture all the information about that feature :


In [69]:

embarked_dummies = pd.get_dummies(train.Embarked, prefix='Embarked').iloc[:, 1:]


In [70]:
# Let's concatenate it to our dataframe : 
train = pd.concat([train, embarked_dummies], axis=1)

In [71]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Sex-male,Embarked_Q,Embarked_S
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1,0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0,0,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0,0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,0,0,1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,1,0,1



✅ We have successfully integrated our two embarked columns to our dataframe using two different methods. 



🙏🏻 Thank you !

👋🏻 See you in the next one !
