One simple way to insert Categorical Feature in Machine Learning model is to generate dummy variables. Lets see more on how::

Dummy variables are also known as Indicator variables :  Convert categorical variable into dummy/indicator variables.

Lets say we have a column in our dataset called 'gender' either male or female. An example of dummy coding / encoding would be mark female as 0 and male as 1. This process is also used to include Unordered Categorical Features in Machine Learning Models. 
Below lets see with example how we can implement that in pandas: 

In [1]:
import pandas as pd

Dataset is from Kaggle's Titanic competition.

In [2]:
train = pd.read_csv('http://bit.ly/kaggletrain')

In [3]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Above are the Passengers that were aboard the Titanic and some information about each of them. 

Lets say we want to generate a dummy variable for the 'Sex' column in 'train' DF , which is now specified as an Object column since its String.

One way to do is to use Series Map method: Map values of Series according to input correspondence. Create a new column in the Dataframe 'Sex_Male'  ::

In [4]:
train['Sex_Male'] = train.Sex.map({'female':0, 'male':1})

In [5]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Sex_Male
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,1


Here is our new column 'Sex_Male' on the right and values are 1 if they are 'male' 0(female) or else.

So thats how we have dummy encoded for this particular 'Sex_Male' column, kind of convention to use 1 for male and 0 for female. 

###### Lets see more flexible way than above with bit more work :: 

Here we are going to use top-level function with pandas which is pandas.get_dummies(pd.get_dummies) and lets see how it works as below::

In [6]:
pd.get_dummies(train.Sex)

Unnamed: 0,female,male
0,0,1
1,1,0
2,1,0
3,1,0
4,0,1
...,...,...
886,0,1
887,1,0
888,1,0
889,0,1


Here, what it does is to Create one column for every possible value. So there are 2 possible values for sex in this dataset and it created one column for each and then for each row it tells us male or female by marking as 1. if 1 under female column then that row sex is female. 

Generally speaking if we have K possible values for a Categorical variable, we use K-1 dummy variables to represent it since that will capture all possible information about the features. 

In other words relating to above resultset, We have 2 possible values for 'Sex'. So only use one dummy variable (either female / male column) to represent it which captures all possible information about the feature. 

So we want to drop any one column from the above resultset, here will drop 1st column (female) because we dont need it and which is defined as baseline. We will use iLoc[ ] method to drop not needed column and just to keep only require column and rows based on integer position.

In [7]:
pd.get_dummies(train.Sex, prefix='Sex').iloc[:,1:] # all rows and columns from 1 through all

Unnamed: 0,Sex_male
0,1
1,0
2,0
3,0
4,1
...,...
886,1
887,0
888,0
889,1


Here, using get_dummies() we have duplicated the same resultset as we got with map() method. so using map() and get_dummies() achieved the same result with bit more work in get_dummies and its flexible.

###### get_dummies( ) with Categorical Variable with more than 2 possible values ::

For instance, Here we will work on column " Embarked " from 'train' Dataframe.

In [8]:
train.Embarked.value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

Here we are checking 'Embarked' columns different values using value_counts() and its having 3 different values S, C, Q.

In [9]:
pd.get_dummies(train.Embarked, prefix='Embarked')

Unnamed: 0,Embarked_C,Embarked_Q,Embarked_S
0,0,0,1
1,1,0,0
2,0,0,1
3,0,0,1
4,0,0,1
...,...,...,...
886,0,0,1
887,0,0,1
888,0,0,1
889,1,0,0


Here, As we see it can only generate a column with each possible values and its only single "One" in each row.
For instance,row 0 has 1 in S, row 1 has 1 in C and again row 2 has 1 in S and so on.

When we glance the Dataframe, it matches what we are expecting there.

As before if we have 3 possible values we need 3-1 or 2 dummy variables to capture all the information about the features. 
so just like before, we are gonna say as below:

In [10]:
pd.get_dummies(train.Embarked, prefix='Embarked').iloc[:,1:] # All rows and columns 1 through the end.

Unnamed: 0,Embarked_Q,Embarked_S
0,0,1
1,0,0
2,0,1
3,0,1
4,0,1
...,...,...
886,0,1
887,0,1
888,0,1
889,0,0


Here we have dropped C(Embarked_C) column and established it as baseline.

We can still point out what the 'Embarked' value for every row. if its Q(0) and S(0) then we know C(1) , if Q(1) and S(0) then we know C(0). So it a combination of 3 values, if we know 2 values then its easy to find the missing 3rd one. 

Lets see how to attach above 2 column to main 'train' dataframe.

In [11]:
Embarked_dummies = pd.get_dummies(train.Embarked, prefix='Embarked').iloc[:,1:] # Will save to 'Embarked_dummies'. 

Below command, we will concatenate 'train' and 'Embarked_dummies' Dataframes, then will overwrite 'train' DF with new columns:

In [12]:
train = pd.concat([train,Embarked_dummies], axis=1) # concat() is a top-level function, axis=1(Column) and axis=0(row)

In [13]:
train.head()#Lets take a look at new added columns

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Sex_Male,Embarked_Q,Embarked_S
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1,0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0,0,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0,0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,0,0,1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,1,0,1


Now we have got Embarked_Q and Embarked_S columns on the right of the original dataframe 'train'.

###### Useful point: 

Reset the dataframe to how it was when we started it.

In [14]:
train = pd.read_csv('http://bit.ly/kaggletrain')

In [15]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Trick is, how to pass Dataframe to pd.get_dummies(). So till now we were passing Series to pd.get_dummies. So lets try passing Dataframe to pd.get_dummies as below::

In [16]:
pd.get_dummies(train, columns=['Sex','Embarked']) 
# In columns parameter, we gonna pass list and say to which all columns do we need dummy

Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,1,0,3,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.2500,,0,1,0,0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C85,1,0,1,0,0
2,3,1,3,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.9250,,1,0,0,0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1000,C123,1,0,0,0,1
4,5,0,3,"Allen, Mr. William Henry",35.0,0,0,373450,8.0500,,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",27.0,0,0,211536,13.0000,,0,1,0,0,1
887,888,1,1,"Graham, Miss. Margaret Edith",19.0,0,0,112053,30.0000,B42,1,0,0,0,1
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",,1,2,W./C. 6607,23.4500,,1,0,0,0,1
889,890,1,1,"Behr, Mr. Karl Howell",26.0,0,0,111369,30.0000,C148,0,1,1,0,0


Above when we do get_dummies() on dataframe, super cool is it drops original 'Sex' and original 'Embarked' column and they are replaced with dummy columns like Sex_female, Sex_male, Embarked_C, Embarked_Q and Embarked_S.

We can't use iloc[ ] trick  to drop like Sex_female and Embarked_C. We can drop using drop method but what if we had to specify 40 different columns that we are dummying and it'd be lot of manual work.

So to drop those exact columns, from pandas version 0.18 (R mar 2016) , new parameter called 'drop_first' and all we do is set to 'True ' as below:

In [17]:
pd.get_dummies(train, columns=['Sex','Embarked'], drop_first=True) 

Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Sex_male,Embarked_Q,Embarked_S
0,1,0,3,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.2500,,1,0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C85,0,0,0
2,3,1,3,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.9250,,0,0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1000,C123,0,0,1
4,5,0,3,"Allen, Mr. William Henry",35.0,0,0,373450,8.0500,,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",27.0,0,0,211536,13.0000,,1,0,1
887,888,1,1,"Graham, Miss. Margaret Edith",19.0,0,0,112053,30.0000,B42,0,0,1
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",,1,2,W./C. 6607,23.4500,,0,0,1
889,890,1,1,"Behr, Mr. Karl Howell",26.0,0,0,111369,30.0000,C148,1,0,0


Thats cool, check the resultsets out now it has dropped 'Sex_female' and 'Embarked_C'.

So we have accomplished what we are expected without using iloc[ ] and concat() and got it done in one line once we got pandas 0.18