- 1. Basic Example of pd.get_dummies()

In [2]:
import pandas as pd

# Sample DataFrame
data = {
    "Name": ["Alice", "Bob", "Charlie", "David"],
    "Grade": ["1st Class", "2nd Class", "3rd Class", "1st Class"]
}
df = pd.DataFrame(data)

print("Original DataFrame:")
df

Original DataFrame:


Unnamed: 0,Name,Grade
0,Alice,1st Class
1,Bob,2nd Class
2,Charlie,3rd Class
3,David,1st Class


-  Now, apply pd.get_dummies() to encode the "Grade" column:



In [10]:
# Basic one-hot encoding
dummies = pd.get_dummies(df["Grade"], dtype=int)

print("One-hot encoded dummies:")
dummies

One-hot encoded dummies:


Unnamed: 0,1st Class,2nd Class,3rd Class
0,1,0,0
1,0,1,0
2,0,0,1
3,1,0,0


- 2. Using the prefix Parameter
The prefix parameter allows you to add a prefix to the column names of the dummy variables, making them more descriptive and avoiding potential naming conflicts with other columns in the DataFrame.



In [25]:
# One-hot encoding with prefix
dummies_with_prefix = pd.get_dummies(df["Grade"], prefix="Grade")

print("One-hot encoded dummies with prefix:")
dummies_with_prefix

One-hot encoded dummies with prefix:


Unnamed: 0,Grade_1st Class,Grade_2nd Class,Grade_3rd Class
0,True,False,False
1,False,True,False
2,False,False,True
3,True,False,False


- # Combine with original DataFrame



In [12]:
df_with_dummies = pd.concat([df, dummies_with_prefix], axis=1)

print("DataFrame with original data and one-hot encoded dummies:")
df_with_dummies

DataFrame with original data and one-hot encoded dummies:


Unnamed: 0,Name,Grade,Grade_1st Class,Grade_2nd Class,Grade_3rd Class
0,Alice,1st Class,True,False,False
1,Bob,2nd Class,False,True,False
2,Charlie,3rd Class,False,False,True
3,David,1st Class,True,False,False


+ 3 Other Useful Options in pd.get_dummies()
Here are additional parameters and options you can use with pd.get_dummies() to customize the encoding:
 - - a. prefix_sep
Specifies the separator between the prefix and the category name. By default, it’s an underscore (_), but you can change it.





In [None]:
# One-hot encoding with custom prefix separator
dummies_with_custom_sep = pd.get_dummies(df["Grade"], prefix="Grade", prefix_sep="-" , dtype=int)

print("One-hot encoded dummies with custom separator:")
dummies_with_custom_sep

One-hot encoded dummies with custom separator:


Unnamed: 0,Grade-1st Class,Grade-2nd Class,Grade-3rd Class
0,True,False,False
1,False,True,False
2,False,False,True
3,True,False,False


- - b. dummy_na
If True, adds a column to indicate missing (NaN) values in the categorical column. This is useful for handling missing data.





In [None]:
# DataFrame with a missing value
data_with_na = {
    "Name": ["Alice", "Bob", "Charlie", "David", "Eve"],
    "Grade": ["1st Class", "2nd Class", "3rd Class", "1st Class", None]
}
df_with_na = pd.DataFrame(data_with_na)

# One-hot encoding with dummy_na=True
dummies_with_na = pd.get_dummies(df_with_na["Grade"], prefix="Grade", dummy_na=True , dtype=int)

print("One-hot encoded dummies with NaN handling:")
dummies_with_na

One-hot encoded dummies with NaN handling:


Unnamed: 0,Grade_1st Class,Grade_2nd Class,Grade_3rd Class,Grade_nan
0,True,False,False,False
1,False,True,False,False
2,False,False,True,False
3,True,False,False,False
4,False,False,False,True


- - c. drop_first
If True, drops the first category to avoid multicollinearity (dummy variable trap) in regression models. This reduces the number of dummy variables by one, as the first category can be inferred from the others.



In [16]:
# One-hot encoding with drop_first=True
dummies_drop_first = pd.get_dummies(df["Grade"], prefix="Grade", drop_first=True)

print("One-hot encoded dummies with drop_first=True:")
dummies_drop_first

One-hot encoded dummies with drop_first=True:


Unnamed: 0,Grade_2nd Class,Grade_3rd Class
0,False,False
1,True,False
2,False,True
3,False,False


Here, the "1st Class" category is dropped, and its presence can be inferred when both "2nd Class" and "3rd Class" are 0.
- - d. columns or Direct DataFrame Input
You can apply pd.get_dummies() directly to a DataFrame or specify specific columns to encode.



In [None]:
# Encode all categorical columns in the DataFrame
dummies_df = pd.get_dummies(df, prefix=["Name", "Grade"], prefix_sep="_" , dtype=int)

print("One-hot encoded entire DataFrame:")
dummies_df

One-hot encoded entire DataFrame:


Unnamed: 0,Name_Alice,Name_Bob,Name_Charlie,Name_David,Grade_1st Class,Grade_2nd Class,Grade_3rd Class
0,True,False,False,False,True,False,False
1,False,True,False,False,False,True,False
2,False,False,True,False,False,False,True
3,False,False,False,True,True,False,False


- This encodes both "Name" and "Grade" columns, creating binary columns for each unique value.
- - e. dtype
Specifies the data type of the resulting dummy variables (default is uint8, but you can use int, bool, etc.).





In [None]:
# One-hot encoding with specific data type
dummies_dtype = pd.get_dummies(df["Grade"], prefix="Grade", dtype=int )

print("One-hot encoded dummies with int dtype:")
dummies_dtype

One-hot encoded dummies with int dtype:


Unnamed: 0,Grade_1st Class,Grade_2nd Class,Grade_3rd Class
0,1,0,0
1,0,1,0
2,0,0,1
3,1,0,0


4. Combining Options
You can combine multiple options for more complex encoding:



In [22]:
# Combine prefix, drop_first, and dummy_na
dummies_complex = pd.get_dummies(df["Grade"], prefix="Grade", prefix_sep="_", drop_first=True, dummy_na=True)

print("One-hot encoded dummies with multiple options:")
dummies_complex

One-hot encoded dummies with multiple options:


Unnamed: 0,Grade_2nd Class,Grade_3rd Class,Grade_nan
0,False,False,False
1,True,False,False
2,False,True,False
3,False,False,False


5. Practical Use Case
Imagine you’re preparing data for a machine learning model and want to encode a categorical column like "Grade" while avoiding multicollinearity:



In [23]:
# Sample DataFrame
data = {
    "Student": ["Alice", "Bob", "Charlie", "David"],
    "Grade": ["1st Class", "2nd Class", "3rd Class", "1st Class"],
    "Score": [85, 90, 78, 92]
}
df = pd.DataFrame(data)

# One-hot encode "Grade" with prefix, drop_first, and combine with original DataFrame
dummies = pd.get_dummies(df["Grade"], prefix="Grade", drop_first=True)
df_encoded = pd.concat([df[["Student", "Score"]], dummies], axis=1)

print("Final DataFrame with encoded Grade:")
df_encoded

Final DataFrame with encoded Grade:


Unnamed: 0,Student,Score,Grade_2nd Class,Grade_3rd Class
0,Alice,85,False,False
1,Bob,90,True,False
2,Charlie,78,False,True
3,David,92,False,False


###  Key elements 
- Multicollinearity: Use drop_first=True when the dummy variables will be used in linear models to avoid redundancy.

- Memory Usage: One-hot encoding can significantly increase the size of your DataFrame, especially with many unique categories. Consider using label encoding or target encoding for high-cardinality data.

- Integration with ML: After encoding, you can use the resulting DataFrame directly with libraries like Scikit-learn for training models.

- Comparison with Label Encoding: Unlike label encoding (e.g., using map()), one-hot encoding doesn’t imply ordinal relationships, making it suitable for nominal data.

