# Encoding Categorical Variables in Pandas

### What Is Encoding?

In machine learning and AI workflows, we can’t directly use categorical text like `'male'`, `'female'`, `'S'`, `'C'`, or `'Q'` in models — they need to be **converted into numbers**. This process is called **encoding**, and it’s an essential step before feeding categorical data into algorithms.

Encoding transforms human-readable categories into numerical formats while **preserving meaning** and **avoiding bias**. There are multiple strategies, and the choice depends on the number of categories, their importance, and whether or not the data has an inherent order.

### Pandas provides two main encoding techniques:

- `.map()` / `.replace()` – Simple label conversion for binary or manual mappings.
- `pd.get_dummies()` – For one-hot encoding (OHE), where each category becomes a new column with 0 or 1.

### Using `.map()` for Binary Categories

The `.map()` method is perfect for converting binary categorical columns into 0s and 1s.

In [1]:
import pandas as pd

df = pd.read_csv("data/train.csv")

# Convert 'Sex' to 0/1
df['Sex_mapped'] = df['Sex'].map({'male': 0, 'female': 1})

print(df[['Sex', 'Sex_mapped']].head())

      Sex  Sex_mapped
0    male           0
1  female           1
2  female           1
3  female           1
4    male           0


This method is **clean and intuitive**. We define a dictionary to map original categories to numeric values. It’s commonly used in binary gender, Yes/No, or True/False columns.

### Using `.replace()` for Manual Mapping

`.replace()` works similarly to `.map()` but is slightly more flexible. It can handle multiple column replacements or even values not present in the column without raising errors.

In [2]:
# Convert 'Embarked' values to numeric codes
df['Embarked_replaced'] = df['Embarked'].replace({'S': 0, 'C': 1, 'Q': 2}).infer_objects(copy=False)
print(df[['Embarked', 'Embarked_replaced']].head())

  Embarked  Embarked_replaced
0        S                0.0
1        C                1.0
2        S                0.0
3        S                0.0
4        S                0.0


  df['Embarked_replaced'] = df['Embarked'].replace({'S': 0, 'C': 1, 'Q': 2}).infer_objects(copy=False)


Use `.replace()` when:

- We’re replacing multiple categories
- We want to avoid NaNs if categories are missing
- We don’t want `.map()`'s strict behavior

### About `.infer_objects(copy=False)`

When we use `.replace()` to convert string values to numbers, Pandas might still keep the column's data type as `object`, even though it looks like numbers. This can cause issues in later steps, especially with machine learning models.

To fix this, we use:

In [None]:
.infer_objects(copy=False)

- It tells Pandas to **try to infer the best data type** for the column (like converting from `object` to `int` or `float`).
- `copy=False` means it will update the data in place, saving memory.

This helps avoid future warnings and ensures our numeric data is actually numeric!

### One-Hot Encoding with `pd.get_dummies()`

One-hot encoding creates a new column for **each category** and marks presence with 1 or absence with 0. This is ideal for categorical data with no order, such as `'Embarked'`, `'Pclass'`, or `'Sex'`.

In [3]:
# One-hot encode 'Embarked'
embarked_dummies = pd.get_dummies(df['Embarked'], prefix='Embarked')
df = pd.concat([df, embarked_dummies], axis=1)

print(df[['Embarked', 'Embarked_C', 'Embarked_Q', 'Embarked_S']].head())

  Embarked  Embarked_C  Embarked_Q  Embarked_S
0        S       False       False        True
1        C        True       False       False
2        S       False       False        True
3        S       False       False        True
4        S       False       False        True


This avoids introducing artificial order into categorical variables. It’s commonly used when training ML models with `scikit-learn`, `XGBoost`, etc.

### Optional:

To avoid multicollinearity (where dummy variables are linearly dependent), We can use:

In [4]:
pd.get_dummies(df['Embarked'], prefix='Embarked', drop_first=True)

Unnamed: 0,Embarked_Q,Embarked_S
0,False,True
1,False,False
2,False,True
3,False,True
4,False,True
...,...,...
886,False,True
887,False,True
888,False,True
889,False,False


This drops one dummy column, making the dataset more compact.

### Encoding Multiple Columns at Once

In [5]:
df = pd.read_csv("data/train.csv")

# One-hot encode 'Sex' and 'Embarked' at once
df_encoded = pd.get_dummies(df, columns=['Sex', 'Embarked'], prefix=['Sex', 'Embarked'])

print(df_encoded.columns)

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Cabin', 'Sex_female', 'Sex_male', 'Embarked_C',
       'Embarked_Q', 'Embarked_S'],
      dtype='object')


This is helpful when we have **many categorical columns** and want to convert all of them efficiently.

### Exercises

Q1. Encode the 'Sex' column as 0 for male and 1 for female using `.map()`.

In [6]:
df['Sex_encoded'] = df['Sex'].map({'male': 0, 'female': 1})
print(df[['Sex', 'Sex_encoded']].head())

      Sex  Sex_encoded
0    male            0
1  female            1
2  female            1
3  female            1
4    male            0


Q2. Use `.replace()` to encode 'Embarked' with: S=0, C=1, Q=2.

In [7]:
df['Embarked_encoded'] = df['Embarked'].replace({'S': 0, 'C': 1, 'Q': 2}).infer_objects(copy=False)
print(df[['Embarked', 'Embarked_encoded']].head())

  Embarked  Embarked_encoded
0        S               0.0
1        C               1.0
2        S               0.0
3        S               0.0
4        S               0.0


  df['Embarked_encoded'] = df['Embarked'].replace({'S': 0, 'C': 1, 'Q': 2}).infer_objects(copy=False)


Q3. Apply `pd.get_dummies()` to the 'Pclass' column.

In [8]:
pclass_dummies = pd.get_dummies(df['Pclass'], prefix='Pclass')
df = pd.concat([df, pclass_dummies], axis=1)
print(df.head())

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  Sex_encoded  \
0      0         A/5 21171   7.2500   NaN        S            0   
1      0          PC 17599  71.2833   C85        C            1   
2      0  STON/O2. 3101282   7.9250   NaN        S            1   
3      0            113803  53.1000  C123        S  

Q4. Use one-hot encoding on both 'Sex' and 'Embarked' columns at once.

In [9]:
df_encoded = pd.get_dummies(df, columns=['Sex', 'Embarked'], prefix=['Sex', 'Embarked'])
print(df_encoded.head())

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name   Age  SibSp  Parch  \
0                            Braund, Mr. Owen Harris  22.0      1      0   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  38.0      1      0   
2                             Heikkinen, Miss. Laina  26.0      0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  35.0      1      0   
4                           Allen, Mr. William Henry  35.0      0      0   

             Ticket     Fare Cabin  Sex_encoded  Embarked_encoded  Pclass_1  \
0         A/5 21171   7.2500   NaN            0               0.0     False   
1          PC 17599  71.2833   C85            1               1.0      True   
2  STON/O2. 3101282   7.9250   NaN            1               0.0     False   
3         

Q5. Use `drop_first=True` with `get_dummies()` on 'Embarked' and explain the difference.

In [10]:
df_emb_dummies = pd.get_dummies(df['Embarked'], drop_first=True)
print(df_emb_dummies.head())

       Q      S
0  False   True
1  False  False
2  False   True
3  False   True
4  False   True


This drops the first category to avoid multicollinearity in ML models.

### Summary

In this topic, we tackled **encoding categorical variables**, a vital step for any machine learning pipeline. Machine learning models only understand numbers — not strings — so we must transform categorical text data into numeric format.

We covered:

- `.map()` for binary mapping
- `.replace()` for manual or safer multi-value replacement
- `pd.get_dummies()` for one-hot encoding, which is powerful and scalable
- Encoding multiple columns efficiently

These encoding techniques allow categorical data to be included in model training, prediction, and evaluation. As we move forward, we’ll use these encoded variables during **feature selection, scaling, and model fitting**. Without encoding, categorical variables would be ignored or mishandled — leading to poor results. But with the right techniques, they become high-value predictors.