# Exploring Categorical Columns in Pandas

### What Are Categorical Columns?

Categorical columns represent **discrete, qualitative information** — like gender, port of embarkation, ticket class, or cabin labels. These columns don’t carry mathematical meaning the way numbers do, but they contain powerful patterns. In the Titanic dataset, columns like `'Sex'`, `'Embarked'`, `'Pclass'`, and even `'Cabin'` are categorical.

Before we can visualize or encode them for ML, we must **explore and understand** them. That includes checking how many **unique values** they contain, how **frequently** each category appears, and whether any values are **unexpected or missing**.

Using methods like `.unique()`, `.nunique()`, and `.value_counts()`, we can get a clear picture of each column’s diversity and distribution. This helps us make smart decisions about **cleaning**, **encoding**, or even **dropping** a column. For example, a column with 700+ unique values (like `'Name'` or `'Cabin'`) might be too noisy to encode directly. But `'Sex'` and `'Embarked'`, which have only a few categories, are perfect for modeling.

### `.unique()`: Finding All Unique Values

The `.unique()` method shows us **all distinct values** in a column. This helps us verify whether a column is truly categorical, and spot typos or anomalies.

In [1]:
import pandas as pd

df = pd.read_csv("data/train.csv")

# Unique values in 'Sex' column
print(df['Sex'].unique())

# Unique values in 'Embarked' column
print(df['Embarked'].unique())

['male' 'female']
['S' 'C' 'Q' nan]


From this, we know that `'Sex'` has two valid categories, and `'Embarked'` has three — but also contains missing values (`nan`).

### `.nunique()`: Count of Unique Values

To check how many **distinct categories** exist in a column, we use `.nunique()`. This is useful for quick diagnostics when scanning multiple columns.

In [2]:
# How many unique values per column?
print(df.nunique())

PassengerId    891
Survived         2
Pclass           3
Name           891
Sex              2
Age             88
SibSp            7
Parch            7
Ticket         681
Fare           248
Cabin          147
Embarked         3
dtype: int64


This will give us a numeric summary — great for identifying ID columns, detecting potential categorical fields, or spotting high-cardinality issues.

### `.value_counts()`: Frequency of Each Category

The `.value_counts()` method is one of our best tools for **analyzing category distribution**. It shows how often each unique value appears — either in raw counts or proportions.

In [3]:
# Count values in 'Sex'
print(df['Sex'].value_counts())

# Count values in 'Embarked' including missing
print(df['Embarked'].value_counts(dropna=False))

Sex
male      577
female    314
Name: count, dtype: int64
Embarked
S      644
C      168
Q       77
NaN      2
Name: count, dtype: int64


This helps us see if the column is balanced or skewed — important for modeling and fairness in ML.

### Exploring Multiple Categorical Columns at Once

To focus only on **categorical** columns (i.e., object or string types), we can use `.select_dtypes()`.

In [4]:
# Select only object (string-like) columns
cat_cols = df.select_dtypes(include='object').columns

# Loop through and display value counts
for col in cat_cols:
    print(f"\nColumn: {col}")
    print(df[col].value_counts(dropna=False))


Column: Name
Name
Braund, Mr. Owen Harris                                1
Cumings, Mrs. John Bradley (Florence Briggs Thayer)    1
Heikkinen, Miss. Laina                                 1
Futrelle, Mrs. Jacques Heath (Lily May Peel)           1
Allen, Mr. William Henry                               1
                                                      ..
Montvila, Rev. Juozas                                  1
Graham, Miss. Margaret Edith                           1
Johnston, Miss. Catherine Helen "Carrie"               1
Behr, Mr. Karl Howell                                  1
Dooley, Mr. Patrick                                    1
Name: count, Length: 891, dtype: int64

Column: Sex
Sex
male      577
female    314
Name: count, dtype: int64

Column: Ticket
Ticket
347082              7
1601                7
CA. 2343            7
3101295             6
CA 2144             6
                   ..
PC 17590            1
17463               1
330877              1
373450              1
S

This gives us a full overview of all category columns — a great trick before one-hot encoding or label encoding in future stages.

### Exercises

Q1. Show all unique values in the `'Pclass'` column.

In [5]:
print(df['Pclass'].unique())

[3 1 2]


Q2. Count how many unique values are in the `'Cabin'` column.

In [6]:
print(df['Cabin'].nunique())

147


Q3. Display the frequency of values in the `'Embarked'` column.

In [7]:
print(df['Embarked'].value_counts(dropna=False))

Embarked
S      644
C      168
Q       77
NaN      2
Name: count, dtype: int64


Q4. Print how many unique values exist in each column (hint: use `.nunique()`).

In [8]:
print(df.nunique())

PassengerId    891
Survived         2
Pclass           3
Name           891
Sex              2
Age             88
SibSp            7
Parch            7
Ticket         681
Fare           248
Cabin          147
Embarked         3
dtype: int64


Q5. Print the top 5 most common ticket values.

In [9]:
print(df['Ticket'].value_counts().head(5))

Ticket
347082      7
1601        7
CA. 2343    7
3101295     6
CA 2144     6
Name: count, dtype: int64


Q6. Loop through all object-type columns and print value counts for each.

In [10]:
for col in df.select_dtypes(include='object').columns:
    print(f"\nColumn: {col}")
    print(df[col].value_counts(dropna=False))


Column: Name
Name
Braund, Mr. Owen Harris                                1
Cumings, Mrs. John Bradley (Florence Briggs Thayer)    1
Heikkinen, Miss. Laina                                 1
Futrelle, Mrs. Jacques Heath (Lily May Peel)           1
Allen, Mr. William Henry                               1
                                                      ..
Montvila, Rev. Juozas                                  1
Graham, Miss. Margaret Edith                           1
Johnston, Miss. Catherine Helen "Carrie"               1
Behr, Mr. Karl Howell                                  1
Dooley, Mr. Patrick                                    1
Name: count, Length: 891, dtype: int64

Column: Sex
Sex
male      577
female    314
Name: count, dtype: int64

Column: Ticket
Ticket
347082              7
1601                7
CA. 2343            7
3101295             6
CA 2144             6
                   ..
PC 17590            1
17463               1
330877              1
373450              1
S

### Summary

In this topic, we explored one of the most powerful aspects of EDA: understanding categorical columns. These fields may not be numeric, but they often hold critical insight — whether it’s someone’s gender, travel class, or embarkation point.

We learned to use:

- `.unique()` to view all distinct values,
- `.nunique()` to count how many categories exist,
- `.value_counts()` to measure the frequency of each category.

We also explored how to isolate all object-type columns for deep inspection. These tools help us spot **imbalances**, **data errors**, **encoding opportunities**, and **potential model bias**. Categorical data is often where machine learning shines — but only if we explore and prepare it correctly.

As we move forward into **encoding techniques**, **visualizations**, and **feature engineering**, this exploration becomes the foundation for everything else. A well-understood category column is one of the biggest advantages in AI.