### What Are Missing Values?

In real-world datasets like Titanic, it's very common to come across **missing values** — places where data is either not recorded, corrupted, or simply unknown. For example, a passenger might not have listed their age, cabin, or port of embarkation. These missing entries appear in Pandas as **NaN** (Not a Number) or sometimes as **None**. If we don’t handle them properly, they can break our calculations, corrupt our models, or give misleading results.

Handling missing values is one of the most **important data cleaning tasks** in any AI/ML pipeline. Depending on the situation, we might **remove**, **fill**, or **replace** them. The strategy we use depends on the data type, the percentage of missing data, and the importance of the column. Pandas provides powerful functions like `.isnull()`, `.notnull()`, `.dropna()`, and `.fillna()` to help us detect and fix these problems efficiently.

By mastering missing value handling, we improve the **quality**, **reliability**, and **accuracy** of our analysis and models. We also avoid potential runtime errors that can happen when working with null values during aggregation, plotting, or training. In short, **the cleaner our data, the smarter our machine learning becomes**.

### **Types of** Missing **Data**

Understanding the reason why data is missing helps us choose the right strategy:

| Type | Meaning | Example |
| --- | --- | --- |
| **MCAR (Missing Completely at Random)** | Data is missing for no reason | Someone forgot to fill in age |
| **MAR (Missing at Random)** | Missingness depends on *other* columns | Cabin is missing for 3rd class but not 1st class |
| **MNAR (Missing Not at Random)** | Missingness depends on the *value itself* | Rich passengers don’t report age intentionally |

### Why Missing Value Handling Matters

Handling missing values is one of the **most important data cleaning tasks** in any AI/ML pipeline. Depending on the situation, we might:

- Remove them (drop)
- Fill them (impute)
- Replace them (with default or logic)

The strategy depends on:

- The data type (numerical or categorical)
- % of missing data
- Importance of the column

### Detecting Missing Values

To find out which columns contain missing data, we use `.isnull().sum()`. This shows us the **total number of null values** in each column. It’s our go-to starting point for diagnosing incomplete data.

In [1]:
import pandas as pd

df = pd.read_csv("data/train.csv")
print(df.isnull().sum())

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


We can also use `.info()` (from the previous topic) to cross-check how many non-null entries each column has.

### Removing Missing Values with `.dropna()`

If a column or row has too many missing values, or if the missing data is not useful, we can remove it using `.dropna()`. We can drop **rows** with missing data:

In [2]:
# Drop rows with any missing values
cleaned_df = df.dropna()
print(cleaned_df)

     PassengerId  Survived  Pclass  \
1              2         1       1   
3              4         1       1   
6              7         0       1   
10            11         1       3   
11            12         1       1   
..           ...       ...     ...   
871          872         1       1   
872          873         0       1   
879          880         1       1   
887          888         1       1   
889          890         1       1   

                                                  Name     Sex   Age  SibSp  \
1    Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
6                              McCarthy, Mr. Timothy J    male  54.0      0   
10                     Sandstrom, Miss. Marguerite Rut  female   4.0      1   
11                            Bonnell, Miss. Elizabeth  female  58.0      0   
..                                                 ...     ...   ... 

Or drop **columns** if entire fields are mostly empty:

In [3]:
# Drop columns with any missing values
df_dropped_cols = df.dropna(axis=1)
print(df_dropped_cols.columns)

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'SibSp', 'Parch',
       'Ticket', 'Fare'],
      dtype='object')


**Warning:** This is irreversible unless we re-load the original data, so we should always check `.shape` before and after to verify the impact.

### Filling Missing Values with `.fillna()`

Instead of removing rows, we can **fill** missing values using `.fillna()`. We can use static values or statistical ones like mean/median.

In [4]:
# Fill missing ages with the mean age
df['Age'] = df['Age'].fillna(df['Age'].mean())
print(df['Age'].isnull().sum())

# Fill missing embarked values with the most common value
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])
print(df['Embarked'].isnull().sum())

0
0


We can also fill forward (`method='ffill'`) or backward (`method='bfill'`) if the data is time-ordered.

In [5]:
df.fillna(method='ffill', inplace=True)  # forward fill
print(df.isnull().sum())

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          1
Embarked       0
dtype: int64


  df.fillna(method='ffill', inplace=True)  # forward fill


### When to Drop vs Fill?

| Scenario | Strategy |
| --- | --- |
| Missing < 5% | Usually safe to fill |
| Missing > 30% | Consider dropping column |
| Critical column (e.g., target) | Don't drop, try imputation |
| Random pattern (MCAR) | Fill with mean/median/mode |
| Systematic pattern (MAR) | Fill based on grouped averages |

Always consider **data quality vs data quantity** trade-off.

### Exercises

Q1. Count missing values in each column.

In [6]:
print(df.isnull().sum())

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          1
Embarked       0
dtype: int64


Q2. Drop all rows with missing values and print the new shape.

In [7]:
df_dropped = df.dropna()
print(df_dropped.shape)

(890, 12)


Q3. Drop all columns that contain missing values.

In [8]:
df_no_missing_cols = df.dropna(axis=1)
print(df_no_missing_cols.columns)

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Embarked'],
      dtype='object')


Q4. Fill missing "Age" values with the median age.

In [9]:
df['Age'] = df['Age'].fillna(df['Age'].median())
print(df['Age'].isnull().sum())

0


Q5. Fill missing "Embarked" values with the most frequent value.

In [10]:
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])
print(df['Embarked'].isnull().sum())

0


Q6. Forward-fill all missing values in the dataset.

In [11]:
df.fillna(method='ffill', inplace=True)
print(df.isnull().sum())

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          1
Embarked       0
dtype: int64


  df.fillna(method='ffill', inplace=True)


Q7. Calculate % of missing values in each column

In [12]:
print(df.isnull().mean() * 100)

PassengerId    0.000000
Survived       0.000000
Pclass         0.000000
Name           0.000000
Sex            0.000000
Age            0.000000
SibSp          0.000000
Parch          0.000000
Ticket         0.000000
Fare           0.000000
Cabin          0.112233
Embarked       0.000000
dtype: float64


### Summary

In this topic, we tackled a major challenge in data science — **missing values**. These are common in real-world datasets and must be handled carefully to maintain the integrity of our models. We learned how to **detect missing values** using `.isnull().sum()` and `.info()`. Then we explored how to **remove missing data** with `.dropna()` and how to **fill or replace** missing entries using `.fillna()` with statistics like mean, median, mode, or propagation techniques (`ffill`, `bfill`).

Each method has pros and cons. Removing data is simple but risky if we lose too much information. Filling helps preserve rows but can introduce bias if done carelessly. As we work with more complex datasets, we’ll often combine these strategies — dropping where safe, filling where necessary.

This topic is **essential preparation** for the next stage: **feature encoding, scaling, and model training**. Without clean data, even the most powerful algorithms will fail. By learning how to handle missing values early and confidently, we set ourselves up for success in every AI/ML project we build.