# **Handling Missing Values in Pandas**


In [305]:
import pandas as pd
import numpy as np


---



## **Table of Contents**

1. Introduction to Missing Values
2. Detecting Missing Values
   - Using `isnull()` and `isna()`
   - Using `notnull()` and `notna()`
   - Counting Missing Values
3. Handling Missing Values
   - Dropping Missing Values
   - Filling Missing Values
     - Using `fillna()`
     - Forward Fill (`ffill`)
     - Backward Fill (`bfill`)
   - Replacing Missing Values `replace()`
   - Interpolation

---


## **1. Introduction to Missing Values**


Missing values are common in datasets due to data entry errors, incomplete data collection, or data corruption. Handling missing values is essential because they can impact data analysis and machine learning models.

In Pandas:

- **Numeric data** missing values are represented as `NaN` (Not a Number).
- **Object (string) data** missing values can be `None` or `NaN`.

---


## **2. Detecting Missing Values**


### **2.1 Using `isnull()` and `isna()`**

Pandas provides two methods to detect missing values:

- `isnull()`: Returns `True` for missing values.
- `isna()`: An alias of `isnull()`, they work the same way.


In [306]:
# Sample DataFrame with missing values
data = {
    "Name": ["Alice", None, "Charlie", None, "Eve", np.nan],
    "Age": [25, 30, 35, 40, 20, 28],
    "City": ["New York", "Los Angeles", None, "Houston", "Chicago", np.nan],
    "Salary": [None, 60000, np.nan, 45000, 52000, None],
}

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City,Salary
0,Alice,25,New York,
1,,30,Los Angeles,60000.0
2,Charlie,35,,
3,,40,Houston,45000.0
4,Eve,20,Chicago,52000.0
5,,28,,


In [307]:
# Detect missing values
display(df.isna())
display(df.isnull())

Unnamed: 0,Name,Age,City,Salary
0,False,False,False,True
1,True,False,False,False
2,False,False,True,True
3,True,False,False,False
4,False,False,False,False
5,True,False,True,True


Unnamed: 0,Name,Age,City,Salary
0,False,False,False,True
1,True,False,False,False
2,False,False,True,True
3,True,False,False,False
4,False,False,False,False
5,True,False,True,True



### **2.2 Using `notnull()` and `notna()`**

To detect non-missing values:

- `notnull()`: Returns `True` for non-missing values.
- `notna()`: An alias of `notnull()` — they are functionally identical.

In [308]:
# Detect non-missing values
# df.notna()
df.notnull()

Unnamed: 0,Name,Age,City,Salary
0,True,True,True,False
1,False,True,True,True
2,True,True,False,False
3,False,True,True,True
4,True,True,True,True
5,False,True,False,False




---


### **2.3: Counting Missing Values**

In [309]:
# Count missing values in each column
df.isnull().sum()

Name      3
Age       0
City      2
Salary    3
dtype: int64

In [310]:
# Count percentage of missing values
df.isnull().sum() / df.shape[0] * 100

Name      50.000000
Age        0.000000
City      33.333333
Salary    50.000000
dtype: float64


---



## **3. Handling Missing Values**



### **3.1: Dropping Missing Values**

```python
DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
```

In [311]:
df

Unnamed: 0,Name,Age,City,Salary
0,Alice,25,New York,
1,,30,Los Angeles,60000.0
2,Charlie,35,,
3,,40,Houston,45000.0
4,Eve,20,Chicago,52000.0
5,,28,,


In [312]:
# Drop rows with any missing values
df.dropna(how="any", axis=0)
df.dropna()

Unnamed: 0,Name,Age,City,Salary
4,Eve,20,Chicago,52000.0


In [313]:
# Drop columns with any missing values
df.dropna(how="any", axis=1)
df.dropna(axis=1)

Unnamed: 0,Age
0,25
1,30
2,35
3,40
4,20
5,28


---

In [314]:
# Create a DataFrame with missing values
temp_df = pd.DataFrame(
    {
        "A": [np.nan, 2, None, 4],
        "B": [None, 6, 7, 8],
        "C": [np.nan, np.nan, None, None],
    }
)
temp_df

Unnamed: 0,A,B,C
0,,,
1,2.0,6.0,
2,,7.0,
3,4.0,8.0,


In [315]:
# Drop rows containing all missing values
temp_df.dropna(axis=0, how="all")

Unnamed: 0,A,B,C
1,2.0,6.0,
2,,7.0,
3,4.0,8.0,


In [316]:
# Drop rows with at least 2 non-missing values
temp_df.dropna(thresh=2)

Unnamed: 0,A,B,C
1,2.0,6.0,
3,4.0,8.0,


In [317]:
# Drop columns containing all missing values
temp_df.dropna(axis=1, how="all")

Unnamed: 0,A,B
0,,
1,2.0,6.0
2,,7.0
3,4.0,8.0


---



### **3.2: Filling Missing Values**


In [318]:
df

Unnamed: 0,Name,Age,City,Salary
0,Alice,25,New York,
1,,30,Los Angeles,60000.0
2,Charlie,35,,
3,,40,Houston,45000.0
4,Eve,20,Chicago,52000.0
5,,28,,



#### **3.2.1: Using `fillna()`**
You can fill missing values with a specific value or a computed value (mean, median, etc.).

In [319]:
# Fill missing values with a specific value
df.fillna("Unknown")

Unnamed: 0,Name,Age,City,Salary
0,Alice,25,New York,Unknown
1,Unknown,30,Los Angeles,60000.0
2,Charlie,35,Unknown,Unknown
3,Unknown,40,Houston,45000.0
4,Eve,20,Chicago,52000.0
5,Unknown,28,Unknown,Unknown


In [320]:
# Fill missing values with a specific value
df["Name"].fillna("Unknown")

0      Alice
1    Unknown
2    Charlie
3    Unknown
4        Eve
5    Unknown
Name: Name, dtype: object

In [321]:
# Fill missing values in a column with its mean
df["Salary"].fillna(df["Salary"].mean())

0    52333.333333
1    60000.000000
2    52333.333333
3    45000.000000
4    52000.000000
5    52333.333333
Name: Salary, dtype: float64

#### **3.2.2: Forward Fill (`ffill`)**

Propagate the last valid observation forward to fill missing values.

In [322]:
# Forward fill missing values
df.ffill()

Unnamed: 0,Name,Age,City,Salary
0,Alice,25,New York,
1,Alice,30,Los Angeles,60000.0
2,Charlie,35,Los Angeles,60000.0
3,Charlie,40,Houston,45000.0
4,Eve,20,Chicago,52000.0
5,Eve,28,Chicago,52000.0


#### **3.2.3: Backward Fill (`bfill`)**

Use the next valid observation to fill missing values.

In [323]:
# Backward fill missing values
df.bfill()

Unnamed: 0,Name,Age,City,Salary
0,Alice,25,New York,60000.0
1,Charlie,30,Los Angeles,60000.0
2,Charlie,35,Houston,45000.0
3,Eve,40,Houston,45000.0
4,Eve,20,Chicago,52000.0
5,,28,,


---


### **3.3: Replacing Missing Values `replace()`**


In [324]:
# Replace missing values with a specific value
df.replace({np.nan: "Missing", None: "Unknown"})

Unnamed: 0,Name,Age,City,Salary
0,Alice,25,New York,Missing
1,Unknown,30,Los Angeles,60000.0
2,Charlie,35,Unknown,Missing
3,Unknown,40,Houston,45000.0
4,Eve,20,Chicago,52000.0
5,Unknown,28,Unknown,Missing


In [325]:
# Replace multiple values
df.replace([np.nan, None], "No Data")

Unnamed: 0,Name,Age,City,Salary
0,Alice,25,New York,No Data
1,No Data,30,Los Angeles,60000.0
2,Charlie,35,No Data,No Data
3,No Data,40,Houston,45000.0
4,Eve,20,Chicago,52000.0
5,No Data,28,No Data,No Data


In [326]:
# Replace using a dictionary
df.replace({"Salary": np.nan}, 0)

Unnamed: 0,Name,Age,City,Salary
0,Alice,25,New York,0.0
1,,30,Los Angeles,60000.0
2,Charlie,35,,0.0
3,,40,Houston,45000.0
4,Eve,20,Chicago,52000.0
5,,28,,0.0


In [None]:
# Replace NaN values in specific columns with different values
df.replace({"Salary": np.nan, "City": np.nan}, {"Salary": 0, "City": "New York"})

Unnamed: 0,Name,Age,City,Salary
0,Alice,25,New York,0.0
1,,30,Los Angeles,60000.0
2,Charlie,35,New York,0.0
3,,40,Houston,45000.0
4,Eve,20,Chicago,52000.0
5,,28,New York,0.0


---

### **3.4: Interpolation**
You can estimate missing values using `interpolate()`, useful for time series data.

In [328]:
# Create a DataFrame with time series data
data = {
    "Date": pd.date_range(start="2025-01-01", periods=5),
    "Value": [1, np.nan, np.nan, 4, 5],
}

df_time = pd.DataFrame(data)
df_time

Unnamed: 0,Date,Value
0,2025-01-01,1.0
1,2025-01-02,
2,2025-01-03,
3,2025-01-04,4.0
4,2025-01-05,5.0


In [329]:
# Interpolate missing values
df_time["Value"] = df_time["Value"].interpolate()
df_time

Unnamed: 0,Date,Value
0,2025-01-01,1.0
1,2025-01-02,2.0
2,2025-01-03,3.0
3,2025-01-04,4.0
4,2025-01-05,5.0


---