
# Handling Missing Data in Pandas





## Learning Objectives
- Understand how Pandas represents missing data (`NaN`, `pd.NA`)
- Identify common sources of missing data (load, merge, reindex, manual entry)
- Detect and quantify missing data with `isnull`, `value_counts`, and boolean math
- Clean missing values with `fillna`, forward/backward fill, interpolation, and `dropna`
- Know how aggregations behave with missing values (`skipna`)


## 1) What is a `NaN` value?

In [4]:
import pandas as pd
import numpy as np

print("np.nan == np.nan ->", np.nan == np.nan)    # False
print("pd.isnull(np.nan) ->", pd.isnull(np.nan))  # True
print("pd.notnull(42) ->", pd.notnull(42))  # True

np.nan == np.nan -> False
pd.isnull(np.nan) -> True
pd.notnull(42) -> True



**Key idea:** `NaN` is not equal to anything, including itself. Use `pd.isnull`/`pd.notnull` to detect missingness.


## 2) Where do missing values come from?

### 2a) Loading data (custom missing markers)

StringIO is a class in the io module in Python. It's used to read from and write to a string as if it were a file. This is useful for testing or when you have data in a string format that you want to process using file-like operations, such as reading it with pd.read_csv(). In these cells, StringIO is used to create a file-like object from the CSV data stored in a string, allowing pd.read_csv() to read it directly.

In [5]:

from io import StringIO

csv_text = StringIO('''ident,site,dated
619,DR-1,1927-02-08
622,DR-1,1927-02-10
734,DR-3,1939-01-07
752,DR-3,
''')

visited_default = pd.read_csv(csv_text)
visited_default


Unnamed: 0,ident,site,dated
0,619,DR-1,1927-02-08
1,622,DR-1,1927-02-10
2,734,DR-3,1939-01-07
3,752,DR-3,


In [6]:

# Same CSV but treat empty strings "" as missing explicitly (demonstration)
csv_text2 = StringIO('''ident,site,dated
619,DR-1,1927-02-08
622,DR-1,1927-02-10
734,DR-3,1939-01-07
752,DR-3,
''')

visited_custom = pd.read_csv(csv_text2, na_values=[""], keep_default_na=True)
visited_custom


Unnamed: 0,ident,site,dated
0,619,DR-1,1927-02-08
1,622,DR-1,1927-02-10
2,734,DR-3,1939-01-07
3,752,DR-3,


### 2b) Merging data (joins can introduce NaN)

Merging or joining data in pandas can introduce `NaN` values, especially when using 'left' or 'outer' joins. If a key in the left DataFrame doesn't have a corresponding match in the right DataFrame (for a left join), or if keys don't match in either DataFrame (for an outer join), the resulting merged DataFrame will have `NaN` values in the columns from the DataFrame that lacked a match.

In [7]:

survey = pd.DataFrame({
    "taken": [619, 622, 734],
    "person": ["dyer", "dyer", "pb"],
    "quant": ["rad", "sal", "rad"],
    "reading": [9.82, 0.13, 8.41]
})
merged = visited_custom.merge(survey, left_on="ident", right_on="taken", how="left")
merged


Unnamed: 0,ident,site,dated,taken,person,quant,reading
0,619,DR-1,1927-02-08,619.0,dyer,rad,9.82
1,622,DR-1,1927-02-10,622.0,dyer,sal,0.13
2,734,DR-3,1939-01-07,734.0,pb,rad,8.41
3,752,DR-3,,,,,


### 2c) Manual/curated values (user input)

In [8]:

num_legs = pd.Series({"goat": 4, "amoeba": np.nan})
num_legs


Unnamed: 0,0
goat,4.0
amoeba,


### 2d) Reindexing can introduce NaN

Reindexing in pandas is the process of conforming a DataFrame or Series to a new index. This can involve rearranging the existing data to match the new index labels, and potentially introducing missing values (`NaN`) for index labels that were not present in the original data. It's often used for tasks like aligning data from different sources or ensuring a consistent time series.

In [9]:

s = pd.Series([1, 2], index=[2002, 2007])
s_reindexed = s.reindex(range(2000, 2010))
s, s_reindexed


(2002    1
 2007    2
 dtype: int64,
 2000    NaN
 2001    NaN
 2002    1.0
 2003    NaN
 2004    NaN
 2005    NaN
 2006    NaN
 2007    2.0
 2008    NaN
 2009    NaN
 dtype: float64)

## 3) Finding & counting missing data

## 3) Finding & counting missing data

You can find and count missing data in pandas using methods like `.isnull()`, `.notnull()`, `.sum()` and `.count_nonzero()`. `.isnull()` returns a boolean mask indicating where the data is missing, `.sum()` can then be used on this mask to count missing values per column, and `np.count_nonzero()` can give you the total count across the entire DataFrame.

In [10]:

visited = visited_custom.copy()
print("Boolean mask:\n", visited.isnull())
print("\nMissing count per column:\n", visited.isnull().sum())
print("\nTotal missing values:", np.count_nonzero(visited.isnull()))


Boolean mask:
    ident   site  dated
0  False  False  False
1  False  False  False
2  False  False  False
3  False  False   True

Missing count per column:
 ident    0
site     0
dated    1
dtype: int64

Total missing values: 1


In [11]:

# Value counts including NaN for a column
visited["dated"].value_counts(dropna=False)


Unnamed: 0_level_0,count
dated,Unnamed: 1_level_1
1927-02-08,1
1927-02-10,1
1939-01-07,1
,1


## 4) Cleaning missing data

### 4a) Replace with a specific value

Replacing with a specific value is a straightforward way to handle missing data by substituting all missing entries in a Series or DataFrame with a predefined value. This can be useful when you have a logical default value to use, such as 0, a mean, a median, or a placeholder string.

In [12]:

visited_replace = visited.copy()
visited_replace["dated_filled"] = visited_replace["dated"].fillna("1900-01-01")
visited_replace


Unnamed: 0,ident,site,dated,dated_filled
0,619,DR-1,1927-02-08,1927-02-08
1,622,DR-1,1927-02-10,1927-02-10
2,734,DR-3,1939-01-07,1939-01-07
3,752,DR-3,,1900-01-01


### 4b) Forward fill (propagate last known value)

Forward fill, or `ffill`, is a method for filling missing data points in a Series or DataFrame by propagating the last valid observation forward. This means that if a value is missing, it will be replaced by the value from the previous row.

In [20]:
# Forward fill (propagate last known value)
# Note: Using .ffill() directly is the preferred method over fillna(method='ffill')

visited_ffill = visited.sort_values("ident").ffill()
visited_ffill

Unnamed: 0,ident,site,dated
0,619,DR-1,1927-02-08
1,622,DR-1,1927-02-10
2,734,DR-3,1939-01-07
3,752,DR-3,1939-01-07


### 4c) Backward fill

Backward fill, or bfill, is a method for filling missing data points in a Series or DataFrame by using the next valid observation in the sequence. This is the opposite of forward fill (ffill), which uses the previous valid observation. I

In [19]:

visited_bfill = visited.sort_values("ident").fillna(method="bfill")
visited_bfill


  visited_bfill = visited.sort_values("ident").fillna(method="bfill")


Unnamed: 0,ident,site,dated
0,619,DR-1,1927-02-08
1,622,DR-1,1927-02-10
2,734,DR-3,1939-01-07
3,752,DR-3,


### 4d) Interpolation

In [15]:

s = pd.Series([1.0, np.nan, 3.0, np.nan, 7.0])
s_interp = s.interpolate()  # default is linear
s, s_interp


(0    1.0
 1    NaN
 2    3.0
 3    NaN
 4    7.0
 dtype: float64,
 0    1.0
 1    2.0
 2    3.0
 3    5.0
 4    7.0
 dtype: float64)

### 4e) Dropping missing data

In [16]:

print("Original shape:", visited.shape)
print("Drop any NaN rows:", visited.dropna().shape)
print("Drop rows only if all values are NaN:", visited.dropna(how='all').shape)
visited.dropna()


Original shape: (4, 3)
Drop any NaN rows: (3, 3)
Drop rows only if all values are NaN: (4, 3)


Unnamed: 0,ident,site,dated
0,619,DR-1,1927-02-08
1,622,DR-1,1927-02-10
2,734,DR-3,1939-01-07


## 5) Calculations with missing data (`skipna`)

In [17]:

visited_calc = visited.copy()
visited_calc["dated_dt"] = pd.to_datetime(visited_calc["dated"], errors="coerce")
visited_calc["year"] = visited_calc["dated_dt"].dt.year
sum_skip = visited_calc["year"].sum(skipna=True)
sum_noskip = visited_calc["year"].sum(skipna=False)

print("Sum with skipna=True:", sum_skip)
print("Sum with skipna=False:", sum_noskip)
visited_calc[["ident","dated","dated_dt","year"]]


Sum with skipna=True: 5793.0
Sum with skipna=False: nan


Unnamed: 0,ident,dated,dated_dt,year
0,619,1927-02-08,1927-02-08,1927.0
1,622,1927-02-10,1927-02-10,1927.0
2,734,1939-01-07,1939-01-07,1939.0
3,752,,NaT,


## 6) Pandas built-in `pd.NA` (experimental)

In [18]:

df = pd.DataFrame({
    "Name": ["Alice", "Bob"],
    "Age": [25, 30]
})
df_pdna = df.copy()
df_pdna.loc[1, "Name"] = pd.NA
df_pdna.loc[0, "Age"] = pd.NA
print(df_pdna)
print("\nDtypes:\n", df_pdna.dtypes)


    Name   Age
0  Alice   NaN
1   <NA>  30.0

Dtypes:
 Name     object
Age     float64
dtype: object



> **Note:** Using `pd.NA` can cause column dtypes to change (e.g., to `object`) to accommodate missing values across dtypes.
