## **Handling Missing Data with Pandas**
pandas borrows all the capabilities from numpy selection + adds a number of convenient methods to handle missing values. Let's see one at a time:

---

### **Hands on!!**

In [1]:
# importing stuffs
import numpy as np
import pandas as pd

### Pandas utility functions

Similarly to numpy, pandas also has a few utility functions to identify and detect null values:

In [2]:
# isnull() returns true of the value inside is null and false if its not null
pd.isnull(np.nan)

True

In [3]:
# checking None is null or not
pd.isnull(None)

True

In [4]:
pd.isna(np.nan)

True

In [5]:
pd.isna(None)

True

There is no difference! isnull() and isna() are identical in Pandas. Both are used to detect missing (null) values in a DataFrame or Series.

now checking the opposite function of the above functions

In [6]:
pd.notnull(np.nan)

False

In [7]:
pd.notna(np.nan)

False

In [8]:
pd.notnull(None)

False

In [9]:
pd.notna(None)

False

In [10]:
pd.notnull(3)

True

These functions also work with Series and DataFrames:

In [11]:
# here we will be checking the above functions with series and dataframe
# creating a series
x = pd.Series([1,2,np.nan,None,7])
pd.isnull(x)

0    False
1    False
2     True
3     True
4    False
dtype: bool

In [12]:
pd.notnull(x)

0     True
1     True
2    False
3    False
4     True
dtype: bool

In [13]:
# now checking for dataframes
df = pd.DataFrame({
    'Column A': [1, np.nan, 7],
    'Column B': [np.nan, 2, 3],
    'Column C': [np.nan, 2, np.nan]
})

In [14]:
pd.isnull(df)

Unnamed: 0,Column A,Column B,Column C
0,False,True,True
1,True,False,False
2,False,False,True


In [15]:
pd.notnull(df)

Unnamed: 0,Column A,Column B,Column C
0,True,False,False
1,False,True,True
2,True,True,False


### Pandas operation with missing values

Pandas manages missing values more gracefully than numpy. nans will no longer behave as "viruses", and operations will just ignore them completely:

In [16]:
# x is a series as mentioned above
x

0    1.0
1    2.0
2    NaN
3    NaN
4    7.0
dtype: float64

In [17]:
x.sum()

10.0

In [18]:
x.count()

3

In [19]:
x.mean()

3.3333333333333335

note that pandas disregards nan values while counting or performing operations  
When series containing null values are converted into boolean series/arrays then pandas operation consider true as 1 and false as 0

### Filtering missing data

As we saw with numpy, we could combine boolean selection + pd.isnull to filter out those nans and null values:

In [20]:
s = pd.Series([1, 2, 3, np.nan, np.nan, 4])

In [21]:
pd.notnull(s)

0     True
1     True
2     True
3    False
4    False
5     True
dtype: bool

In [22]:
pd.isnull(s)

0    False
1    False
2    False
3     True
4     True
5    False
dtype: bool

In [23]:
# now performing operation on the series
pd.isnull(s).sum()

2

In [24]:
pd.notnull(s).sum()

4

In [25]:
pd.notnull(s).count()

6

In [26]:
# now selecting values right from the series s
s[pd.isnull(s)]

3   NaN
4   NaN
dtype: float64

In [27]:
s[pd.notnull(s)]

0    1.0
1    2.0
2    3.0
5    4.0
dtype: float64

But both notnull and isnull are also methods of Series and DataFrames, so we could use it that way:

In [28]:
# since series and dataframes also have isnull() and notnull() as functions so we can use it directly
s.isnull()

0    False
1    False
2    False
3     True
4     True
5    False
dtype: bool

In [29]:
s.notnull()

0     True
1     True
2     True
3    False
4    False
5     True
dtype: bool

In [30]:
# we can also write 
s[s.notnull()]

0    1.0
1    2.0
2    3.0
5    4.0
dtype: float64

### Dropping null values

Boolean selection + notnull() seems a little bit verbose and repetitive. And as we said before: any repetitive task will probably have a better, more DRY way. In this case, we can use the **dropna** method:

In [31]:
# here we will perform dropns method which will drop the null or nan values
s.dropna()

0    1.0
1    2.0
2    3.0
5    4.0
dtype: float64

s.dropna() perform basically the same task as s[s.notnull()]

Note that all these operations are immutable hence not changing the original series or dataframe

In [32]:
s

0    1.0
1    2.0
2    3.0
3    NaN
4    NaN
5    4.0
dtype: float64

see told ya !!

### Dropping null values on DataFrames

You saw how simple it is to drop nas with a Series. But with DataFrames, there will be a few more things to consider, because you can't drop single values. You can only drop entire columns or rows. Let's start with a sample DataFrame:

In [33]:
df = pd.DataFrame({
    'Column A': [1, np.nan, 30, np.nan],
    'Column B': [2, 8, 31, np.nan],
    'Column C': [np.nan, 9, 32, 100],
    'Column D': [5, 8, 34, 110],
})

In [34]:
df

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,,5
1,,8.0,9.0,8
2,30.0,31.0,32.0,34
3,,,100.0,110


In [35]:
# every thing thats works with pd series also works with dataframes
df.isnull()

Unnamed: 0,Column A,Column B,Column C,Column D
0,False,False,True,False
1,True,False,False,False
2,False,False,False,False
3,True,True,False,False


In [36]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Column A  2 non-null      float64
 1   Column B  3 non-null      float64
 2   Column C  3 non-null      float64
 3   Column D  4 non-null      int64  
dtypes: float64(3), int64(1)
memory usage: 256.0 bytes


df.info shows the entire information of the dataframe  
here the df is representing exact number of not null values as shown above in the not-null count column

In [37]:
df.isnull()

Unnamed: 0,Column A,Column B,Column C,Column D
0,False,False,True,False
1,True,False,False,False
2,False,False,False,False
3,True,True,False,False


In [38]:
# now we can perform operations
df.isnull().sum()

Column A    2
Column B    1
Column C    1
Column D    0
dtype: int64

The default dropna behavior will drop all the rows in which atleast one null value is present:

In [39]:
df

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,,5
1,,8.0,9.0,8
2,30.0,31.0,32.0,34
3,,,100.0,110


In [40]:
df.dropna()

Unnamed: 0,Column A,Column B,Column C,Column D
2,30.0,31.0,32.0,34


see here all the rows except the row 2 is dropped since they all contained atleast 1 null value

In this case we're dropping rows. Rows containing null values are dropped from the DF. You can also use the axis parameter to drop columns containing null values:

In [41]:
# now we can set the axis to columns only dropping
df.dropna(axis = 1) # axis='columns' also works

Unnamed: 0,Column D
0,5
1,8
2,34
3,110


 In this case, any row or column that contains at least one null value will be dropped. Which can be, depending on the case, too extreme. You can control this behavior with the how parameter. Can be either 'any' or 'all':

The **how** parameter is used in merging, joining, and dropping operations in Pandas. It specifies how the operation should be performed, usually by defining rules for handling missing or overlapping data.

In [42]:
# Drops Rows/Columns Only If All Values are NaN
df.dropna(how = "all")

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,,5
1,,8.0,9.0,8
2,30.0,31.0,32.0,34
3,,,100.0,110


In [43]:
# Drops Rows/Columns If Any Value is NaN
# # default behavior
df.dropna(how = "any")

Unnamed: 0,Column A,Column B,Column C,Column D
2,30.0,31.0,32.0,34


In [44]:
df

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,,5
1,,8.0,9.0,8
2,30.0,31.0,32.0,34
3,,,100.0,110


The thresh parameter in df.dropna(thresh=val) specifies the minimum number of non-NaN values required for a row/column to be kept.

In [45]:
df.dropna(thresh = 3)

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,,5
1,,8.0,9.0,8
2,30.0,31.0,32.0,34


we can also change the axis of drop for the thresh parametre

In [46]:
df.dropna(thresh = 3, axis = "columns")

Unnamed: 0,Column B,Column C,Column D
0,2.0,,5
1,8.0,9.0,8
2,31.0,32.0,34
3,,100.0,110


### Filling null values

Sometimes instead than dropping the null values, we might need to replace them with some other value. This highly depends on your context and the dataset you're currently working. Sometimes a nan can be replaced with a 0, sometimes it can be replaced with the mean of the sample, and some other times you can take the closest value. Again, it depends on the context. We'll show you the different methods and mechanisms and you can then apply them to your own problem.

In [47]:
s

0    1.0
1    2.0
2    3.0
3    NaN
4    NaN
5    4.0
dtype: float64

**Filling nulls with a arbitrary value**

In [48]:
# now here will be learning how to fix the nan values.
# the first method will be using the .fillna(val) method 
# here .fillna takes a value and replace the nan values with the value provided
s.fillna(0) 

0    1.0
1    2.0
2    3.0
3    0.0
4    0.0
5    4.0
dtype: float64

In [49]:
# we can also use the mean value to fill the nan
s.fillna(s.mean())

0    1.0
1    2.0
2    3.0
3    2.5
4    2.5
5    4.0
dtype: float64

note that here we are not altering the series hence the original series is preserved

In [50]:
s

0    1.0
1    2.0
2    3.0
3    NaN
4    NaN
5    4.0
dtype: float64

**Filling nulls with contiguous (close) values**

The method argument is used to fill null values with other values close to that null one:

**Using method to Propagate Values**  
- ✅ Forward Fill (ffill) – Fill NaN with previous row’s value
- ✅ Backward Fill (bfill) – Fill NaN with next row’s value

In [51]:
s.fillna(method = "ffill")

0    1.0
1    2.0
2    3.0
3    3.0
4    3.0
5    4.0
dtype: float64

In [52]:
s.fillna(method = "bfill")

0    1.0
1    2.0
2    3.0
3    4.0
4    4.0
5    4.0
dtype: float64

This can still leave null values at the extremes of the Series/DataFrame:

In [53]:
pd.Series([np.nan, 3, np.nan, 9]).fillna(method='ffill')

0    NaN
1    3.0
2    3.0
3    9.0
dtype: float64

In [54]:
pd.Series([1, np.nan, 3, np.nan, np.nan]).fillna(method='bfill')

0    1.0
1    3.0
2    3.0
3    NaN
4    NaN
dtype: float64

**🔹 Why Do NaN Values Still Exist After Using .fillna(method='ffill'/'bfill')?**
- 1️⃣ Case 1: NaN at the Beginning (ffill won't work)
   - 🔹 Problem: If a column starts with NaN, there’s no previous value to forward-fill.
- 2️⃣ Case 2: NaN at the End (bfill won't work)
   - 🔹 Problem: If a column ends with NaN, there’s no next value to backward-fill.
- 3️⃣ Case 3: Isolated NaN Values with No Nearby Data
   - 🔹 Problem: If all values in a column are NaN, both ffill and bfill will fail because there’s no valid data to propagate.

**🔹 How to Fix This?**
- ✅ 1. Combine ffill and bfill
   - 🔹 Forward-fill first, then back-fill any remaining NaNs.
- ✅ 2. Use fillna(value=...) as a fallback
   - 🔹 Replace remaining NaNs with a default value (0, mean, median, etc.).
- ✅ 3. Drop Rows/Columns with Remaining NaNs
   - 🔹 Removes rows with NaNs after fill.

**Limiting Fills with limit**

In [55]:
# here we can use the limit parametre to fill only certain no. of nan val
s.fillna(method = "bfill" , limit = 1)

0    1.0
1    2.0
2    3.0
3    NaN
4    4.0
5    4.0
dtype: float64

**Column-Wise Replacement**

In [56]:
# he can also replace nan value by selecting columns
df

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,,5
1,,8.0,9.0,8
2,30.0,31.0,32.0,34
3,,,100.0,110


In [57]:
df.fillna({'Column A': 100, 'Column B': 200, 'Column C': 300})

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,300.0,5
1,100.0,8.0,9.0,8
2,30.0,31.0,32.0,34
3,100.0,200.0,100.0,110


### Filling null values on DataFrames

The fillna method also works on DataFrames, and it works similarly. The main differences are that you can specify the axis (as usual, rows or columns) to use to fill the values (specially for methods) and that you have more control on the values passed:

In [58]:
df

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,,5
1,,8.0,9.0,8
2,30.0,31.0,32.0,34
3,,,100.0,110


In [59]:
# here we will be filling the dataframe using the fillna function
# here we are filling column specified with mean value
df.fillna({'Column A': 0, 'Column B': 99, 'Column C': df['Column C'].mean()})

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,47.0,5
1,0.0,8.0,9.0,8
2,30.0,31.0,32.0,34
3,0.0,99.0,100.0,110


In [60]:
df

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,,5
1,,8.0,9.0,8
2,30.0,31.0,32.0,34
3,,,100.0,110


In [61]:
# here we will be using the ffill method only onthe row as per the axis
df.fillna(method='ffill', axis=0)

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,,5
1,1.0,8.0,9.0,8
2,30.0,31.0,32.0,34
3,30.0,31.0,100.0,110


see in the column c if the first row has NaN, it remains unchanged (since there’s no previous row to copy from).

In [62]:
df.fillna(method='ffill', axis=1)

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,2.0,5.0
1,,8.0,9.0,8.0
2,30.0,31.0,32.0,34.0
3,,,100.0,110.0


### Checking if there are NAs

The question is: Does this Series or DataFrame contain any missing value? The answer should be yes or no: True or False. How can you verify it?

**Example 1: Checking the length**  
If there are missing values, s.dropna() will have less elements than s:

In [63]:
s

0    1.0
1    2.0
2    3.0
3    NaN
4    NaN
5    4.0
dtype: float64

In [64]:
# here counting the no. of not null values in the s series
s.dropna().count()

4

In [65]:
# now we will be founding the no. of missing values 
missing_values = len(s.dropna()) != len(s)
missing_values

True

since the missing_values give result true that means that there exist nan values in the s series

There's also a count method, that excludes nans from its result:

In [66]:
len(s)

6

In [67]:
s.count()

4

🔹 Why Does s.count() Return 4 While len(s) Returns 6?  
This happens because:  
- s.count() → Counts only non-NaN values in the Series.
- len(s) → Counts all elements (including NaN values).

In [68]:
# now checking whether nan value present in s using len()
missing_values = len(s) != s.count()
missing_values

True

**More Pythonic solution any**

The methods **any** and **all** check if either there's any True value in a Series or all the values are True. They work in the same way as in Python:

In [69]:
pd.Series([True, False, False]).any()

True

In [70]:
pd.Series([True, False, False]).all()

False

In [71]:
pd.Series([True, True, True]).all()

True

The isnull() method returned a Boolean Series with True values wherever there was a nan:

In [72]:
s.isnull()

0    False
1    False
2    False
3     True
4     True
5    False
dtype: bool

So we can just use the any method with the boolean array returned:

In [73]:

pd.Series([1, np.nan]).isnull().any()

True

In [74]:
pd.Series([1, 2]).isnull().any()

False

In [75]:
s.isnull().any()

True

A more strict version would check only the values of the Series:

In [76]:
# .values → Converts the result into a NumPy array.
s.isnull().values

array([False, False, False,  True,  True, False])

In [77]:
s.isnull().values.any()

True