# Handling Missing Values in Pandas (Detailed)

Missing data can affect analysis. In this notebook, we will cover:

- Identifying missing values
- Removing missing values
- Filling missing values using various strategies
- Advanced methods like interpolation and forward fill


In [1]:
import pandas as pd
import numpy as np

# Create a sample DataFrame with missing values
data = {
    'A': [1, 2, np.nan, 4, 5],
    'B': [np.nan, 2, 3, np.nan, 5],
    'C': [1, np.nan, np.nan, 4, 5]
}
df = pd.DataFrame(data)
print('Original DataFrame:')
print(df)

Original DataFrame:
     A    B    C
0  1.0  NaN  1.0
1  2.0  2.0  NaN
2  NaN  3.0  NaN
3  4.0  NaN  4.0
4  5.0  5.0  5.0


### Identifying Missing Values

Use `isnull()`, `sum()`, or `info()` to get an overview of missing data.


In [3]:
print('\nMissing values (isnull):')
print(df.isnull())

print('\nCount of missing values per column:')
print(df.isnull().sum())

print('\nDataFrame info:')
df.info()


Missing values (isnull):
       A      B      C
0  False   True  False
1  False  False   True
2   True  False   True
3  False   True  False
4  False  False  False

Count of missing values per column:
A    1
B    2
C    2
dtype: int64

DataFrame info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       4 non-null      float64
 1   B       3 non-null      float64
 2   C       3 non-null      float64
dtypes: float64(3)
memory usage: 252.0 bytes


### Removing Missing Values

Use `dropna()` to remove rows or columns with missing values. Be cautious as this might remove useful data.


In [2]:
df_dropna = df.dropna()
print('\nDataFrame after dropna():')
print(df_dropna)


DataFrame after dropna():
     A    B    C
4  5.0  5.0  5.0


### Filling Missing Values

There are several strategies:

- **Constant value:** Use `fillna(0)`
- **Column mean:** Use `fillna(df.mean())`
- **Forward fill:** Use `fillna(method='ffill')`
- **Interpolation:** Use `interpolate()` for continuous data


In [3]:
# Fill with a constant value
df_fill_const = df.fillna(0)
print('\nDataFrame after fillna(0):')
print(df_fill_const)

# Fill with the column mean
df_fill_mean = df.fillna(df.mean())
print('\nDataFrame after filling with mean:')
print(df_fill_mean)

# Forward fill
df_ffill = df.fillna(method='ffill')
print('\nDataFrame after forward fill:')
print(df_ffill)

# Interpolation
df_interp = df.interpolate()
print('\nDataFrame after interpolation:')
print(df_interp)


DataFrame after fillna(0):
     A    B    C
0  1.0  0.0  1.0
1  2.0  2.0  0.0
2  0.0  3.0  0.0
3  4.0  0.0  4.0
4  5.0  5.0  5.0

DataFrame after filling with mean:
     A         B         C
0  1.0  3.333333  1.000000
1  2.0  2.000000  3.333333
2  3.0  3.000000  3.333333
3  4.0  3.333333  4.000000
4  5.0  5.000000  5.000000

DataFrame after forward fill:
     A    B    C
0  1.0  NaN  1.0
1  2.0  2.0  1.0
2  2.0  3.0  1.0
3  4.0  3.0  4.0
4  5.0  5.0  5.0

DataFrame after interpolation:
     A    B    C
0  1.0  NaN  1.0
1  2.0  2.0  2.0
2  3.0  3.0  3.0
3  4.0  4.0  4.0
4  5.0  5.0  5.0


  df_ffill = df.fillna(method='ffill')


### Summary

Choose the method based on your data. Dropping data loses information; filling data preserves it but might introduce bias.
