# 🧼 Handling Missing Data with Pandas

In this notebook, we'll cover:
- Detecting missing values
- Deleting/removing missing data or unwanted columns
- Filling missing data with static or computed values
- Interpolating missing data using various methods


In [1]:
# 📥 Importing pandas
import pandas as pd 

## 📊 Step 1: Create Sample DataFrame with Missing Values


In [None]:
data = {
    "Name": ['Ram', None, 'Ghanshyam', 'Dhanshyam', 'Aditi', 'Jagdish', 'Raj', 'Simran'],
    "Age": [28, None, 47, 57, 17, 27, 77, 25],
    "Salary": [5000, None, 45000, 5200, 4900, 7000, 9000, 17000],
    "Performance Score": [43, None, 26, 59, 84, 38, 67, 22]  
}
df = pd.DataFrame(data)
print(df)

        Name   Age   Salary  Performance Score
0        Ram  28.0   5000.0               43.0
1       None   NaN      NaN                NaN
2  Ghanshyam  47.0  45000.0               26.0
3  Dhanshyam  57.0   5200.0               59.0
4      Aditi  17.0   4900.0               84.0
5    Jagdish  27.0   7000.0               38.0
6        Raj  77.0   9000.0               67.0
7     Simran  25.0  17000.0               22.0


## 🔍 Step 2: Detect Missing Data

- `isnull()` returns a DataFrame of boolean values showing where data is missing.
- `isnull().sum()` shows total number of missing values in each column.


In [None]:
print("Where is data missing?\n")
print(df.isnull())

print("\nTotal number of missing values per column:\n")
print(df.isnull().sum())

Where is data missing?

    Name    Age  Salary  Performance Score
0  False  False   False              False
1   True   True    True               True
2  False  False   False              False
3  False  False   False              False
4  False  False   False              False
5  False  False   False              False
6  False  False   False              False
7  False  False   False              False

Total number of missing values per column:

Name                 1
Age                  1
Salary               1
Performance Score    1
dtype: int64


## 🧹 Step 3: Remove Missing Data

We can remove:
- Rows with any missing data using `dropna()`
- Columns with missing data using `dropna(axis=1)`


In [4]:
# Removing rows with any missing values
df_removed = df.copy()
df_removed.dropna(inplace=True)
print("After removing rows with missing values:\n")
print(df_removed)

After removing rows with missing values:

        Name   Age   Salary  Performance Score
0        Ram  28.0   5000.0               43.0
2  Ghanshyam  47.0  45000.0               26.0
3  Dhanshyam  57.0   5200.0               59.0
4      Aditi  17.0   4900.0               84.0
5    Jagdish  27.0   7000.0               38.0
6        Raj  77.0   9000.0               67.0
7     Simran  25.0  17000.0               22.0


## 🧪 Step 4: Fill Missing Data (Imputation)

We can use:
- A constant value (e.g., 77)
- A computed value like mean, median, or mode for specific columns


In [6]:
# Filling missing values with appropriate replacements
df_filled = pd.DataFrame(data)  # recreate DataFrame

# Fill entire DataFrame with constant (optional)
# df_filled.fillna(77, inplace=True)

# Fill specific columns with mean values
df_filled['Age'].fillna(df_filled['Age'].mean(), inplace=True)
df_filled['Salary'].fillna(df_filled['Salary'].mean(), inplace=True)

print("After filling missing Age and Salary with their means:\n")
print(df_filled)

After filling missing Age and Salary with their means:

        Name        Age   Salary  Performance Score
0        Ram  28.000000   5000.0               43.0
1       None  39.714286  13300.0                NaN
2  Ghanshyam  47.000000  45000.0               26.0
3  Dhanshyam  57.000000   5200.0               59.0
4      Aditi  17.000000   4900.0               84.0
5    Jagdish  27.000000   7000.0               38.0
6        Raj  77.000000   9000.0               67.0
7     Simran  25.000000  17000.0               22.0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_filled['Age'].fillna(df_filled['Age'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_filled['Salary'].fillna(df_filled['Salary'].mean(), inplace=True)


# 🔄 Step 5: Interpolation in Pandas

Interpolation is a method of estimating missing values between two known values.

### Types of Interpolation:

1. **Linear**:
   - Fills missing values linearly from nearby values.
   - Good for numeric trends.
   - `method='linear'`

2. **Polynomial**:
   - Fills using polynomial equations of specified order (e.g., quadratic).
   - Can be more accurate but risky if overfitted.
   - `method='polynomial', order=n`

3. **Time**:
   - Interpolates assuming index is datetime.
   - Useful for time series data.
   - `method='time'` (only works with DateTimeIndex)


In [7]:
# Recreate original DataFrame
df_interp = pd.DataFrame(data)

# Interpolate missing values using linear method
df_interp_linear = df_interp.copy()
df_interp_linear.interpolate(method="linear", inplace=True)

print("🔗 Linear Interpolation:\n")
print(df_interp_linear)

🔗 Linear Interpolation:

        Name   Age   Salary  Performance Score
0        Ram  28.0   5000.0               43.0
1       None  37.5  25000.0               34.5
2  Ghanshyam  47.0  45000.0               26.0
3  Dhanshyam  57.0   5200.0               59.0
4      Aditi  17.0   4900.0               84.0
5    Jagdish  27.0   7000.0               38.0
6        Raj  77.0   9000.0               67.0
7     Simran  25.0  17000.0               22.0


  df_interp_linear.interpolate(method="linear", inplace=True)


In [8]:
# Polynomial interpolation (order=2)
df_interp_poly = pd.DataFrame(data)
df_interp_poly.interpolate(method="polynomial", order=2, inplace=True)

print("📐 Polynomial Interpolation (order=2):\n")
print(df_interp_poly)

  df_interp_poly.interpolate(method="polynomial", order=2, inplace=True)


📐 Polynomial Interpolation (order=2):

        Name        Age        Salary  Performance Score
0        Ram  28.000000   5000.000000          43.000000
1       None  33.704057  49565.579422          19.593742
2  Ghanshyam  47.000000  45000.000000          26.000000
3  Dhanshyam  57.000000   5200.000000          59.000000
4      Aditi  17.000000   4900.000000          84.000000
5    Jagdish  27.000000   7000.000000          38.000000
6        Raj  77.000000   9000.000000          67.000000
7     Simran  25.000000  17000.000000          22.000000


⚠️ **Note on Time Interpolation**

`method='time'` only works if your DataFrame has a DateTime index or column. It does **not work** on plain integers or floats.

Below is an example showing why it doesn’t work with just numeric values.


In [None]:
# Example with numeric column for time
data2 = {
    "time": [1, 2, None, 4, None, 7, None, 10],
    "val": [10, 20, 30, 40, 50, 60, 70, 100]
}
df2 = pd.DataFrame(data2)
print("Original DataFrame with missing time:\n")
print(df2)

print("\nPolynomial Interpolation on 'time':\n")
df2["time"] = df2["time"].interpolate(method="polynomial", order=2)
print(df2)

# Reminder:
# df2["time"] = df2["time"].interpolate(method="time")  # ❌ Will not work without datetime

Original DataFrame with missing time:

   time  val
0   1.0   10
1   2.0   20
2   NaN   30
3   4.0   40
4   NaN   50
5   7.0   60
6   NaN   70
7  10.0  100

Polynomial Interpolation on 'time':

        time  val
0   1.000000   10
1   2.000000   20
2   2.929293   30
3   4.000000   40
4   5.424242   50
5   7.000000   60
6   8.525253   70
7  10.000000  100
