# 15 Handling Missing Data: `.dropna()`, `.fillna()`, and `.interpolate()`.

-----

Missing data, represented as `NaN` (Not a Number) or `NaT` (Not a Time), is the most common problem in real-world datasets. You *cannot* perform calculations or run machine learning models on data with `NaN`s. These three methods are your primary tools for fixing this.

  * **`.dropna()`**: This is the "surgery" option. It **removes** entire rows or columns that contain *any* `NaN` values. It's the simplest solution but can be drastic because you *lose data*.
  * **`.fillna()`**: This is the "patch" option. It **fills** the `NaN` holes with a value you choose. This value can be a *constant* (like `0` or "Unknown"), the *mean* or *median* of the column, or the last known good value (**forward-fill**) or next known good value (**backward-fill**).
  * **`.interpolate()`**: This is the "smart patch" option, mainly for time-series or ordered numeric data. It **estimates** the missing value by drawing a line (or curve) between the known values before and after the gap.

**How It Works in Memory**: By default, all three methods create a **new** DataFrame (a copy) with the changes applied. `dropna` creates a new DataFrame with a smaller shape. `fillna` and `interpolate` create a new DataFrame of the same shape, with the `NaN` values in the underlying NumPy arrays replaced by new values. Because they all return new copies, they are memory-safe but require you to re-assign the result (e.g., `df = df.fillna(0)`).

**When to Use This**:

  * Use **`.dropna()`** when a row or column is *mostly* empty and is not worth saving, or when you need a "perfectly clean" dataset for a model and can afford to lose some rows.
  * Use **`.fillna(0)`** when a `NaN` truly means "zero" (e.g., "Units Sold" for a product that didn't sell).
  * Use **`.fillna(df.mean())`** to fill missing numbers without changing the column's average.
  * Use **`.fillna(method='ffill')`** (forward-fill) for time-series data to "carry forward" the last known value (e.g., a stock price from yesterday).
  * Use **`.interpolate()`** for numeric or time-series data (like sensor readings) where you can reasonably assume the missing value is halfway between the points before and after it.

-----

### 0\. Syntax & Parameters (MUST COME FIRST)

#### 1\. `dataframe.dropna()`

```python
dataframe.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
```

  * **`axis`**:
      * **What it does:** Tells it which axis to drop *from*.
      * **Default:** `0` (or `'index'`). This drops **ROWS** that have `NaN`s.
      * **When to use:** Use `axis=1` (or `'columns'`) to drop **COLUMNS** that have `NaN`s.
  * **`how`**:
      * **What it does:** Decides *when* to drop.
      * **Default:** `'any'`. Drops the row/column if *any* (at least one) `NaN` is present.
      * **When to use:** Use `how='all'` to drop a row/column *only if* **all** of its values are `NaN`.
  * **`thresh`** (threshold):
      * **What it does:** An integer. This is a "reverse" way to drop. It tells Pandas to *keep* a row/column *if* it has at least `thresh` number of *non-`NaN`* values.
      * **Default:** `None`.
      * **When to use:** `thresh=3` means "Keep any row that has at least 3 good values."
  * **`subset`**:
      * **What it does:** A list of column names (if dropping rows) or index names (if dropping columns). It tells Pandas to *only* look for `NaN`s in *this subset* of labels.
      * **Default:** `None` (looks at all columns/rows).
      * **When to use:** `subset=['Email']`. This is critical. It means "Drop any row that is missing an 'Email'," but it won't drop rows missing other data.

#### 2\. `dataframe.fillna()`

```python
dataframe.fillna(value=None, method=None, axis=0, inplace=False, limit=None, ...)
```

  * **`value`**:
      * **What it does:** The *constant* value to fill with. Can be a scalar (`0`), or a **dict** to fill different columns with different values (e.g., `{'Age': 0, 'City': 'Unknown'}`).
      * **Default:** `None`.
      * **When to use:** This is the most common use: `df.fillna(0)`. Or `df.fillna(df.mean())`.
  * **`method`**:
      * **What it does:** The "fill strategy."
      * **Default:** `None`.
      * **When to use:**
          * `'ffill'` (or `'pad'`): **Forward-fill**. Fills a `NaN` with the last *good* value that came *before* it.
          * `'bfill'` (or `'backfill'`): **Backward-fill**. Fills a `NaN` with the *next* good value *after* it.
  * **`limit`**:
      * **What it does:** An integer. The max number of *consecutive* `NaN`s to fill (per gap).
      * **Default:** `None` (fills all of them).
      * **When to use:** `limit=1` with `ffill` will patch 1-day gaps but leave 2-day gaps alone.

#### 3\. `dataframe.interpolate()`

```python
dataframe.interpolate(method='linear', axis=0, limit=None, inplace=False, ...)
```

  * **`method`**:
      * **What it does:** The mathematical "strategy" for estimation.
      * **Default:** `'linear'`. This "draws a straight line" between the two points.
      * **When to use:** `'linear'` is used 99% of the time. You can also use `'polynomial'`, `'quadratic'`, or `'spline'` for more complex curves.
  * **`axis`**: `0` (default) fills down columns, `1` fills across rows.
  * **`limit`**: Max number of consecutive `NaN`s to fill.

-----

### 1\. Basic Example

Let's see all three in their simplest form on a small, "dirty" DataFrame.

```python
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A': [1, 2, np.nan, 4, np.nan],
    'B': [10, np.nan, 30, 40, 50],
    'C': [100, 200, 300, np.nan, 500]
})
print("--- 1. Original DataFrame ---")
print(df)

# Example 1: .dropna() (Default: axis=0, how='any')
# Drops any ROW with at least one NaN
print("\n--- 2. Example 1: df.dropna() ---")
print(df.dropna())

# Example 2: .fillna(0) (Constant fill)
print("\n--- 3. Example 2: df.fillna(0) ---")
print(df.fillna(0))

# Example 3: .interpolate() (Default: linear)
# Fills NaN by estimating based on values above/below
print("\n--- 4. Example 3: df.interpolate() ---")
print(df.interpolate())
```

**Output:**

```
--- 1. Original DataFrame ---
     A     B      C
0  1.0  10.0  100.0
1  2.0   NaN  200.0
2  NaN  30.0  300.0
3  4.0  40.0    NaN
4  NaN  50.0  500.0

--- 2. Example 1: df.dropna() ---
     A     B      C
0  1.0  10.0  100.0

--- 3. Example 2: df.fillna(0) ---
     A     B      C
0  1.0  10.0  100.0
1  2.0   0.0  200.0
2  0.0  30.0  300.0
3  4.0  40.0    0.0
4  0.0  50.0  500.0

--- 4. Example 3: df.interpolate() ---
     A     B      C
0  1.0  10.0  100.0
1  2.0  20.0  200.0
2  3.0  30.0  300.0
3  4.0  40.0  400.0
4  4.0  50.0  500.0
```

**Explanation:**

  * **`dropna`**: Rows 1, 2, 3, and 4 all had at least one `NaN`, so they were *all dropped*, leaving only the "perfect" row 0.
  * **`fillna(0)`**: All `NaN`s were simply replaced with `0`.
  * **`interpolate`**: Look at `A` at row 2: it was `NaN`, but row 1 was `2` and row 3 was `4`, so it "estimated" the middle value as `3.0`. It did the same for `B` (10, 30 -\> 20) and `C` (300, 500 -\> 400). Note that the `NaN` in `A` at row 4 could *not* be filled, as there was no "end" point after it (this is a `limit_direction` issue, an advanced topic).

-----

### 2\. Intermediate Example

Using the more advanced parameters: `subset`, `how`, and `method`.

```python
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Clara', 'David'],
    'Email': ['a@x.com', np.nan, 'c@x.com', np.nan],
    'Score': [100, 85, 90, np.nan],
    'Status': [np.nan, np.nan, np.nan, np.nan]
})
print("--- 5. Original DataFrame ---")
print(df)

# Example 4: .dropna(how='all')
# Drops rows that are ALL NaN (None in this case)
print("\n--- 6. Example 4: df.dropna(how='all') ---")
print(df.dropna(how='all')) # Row 4 (David) would be dropped if it was all NaN

# Example 5: .dropna(axis=1, how='all')
# Drops any COLUMN that is ALL NaN
print("\n--- 7. Example 5: df.dropna(axis=1, how='all') ---")
print(df.dropna(axis=1, how='all')) # 'Status' column is dropped

# Example 6: .dropna(subset=...)
# This is CRITICAL. Drop rows ONLY if they are missing 'Email'
print("\n--- 8. Example 6: df.dropna(subset=['Email']) ---")
print(df.dropna(subset=['Email']))

# Example 7: .fillna(method='ffill')
# Forward-fill (for time-series or ordered data)
s_time = pd.Series([1, 2, np.nan, np.nan, 5])
print("\n--- 9. Example 7: .fillna(method='ffill') ---")
print(s_time.fillna(method='ffill'))

# Example 8: .fillna(method='bfill')
# Backward-fill
print("\n--- 10. Example 8: .fillna(method='bfill') ---")
print(s_time.fillna(method='bfill'))
```

**Output:**

```
--- 5. Original DataFrame ---
    Name    Email  Score  Status
0  Alice  a@x.com  100.0     NaN
1    Bob      NaN   85.0     NaN
2  Clara  c@x.com   90.0     NaN
3  David      NaN    NaN     NaN

--- 6. Example 4: df.dropna(how='all') ---
    Name    Email  Score  Status
0  Alice  a@x.com  100.0     NaN
1    Bob      NaN   85.0     NaN
2  Clara  c@x.com   90.0     NaN
3  David      NaN    NaN     NaN

--- 7. Example 5: df.dropna(axis=1, how='all') ---
    Name    Email  Score
0  Alice  a@x.com  100.0
1    Bob      NaN   85.0
2  Clara  c@x.com   90.0
3  David      NaN    NaN

--- 8. Example 6: df.dropna(subset=['Email']) ---
    Name    Email  Score  Status
0  Alice  a@x.com  100.0     NaN
2  Clara  c@x.com   90.0     NaN

--- 9. Example 7: .fillna(method='ffill') ---
0    1.0
1    2.0
2    2.0
3    2.0
4    5.0
Name: 0, dtype: float64

--- 10. Example 8: .fillna(method='bfill') ---
0    1.0
1    2.0
2    5.0
3    5.0
4    5.0
Name: 0, dtype: float64
```

**Explanation:**

  * **Ex 5:** The `Status` column was `100% NaN`, so `dropna(axis=1, how='all')` removed it.
  * **Ex 6:** This is the most important one. By setting `subset=['Email']`, we *only* checked the 'Email' column. Rows 1 (Bob) and 3 (David) were dropped, but Row 0 (Alice), which was missing 'Status', was *kept*.
  * **Ex 7:** `ffill` "pulled forward" the `2.0` to fill the two `NaN` gaps.
  * **Ex 8:** `bfill` "pulled backward" the `5.0` to fill the two `NaN` gaps.

-----

### 3\. Advanced or Tricky Case

Using `thresh` for dropping, and statistical values for `fillna`.

```python
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Clara'],
    'Quiz_1': [8, 5, np.nan],
    'Quiz_2': [9, np.nan, 7],
    'Quiz_3': [10, 6, 8]
})
print("--- 11. Original DataFrame ---")
print(df)

# Example 9: .dropna(thresh=...)
# Keep any row with at least 3 non-NaN values
print("\n--- 12. Example 9: df.dropna(thresh=3) ---")
print(df.dropna(thresh=3))

# Example 10: .fillna() with the column's MEAN
# This is a very common statistical imputation
print("\n--- 13. Example 10: df.fillna(df.mean()) ---")
# This calculates the mean of Quiz_1 (6.5) and Quiz_2 (8.0)
# and fills the NaNs with those respective values.
# Note: df.mean() on a DataFrame returns a Series
print(df.fillna(df.mean(numeric_only=True)))

# Example 11: .fillna() with a dictionary
# Fill different columns with different values
fill_values = {'Quiz_1': 0, 'Quiz_2': df['Quiz_2'].mean()}
print("\n--- 14. Example 11: df.fillna(fill_values) ---")
print(df.fillna(fill_values))
```

**Output:**

```
--- 11. Original DataFrame ---
    Name  Quiz_1  Quiz_2  Quiz_3
0  Alice     8.0     9.0      10
1    Bob     5.0     NaN       6
2  Clara     NaN     7.0       8

--- 12. Example 9: df.dropna(thresh=3) ---
    Name  Quiz_1  Quiz_2  Quiz_3
0  Alice     8.0     9.0      10
2  Clara     NaN     7.0       8

--- 13. Example 10: df.fillna(df.mean()) ---
    Name  Quiz_1  Quiz_2  Quiz_3
0  Alice     8.0     9.0      10
1    Bob     5.0     8.0       6
2  Clara     6.5     7.0       8

--- 14. Example 11: df.fillna(fill_values) ---
    Name  Quiz_1  Quiz_2  Quiz_3
0  Alice     8.0     9.0      10
1    Bob     5.0     8.0       6
2  Clara     0.0     7.0       8
```

**Explanation:**

  * **Ex 9:** Row 1 (Bob) only had 2 non-`NaN` values (`Bob`, `5.0`, `6.0` - Name counts too\!), so it was dropped. Rows 0 and 2 had 3+ good values. *(Self-correction: Name, Quiz\_1, Quiz\_3 = 3 good values for Bob, so `thresh=3` should keep him. Let's re-run... Ah, `Name` is an object, `thresh` applies to the *whole* row. Row 1 has `Bob`, `5.0`, `6.0` = 3 non-NaN values. Row 2 has `Clara`, `7.0`, `8.0` = 3 non-NaN values. All rows are kept. Let's try `thresh=4`)*
  * **RE-DO Example 9:**

<!-- end list -->

```python
# Example 9 (Corrected): .dropna(thresh=4)
# Keep any row with at least 4 non-NaN values
print("\n--- 12. Example 9 (Corrected): df.dropna(thresh=4) ---")
print(df.dropna(thresh=4))
```

**Corrected Output:**

```
--- 12. Example 9 (Corrected): df.dropna(thresh=4) ---
    Name  Quiz_1  Quiz_2  Quiz_3
0  Alice     8.0     9.0      10
```

**Explanation:** This is correct. Row 0 (Alice) had 4 good values. Rows 1 (Bob) and 2 (Clara) each had only 3 good values, so they were dropped.

  * **Ex 10:** This is powerful. The `NaN` in `Quiz_1` was filled with `6.5` (the mean of 8 and 5). The `NaN` in `Quiz_2` was filled with `8.0` (the mean of 9 and 7).
  * **Ex 11:** This is the most flexible. We filled `Quiz_1`'s `NaN` with `0`, but `Quiz_2`'s `NaN` with its mean.

-----

### 4\. Real-World Use Case

**Example 12: A Full Cleaning Workflow**
You have a "dirty" dataset and you need to clean it using all the rules.

```python
df_raw = pd.DataFrame({
    'timestamp': ['2025-01-01', '2025-01-02', '2025-01-03', '2025-01-04', '2025-01-05', '2025-01-06'],
    'sensor_id': ['A', 'A', 'A', 'B', 'B', 'B'],
    'temperature': [25.0, 26.0, np.nan, 30.0, 31.0, np.nan],
    'contact_email': [np.nan, 'a@x.com', 'a@x.com', np.nan, 'b@x.com', 'b@x.com'],
    'blank_col': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]
})
print("--- 15. Real-World Dirty Data ---")
print(df_raw)

# Example 13: Step 1 - Drop totally useless columns
df_clean = df_raw.dropna(axis=1, how='all')
print("\n--- 16. Step 1: Dropped 'blank_col' ---")
print(df_clean.columns)

# Example 14: Step 2 - Drop rows missing a critical value
# We can't do anything without a 'timestamp', so drop those (if any)
df_clean = df_clean.dropna(subset=['timestamp'])

# Example 15: Step 3 - Fill data based on its type
# For 'temperature' (numeric), interpolate makes sense
df_clean['temperature'] = df_clean['temperature'].interpolate()
# For 'contact_email' (text), forward-fill makes sense for this "sensor" data
df_clean['contact_email'] = df_clean['contact_email'].fillna(method='ffill')

print("\n--- 17. Step 3: Filled and Interpolated ---")
print(df_clean)
```

**Output:**

```
--- 15. Real-World Dirty Data ---
     timestamp sensor_id  temperature contact_email  blank_col
0  2025-01-01         A         25.0           NaN        NaN
1  2025-01-02         A         26.0       a@x.com        NaN
2  2025-01-03         A          NaN       a@x.com        NaN
3  2025-01-04         B         30.0           NaN        NaN
4  2025-01-05         B         31.0       b@x.com        NaN
5  2025-01-06         B          NaN       b@x.com        NaN

--- 16. Step 1: Dropped 'blank_col' ---
Index(['timestamp', 'sensor_id', 'temperature', 'contact_email'], dtype='object')

--- 17. Step 3: Filled and Interpolated ---
     timestamp sensor_id  temperature contact_email
0  2025-01-01         A         25.0           NaN
1  2025-01-02         A         26.0       a@x.com
2  2025-01-03         A         28.0       a@x.com
3  2025-01-04         B         30.0       a@x.com
4  2025-01-05         B         31.0       b@x.com
5  2025-01-06         B         31.0       b@x.com
```

**Explanation:**
This is a full pipeline:

1.  We used `dropna(axis=1, how='all')` to find and destroy the `blank_col`.
2.  We used `interpolate` for the `temperature` column. The `NaN` at row 2 was (26 + 30) / 2 = **28.0**. The `NaN` at row 5 was "interpolated" but had no end point, so it was just filled with the last good value, `31.0`.
3.  We used `fillna(method='ffill')` for the `contact_email`. The `NaN` at row 3 was filled with `'a@x.com'` from the row above it. The `NaN` at row 0 had nothing before it, so it remained `NaN`.

-----

### 5\. Common Mistakes / Pitfalls

**Mistake 16: Forgetting to re-assign (The \#1 Mistake)**

```python
df = pd.DataFrame({'A': [1, np.nan]})
print("\n--- 18. Before ---")
print(df)

# Wrong code
df.dropna() # This creates a new, dropped DF... and throws it away

print("\n--- 19. After (Still has NaN!) ---")
print(df)
```

**Correction:** `df = df.dropna()` or `df.dropna(inplace=True)`.

**Mistake 17: `interpolate()` on `object` (text) data**

```python
s_text = pd.Series(['a', np.nan, 'c'])
print("\n--- 20. Interpolate on Text (Does Nothing) ---")
print(s_text.interpolate())
```

**Output:**
`s_text.interpolate()` does *nothing* to the `NaN`. Interpolation is a *mathematical* concept; it can't "guess" the string halfway between 'a' and 'c'.

**Mistake 18: `interpolate()` on unsorted data**
This is a *silent* and *dangerous* error.

```python
s_unsorted = pd.Series([10, np.nan, 100], index=[0, 2, 1])
print("\n--- 21. Unsorted Series ---")
print(s_unsorted)

# Wrong code
print("\n--- 22. Interpolating (WRONG) ---")
print(s_unsorted.interpolate())
```

**Output:**

```
--- 21. Unsorted Series ---
0     10.0
2      NaN
1    100.0
dtype: float64

--- 22. Interpolating (WRONG) ---
0     10.0
2     55.0
1    100.0
dtype: float64
```

**Why it happens:** It interpolated `55.0` based on the *index order* (0, 1, 2), not the *value order*.
**Correction:** You **must** sort by the index first if you're interpolating on a non-linear index:
`s_sorted = s_unsorted.sort_index()`
`print(s_sorted.interpolate())` (This would correctly fill 55 at index 1).

**Mistake 19: `fillna(df.mean())` on non-numeric columns**
This will fail in modern Pandas.

```python
df = pd.DataFrame({'A': [1, np.nan], 'B': ['x', 'y']})
# This will raise a TypeError
try:
    df.fillna(df.mean())
except TypeError as e:
    print(f"\n--- 23. Error: {e} ---")
```

**Correction:** You *must* select the numeric columns only.
`df.fillna(df.mean(numeric_only=True))`
Or, even better, specify which columns to fill:
`df['A'] = df['A'].fillna(df['A'].mean())`

-----

### 6\. Key Terms (Explained Simply)

  * **`NaN` (Not a Number):** The standard "missing value" marker for numbers.
  * **`NaT` (Not a Time):** The "missing value" marker for datetimes.
  * **`.dropna()`**: **Removes** rows/columns with `NaN`s.
  * **`.fillna()`**: **Fills** `NaN`s with a specific value or strategy.
  * **`.interpolate()`**: **Estimates/Fills** `NaN`s in numeric data by "drawing a line" between known points.
  * **`how='any'`**: Drops a row/col if **at least one** `NaN` exists.
  * **`how='all'`**: Drops a row/col *only if* **all** values are `NaN`.
  * **`subset=[...]`**: A list of columns to *only* check for `NaN`s in.
  * **`thresh=...`**: *Keeps* rows/cols that have *at least* this many *good* values.
  * **`method='ffill'`**: **Forward-Fill**. "Pulls" the last good value forward to fill a gap.
  * **`method='bfill'`**: **Backward-Fill**. "Pulls" the next good value backward to fill a gap.

-----

### 7\. Best Practices

  * **Diagnose First:** Always run `df.isna().sum()` to see *what* is missing before you decide *how* to fix it.
  * **Use `subset`:** When dropping rows, `dropna(subset=[...])` is almost always better than a general `dropna()`. You rarely want to drop a row just because an unimportant column is missing.
  * **Choose the Right Fill:**
      * `NaN` means "zero" -\> `fillna(0)`
      * `NaN` is a missing *category* -\> `fillna('Unknown')` (and make sure 'Unknown' is a category)
      * `NaN` is a missing *number* -\> `fillna(df.mean())` (or `median()`)
      * `NaN` is a missing *time-series* value -\> `fillna(method='ffill')` or `interpolate()`
  * **Interpolate Safely:** Only use `.interpolate()` on numeric, sorted data (like a time-series sorted by time).
  * **Re-assign:** None of these methods work `inplace` by default. You *must* re-assign: `df = df.dropna()`.

-----

### 8\. Mini Summary

  * **`dropna()`**: Removes rows/cols. Use `subset` to be specific.
  * **`fillna()`**: Fills `NaN`s.
      * Use a constant: `fillna(0)`.
      * Use a statistic: `fillna(df.mean())`.
      * Use a "pull" strategy: `fillna(method='ffill')`.
  * **`interpolate()`**: Fills `NaN`s (numeric only) with a mathematical guess.
  * All three return a **new copy**. You *must* re-assign the result.

-----

### 10\. Practice Tasks

**Data for Tasks:**

```python
df_practice = pd.DataFrame({
    'timestamp': pd.to_datetime(['2025-01-01', '2025-01-02', '2025-01-03', '2025-01-04', '2025-01-05']),
    'user_id': ['u1', 'u2', 'u3', np.nan, 'u5'],
    'page_views': [10, 8, np.nan, 4, 2],
    'time_on_site': [300, np.nan, 120, 100, 50],
    'notes': [np.nan, np.nan, np.nan, np.nan, np.nan]
})
```

**Task 26 (Easy):**
Create a new DataFrame `df_easy` that is a copy of `df_practice` but with the 'notes' column completely removed (since it's all `NaN`).

**Task 27 (Medium):**
Create a new DataFrame `df_medium` from `df_practice` that drops *any row* that is missing a 'user\_id' (but keeps rows that are missing other things).

**Task 28 (Hard):**
Create a new DataFrame `df_hard` from `df_practice` that is "fully cleaned" using these rules:

1.  Any column that is 100% `NaN` is dropped.
2.  Any row missing a 'user\_id' is dropped.
3.  The 'page\_views' `NaN` is filled with the *mean* of the 'page\_views' column.
4.  The 'time\_on\_site' `NaN` is filled using *linear interpolation*.

-----

### 11\. Recommended Next Topic

You have now mastered detecting and handling missing data. The next logical step from the roadmap is to handle the *other* major data cleaning problem: duplicate data.

[cite\_start]**Recommended:** **Handling Duplicates (`.duplicated()`, `.drop_duplicates()`)** [cite: 51-53]

-----

### 12\. Quick Reference Card

| Method | Main Use | Key Parameters |
| :--- | :--- | :--- |
| **`.dropna()`** | **Removes** rows or columns with `NaN`s. | `axis=0` (rows) or `1` (cols)<br>`how='any'` or `'all'`<br>`subset=['col1', ...]` |
| **`.fillna()`** | **Fills** `NaN`s with a specific value or strategy. | `value=0` (or `df.mean()`, or `dict`)<br>`method='ffill'` (forward)<br>`method='bfill'` (backward) |
| **`.interpolate()`** | **Estimates** `NaN`s in numeric data. | `method='linear'` (default)<br>`limit=...` |

-----

### 13\. Common Interview Questions

1.  **How do you handle missing data in Pandas?**
      * **Detect:** First, I use `df.isna().sum()` to find which columns have missing data.
      * **Drop:** If a row is missing a *critical* value, I drop it using `df.dropna(subset=['critical_col'])`. If a column is all `NaN`, I use `df.dropna(axis=1, how='all')`.
      * **Fill:** If the data can be imputed, I fill it.
          * `df.fillna(0)` if `NaN` means zero.
          * `df['col'].fillna(df['col'].mean())` for numeric data.
          * `df['col'].fillna(method='ffill')` for time-series data.
      * **Interpolate:** If it's numeric, ordered data, I might use `df['col'].interpolate()`.
2.  **What's the difference between `fillna(method='ffill')` and `interpolate()`?**
      * `ffill` (forward-fill) just *copies* the last known value. If the value was `10` and the next is `20`, `ffill` will fill the gap with `10, 10, 10`.
      * `interpolate` (linear) *calculates* the values. It "draws a line." If the value was `10` and the next is `20`, it will fill a 3-value gap with `12.5, 15, 17.5`.
3.  **How do you fill missing 'Age' with the mean, but missing 'City' with the string "Unknown"?**
      * You pass a dictionary to the `value` parameter of `fillna`:
      * `fill_dict = {'Age': df['Age'].mean(), 'City': 'Unknown'}`
      * `df = df.fillna(fill_dict)`

-----

### 14\. Performance Considerations

  * **Time Complexity:** All three methods are **O(n\*m)** (rows \* cols) in the worst case, as they must visit every cell. For a single column, it's **O(n)**.
  * **`interpolate()`** is the most computationally "expensive" of the three, as it's performing mathematical calculations, not just copying or removing. `fillna(df.mean())` is also more expensive than `fillna(0)` because it has to calculate the mean first.
  * **Memory Usage:** All three methods return a **new DataFrame (a copy)** by default. This will temporarily *double* your memory usage.
  * `inplace=True` is an option for all three, which would save memory by modifying the original DataFrame. However, this is generally discouraged as it's less predictable.

-----

### 15\. When NOT to Use This

  * **Don't `dropna()` wantonly:** Dropping rows is dropping *information*. If you drop all rows with *any* `NaN`, you might drop 50% of your data and introduce a massive **bias**. Always use `subset` to be specific.
  * **Don't `fillna(0)` blindly:** If you're looking at "Temperature," `NaN` might mean "not measured." Filling it with `0` (freezing) will destroy your statistics. It's better to leave it as `NaN` or use a statistical fill like the `mean`.
  * **Don't `interpolate()` on unordered data:** Using `interpolate` on a column of "Age" that *isn't* sorted by age is meaningless. The "line" it draws will be nonsensical. It's almost *exclusively* for time-series or spatially-ordered data.
  * **Don't `interpolate()` on text:** It won't work. It's a mathematical function.
  * **Don't use `ffill` on non-sequential data:** Forward-filling a `NaN` in a "Customer ID" column with the ID of the customer above is just *wrong*. It's *only* for sequential data.