# CSC271: Lecture Notes

## Handling Entries with Missing Data

When DataFrame entries are missing data, we need to decide what to do about it. Our options include leaving the data as-is, removing the entries (rows) that contain missing data, or filling in the missing data. This lesson is about determining how to handle missing data.

The decision for how to handle missing data is often involves making trade-offs and there isn't always a single agreed-upon approach.

Considerations:
- how much data is missing? 
- why is the data missing?
- are the missing values important? 
- would filling change the distribution? (More on this in STA courses!)

</div>

## Why is the data missing?

We first need to understand why the data is missing. In other words, we'll consider the mechanisms that led to the missing data.

**Missing Completely at Random (MCAR)**

If the entries that are missing data are a random subset of the entries, then they are considered to be *missing completely at random*. In other words, the missingness does not depend on the values of the data (missing or non-missing).

Example: a sensor failed for a random period of time

**Missing at Random (MAR)**

If the entries that are missing depend other values (other features in the data set), they are considered to be *missing at random*.

Example: in survey data, income is missing more often for younger respondents. (The missing salary data depends on age.)

**Missing Not at Random (MNAR)**

If the missing values depend on the missing values themselves, they are consider to be *missing not at random*.

Example: people with very high income may choose not to report it


### Exercise: Categorizing why data is missing


| Student ID | Name       | Assignment 1 | Assignment 2 | Assignment 3 | Final Exam |
|------------|------------|--------------|--------------|--------------|------------|
| S001       | A      | 85           | 90           | 88           | 92         |
| S002       | B       | 78           |              | 82           | 85         |
| S003       | C    | 92           | 88           |              | 91         |
| S004       | D      | 10             |            |            |          |
| S005       | E      | 70           | 72           | 68           | 75         |
| S006       | F      | 88           |              |              | 90         |
| S007       | G     | 95           | 94           | 96           | 98         |
| S008       | H     | 82           | 80           | 81           | 85         |
| S009       | I       | 77           |              | 79           | 80         |
| S010       | J      |              | 85           | 87           | 88         |


For each explanation of the missing data below, identify the category (MCAR, MAR, or MNAR) for the reason why it is missing.

<div class="alert alert-block alert-success">
1. Student J enrolled late in the course and missed Assignment 1.
</div>

Missing at Random

<div class="alert alert-block alert-success">

2. Student D did poorly on Assignment 1, so they did not submit the remaining coursework.
</div>

Missing at Random

<div class="alert alert-block alert-success">

3. Student B was sick and missed Assignment 2.
</div>

In [None]:
Missing Completely at Random (nothing to do with other assignments)

<div class="alert alert-block alert-success">

4. Student C found Assignment 3 very difficult. They knew they did poorly on it, so they decided not to submit it.
</div>

Missing Not at Random

## Considerations for choosing between ways to handle missing data

### Removing (dropping) rows

One strategy is to completely remove rows that contain missing data.

You might use this technique when:
- the missing data is **Missing Completely at Random (MCAR)**,
- only a small proportion of the data is missing, and
- removing these rows will not meaningfully affect the analysis or conclusions (for example, it would not remove certain groups).

### Doing nothing (leaving missing values)

Another strategy is to leave missing values as they are and proceed with the analysis.

You might use this technique when:
- the missing data is **Missing Completely at Random (MCAR)** or **Missing at Random (MAR)**,
- the analysis method or visualization can handle missing values correctly,
- the missingness itself may be informative, or
- removing or filling values would introduce more bias than leaving them missing.


### Filling (imputing) missing values

Another strategy is to replace missing values with estimated values, such as the mean, median, or a constant.

You might use this technique when:
- the missing data is **Missing at Random (MAR)**,
- the proportion of missing data is **moderate**, and
- you want to **retain all rows** for analysis or modeling.

Fill strategies:
- **Mean or median** filling preserves overall scale but reduces variability.
- **Constant values (e.g., 0)** should only be used when they are meaningful in context.
- Other more complex strategies (not covered in this course).

### What to do about  **Missing Not at Random (MNAR)**? 

For **Missing Not at Random (MNAR)** data, simple filling methods can introduce bias and should be used with caution. One strategy may actually be to understand why the data is missing and focus on improving the data collection itself.