# What to Do With Outliers Once You Find Them? (Hint: You Can't Drop Them)
![](images/fractal.jpg)
<figcaption style="text-align: center;">
Image by <a href="https://pixabay.com/users/realworkhard-23566/?utm_source=link-attribution&amp;utm_medium=referral&amp;utm_campaign=image&amp;utm_content=199054">Ralf Kunze</a> from <a href="https://pixabay.com//?utm_source=link-attribution&amp;utm_medium=referral&amp;utm_campaign=image&amp;utm_content=199054">Pixabay</a>
</figcaption>

### Motivation

Outlier detection is only part of the problem. The real challenge comes in figuring out what to do with these anomalies. It's all too easy to just brush them aside, but there are a lot of nuances and factors to consider before dropping them.

Today, we'll break down the issue from two perspectives. First, we'll look at appropriate courses of action based on the reasons of outlier presence and then we'll discuss what to do depending on the number of outliers in the dataset. Let's get started!

### 1. Cause of presence: error 

One of the most frequent causes of outliers is human and equipment errors. Someone screws up the number of zeros, presses the wrong key, forgets to measure something (or measures it twice), or a faulty instrument produces inconsistent readings, a software glitch records incorrect data and so on. 

As a data scientist, these things are out of your control as they usually happen during data collection. The first appropriate course of action is try to correct these faulty anomalies. Try to fix that typo or change numeric values to common sense alternatives (like when someone is 200 years old in a survey, you change it to 20).

When correction is not possible or too expensive, there is nothing left but to filter them out because you know those are incorrect values.

### 2. Cause of presence: Sampling errors

Statisticians and machine learning engineers use small samples to draw conclusions about a specific target population. However, during data collection, datapoints that aren't from the population may leak into the collected sample. 

For example, imagine you are conducting a study on the growth of apples from your friend's orchard. The defined population of this study is all the apple trees in this orchard and the sample is 1000 randomly selected apple trees. But, as the fences between them aren't clearly visible, a few dozen trees from a neighbor's orchard makes it into your sample. 

While the neighbor's apples aren't necessarily abnormal, they come from a different population and can possibly distort your entire study.

When such sampling errors happen, the only course of action is removing the outliers. 

### 3. Cause of presence: Natural variability

The world is full of surprises and uncertainty. Some outliers might just occur out of nowhere and still be part of the natural variability inherent in the target population. 

Examples are people blessed with certain genes that make them extremely tall, very short, savants, etc. or animals that can jump unusually high or live too long relative to their peers... The examples are endless.

If you take a large enough sample, you are bound to get oddballs that are naturally part of their distributions. They aren't necessarily problems but introduce variability in data. 

While it is tempting to remove them (since they decrease statistical significance of data), you can't simply do so for the sake of better metrics. 

### 4. Number of outliers

Dealing with outliers will also heavily depend on their numbers relative to the dataset size. 

If they are only a few (below 1%) and you have abundant data, it is safe to exclude them for transparency. Of course, you have to talk it out with the people who collected the data or domain experiments to make sure they are not part of natural variability. 

If there are too many outliers that they raise suspicion, then, there is probably some unknown reason for their presence. Maybe they only appear as outliers relative to the majority but actually are key characteristics of the target population. In that case, the drawn sample would be considered non-representative. 

The final case is that there are so many outliers in the data that they form a new cluster or a sub-group. Here, the same approach is recommended - analyze how the initial machine learning or data science problem was framed, how the target population and sample were chosen and how so many outliers made into the dataset.

### What to do with outliers if you don't drop them?

Even though outliers are often dropped so that they don't skew mean and standard deviation of features and ultimately lead to degraded model performance, in some contexts, outliers themselves are of interest.

Some applications of anomaly detection include intrusion detection (cyber security), fraud detection, fault detection, system health monitoring, event detection in sensor networks, detecting ecosystem disturbances, defect detection in images using machine vision and medical diagnosis. In these scenarios, the nature and presence of outliers are heavily studied to drive business, privacy and medical decisions. 

If you do decide to drop outliers to increase model performance, improve visuals or statistical tests, you should be transparent in your approach. The stakeholders (people who directly benefit or lose from the project) should be informed of the decision. Preferably, you should present two results: one with outliers present and one without. 

There are also some non-aggressive and lightweight alternatives. A popular method is *percentile trimming* where extreme values that are beyond the first and 99th percentiles are capped. You can easily perform the operation in Pandas or NumPy:

```python
import pandas as pd

low = distribution.quantile(0.01)
high = distribution.quantile(0.99)

outlier_free = distribution.clip(low, high)
```

You can also replace outliers with the median or mode. This can be done in Pandas with the `replace` function or `where` function in NumPy.