# Outliers

Sometimes our data is not nice enough to simply have a `NaN` or zero value to make it easy to tell what we should remove. Sometimes our data has outliers in it. So lets look at some strategies to identifying these points.

# Basics

The most basic and most-common way of manually doing outlier pruning on data distributions is to:

1. Model your data as some analytic distribution
2. Find all points below a certain probability
3. Remove them
4. Refit the distributions, and potentially run again from Step 1.

So, how do we pick what our threshold should be? Visual inspection is actually hard to beat. You can make an argument for relating the number to the number of samples you have or how much of the data you are willing to cut, but be warned that too much rejection is going to eat away at your actual data sample and bias your results.

# Outliers in curve fitting

If you don't have a distribution but instead have data with uncertainties, you can do similar things. To take a real world example, in an [old paper of mine](https://arxiv.org/abs/1603.09438), we have some value of xs, ys and error (wavelength, flux and flux error) and want to subtract the smooth background. We wanted to do this with a simple polynomial fit, but unfortunately the data had several emission lines and cosmic ray impacts in it (visible as spikes) which biased our poly fitting and so we had to remove them.

What we did is fit a polynomial to it, remove all points more than three standard deviations from polynomial from consideration and loop until all points are within three standard deviations. In the example below, for simplicity the data is normalised so that all errors are one.

# Automating it

Blessed `sklearn` to the rescue. Check out [the main page](https://scikit-learn.org/stable/modules/outlier_detection.html) which lists a ton of ways you can do outlier detection. I think LOF (Local Outlier Finder) is great - it uses the distance from one point to its closest twenty neighbours to figure out point density and removes those in low density regions.