# How to Perform Outlier Detection In Python In Easy Steps For Machine Learning
## Let's get those outliers - part 1

### What are outliers?

We live on an outlier. Earth is the only hump of rock that has life on it in the Milky Way galaxy. Other planets in our galaxy are inliers or normal datapoints in a so-called database of stars and planets. 

There are many definitions of outliers. In simple terms, we define outliers as datapoints that are significantly different than the majority in a dataset. Outliers are the rare, extreme samples that don't conform or align with the inliers in a dataset.

Statistically speaking, outliers come from a different distribution than the rest of the samples in a feature. They present statistically significant abnormalities.

It all depends on what we consider "normal". For example, it is perfectly normal for CEOs to make millions of dollars but if we add their salary information to a dataset of household incomes, they become abnormal. 

Outlier detection is the field of statistics and machine learning that uses a variety of techniques and algorithms to detect such extreme samples. 

### Why bother with outlier detection?

But why though? Why do we need to find them? What's the harm in them? Well, consider this distribution of 10 numbers ranged from 50 to 100 but one of the datapoints is 2534. It is clearly an outlier.

Let's calculate the mean and standard deviation of this distribution. 

Now, let's do the same but after removing the outlier.

As you can see, the outlier-free distribution has n times smaller mean  and n times smaller standard deviation. 

Mean and standard deviation are two of the most heavily used metrics in statistics and machine learning. As outliers skew the true value of these metrics, it is highly important to find them and explore the reasons for their presence.

When left unchecked, they will disrupt all statistical qualities of distributions and in turn, hurt the performance of machine learning models.

### What you will learn in this tutorial

Outlier detection is very easy to perform in code with libraries like PyOD or Sklearn once you understand the important theory behind the process. For example, here is how to do outlier detection using a popular Isolation Forest algorithm. 

It only takes  a few lines of code.

Therefore, this tutorial will focus more on theory. Specifically, we look at outlier detection in the context of unsupervised learning, the concept of contamination in datasets, the difference between anomalies, outliers and novelties and univariate/multivariate outliers.

Let's get started.

### Outlier detection is an unsupervised problem

Unlike many other ML tasks, outlier detection is an unsupervised learning problem. What do we mean by that?

For example, in classification, we have a set of features that map to specific outputs. We have labels that tell us which sample is a dog and which one is a cat.

In outlier detection, that's not the case. We have no prior knowledge of outliers when we are presented with a new dataset. This causes a number of challenges (but nothing we can't handle).

First of all, we won't have an easy way of measuring the effectiveness of outlier detection methods. In classification, we used metrics such as accuracy or precision to measure how well the algorithm fits to our training dataset. In outlier detection, we can't use these metrics because we won't have any labels that allow us to compare predictions to ground truth.

And since we can't use traditional metrics to measure performance, we can't easily perform hyperparameter tuning. This makes it even hard to find the best outlier classifier (an algorithm that returns inlier/outlier labels for each row of a dataset) for the task at hand. 

However, don't despair. We will see two excellent workarounds in the next tutorial.

### Anomalies vs. outliers vs. novelties

In many sources, you'll see the terms "anomalies" and "novelties" often cited together. Even though they are close in meaning, there are important distinctions.

An anomaly is a general term that encompasses anything out of the ordinary, abnormal. Anomalies can refer to irregularities in either training or test sets.

As for outliers, they only exist in training data. Outlier detection only refers to the process of finding abnormal dsatapoints from the training set. Outlier classifiers only perform a `fit` to the training data and returns inlier/outlier labels.

On the other hand, novelties exist only in the test set. In novelty detection, you have a clean, outlier-free dataset and you are trying to see if new, unseen observations contain outliers. Hence, outliers in a test set become novelties.

In short, anomaly detection is the parent field of both outlier and novelty detection. While outliers only refer to abnormal samples in the training data, novelties exist in the test set.

This distinction is important for when we start using outlier classifiers in the next tutorial.

### Univariate vs. multivariate outliers

Univariate and multivariate outliers refer to outliers in different types of data. 

As the name suggests, univariate outliers only exist in single distributions. An example is an extremely tall person in a dataset of height measurements.

Multivariate outliers are a bit tricky. They refer to outliers with two or more attributes, which when looked at individually, don't appear anomalous but only become outliers when all attributes are considered in unison.

An example multivariate outlier can be an old car with very low mileage. The attributes of this car may not be abnormal when looked at individually but when combined, you'll realize that old cars usually have high mileage proportional to their age. (There are many old cars and also many cars with low mileage but there aren't many cars that are both old and have low mileage).

The distinction between types of outliers becomes important when choosing an algorithm to detect them. 

As univariate outliers exist in datasets with only one column, you can use simple and lightweight methods such as [z-scores](https://en.wikipedia.org/wiki/Standard_score) or [modified z-scores](https://en.wikipedia.org/wiki/Median_absolute_deviation).

Multivaraite outliers pose a bigger challenge since they may only emerge across many columns of a dataaset. For that reason, you have to take out big guns such as Isolation Forest, KNN, Local Outlier Factor and so on.

In the coming tutorials, we'll see how to use some of the above methods. 

### Conclusion