# 1 — Setting the Stage
In recent years, anomaly detection has become more popular in the machine learning community. Despite this, there are definitely fewer resources on anomaly detection than classical machine learning algorithms. As such, learning about anomaly detection can feel more tricky than it should be. Anomaly detection is from a conceptual standpoint actually very simple!

The goal of this blog post is to give you a quick introduction to anomaly/outlier detection. Specifically, I will show you how to implement anomaly detection in Python with the package PyOD — Python Outlier Detection. In this way, you will not only get an understanding of what anomaly/outlier detection is but also how to implement anomaly detection in Python.



# 2 — What is Anomaly/Outlier Detection?
Anomaly detection goes under many names; outlier detection, outlier analysis, anomaly analysis, and novelty detection. A concise description from Wikipedia describes anomaly detection as follows:
Anomaly detection is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data.
Let‘s try to unpack the above statements. Say you have a dataset consisting of many observations. The goal of anomaly detection is to identify the observations that differ significantly from the rest. Why would you want to do this? There are two major reasons:

### Use Case 1 — Data Cleaning
When cleaning the data, it is sometimes better to remove anomalies as they misrepresent the data

**Anomaly detection is implementing algorithms to detect outliers automatically.**

### Use Case 2 — Prediction
In other applications, the anomalies themselves are the point of interest. Examples are network intrusion, bank fraud, and certain structural defects. In these applications, the anomalies represent something that is worthy of further study.
* Network intrusion — anomalies in network data can indicate that a network attack of some sort has taken place.
* Bank fraud — anomalies in transaction data can indicate fraud or suspicious behaviour.
* Structural defects — anomalies can indicate that something is wrong with your hardware. While more traditional monitoring software is typically available in this setting, anomaly detection can discover more weird defects.

# 3 — Introducing PyOD
**PyOD is a comprehensive and scalable Python toolkit for detecting outlying objects in multivariate data.
Brifly put, PyOD supplies you with a bunch of models that perform anomaly.**

I will demonstrate two algorithms for doing anomaly detection: KNN and LOC. You’ve maybe heard of KNN (K – Nearest Neighbors) previously, while LOC (Local Outlier Factor) is probably unfamiliar to you. Let’s first take a look at the data you will be using

# 4 — Getting Familiar with the Data
classical Titanic dataset.

In [1]:
import pandas as pd
titanic = pd.read_csv(
"https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
)

In [2]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
#Selecting only the columns Survived, Pclass, Fare, and Sex
partial_titanic = titanic[["Survived", "Pclass", "Fare", "Sex"]]

There are no missing values in partial_titanic. However, the column Sex consists of the string values male or female. To be able to do anomaly detection, you need numeric values. You can convert this binary categorical variable to the values 0 and 1 with the code:

In [4]:
# Change the categorical value Sex to numeric values
partial_titanic["Sex"] = partial_titanic["Sex"].map({
"male": 0,
"female": 1
})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  partial_titanic["Sex"] = partial_titanic["Sex"].map({


# 5 — Anomaly Detection for Data Cleaning
Let’s now use anomaly detection to clean the dataset partial_titanic you made in the previous section. You will use the KNN model to do this. The KNN model examines the data and looks for data points (rows) that are far from the other data points. To get started, you import the KNN model as follows:

In [5]:
# Import the KNN
from pyod.models.knn import KNN

In [6]:
# Initiate a KNN model
KNN_model = KNN()

In [7]:
# Fit the model to the whole dataset
KNN_model.fit(partial_titanic)

KNN(algorithm='auto', contamination=0.1, leaf_size=30, method='largest',
  metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=5, p=2,
  radius=1.0)

When running the code above you get printed out a lot of default values (e.g. contamination=0.1). This can be tweaked if needed. After running a model you can access two types of output:

`Labels`: By running KNN_model.labels_ you can find binary labels of whether an observation is an outlier or not. The number 0 indicates a normal observation, while the number 1 indicates an outlier.

`Decision Scores`: By running KNN_model.decision_scores_ you get the raw scores of how much of an outlier something is. The values will range from 0 and upwards. A higher anomaly score indicates that a data point is more of an outlier.

In [8]:
# Find the labels
outlier_labels = KNN_model.labels_
# Find the number of outliers
number_of_outliers = len(outlier_labels[outlier_labels == 1])
print(number_of_outliers)

88


For a dataset with 891 passengers, having 88 outliers is quite high. To reduce this, you can specify the parameter contamination in the KNN model to be lower. The contamination indicates the percentage of data points that are outliers. Let’s say that the contamination is only 1%:

In [9]:
# Initiate a KNN model
KNN_model = KNN(contamination=0.01)
# Fit the model to the whole dataset
KNN_model.fit(partial_titanic)
# Find the labels
outlier_labels = KNN_model.labels_
# Find the number of outliers
number_of_outliers = len(outlier_labels[outlier_labels == 1])
print(number_of_outliers)

9


In [10]:
# Finding the outlier passengers
outliers = partial_titanic.iloc[outlier_labels == 1]

In [12]:
outliers

Unnamed: 0,Survived,Pclass,Fare,Sex
258,1,1,512.3292,1
380,1,1,227.525,1
679,1,1,512.3292,0
689,1,1,211.3375,1
700,1,1,227.525,1
716,1,1,227.525,1
730,1,1,211.3375,1
737,1,1,512.3292,0
779,1,1,211.3375,1


If you check out the passengers above, then the KNN model picks up that their fare price is incredibly high. The average fare price for all the passengers can be easily found in Pandas:

In [13]:
# Average fare price
round(partial_titanic["Fare"].mean(), 3)

32.204

The KNN algorithm has successfully found 9 passengers that are outliers in the sense of the fare price. There are many optional parameters you can play around with for the KNN model to make it suit your specific need 🔥

The outliers can now be removed from the data if you feel like they don’t represent the general feel of the data. As mentioned previously, you should consider carefully whether anomaly detection for data cleaning is appropriate for your problem.

# 6 — Anomaly Detection for Prediction
In the previous section, you looked at anomaly detection for data cleaning. In this section, you will take a peak at anomaly detection for prediction. You will train a model on existing data, and then use the model to predict whether new data are outliers.

Say a rumor spread that a Mrs. Watson had also taken the Titanic, but her death was never recorded. According to the rumors, Mrs. Watson was a wealthy lady that paid 1000$ to travel with the Titanic in a very exclusive suite.

Anomaly detection can not say with certainty whether the rumor is true or false. However, it can say whether Mrs. Watson is an anomaly or not based on the information of the other passengers. If she is an anomaly, the rumor should be taken with a grain of salt.

Let’s test Mrs. Watson existence with another model in the PyOD library; Local Outlier Factor (LOF). A LOF model tests whether a data point is an outlier by comparing the local density of the datapoint with the local densities of its neighbors.

In [14]:
# Import the LOF
from pyod.models.lof import LOF
# Initiate a LOF model
LOF_model = LOF()
# Train the model on the Titanic data
LOF_model.fit(partial_titanic)

LOF(algorithm='auto', contamination=0.1, leaf_size=30, metric='minkowski',
  metric_params=None, n_jobs=1, n_neighbors=20, novelty=True, p=2)

In [15]:
# Represent Mrs. Watson as a data point
mrs_watson = [[0, 1, 1000, 1]]

The values in mrs_watson represent her survival (0 for not survived), passenger class (1 for first-class), fare price (1000$ for the fare price), and sex (1 for female). The LOF model requires 2D arrays, so this is the reason for the extra bracket pair [] in mrs_watson.
We now use the predict() method to predict whether Mrs. Watson is an outlier or not:

In [16]:
outlier = LOF_model.predict(mrs_watson)
print(outlier)

[1]


A value of 1 indicates that Mrs. Watson is an outlier. This should make you suspicious that the rumor regarding Mrs. Watson is false