# 2 May 

Answer 1
Anomaly detection refers to the process of identifying observations or data points that deviate from the expected or normal behavior in a dataset. The purpose of anomaly detection is to detect unusual or suspicious events that might be indicative of errors, fraud, or security breaches. Anomaly detection is used in a variety of applications, such as fraud detection in financial transactions, network intrusion detection, and predictive maintenance in industrial settings.

Answer 2
Anomaly detection faces several key challenges, including:

Lack of labeled data: Anomaly detection often involves detecting rare events that are not well-represented in the training data. As a result, there may not be enough labeled data available to train accurate anomaly detection models.

High-dimensional data: Anomaly detection is particularly challenging in high-dimensional data, where the number of features or variables is much larger than the number of observations.

Concept drift: Anomaly detection models need to be updated over time to account for changes in the data distribution, such as new types of fraud or network attacks.

Interpretability: Anomaly detection models can be difficult to interpret, especially if they use complex algorithms such as neural networks.

Answer 3
Unsupervised anomaly detection and supervised anomaly detection differ in the type of data available for training the model:

Unsupervised anomaly detection: In unsupervised anomaly detection, the model is trained on a dataset that does not have labeled anomalies. The goal is to learn a representation of the normal data distribution and identify observations that deviate significantly from that distribution. Unsupervised anomaly detection is useful in situations where labeled anomalies are rare or difficult to obtain.

Supervised anomaly detection: In supervised anomaly detection, the model is trained on a dataset that includes labeled anomalies. The goal is to learn a decision boundary that separates normal data from anomalous data. Supervised anomaly detection is useful in situations where labeled anomalies are readily available and the focus is on identifying specific types of anomalies. However, supervised anomaly detection may not perform well when faced with novel or previously unseen anomalies.

Answer 4
There are several categories of anomaly detection algorithms:

Statistical methods: These methods use statistical models to detect anomalies. They assume that normal data follows a certain probability distribution, such as Gaussian or Poisson, and identify data points that have low probability under this distribution.

Machine learning methods: These methods use machine learning algorithms to learn a model of normal data and identify anomalies based on deviations from this model. Machine learning methods can be supervised or unsupervised.

Distance-based methods: These methods measure the distance or similarity between data points and identify points that are far away or dissimilar from the rest of the data.

Clustering methods: These methods group similar data points together and identify points that do not belong to any cluster or belong to a small cluster.

Answer 5
Distance-based anomaly detection methods make the following assumptions:

Normal data points are tightly clustered in some feature space.

Anomalies are located far away from the cluster of normal data points.

The number of anomalies is small compared to the number of normal data points.

Based on these assumptions, distance-based methods measure the distance or similarity between data points and identify points that are far away or dissimilar from the rest of the data.

Answer 6
The LOF (Local Outlier Factor) algorithm computes anomaly scores as follows:

For each data point, it computes the distance to its k-nearest neighbors (k is a user-defined parameter).

It then computes the reachability distance of each point as the maximum distance to a k-nearest neighbor.

The local reachability density of each point is defined as the inverse of the average reachability distance of its k-nearest neighbors.

Finally, the LOF score of each point is defined as the average local reachability density of its k-nearest neighbors, normalized by the local reachability density of the point itself.

Answer 7
The Isolation Forest algorithm has two key parameters:

n_estimators: the number of decision trees to build in the ensemble. Increasing the number of trees can improve the accuracy of the algorithm but also increases the computational cost.

max_samples: the number of samples to randomly select at each node of a decision tree. This parameter controls the trade-off between the quality of isolation and the speed of isolation. Increasing the number of samples can improve the quality of isolation but also increases the computational cost.

Answer 8
To compute the anomaly score of a data point using KNN with K=10, we need to find the distance to its 10th nearest neighbor (assuming the neighbors are not all of the same class). If a data point has only 2 neighbors of the same class within a radius of 0.5, this means that the distance to the 10th nearest neighbor is likely to be large, since there are only a few neighbors in close proximity. Therefore, the anomaly score of this data point using KNN with K=10 is likely to be high. However, the actual value of the anomaly score depends on the distance to the 10th nearest neighbor, which cannot be determined without additional information about the dataset.

Answer 9
The anomaly score for a data point using the Isolation Forest algorithm is defined as:

s = 2^(-E(x)/c)

where E(x) is the average path length of the data point in all trees, and c is a normalization factor that depends on the size of the dataset and the number of trees in the forest.

Assuming a dataset of 3000 data points and 100 trees in the forest, the value of c can be computed as:

c = 2 * H(n-1) - (2*(n-1)/n)

where n is the number of data points and H is the harmonic number function. For n = 3000, we have:

c = 2 * H(2999) - (2*2999/3000) ≈ 40.3975

If a data point has an average path length of 5.0 compared to the average path length of the trees, then E(x) = 5.0. Plugging this into the formula for the anomaly score, we get:

s = 2^(-5.0/40.3975) ≈ 0.872