In [None]:


# Q1. What is anomaly detection and what is its purpose?

# Anomaly detection is a process of finding patterns or instances in a dataset that deviate significantly from the expected or “normal behavior.” The purpose of anomaly detection is to identify rare events, outliers, or errors that may indicate some kind of problem or anomaly in the data source¹.

# Q2. What are the key challenges in anomaly detection?

# Some of the key challenges in anomaly detection are:

# - Defining what constitutes a normal or anomalous behavior, which may vary depending on the context and domain of the data.
# - Choosing an appropriate anomaly detection technique, which may depend on the type, structure, and characteristics of the data, as well as the desired output and performance of the algorithm.
# - Evaluating the performance and accuracy of the anomaly detection algorithm, which may require a labeled dataset or a ground truth, or some other criteria to measure the quality of the results.
# - Handling the trade-off between false positives and false negatives, which may require tuning the parameters or thresholds of the algorithm to balance the sensitivity and specificity of the anomaly detection.

# Q3. How does unsupervised anomaly detection differ from supervised anomaly detection?

# Unsupervised anomaly detection techniques detect anomalies in an unlabeled test data set under the assumption that the majority of the instances in the data set are normal by looking for instances that seem to fit least to the remainder of the data set. Supervised anomaly detection techniques require a data set that has been labeled as "normal" and "abnormal" and involve training a classifier to distinguish between the two classes¹.

# Q4. What are the main categories of anomaly detection algorithms?

# There are many types of anomaly detection algorithms, but some of the main categories are:

# - Statistical-based methods: These methods use statistical models to describe the normal behavior of the data and then identify anomalies as deviations from the model. Examples of these methods are parametric methods (such as Gaussian mixture models), non-parametric methods (such as kernel density estimation), and linear regression methods (such as principal component analysis).
# - Distance-based methods: These methods use distance or similarity measures to compare each instance to its neighbors and then identify anomalies as instances that are far away or dissimilar from their neighbors. Examples of these methods are k-nearest neighbors, local outlier factor, and clustering-based methods.
# - Reconstruction-based methods: These methods use a model to learn a representation or approximation of the data and then identify anomalies as instances that have a high reconstruction error or residual. Examples of these methods are autoencoders, matrix factorization, and subspace methods.
# - Domain-specific methods: These methods use domain knowledge or specific features to design custom anomaly detection algorithms for a particular application or problem. Examples of these methods are support vector machines, isolation forest, and one-class neural networks²³.

# Q5. What are the main assumptions made by distance-based anomaly detection methods?

# Distance-based anomaly detection methods make the following assumptions:

# - The data is in a metric space, meaning that there is a distance function that can measure the similarity or dissimilarity between any two instances in the data.
# - The data is homogeneous, meaning that there is no significant difference in the distribution or density of the data across different regions or clusters.
# - The data is low-dimensional, meaning that the number of features or dimensions of the data is relatively small compared to the number of instances or samples.
# - The anomalies are isolated, meaning that they are far away or dissimilar from their nearest neighbors or clusters⁴.

# Q6. How does the LOF algorithm compute anomaly scores?

# The LOF (Local Outlier Factor) algorithm computes anomaly scores as follows:

# - For each instance in the data, it calculates the k-distance, which is the distance to its k-th nearest neighbor, and the k-distance neighborhood, which is the set of instances that are within the k-distance from the instance.
# - For each instance in the data, it calculates the reachability distance, which is the maximum of the k-distance of the instance and the distance to another instance, and the local reachability density, which is the inverse of the average reachability distance of the instances in the k-distance neighborhood of the instance.
# - For each instance in the data, it calculates the local outlier factor, which is the ratio of the average local reachability density of the instances in the k-distance neighborhood of the instance and the local reachability density of the instance itself. A high local outlier factor indicates that the instance is far away from its neighbors compared to how close the neighbors are to each other, which implies that the instance is an anomaly⁵.

# Q7. What are the key parameters of the Isolation Forest algorithm?

# The Isolation Forest algorithm has the following key parameters:

# - n_estimators: The number of trees to build in the forest. A larger number of trees improves the accuracy and stability of the algorithm, but also increases the computational cost and memory usage.
# - max_samples: The number of samples to draw from the data to train each tree. A smaller number of samples improves the isolation of anomalies, but also reduces the diversity and representativeness of the forest.
# - max_features: The number of features to randomly select at each split when building each tree. A smaller number of features improves the isolation of anomalies, but also reduces the information and accuracy of the forest.
# - contamination: The proportion of anomalies in the data. This parameter is used to determine the threshold for flagging anomalies based on the anomaly scores. A higher contamination value lowers the threshold and flags more anomalies, and vice versa⁶.

# Q8. If a data point has only 2 neighbours of the same class within a radius of 0.5, what is its anomaly score
# using KNN with K=10?

# Using KNN with K=10, the anomaly score of a data point is defined as the ratio of the average distance to its K nearest neighbors and the average distance of its neighbors to their own K nearest neighbors. In this case, the data point has only 2 neighbors of the same class within a radius of 0.5, so the average distance to its K nearest neighbors is 0.5. Assuming that the neighbors have more neighbors of the same class within a smaller radius, the average distance of the neighbors to their own K nearest neighbors is less than 0.5. Therefore, the anomaly score of the data point is greater than 1, which indicates that it is an anomaly.

# Q9. Using the Isolation Forest algorithm with 100 trees and a dataset of 3000 data points, what is the
# anomaly score for a data point that has an average path length of 5.0 compared to the average path
# length of the trees?

# Using the Isolation Forest algorithm, the anomaly score for a data point is defined as 2^(-E(h(x))/c(n)), where E(h(x)) is the average path length of the data point over the trees in the forest, c(n) is the average path length of unsuccessful searches in a binary search tree, and n is the number of samples used to build each tree. In this case, E(h(x)) is 5.0, c(n) is approximately 10.6 (using the formula c(n) = 2H(n-1) - 2(n-1)/n, where H is the harmonic number), and n is 3000. Therefore, the anomaly score for the data point is 2^(-5.0/10.6), which is approximately 0.55. This means that the data point is more likely to be normal than anomalous, since a lower anomaly score indicates a higher probability of being normal.

