# Q1. What is anomaly detection and what is its purpose?

Anomaly detection is a process of identifying data points that deviate from the standard data pattern. It is used to identify rare items, events or observations which deviate significantly from the majority of the data and do not conform to a well-defined notion of normal behavior ¹. Companies use anomalous activity detection to define system baselines, identify deviations from that baseline, and investigate inconsistent data ¹. Anomaly detection is critical for the security and efficiency of Internet of Things (IoT) systems. It helps in identifying system failures and security breaches in complex networks of IoT devices ⁴. Anomalous data can indicate critical incidents, such as a technical glitch, or potential opportunities, for instance, a change in consumer behavior ⁵. 

# Q2. What are the key challenges in anomaly detection?

Some of the key challenges in anomaly detection are:

1. **Defining what is considered "normal"**: One of the biggest challenges in anomaly detection is defining what is considered normal. This is because the definition of normal can vary depending on the context and the data being analyzed ¹.

2. **Extracting useful features appropriately**: Another challenge is extracting useful features from the data that can help identify anomalies. This is particularly difficult when dealing with high-dimensional data, where there are many features to consider ¹².

3. **Dealing with the situations where there are significantly more normal values than anomalies**: In many cases, the number of normal values in a dataset is much larger than the number of anomalies. This can make it difficult to identify the anomalies accurately ¹.

4. **Separating noise from real outliers**: Anomalies can be difficult to distinguish from noise or other outliers in the data. This can lead to false positives or false negatives ¹.

5. **Real-time detection**: In some applications, such as IoT systems, real-time detection of anomalies is critical. This can be challenging due to the large volume of data that needs to be processed in real-time ⁴.

# Q3. How does unsupervised anomaly detection differ from supervised anomaly detection?

The main difference between **supervised** and **unsupervised** anomaly detection is the approach involved ¹. 

**Supervised anomaly detection** requires labeled training data where each row is labeled as an outlier/anomaly or not. The modeling technique used for binary responses, such as logistic regression or gradient boosting, can be applied here ⁶. In supervised anomaly detection, predefined algorithms and AI training are used to identify anomalies ¹.

On the other hand, **unsupervised anomaly detection** uses a general outlier-detection mechanism based on pattern matching ¹. It does not require labeled training data and is a great fit for anomaly detection, recommendation engines, customer personas, and medical imaging ¹. Unsupervised anomaly detection is particularly useful when dealing with high-dimensional data, where there are many features to consider ¹². 

# Q4. What are the main categories of anomaly detection algorithms?

There are several categories of anomaly detection algorithms. Here are some of the main ones:

1. **Statistical-based algorithms**: These algorithms use statistical techniques to identify anomalies in the data. They assume that the normal data follows a known statistical distribution and identify data points that fall outside of this distribution ¹.

2. **Clustering-based algorithms**: These algorithms group similar data points together and identify data points that do not belong to any cluster as anomalies ¹⁴.

3. **Density-based algorithms**: These algorithms identify anomalies as data points that have a low probability of being generated by the underlying data distribution ¹.

4. **Distance-based algorithms**: These algorithms identify anomalies based on their distance from other data points. They assume that anomalies are far away from the majority of the data points ¹.

5. **Spectral-based algorithms**: These algorithms use spectral analysis to identify anomalies in the data. They identify anomalies as data points that have a different spectral signature than the majority of the data points ¹.

6. **Deep learning-based algorithms**: These algorithms use deep learning techniques to identify anomalies in the data. They are particularly useful when dealing with high-dimensional data, where there are many features to consider ¹.

# Q5. What are the main assumptions made by distance-based anomaly detection methods?

Distance-based anomaly detection methods assume that normal data points are close to their neighbors, while anomalous data points are far from the normal data ¹. These methods use a distance from a considered test point to its nearest neighbors ¹. The simplest distance-based anomaly detection algorithms are based on the assumptions about the data distribution, such as that data is one-dimensional and normally distributed with a mean of $\bar{p}$ and standard deviation of $\sigma$ ⁴. A large distance from the center of the distribution implies that the probability of observing such a data point is very small ⁴. 

# Q6. How does the LOF algorithm compute anomaly scores?

The **Local Outlier Factor (LOF)** algorithm is an unsupervised anomaly detection method that computes local density deviation of the data point and compares it to the neighboring ones ¹. The algorithm computes an anomaly score by using the local density of each sample point with respect to the points in its surrounding neighborhood ⁶. The local density is inversely correlated with the average distance from a point to its nearest neighbors ⁶. The anomaly score values greater than 1.0 usually indicate the anomaly ¹.

# Q7. What are the key parameters of the Isolation Forest algorithm?

The **Isolation Forest** algorithm is an unsupervised machine learning algorithm used for anomaly detection. It is based on the idea that anomalies are easier to separate from normal data points because they are few and different ¹. The algorithm works by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature. This process is repeated recursively until all data points are isolated ². 

The key parameters of the Isolation Forest algorithm are:

1. **n_estimators**: The number of trees in the forest. Increasing the number of trees will improve the performance of the algorithm, but it will also increase the computational cost ¹.

2. **max_samples**: The number of samples to draw from the training set to train each Isolation Tree with. If "auto", then max_samples=min(256, n_samples). If max_samples is larger than the number of samples provided, all samples will be used for all trees (no sampling) ³.

3. **contamination**: The proportion of outliers in the data set. Used when fitting to define the threshold on the scores of the samples. If "auto", the threshold is determined as in the original paper ³.

4. **max_features**: The number of features to draw from X to train each base estimator. If int, then draw max_features features. If float, then draw max(1, int(max_features * n_features_in_)) features ³.

# Q8. If a data point has only 2 neighbours of the same class within a radius of 0.5, what is its anomaly score using KNN with K=10?

The anomaly score of a data point using KNN with K=10 depends on the distance of the data point to its Kth nearest neighbor. If a data point has only 2 neighbors of the same class within a radius of 0.5, then the Kth nearest neighbor will be one of those two neighbors. The anomaly score of the data point will be the distance between the data point and its Kth nearest neighbor ². 

# Q9. Using the Isolation Forest algorithm with 100 trees and a dataset of 3000 data points, what is the anomaly score for a data point that has an average path length of 5.0 compared to the average path length of the trees?

The **Isolation Forest** algorithm computes the anomaly score of each sample using the average path length over all isolation trees in the isolation forest ¹. The anomaly score of a data point with an average path length of 5.0 can be computed as follows:

1. The average path length of the trees can be computed as follows: 

    - The number of splits required to isolate a sample is equivalent to the path length from the root node to the terminating node ¹.
    - The average path length of a tree with n data points is given by $2H(n-1)-\frac{2(n-1)}{n}$, where $H(i)$ is the harmonic number and is defined as $\sum_{j=1}^{i}\frac{1}{j}$ ².
    - The average path length of the trees can be computed as the average of the average path length of each tree in the forest ¹.

2. The anomaly score of the data point can be computed as follows:

    - The anomaly score of a data point is given by $2^{-\frac{E(h(x))}{c(n)}}$, where $E(h(x))$ is the average path length over all isolation trees in the isolation forest, and $c(n)$ is the average path length of unsuccessful searches in a binary search tree of n observations ¹.

Given that the dataset has 3000 data points and the Isolation Forest algorithm uses 100 trees, the average path length of the trees can be computed as follows:

- The average path length of a tree with 3000 data points is given by $2H(2999)-\frac{2(2999)}{3000} \approx 8.03$ ².
- The average path length of the trees can be computed as the average of the average path length of each tree in the forest, which is approximately 8.03 ¹.

Therefore, the anomaly score of the data point with an average path length of 5.0 can be computed as follows:

- $2^{-\frac{5.0}{8.03}} \approx 0.54$