#### Q1. What is anomaly detection and what is its purpose?

Ans: `Anomaly detection` refers to the process of identifying rare or unusual patterns or observations within a dataset that deviate significantly from the norm. 

Its *purpose* is to detect and flag instances that are considered anomalous or potentially suspicious, which may indicate errors, fraud, or unusual behavior.

#### Q2. What are the key challenges in anomaly detection?

Ans: Key challenges in anomaly detection include defining what constitutes: 
- normal behavior, 
- handling imbalanced datasets where anomalies are rare, 
- selecting appropriate features for anomaly detection, 
- dealing with high-dimensional data, 
- determining an appropriate threshold for anomaly detection, and 
- adapting to evolving anomalies over time.

#### Q3. How does unsupervised anomaly detection differ from supervised anomaly detection?

Ans: In **unsupervised anomaly detection**, labeled examples of anomalies are not provided during training. The algorithm learns the normal patterns and structures from the unlabeled data and identifies instances that deviate significantly from the learned norm as anomalies. Unsupervised methods do not require prior knowledge or labels, making them more flexible but potentially less accurate in identifying anomalies.

On the other hand, **supervised anomaly detection** relies on labeled examples of anomalies during the training phase. The algorithm learns to distinguish between normal and anomalous instances based on the provided labels. Supervised methods can be more accurate in identifying anomalies since they have explicit knowledge of the anomalies they need to detect. However, they require a labeled training set, which may be difficult or expensive to obtain.

#### Q4. What are the main categories of anomaly detection algorithms?

Ans: The main categories of anomaly detection algorithms include:

- **Statistical-based methods:** These methods use statistical techniques to model the normal behavior of the data and identify instances that significantly deviate from the expected statistical properties.

- **Distance-based methods:** These algorithms compute distances or similarities between data points and consider instances that are far away from their neighbors in the feature space as anomalies.

- **Density-based methods:** These methods identify anomalies based on the density of data points. Anomalies are typically areas of low-density compared to the surrounding data.

- **Model-based methods:** These algorithms build models of the normal data distribution and classify instances as anomalies if they have a low probability under the learned model.

- **Ensemble methods:** Ensemble approaches combine multiple anomaly detection algorithms or models to improve the accuracy and robustness of anomaly detection by leveraging the strengths of individual methods.

#### Q5. What are the main assumptions made by distance-based anomaly detection methods?

Ans: Distance-based anomaly detection methods make the following assumptions:

- Anomalies are located far away from normal instances in the feature space.
- Anomalies have dissimilarities or distances that significantly differ from the distances between normal instances.
- Normal instances are more densely clustered in the feature space compared to anomalies.
- The number of anomalies is relatively small compared to the number of normal instances.

These assumptions guide the distance-based methods in identifying instances that are isolated or have large distances to their neighbors as anomalies.

#### Q6. How does the LOF algorithm compute anomaly scores?

Ans: The `LOF (Local Outlier Factor)` algorithm computes anomaly scores as follows:

1. For each data point, LOF calculates the reachability distance to its k-nearest neighbors. The reachability distance represents how easily a point can be reached from its neighbors.

2. The local reachability density (LRD) is computed for each point by taking the inverse of the average reachability distance of its k-nearest neighbors.

3. LOF compares the LRD of a point with the LRDs of its neighbors. If the LRD of a point is significantly lower than the LRDs of its neighbors, it indicates that the point is in a sparser region and has a higher likelihood of being an anomaly.

4. The anomaly score, which is the LOF score, is calculated as the average ratio of the LRD of a point to the LRDs of its neighbors. A higher LOF score signifies a higher probability of the point being an anomaly.

#### Q7. What are the key parameters of the Isolation Forest algorithm?

Ans: The key parameters of the Isolation Forest algorithm are:

1. **Number of trees:** This parameter specifies the number of isolation trees to be created. Increasing the number of trees can improve the accuracy of anomaly detection but may also increase computation time.

2. **Subsampling size:** Isolation Forest randomly selects a subset of the data for constructing each tree. The subsampling size determines the size of the subsets. A smaller subsampling size can lead to more finely grained results, but it can also increase the chances of overfitting.

#### Q8. If a data point has only 2 neighbours of the same class within a radius of 0.5, what is its anomaly score using KNN with K=10?

Ans: In KNN with K=10, if a data point has only 2 neighbors of the same class within a radius of 0.5, its anomaly score would be relatively high. 

Since it has a small number of neighbors within the specified radius, it suggests that the point is isolated or dissimilar from the majority of the neighboring points. This isolation indicates a higher likelihood of the point being an anomaly.

To calculate the anomaly score, we can consider the proportion of the data point's neighbors that belong to the same class within the specified radius. In this case, since all 2 neighbors are of the same class, the proportion is 2/2 = 1.

#### Q9. Using the Isolation Forest algorithm with 100 trees and a dataset of 3000 data points, what is the anomaly score for a data point that has an average path length of 5.0 compared to the average path length of the trees?

Ans: In the Isolation Forest algorithm, the anomaly score for a data point is calculated based on its average path length (APL) compared to the average path length of the trees in the forest. 

Given:
- Number of trees (T) = 100
- Dataset size (N) = 3000
- Data point's average path length (APL) = 5.0

To calculate the anomaly score, we need to normalize the APL by comparing it to the expected average path length for normal data points in the isolation forest. The expected average path length for normal data points can be estimated using the formula:

E(APL) = 2 * (log(N - 1) + 0.5772156649) - (2 * (N - 1) / N)

Here, log represents the natural logarithm.

Let's calculate the expected average path length for the given dataset size:

E(APL) = 2 * (log(3000 - 1) + 0.5772156649) - (2 * (3000 - 1) / 3000)

E(APL) ≈ 8.299

Now, we can compute the anomaly score for the data point using the following formula:

Anomaly Score = 2^(-APL / E(APL))

Anomaly Score = 2^(-5.0 / 8.299)

Anomaly Score ≈ 0.477

Therefore, the anomaly score for the data point with an average path length of 5.0 compared to the average path length of the trees is approximately `0.477`.