Q1. What is anomaly detection and what is its purpose?

Anomaly detection, also called outlier detection, is the identification of unexpected events, observations, or items that differ significantly from the norm. Often applied to unlabeled data by data scientists in a process called unsupervised anomaly detection, any type of anomaly detection rests upon two basic assumptions:

1.Anomalies in data occur only very rarely
2.The features of data anomalies are significantly different from those of normal instances
Typically, anomalous data is linked to some sort of problem or rare event such as hacking, bank fraud, malfunctioning equipment, structural defects / infrastructure failures, or textual errors. For this reason, identifying actual anomalies rather than false positives or data noise is essential from a business perspective.

Q2. What are the key challenges in anomaly detection?

Challenge 1: Data quality
When building an anomaly detection model, one primary question you may have is:
“Which algorithm should I use?” This greatly depends on the type of problem you're trying to solve, of course, but one thing to consider is the underlying data.

Data quality — that is, the quality of the underlying dataset — is going to be the biggest driver in creating an accurate usable model. Data quality problems can include:

Challenge 2: Training sample sizes
Having a large training set is important for many reasons. If the training set is too small, then…

The algorithm doesn’t have enough exposure to past examples to build an accurate representation of the expected value at a given time.
Anomalies will skew the baseline, which will affect the overall accuracy of the model. 
Seasonality is another common problem with small sample sets. Not every day or week is the same, which is why having a large enough sample dataset is important. Customer traffic volumes may spike during the holiday season, or could significantly drop depending on the line of business. It’s important for the model to see data samples for multiple years so it can accurately build and monitor the baseline during common holidays. 


Challenge 3: False alerting
Identifying anomalies is an excellent tool in a dynamic environment as it can learn from the past to identify expected behavior and anomalous events. But what happens when your model continuously generates false alerts and is consistently wrong?

It’s hard to gain trust from skeptical users and easy to lose it — which is why it’s important to ensure a balance in sensitivity. 

Challenge 4: Imbalanced distributions
Another method of building an anomaly detection model would be to use a classification algorithm to build a supervised model. This supervised model will require labeled data to understand what is good or bad.

A common problem with labeled data is distribution imbalance. It’s normal to have a good state which means 99% of the labeled data will be skewed towards good. Because of this natural imbalance, the training set may not have enough examples to learn and associate with the bad state. 

Q3. How does unsupervised anomaly detection differ from supervised anomaly detection?

Supervised anomaly detection:-
Supervised anomaly detection uses labeled data to train a classifier that can distinguish between normal and anomalous instances. The labels indicate whether an instance belongs to the normal class or one of the predefined anomaly classes. For example, a supervised anomaly detector for credit card fraud detection would learn from historical transactions that are labeled as fraudulent or legitimate. The main advantage of supervised anomaly detection is that it can achieve high accuracy and specificity for known types of anomalies. However, it also has some drawbacks, such as the need for large and balanced datasets, the difficulty of obtaining and maintaining labels, and the inability to detect novel or unknown anomalies.


Unsupervised anomaly detection:-
Unsupervised anomaly detection does not require labeled data to identify outliers. Instead, it relies on statistical or distance-based measures to assess how different an instance is from the rest of the data. For example, an unsupervised anomaly detector for network intrusion detection would use clustering or density estimation to group similar instances and flag those that are far from their nearest neighbors or clusters. The main advantage of unsupervised anomaly detection is that it can handle unlabeled, unbalanced, or evolving data, and discover new or emerging types of anomalies. However, it also has some challenges, such as the choice and interpretation of the anomaly score, the sensitivity to noise and outliers, and the lack of feedback or evaluation.

Q4. What are the main categories of anomaly detection algorithms?

Anomaly detection algorithms can be categorized into several main categories based on their approach and methodology. These categories include:

Supervised Anomaly Detection:

In this category, the algorithm is trained on a dataset that includes both normal and anomalous examples. The algorithm learns to distinguish between the two classes, and when it encounters new data, it can classify instances as normal or anomalous based on what it has learned.
Unsupervised Anomaly Detection:

Unsupervised algorithms don't require a labeled dataset with examples of anomalies. Instead, they identify anomalies by looking for patterns that deviate from the norm within the dataset. Common methods include clustering, density estimation, and dimensionality reduction techniques.
Semi-Supervised Anomaly Detection:

Semi-supervised approaches combine elements of both supervised and unsupervised methods. These algorithms are typically trained on a mostly normal dataset with only a small portion of anomalous data. They aim to identify anomalies among the majority of normal instances.
Time-Series Anomaly Detection:

Time-series anomaly detection algorithms focus on identifying anomalies in temporal data, such as stock prices, sensor readings, or network traffic. Techniques like autoregressive models, moving averages, and recurrent neural networks (RNNs) are commonly used for this purpose.
Point-Based Anomaly Detection:

Point-based methods assess individual data points for anomalies. They evaluate each data point independently without considering the relationships or dependencies between them. Simple statistical techniques and threshold-based methods fall into this category.
Contextual Anomaly Detection:

Contextual anomaly detection considers the context in which data points occur. It examines anomalies in relation to the surrounding data. One-class SVM and isolation forests are examples of algorithms used in contextual anomaly detection.
Collective Anomaly Detection:

Collective anomaly detection is used for detecting anomalies in a group of data points or entities. Instead of focusing on individual data points, it assesses the collective behavior of multiple data points. Graph-based algorithms and social network analysis methods are examples of collective anomaly detection techniques.
Domain-Specific Anomaly Detection:

Some anomaly detection methods are tailored for specific domains or applications, such as fraud detection, cybersecurity, healthcare, or industrial quality control. These algorithms often incorporate domain-specific features and knowledge to improve accuracy.
Machine Learning-Based Anomaly Detection:

This category includes a variety of machine learning algorithms that can be used for anomaly detection, such as decision trees, random forests, support vector machines, and deep learning techniques like autoencoders and recurrent neural networks.

Q5. What are the main assumptions made by distance-based anomaly detection methods?

Distance-based anomaly detection methods rely on the assumption that anomalies are data points that are significantly different from the majority of the data, which is considered normal. These methods compute the distance or dissimilarity between data points and use these distances to identify anomalies. The main assumptions made by distance-based anomaly detection methods include:

Normal Data Concentration: Distance-based methods assume that normal data points are concentrated in specific regions of the feature space, forming clusters or dense regions. Anomalies, on the other hand, are expected to be isolated or sparsely distributed.

Distance Metric: These methods often assume that a suitable distance metric exists to measure the dissimilarity or similarity between data points. Common distance metrics include Euclidean distance, Mahalanobis distance, cosine similarity, and others, depending on the data characteristics and domain.

Global vs. Local Anomalies: Distance-based methods may assume that anomalies can be identified by considering global characteristics of the dataset. However, some algorithms are designed to detect local anomalies, meaning they focus on specific neighborhoods within the data.

Threshold-Based Detection: Many distance-based approaches assume the existence of a threshold value that can differentiate normal data from anomalies. Data points with distances beyond this threshold are considered anomalies.

Single-Cluster Assumption: Some distance-based methods work well when anomalies can be defined as data points that are far from the center of the majority cluster. These methods assume that anomalies are distant outliers from the cluster.

Independence of Features: Some distance-based algorithms assume that features are independent of each other, and the overall distance can be calculated as a combination of distances along each feature dimension. This assumption may not hold in some datasets with correlated features.

Q6. How does the LOF algorithm compute anomaly scores?

![image.png](attachment:5c93f74c-e9dc-4c0e-b6fe-8183178f11b8.png)

LRD of each point is used to compare with the average LRD of its K neighbors. LOF is the ratio of the average LRD of the K neighbors of A to the LRD of A.

Intuitively, if the point is not an outlier (inlier), the ratio of average LRD of neighbors is approximately equal to the LRD of a point (because the density of a point and its neighbors are roughly equal). In that case, LOF is nearly equal to 1. On the other hand, if the point is an outlier, the LRD of a point is less than the average LRD of neighbors. Then LOF value will be high.

Generally, if LOF> 1, it is considered as an outlier, but that is not always true. Let’s say we know that we only have one outlier in the data, then we take the maximum LOF value among all the LOF values, and the point corresponding to the maximum LOF value will be considered as an outlier.

Q7. What are the key parameters of the Isolation Forest algorithm?

The Isolation Forest algorithm is an unsupervised machine learning algorithm used for anomaly detection. It is based on the concept of isolating anomalies by constructing binary trees. The key parameters of the Isolation Forest algorithm include:

Number of Trees (n_estimators):

This parameter defines the number of isolation trees that are constructed. A higher number of trees can improve the accuracy of anomaly detection but may also increase the computation time.
Maximum Tree Depth (max_depth):

It specifies the maximum depth of each individual isolation tree. A deeper tree can capture more complex patterns in the data but may also be prone to overfitting. You can control the trade-off between accuracy and overfitting by adjusting this parameter.
Sample Size (max_samples):

The max_samples parameter determines the number of data points randomly sampled to build each isolation tree. Smaller sample sizes can make the algorithm run faster, but larger samples may result in better anomaly detection performance.
Contamination:

Contamination is an important parameter that sets the expected fraction of anomalies in the dataset. It guides the algorithm in identifying anomalies based on the proportion specified. The algorithm will classify data points as anomalies if they have a shorter average path length than expected based on the contamination parameter.
Random Seed (random_state):

This parameter allows you to set a random seed for reproducibility. By specifying a seed, you can ensure that the same random process generates the same results when you run the algorithm multiple times with the same data.
These parameters allow you to control the behavior and performance of the Isolation Forest algorithm in anomaly detection tasks. You can tune these parameters based on the characteristics of your data and the specific requirements of your application to achieve the best results.

Q8. If a data point has only 2 neighbours of the same class within a radius of 0.5, what is its anomaly score
using KNN with K=10?

1.The density is the number of neighbors within the specified radius. In this case, the density is 2 because there are 2 neighbors of the same class within the radius of 0.5.
2.The anomaly score is typically the inverse of the density. So in this case, the anomaly score would be 1/density. Therefore, the anomaly score is 1/2, which is equal to 0.5.

Q9. Using the Isolation Forest algorithm with 100 trees and a dataset of 3000 data points, what is the
anomaly score for a data point that has an average path length of 5.0 compared to the average path
length of the trees?

1.Compute the average path length for the entire dataset:

Calculate the average path length for all the data points in your dataset using the 100 trees in your Isolation Forest.

2.Compare the data point's average path length to the dataset's average path length:

If the data point's average path length is shorter (in this case, 5.0), it suggests that the data point is more isolated from the rest of the data and, therefore, more likely to be an anomaly.

3.If the analomy socre is near 1 then it can be consider as outlier and if analomy score is near 0.5 then it can be consider as normal data point.