Q1: What is anomaly detection and what is its purpose?

Anomaly Detection is the process of identifying data points, events, or observations that deviate significantly from the majority of the data, which are considered normal. These deviations are called anomalies or outliers.

Purpose:

Fraud Detection: Identifying fraudulent activities in financial transactions.
Network Security: Detecting intrusions or abnormal network traffic.
Fault Detection: Identifying defects or malfunctions in industrial systems.
Medical Diagnostics: Detecting rare diseases or abnormal health conditions from medical data.

Q2: What are the key challenges in anomaly detection?

Key Challenges:

Imbalanced Data: Anomalies are rare compared to normal instances, making it difficult to learn patterns from them.
High Dimensionality: Data with many features can obscure the distance between points, complicating the detection of anomalies.
Noise: Differentiating between noise and actual anomalies can be challenging.
Dynamic Data: Anomalies may change over time, requiring adaptive or real-time detection methods.
Lack of Labeled Data: Supervised methods require labeled data, which is often unavailable for anomalies.


Q3: How does unsupervised anomaly detection differ from supervised anomaly detection?

Unsupervised Anomaly Detection:

No Labeled Data: Operates without labeled training data.
Assumption: Anomalies are statistically different from the majority of data.
Methods: Techniques include clustering, density estimation, and distance-based methods.
Supervised Anomaly Detection:

Labeled Data: Requires a labeled dataset with normal and anomalous examples.
Learning Patterns: Learns to distinguish between normal and anomalous data based on labeled examples.
Methods: Techniques include classification algorithms like SVM, neural networks, and decision trees.

Q4: What are the main categories of anomaly detection algorithms?
Main Categories:

Statistical Methods: Assumes normal data follows a known distribution. Anomalies deviate significantly from this distribution.
Examples: Z-score, Gaussian models.
Proximity-Based Methods: Detect anomalies based on the distance or density of data points.
Examples: k-NN, Local Outlier Factor (LOF).
Clustering-Based Methods: Anomalies are points that do not fit well into any cluster.
Examples: k-means, DBSCAN.
Machine Learning Methods: Utilizes machine learning algorithms to detect anomalies.
Examples: One-Class SVM, Isolation Forest.


Q5: What are the main assumptions made by distance-based anomaly detection methods?

Main Assumptions:

Normal Points are Dense: Normal data points form dense regions.
Anomalous Points are Sparse: Anomalies are far from the dense regions of normal data.
Distance Metric: A suitable distance metric (e.g., Euclidean distance) effectively measures the similarity or dissimilarity between data points.

Q6: How does the LOF algorithm compute anomaly scores?

Local Outlier Factor (LOF) algorithm computes anomaly scores by:

Local Density Estimation: For each data point, calculate the local density based on the distance to its k-nearest neighbors.
Reachability Distance: The reachability distance of a point A with respect to point B is the maximum of the distance from A to B and the k-distance of 
B.
Local Reachability Density (LRD): The inverse of the average reachability distance of a point to its k-nearest neighbors.
LOF Score: The ratio of the average local reachability density of the point's neighbors to the point's own local reachability density. Higher scores indicate higher likelihood of being an anomaly.
LOF(𝐴)=∑𝐵∈𝑘𝑁𝑁(𝐴)(LRD(𝐵) / LRD(𝐴)) / 𝑘

 
Q7: What are the key parameters of the Isolation Forest algorithm?

Key Parameters:

Number of Trees (n_estimators): The number of trees in the forest.
Subsampling Size (max_samples): The number of samples used to train each tree.
Contamination: The proportion of anomalies in the dataset (if known), used to threshold anomaly scores.
Random Seed (random_state): Controls the randomness of the sample selection and feature splits.


Q8: If a data point has only 2 neighbors of the same class within a radius of 0.5, what is its anomaly score using KNN with K=10?
In k-NN anomaly detection with K=10:

Anomaly Score: Typically, the anomaly score is based on the distance to the K-th nearest neighbor or the proportion of neighbors within a certain distance.
Given the point has only 2 neighbors within a radius of 0.5 (out of 10 required):

Interpretation: The point likely has a high anomaly score because it does not have sufficient neighbors within the specified radius.


Q9: Using the Isolation Forest algorithm with 100 trees and a dataset of 3000 data points, what is the anomaly score for a data point that has an average path length of 5.0 compared to the average path length of the trees?
Isolation Forest:

Average Path Length (h(x)): The average number of splits required to isolate a data point.
Normalization Factor (c(n)): For n data points, the average path length of unsuccessful search in a Binary Search Tree.
𝑐(𝑛)≈2𝐻(𝑛−1)−2(𝑛−1)𝑛
where 
H(n) is the harmonic number, approximated by 

ln(n)+γ (Euler's constant).

For 
n=3000:
𝑐(3000)≈2(ln⁡(2999)+𝛾)−2(2999)3000
𝑐(3000)≈2(8.006+0.577)−1.999
c(3000)≈17.166−1.999≈15.167
Anomaly Score:


s(x,n)=2 
− 
c(n)
h(x)
​
 
 
Given 
ℎ
(
𝑥
)
=
5.0
h(x)=5.0:

𝑠
(
𝑥
,
3000
)
=
2
−
5.0
15.167
≈
2
−
0.329
≈
0.80
s(x,3000)=2 
− 
15.167
5.0
​
 
 ≈2 
−0.329
 ≈0.80
The anomaly score for the data point is approximately 0.80, indicating it is not considered an extreme anomaly.