Q1. What is anomaly detection and what is its purpose?


Anomaly detection is the process of identifying data points, items, or events that deviate significantly from the norm or expected pattern.

 These anomalies can be considered as outliers, noise, or exceptions.   

The purpose of anomaly detection is to:

Identify unusual patterns: Detect deviations from normal behavior that might indicate potential problems or opportunities.   
Prevent fraud: Detect fraudulent activities like credit card fraud or insurance claims.   
System health monitoring: Identify system failures or performance issues.
Network intrusion detection: Detect malicious activities in network traffic.
Quality control: Identify defective products or manufacturing errors.   


Q2. What are the key challenges in anomaly detection?


Anomaly detection often faces the following challenges:

Defining normality: Determining what constitutes normal behavior can be subjective and depends on the specific context.
Imbalanced data: Anomalies are typically rare, leading to imbalanced datasets which can affect model performance.
Evolving patterns: Normal behavior can change over time, requiring adaptive models.
High dimensionality: In many real-world applications, data has a large number of features, making it difficult to identify anomalies.
Noisy data: Real-world data often contains noise, which can hinder anomaly detection.   


Q3. How does unsupervised anomaly detection differ from supervised anomaly detection?


Unsupervised Anomaly Detection
Assumes no prior knowledge of anomalies.
Learns the normal patterns from the data itself.   
Identifies data points that deviate significantly from the learned pattern.
Commonly used when labeled data is scarce or unavailable.   
Supervised Anomaly Detection
Requires labeled data with both normal and anomalous instances.   
Builds a model to classify new data points as normal or anomalous.
Generally achieves higher accuracy but relies on availability of labeled data.   


Q4. What are the main categories of anomaly detection algorithms?

Anomaly detection algorithms can be categorized into several types:

Statistical methods: Based on statistical properties of data, such as Z-score, IQR, and density-based methods.   
Machine learning-based methods: Employ various machine learning techniques, including clustering, classification, and neural networks.
Information-theoretic methods: Utilize information theory concepts to identify anomalies.
Spectral methods: Employ techniques from spectral graph theory.

Q5. What are the main assumptions made by distance-based anomaly detection methods?

Distance-based anomaly detection methods assume:   

Data points are represented in a metric space.
Anomalies are data points that are far away from their neighbors.   
The density of normal data points is higher than the density of anomalous points.

Q6. How does the LOF algorithm compute anomaly scores?

Local Outlier Factor (LOF) is a density-based anomaly detection algorithm. It computes anomaly scores based on the local density of data points.   

k-nearest neighbors: For each data point, identify its k nearest neighbors.
Reachability distance: Calculate the reachability distance of each neighbor to the data point.   
Local reachability density: Compute the average reachability distance of the data point to its k nearest neighbors.   
Local outlier factor: Calculate the ratio of the average local reachability density of a data point to the average local reachability density of its k nearest neighbors. A higher LOF score indicates a higher likelihood of being an outlier.   

Q7. What are the key parameters of the Isolation Forest algorithm?

Isolation Forest is an anomaly detection algorithm based on random decision trees. The key parameters are:   

Number of trees: The number of decision trees in the forest.
Subsampling size: The number of data points randomly selected for each tree.
Contamination: The estimated proportion of anomalies in the dataset.

Q8. If a data point has only 2 neighbours of the same class within a radius of 0.5, what is its anomaly score
using KNN with K=10?


In [1]:
import numpy as np

def compute_anomaly_score(neighbors_within_radius, total_neighbors, k):
    """
    Compute the anomaly score using KNN.
    
    Parameters:
    - neighbors_within_radius (int): Number of neighbors of the same class within the given radius.
    - total_neighbors (int): Total number of neighbors considered (including those outside the radius).
    - k (int): Number of neighbors considered for anomaly scoring.
    
    Returns:
    - float: Anomaly score.
    """
    # If there are fewer neighbors within the radius than the total number of neighbors, the score needs adjustment
    if total_neighbors < k:
        total_neighbors = k
    
    # Calculate the proportion of neighbors of the same class within the radius
    proportion_within_radius = neighbors_within_radius / k
    
    # Calculate the anomaly score (simple ratio-based scoring)
    anomaly_score = 1 - proportion_within_radius
    return anomaly_score

# Parameters
neighbors_within_radius = 2
k = 10  # Number of neighbors for KNN

# Compute anomaly score
anomaly_score = compute_anomaly_score(neighbors_within_radius, k, k)
print(f'Anomaly Score: {anomaly_score:.2f}')


Anomaly Score: 0.80



Q9. Using the Isolation Forest algorithm with 100 trees and a dataset of 3000 data points, what is the
anomaly score for a data point that has an average path length of 5.0 compared to the average path
length of the trees?

In [2]:
import numpy as np

def average_path_length(n):
    """
    Calculate the average path length for a dataset of size n.
    
    Parameters:
    - n (int): Number of data points in the dataset.
    
    Returns:
    - float: Average path length.
    """
    # Euler-Mascheroni constant
    gamma = 0.5772
    if n <= 1:
        return 0
    else:
        return 2 * (np.log2(n - 1) + gamma * (n - 1) / 2)

def compute_anomaly_score(E_x, n):
    """
    Compute the anomaly score using Isolation Forest.
    
    Parameters:
    - E_x (float): Average path length of the data point.
    - n (int): Number of data points in the dataset.
    
    Returns:
    - float: Anomaly score.
    """
    # Calculate average path length for the dataset
    c_n = average_path_length(n)
    
    # Compute the anomaly score
    anomaly_score = 2 ** (-E_x / c_n)
    return anomaly_score

# Parameters
n = 3000  # Number of data points
E_x = 5.0  # Average path length of the data point

# Compute anomaly score
anomaly_score = compute_anomaly_score(E_x, n)
print(f'Anomaly Score: {anomaly_score:.4f}')


Anomaly Score: 0.9980
