In [None]:
Q1. What is anomaly detection and what is its purpose?

In [None]:
A1. Anomaly detection is a technique used in data analysis to identify unusual patterns, 
    outliers, or anomalous data points that deviate significantly from the expected or normal 
    behavior within a dataset. The purpose of anomaly detection is to identify these deviations, 
    which may represent critical insights, potential problems, or opportunities for further investigation.

In [None]:
Q2. What are the key challenges in anomaly detection?

In [None]:
A2. Defining "normal" behavior: Determining what constitutes normal or expected behavior in a dataset 
    can be challenging, especially in complex systems or environments where patterns can be dynamic and evolving.
    
    Dealing with noise and outliers: Real-world data often contains noise, errors, or outliers that can interfere 
    with anomaly detection algorithms. Distinguishing between true anomalies and noise can be difficult.

    Handling high-dimensional data: Many datasets contain a large number of features or variables, making 
    it challenging to identify anomalies in high-dimensional spaces.
    
    Adapting to concept drift: The definition of "normal" behavior may change over time, requiring anomaly 
    detection models to adapt and adjust to these changes (concept drift).

In [None]:
Q3. How does unsupervised anomaly detection differ from supervised anomaly detection?

In [None]:
A3. Unsupervised Anomaly Detection:
    In unsupervised anomaly detection, the algorithm learns solely from the input data without any 
    labeled examples of anomalies or normal instances. The key characteristics of unsupervised anomaly 
    detection are:

    No labeled data: The algorithm does not require pre-labeled data to train on. It learns the patterns 
    and structures present in the data itself.
    Assumption of normality: Unsupervised methods assume that the majority of the data represents "normal" 
    instances, and anomalies are deviations from this normality.
    Techniques: Common unsupervised anomaly detection techniques include clustering-based methods 
    (e.g., k-means), density-based methods (e.g., Local Outlier Factor), and statistical methods 
    (e.g., Gaussian mixture models).
    Applications: Unsupervised methods are useful when labeled data is scarce or unavailable, or when the 
    definition of an anomaly is not well-defined or changes over time.
    
    Supervised Anomaly Detection:
    In supervised anomaly detection, the algorithm is trained on a labeled dataset that contains examples 
    of both normal and anomalous instances. The key characteristics of supervised anomaly detection are:

    Labeled data: The algorithm requires a labeled dataset where instances are explicitly marked as normal 
    or anomalous.
    Classification approach: Supervised anomaly detection is typically framed as a binary classification 
    problem, where the algorithm learns to distinguish between normal and anomalous instances based on the 
    labeled examples.
    Techniques: Common supervised techniques include decision trees, support vector machines, 
    neural networks, and other classification algorithms.
    Applications: Supervised methods are suitable when labeled data is available or can be obtained through 
    manual labeling or domain expertise.

In [None]:
Q4. What are the main categories of anomaly detection algorithms?

In [None]:
A4. Statistical methods:
    These methods model the normal behavior of the data using statistical techniques and identify 
    instances that deviate significantly from this model as anomalies.
    Examples: Gaussian mixture models, parametric and non-parametric techniques, hypothesis testing.
    
    Proximity-based methods:
    These methods identify anomalies based on their distance or similarity to the majority of the data 
    points.
    Examples: k-Nearest Neighbors (k-NN), distance-based outlier detection.
    
    Density-based methods:
    These methods assume that normal instances occur in dense regions, while anomalies lie in sparse 
    regions of the data.
    Examples: Local Outlier Factor (LOF), Cluster-Based Local Outlier Factor (CBLOF), Density-Based 
    Spatial Clustering of Applications with Noise (DBSCAN).
    
    Subspace and Projection methods:
    These methods identify anomalies by projecting the data into different subspaces or lower-dimensional spaces, where anomalies may be more easily detectable.
    Examples: Principal Component Analysis (PCA), Subspace Outlier Detection.
    

In [None]:
Q5. What are the main assumptions made by distance-based anomaly detection methods?

In [None]:
A5. Proximity assumption: The fundamental assumption is that normal data instances are close together, 
    forming dense clusters or regions, while anomalies are far away from these dense regions. This means that 
    anomalies are expected to have larger distances from their nearest neighbors compared to normal instances.

    Representative training data: It is assumed that the training data is representative of the normal 
    instances and does not contain a significant number of anomalies. If the training data is contaminated 
    with too many anomalies, the distance-based methods may not accurately model the normal behavior.

    Meaningful distance metric: These methods rely on the existence of a meaningful distance or similarity 
    metric that can accurately measure the proximity between data instances. The choice of distance metric 
    (e.g., Euclidean distance, Manhattan distance, cosine similarity) should be appropriate for the specific 
    data and problem domain.
    
    Equal variance assumption (for some methods): Some distance-based methods, such as the classic k-NN, 
    assume that the features have equal variance or importance in determining the distance between 
    instances. This assumption may not hold true for all datasets, and violations can lead to suboptimal 
    performance.

In [None]:
Q6. How does the LOF algorithm compute anomaly scores?
A6. Compute k-distance: For each data instance, the k-distance is calculated, which is the distance 
    to the k-th nearest neighbor. This gives an idea of the density around that instance based on its 
    nearest neighbor distances.
    
    Compute reachability distance: The reachability distance of instance A with respect to instance B 
    is the maximum of the k-distance of B and the actual distance between A and B. This is used to reduce 
    the impact of statistical fluctuations in dense regions.
    
    Compute local reachability density: The local reachability density of an instance is the inverse of 
    the average reachability distance from its k nearest neighbors. It captures the idea that instances 
    with a higher density around them have a higher local reachability density.
    
    Compute LOF score: The LOF score of an instance is calculated by comparing its local reachability 
    density to the local reachability densities of its k nearest neighbors: 
        LOF(A) = sum(local_reachability_density(B) / local_reachability_density(A)) / k 
        Where the sum is taken over the k nearest neighbors of instance A.

In [None]:
Q7. What are the key parameters of the Isolation Forest algorithm?
A7. Number of trees (n_estimators): This parameter specifies the number of isolation trees to build in 
    the ensemble. A higher number of trees generally leads to better anomaly detection performance, but at 
    the cost of increased computational complexity and memory usage. Typical values range from 100 to 500 
    trees.
    
    Subsample size (max_samples): This parameter determines the size of the subsample of instances used 
    to construct each isolation tree. By default, it is set to the smaller of 256 or the number of 
    instances in the dataset. A smaller subsample size can lead to better anomaly detection performance, 
    but may also increase the risk of overfitting to the training data.

    Maximum tree depth (max_depth): This parameter limits the maximum depth of the isolation trees. 
    Setting a lower value can help control the complexity of the trees and prevent overfitting, 
    but may also reduce the ability to capture subtle anomalies.
    
    Contamination (contamination): This parameter specifies the expected proportion of anomalies in 
    the dataset. It is used to define the threshold for flagging instances as anomalies based on their 
    anomaly scores. A higher value will result in more instances being labeled as anomalies.
    
    Bootstrap (bootstrap): This parameter determines whether bootstrap samples are used when constructing 
    each isolation tree. Setting it to True can improve the diversity of the trees and potentially 
    enhance anomaly detection performance.

In [None]:
Q8. If a data point has only 2 neighbours of the same class within a radius of 0.5, what is its anomaly score
    using KNN with K=10?
A8. Since we are using k-NN with k=10, we need to consider the 10 nearest neighbors of the data point.
    Within a radius of 0.5, the data point has only 2 neighbors of the same class.
    This means that the remaining 8 neighbors (out of the 10 nearest neighbors) are of different classes 
    or are at a distance greater than 0.5.
    In k-NN anomaly detection, a data point is considered an anomaly if the majority of its k nearest 
    neighbors are different from its own class or are far away.
    In this case, 8 out of the 10 nearest neighbors are either of a different class or are far away 
    (beyond the radius of 0.5).
    Therefore, the anomaly score for this data point would be relatively high, as it deviates 
    significantly from its neighbors in terms of class or distance.

In [None]:
Q9. Using the Isolation Forest algorithm with 100 trees and a dataset of 3000 data points, what is the
    anomaly score for a data point that has an average path length of 5.0 compared to the average path
    length of the trees?

In [None]:
A9. The anomaly score = 2^(avg_path_length)/c(m)
    c(m) = 2⋅(ln(n−1)+0.5772)
    
    avg_path_length=5
    c(3000)=17.16
    
    anomaly score=2^(-5)/17.16
    
    The anomaly score is Anomaly Score ≈ 0.747