In [None]:
# Q1. What is anomaly detection and what is its purpose?
Ans.
Anomaly detection is a technique used in data analysis to identify outliers or unusual patterns in a dataset. Its purpose is to flag 
instances that deviate significantly from the norm or expected behavior. Anomalies could represent errors, outliers, or potentially
interesting patterns that warrant further investigation.
Anomaly detection is commonly used in various domains such as cybersecurity (to detect malicious activities), finance (to identify
fraudulent transactions), manufacturing (to spot defects in products), and healthcare (to detect unusual patient conditions). By 
identifying anomalies, organizations can take proactive measures to address issues or exploit opportunities that may otherwise go unnoticed.

In [None]:
# Q2. What are the key challenges in anomaly detection?
Ans.
Anomaly detection faces several key challenges, including:
1. Data Quality: Anomalies may be obscured by noisy or incomplete data. Poor data quality can make it difficult to distinguish true
anomalies from random fluctuations or errors.
2. Imbalanced Data: In many real-world scenarios, anomalies are rare compared to normal instances, resulting in imbalanced datasets. 
Traditional machine learning algorithms may struggle to effectively detect anomalies in such imbalanced data.
3. High Dimensionality: Anomalies may not always be apparent in low-dimensional representations of the data. As the dimensionality of
the data increases, the difficulty of detecting anomalies also increases due to the curse of dimensionality.
4. Concept Drift: The underlying patterns in the data may change over time, leading to concept drift. Anomaly detection models trained
on historical data may become less effective as new patterns emerge, requiring continuous monitoring and adaptation.
5. Scalability: Anomaly detection algorithms must be scalable to handle large volumes of data efficiently. Real-time or near-real-time
processing may be necessary in certain applications, posing additional scalability challenges.
6. Adversarial Attacks: In cybersecurity and fraud detection, adversaries may deliberately manipulate data to evade detection. Anomaly 
detection models must be robust against such adversarial attacks to maintain their effectiveness.

In [None]:
# Q3. How does unsupervised anomaly detection differ from supervised anomaly detection?
Ans.
Unsupervised anomaly detection and supervised anomaly detection are two different approaches to identifying anomalies in data:

1. Unsupervised Anomaly Detection:
In unsupervised anomaly detection, the algorithm is trained on a dataset containing only normal instances, without any labeled anomalies.
The algorithm learns the underlying patterns or structure of the normal data and flags instances that deviate significantly from this
learned pattern as anomalies.
Unsupervised techniques include methods like clustering-based approaches, density estimation, and isolation forests.
Unsupervised anomaly detection is useful when labeled anomaly data is scarce or unavailable, and it can identify previously unknown types
of anomalies.

2. Supervised Anomaly Detection:
In supervised anomaly detection, the algorithm is trained on a dataset that includes both normal instances and labeled anomalies.
The algorithm learns to distinguish between normal and anomalous instances based on the labeled examples provided during training.
Supervised techniques typically involve training classifiers such as support vector machines (SVMs), decision trees, or neural networks.
Supervised anomaly detection is effective when labeled anomaly data is available and can provide better performance compared to unsupervised 
methods, especially when anomalies are well-defined and representative.

In [None]:
# Q4. What are the main categories of anomaly detection algorithms?
Ans.
The main categories of anomaly detection algorithms are:
1. Statistical Methods: These algorithms model the statistical properties of normal data and flag instances that deviate significantly from
these properties.
2. Machine Learning-Based Methods: These algorithms use machine learning techniques to learn patterns from normal data and identify anomalies
based on deviations from these learned patterns.
3. Proximity-Based Methods: These algorithms measure the similarity or dissimilarity between data points and identify anomalies as instances 
that are significantly different from their neighbors.
4. Clustering-Based Methods: These algorithms group similar data points into clusters and identify anomalies as data points that do not belong
to any cluster or belong to small clusters.
5. Classification-Based Methods: These algorithms train classifiers to distinguish between normal and anomalous instances based on labeled 
training data, then classify new instances accordingly.
6. Dimensionality Reduction-Based Methods: These algorithms reduce the dimensionality of the data while preserving its essential characteristics,
making it easier to identify anomalies in the reduced-dimensional space.

In [None]:
# Q5. What are the main assumptions made by distance-based anomaly detection methods?
Ans.
Distance-based anomaly detection methods make the following main assumptions:
1. Density: Normal instances are typically clustered together and have a higher density compared to anomalies.
2. Distance: Anomalies are significantly distant from normal instances in the feature space, making them outliers with respect to the majority of 
the data points.
3. Neighborhood: Anomalies have fewer or no neighboring data points in their vicinity, as they are isolated or distant from the majority of the
data distribution.

In [None]:
# Q6. How does the LOF algorithm compute anomaly scores?
Ans.
The Local Outlier Factor (LOF) algorithm computes anomaly scores for each data point based on its local density compared to the densities of its
neighbors. Here's how it works:
1. Local Density Estimation: For each data point, the algorithm computes its local density by measuring the average distance to its k nearest
neighbors. A lower average distance indicates higher density, while a higher average distance suggests lower density.
2. Reachability Distance: The reachability distance of a data point p with respect to its neighbor q is defined as the maximum of the distance between 
p and q and the local density of q. This distance measures how easily p can be reached from q while considering the local density around q.
3. Local Reachability Density: The local reachability density of a data point is defined as the inverse of the average reachability distance to its
k nearest neighbors. Higher values indicate that the data point is more reachable from its neighbors and thus has lower outlierliness.
4. Local Outlier Factor (LOF) Score: Finally, the LOF score of a data point is computed by comparing its local reachability density with that of its
neighbors. A high LOF score indicates that the data point is significantly less dense than its neighbors, suggesting that it may be an outlier.

In [None]:
# Q7. What are the key parameters of the Isolation Forest algorithm?
Ans.
The Isolation Forest algorithm, a popular tree-based anomaly detection method, has a few key parameters that can be adjusted to control its behavior:
1. Number of Trees (n_estimators): This parameter specifies the number of isolation trees to be created in the forest. A larger number of trees can 
lead to better detection of anomalies but may also increase computational complexity.
2. Maximum Tree Depth (max_depth): The maximum depth allowed for each isolation tree in the forest. Controlling the maximum depth can help prevent 
overfitting and improve the algorithm's generalization ability.
3. Subsample Size (max_samples): This parameter determines the number of samples randomly drawn from the dataset to build each isolation tree. A 
smaller subsample size can lead to faster training but may also reduce the algorithm's effectiveness in capturing the underlying data structure.
4. Contamination: This parameter specifies the expected proportion of anomalies in the dataset. It is used to adjust the decision threshold for 
classifying instances as anomalies. A higher contamination value indicates a higher threshold for identifying anomalies.

In [None]:
# Q8. If a data point has only 2 neighbours of the same class within a radius of 0.5, what is its anomaly score
# using KNN with K=10?
To compute the anomaly score of a data point using K-nearest neighbors (KNN) with K=10, we need to identify the distance between the data point and 
its 10th nearest neighbor. If the distance is large, the data point is likely to be an anomaly.
In this case, the data point has only 2 neighbors of the same class within a radius of 0.5. Since K=10, we need to find the distance between the data 
point and its 10th nearest neighbor. If the data point has only 2 neighbors within a radius of 0.5, it is unlikely that it will have 10 neighbors 
within the same radius. Therefore, we cannot compute the anomaly score of the data point using KNN with K=10.
However, if we still want to compute the anomaly score using KNN with K=10, we can extend the distance radius until we find 10 neighbors. For example,
if we extend the radius to 1, we may find 10 neighbors. We can then compute the distance between the data point and its 10th nearest neighbor and use
it to compute the anomaly score. The larger the distance, the higher the anomaly score.
Anomaly Score = 1 / (average distance to k nearest neighbors)

In [None]:
# Q9. Using the Isolation Forest algorithm with 100 trees and a dataset of 3000 data points, what is the
# anomaly score for a data point that has an average path length of 5.0 compared to the average path
# length of the trees?
Ans.
The Isolation Forest algorithm generates a forest of decision trees, where each data point is isolated in a different partition of the feature space.
The anomaly score of a data point is computed based on the average path length of the data point in the trees of the forest.
If a data point has an average path length of 5.0 compared to the average path length of the trees, we can compute its anomaly score using the 
following formula:
Anomaly Score = 2^(-average path length / c(n))

where c(n) is a constant that depends on the number of data points n in the dataset. The value of c(n) can be computed as:
c(n) = 2 * H(n-1) - (2 * (n-1) / n)
where H(n-1) is the harmonic number of n-1.

For a dataset of 3000 data points, c(n) can be computed as:
c(3000) = 2 * H(2999) - (2 * 2999 / 3000) = 11.8979

Using this value of c(n), we can compute the anomaly score of the data point with an average path length of 5.0 as:
Anomaly Score = 2^(-5.0 / 11.8979) = 0.5017
This indicates that the data point is less anomalous than a data point with an average path length that is farther from the average path length of the
trees.

In [25]:
from sklearn.ensemble import IsolationForest
import numpy as np

X = np.random.randn(3000,10)

clf = IsolationForest(n_estimators=100,contamination='auto',random_state=42)
clf.fit(X)

In [26]:
anomaly_scores = clf.score_samples(X)
anomaly_scores

array([-0.43270147, -0.44104147, -0.44888877, ..., -0.38783072,
       -0.42319242, -0.45325706])

In [27]:
mean_anomaly_score = np.mean(anomaly_scores)
# Print the mean anomaly score
print(f"\nThe mean anomaly score is {mean_anomaly_score:.4f}")


The mean anomaly score is -0.4428
