
### **Q1. What is anomaly detection and what is its purpose?**
#####[Ans]
- **Anomaly Detection**: The process of identifying unusual patterns or data points that differ significantly from the majority of the data.
- **Purpose**:
  - Detect fraudulent activities in financial transactions.
  - Identify faults in industrial systems.
  - Discover unusual behaviors in cybersecurity.

---

### **Q2. What are the key challenges in anomaly detection?**
#####[Ans]
1. **Imbalanced Data**: Anomalies are rare compared to normal data.
2. **Dynamic Data**: Data distributions may change over time.
3. **High Dimensionality**: Complex datasets with many features make detecting anomalies harder.
4. **Lack of Labeled Data**: Often, there is no labeled dataset for anomalies.
5. **Defining Anomalies**: The definition of anomalies can vary depending on the context.

---

### **Q3. How does unsupervised anomaly detection differ from supervised anomaly detection?**
#####[Ans]
- **Unsupervised Anomaly Detection**:
  - No labeled data required.
  - Identifies patterns that deviate significantly from the majority.
  - Example: Isolation Forest.

- **Supervised Anomaly Detection**:
  - Requires labeled data (normal vs. anomalous).
  - Learns a model based on training data.
  - Example: Classification models.

---

### **Q4. What are the main categories of anomaly detection algorithms?**
#####[Ans]
1. **Statistical Methods**:
   - Based on probabilistic models.
   - Example: Z-score, Gaussian Mixture Models.

2. **Distance-Based Methods**:
   - Measure distance between points to identify outliers.
   - Example: K-Nearest Neighbors (KNN).

3. **Density-Based Methods**:
   - Identify regions of low data density as anomalies.
   - Example: Local Outlier Factor (LOF).

4. **Machine Learning-Based Methods**:
   - Supervised and unsupervised algorithms.
   - Example: Isolation Forest, Autoencoders.

---

### **Q5. What are the main assumptions made by distance-based anomaly detection methods?**
#####[Ans]
1. Normal data points are closer to each other.
2. Anomalous data points are farther from the majority of the data.
3. Distance metrics (e.g., Euclidean distance) are effective in capturing data relationships.

---

### **Q6. How does the LOF algorithm compute anomaly scores?**
#####[Ans]
- **Local Outlier Factor (LOF)**:
  1. Computes the **local density** of each data point based on its neighbors.
  2. Compares the density of a point to the densities of its neighbors.
  3. **Anomaly Score**:
     - High LOF score indicates that the point is in a low-density region compared to its neighbors, suggesting it is an anomaly.

---

### **Q7. What are the key parameters of the Isolation Forest algorithm?**
#####[Ans]
1. **Number of Trees (`n_estimators`)**:
   - Determines the number of trees in the forest.
   - More trees improve the model's accuracy but increase computational cost.

2. **Subsample Size (`max_samples`)**:
   - Number of samples used to train each tree.
   - Smaller subsample sizes enhance anomaly isolation.

3. **Contamination**:
   - Proportion of the dataset expected to be anomalies.
   - Helps in determining the threshold for classifying points as anomalies.

4. **Maximum Features (`max_features`)**:
   - Number of features to consider when splitting nodes.
   - Helps control model complexity.

---

### Q8. If a data point has only 2 neighbours of the same class within a radius of 0.5, what is its anomaly score using KNN with K=10?
#####[Ans]

In [2]:
K = 10
radius = 0.5
neighbors = 2

anomaly_score = 1 - (neighbors / K)

print(f"ANOMALY SCORE : {anomaly_score:.2f}")

ANOMALY SCORE : 0.80


---
### Q9. Using the Isolation Forest algorithm with 100 trees and a dataset of 3000 data points, what is the anomaly score for a data point that has an average path length of 5.0 compared to the average path length of the trees?
#####[Ans]

In [1]:
import math

n = 3000
h_x = 5.0

euler_gamma = 0.5772

c_n = 2 * (math.log(n - 1) + euler_gamma) - (2 * (n - 1) / n)

anomaly_score = 2 ** (-h_x / c_n)

print(f"c(n) for n={n}: {c_n:.4f}")
print(f"Anomaly Score: {anomaly_score:.4f}")


c(n) for n=3000: 15.1671
Anomaly Score: 0.7957
