In [None]:
Q1. What is the role of feature selection in anomaly detection?

Feature selection plays a crucial role in anomaly detection by identifying the most relevant features that capture the underlying structure of normal and anomalous instances.
It helps in reducing the dimensionality of the data, improving computational efficiency, and enhancing the performance of anomaly detection algorithms.

In [None]:
Q2. What are some common evaluation metrics for anomaly detection algorithms and how are they computed?

Common evaluation metrics for anomaly detection include:
Precision, Recall, F1-score: Computed based on the true positives, false positives, and false negatives.
Area Under the ROC Curve (AUC-ROC): Measures the ability of the model to discriminate between normal and anomalous instances.
Area Under the Precision-Recall Curve (AUC-PR): Evaluates the precision-recall trade-off of the model.
Mean Squared Error (MSE): Measures the average squared difference between predicted and actual values.

In [29]:
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, average_precision_score, mean_squared_error
import numpy as np

# Dummy true labels and predictions
y_true = np.array([0, 1, 1, 0, 1])
y_pred = np.array([0, 1, 0, 1, 1])

# Example usage
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
auc_roc = roc_auc_score(y_true, y_pred)  # Note: y_score is not required here
auc_pr = average_precision_score(y_true, y_pred)  # Note: y_score is not required here
mse = mean_squared_error(y_true, y_pred)

print('precision','-',precision)
print('recall','-',recall)
print('f1','-',f1)
print('auc_roc','-',auc_roc)
print('mse','-',mse)

precision - 0.6666666666666666
recall - 0.6666666666666666
f1 - 0.6666666666666666
auc_roc - 0.5833333333333333
mse - 0.4


In [None]:
Q3. What is DBSCAN and how does it work for clustering?

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm that groups together closely packed points based on a density threshold.
It forms clusters by identifying regions of high density separated by regions of low density, and it can discover clusters of arbitrary shapes.

In [None]:
Q4. How does the epsilon parameter affect the performance of DBSCAN in detecting anomalies?

The epsilon parameter in DBSCAN determines the radius within which points are considered neighbors. Smaller values of epsilon will result in tighter clusters and potentially more anomalies, 
while larger values may merge clusters and reduce the number of anomalies detected.

In [None]:
Q5. What are the differences between the core, border, and noise points in DBSCAN, and how do they relate to anomaly detection?

Core points: Points that have at least min_samples number of points (including itself) within a distance of epsilon. These points are at the interior of a cluster.
Border points: Points that are within the epsilon distance of a core point but do not satisfy the min_samples condition. They are on the edge of a cluster.
Noise points: Points that are neither core nor border points. They are considered outliers or anomalies.

In [None]:
Q6. How does DBSCAN detect anomalies and what are the key parameters involved in the process?

DBSCAN can detect anomalies as noise points, i.e., points that do not belong to any cluster.
The key parameters involved are epsilon (the maximum distance between two samples for one to be considered as in the neighborhood of the other) and min_samples (the number of samples in a neighborhood for a point to be considered a core point).

In [None]:
Q7. What is the make_circles package in scikit-learn used for?

The make_circles function in scikit-learn is used to generate synthetic data with concentric circles. It is often used for testing clustering algorithms, including DBSCAN.

In [25]:
from sklearn.datasets import make_circles

# Generate data with two circles
X, _ = make_circles(n_samples=1000, noise=0.1, random_state=42)


In [None]:
Q8. What are local outliers and global outliers, and how do they differ from each other?

Local outliers: Points that are anomalies within their local neighborhood but may appear normal in the global context. They have low local density compared to their neighbors.
Global outliers: Points that are anomalies when considering the entire dataset. They have low density compared to all points in the dataset.

In [None]:
Q9. How can local outliers be detected using the Local Outlier Factor (LOF) algorithm?

The LOF algorithm computes a score for each data point based on its local density compared to the local densities of its neighbors.
Points with significantly lower density compared to their neighbors are considered local outliers.

In [None]:
Q10. How can global outliers be detected using the Isolation Forest algorithm?

The Isolation Forest algorithm identifies global outliers by isolating them in the feature space through random partitioning.
It measures the number of splits required to isolate a data point, and points that require fewer splits are considered outliers.

In [None]:
Q11. What are some real-world applications where local outlier detection is more appropriate than global outlier detection, and vice versa?

Local outlier detection is more appropriate in applications where anomalies occur in clusters or localized regions, such as network intrusion detection or fraud detection in localized areas.
Global outlier detection is suitable for applications where anomalies are spread across the entire dataset, such as manufacturing quality control or anomaly detection in financial transactions.