In [None]:
Q1. What is the role of feature selection in anomaly detection?

Q2. What are some common evaluation metrics for anomaly detection algorithms and how are they
computed?

Q3. What is DBSCAN and how does it work for clustering?
                                       
Q4. How does the epsilon parameter affect the performance of DBSCAN in detecting anomalies?

Q5. What are the differences between the core, border, and noise points in DBSCAN, and how do they relate
to anomaly detection?

Q6. How does DBSCAN detect anomalies and what are the key parameters involved in the process?

Q7. What is the make_circles package in scikit-learn used for?
                                                         
Q8. What are local outliers and global outliers, and how do they differ from each other?

Q9. How can local outliers be detected using the Local Outlier Factor (LOF) algorithm?

Q10. How can global outliers be detected using the Isolation Forest algorithm?

Q11. What are some real-world applications where local outlier detection is more appropriate than global
outlier detection, and vice versa?

## Solutions

In [None]:
#Sol1...

Feature selection plays a crucial role in anomaly detection by removing irrelevant or redundant features, which can mislead the algorithm or increase
computational complexity. By focusing on the most relevant features, the detection process becomes more efficient and accurate. 

For example, in network intrusion detection, only specific network features (such as IP address, port, or protocol) might indicate anomalies,
while others add noise.

In [None]:
#Sol2...

Common evaluation metrics for anomaly detection include:
- **Precision**: It measures the proportion of true anomalies among all predicted anomalies, calculated as 
    {Precision} = {True Positives}/({True Positives} + {False Positives}).
    
- **Recall (Sensitivity)**: It measures the proportion of true anomalies detected from the actual anomalies, calculated as 
    {Recall} = {True Positives}/({True Positives} +{False Negatives}).
    
- **F1 Score**: A harmonic mean of precision and recall, useful when there is an imbalance between them. 
    {F1 Score} = 2*{Precision}*{Recall}({Precision} + {Recall}).
             
- **ROC AUC (Area Under the Receiver Operating Characteristic Curve)**: Measures the trade-off between true positive rate and false
    positive rate across thresholds, useful for threshold-based anomaly detection.

In [None]:
#Sol3...

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm. It groups points that 
are closely packed together (those with many neighbors within a defined distance) and marks points in low-density regions as noise.

It does not require the number of clusters as an input, unlike k-means, and can discover clusters of arbitrary shape. 

In [None]:
#Sol4...
        
The epsilon (ε) parameter defines the neighborhood radius around each data point. A small ε value results in small neighborhoods,
potentially classifying more points as noise (anomalies), while a large ε may result in larger clusters, reducing the number 
of detected anomalies. 
Choosing an optimal ε value is critical—too small, and even normal points might be considered noise; too large,
and actual anomalies may be absorbed into clusters.


In [None]:
#Sol5...

 **Core points**: Points with at least MinPts neighbors within a radius of ε. They are part of the dense cluster and help form its "core."
        
- **Border points**: Points that do not have enough neighbors to be core points but are within the ε radius of a core point.
        They are on the edge of a cluster.
- **Noise points**: Points that are neither core nor border points. They do not belong to any cluster and are considered anomalies or outliers. 
        In anomaly detection, these noise points are the anomalies DBSCAN detects.


In [None]:
#Sol6...

        
DBSCAN detects anomalies as noise points, which do not belong to any cluster because they have fewer than MinPts neighbors
within an ε distance. The key parameters are:
        
- **Epsilon (ε)**: Controls the size of the neighborhood around a point.
- **MinPts**: Defines the minimum number of points required to form a dense region. Points that don't meet this threshold and are 
        outside the clusters are treated as anomalies.

In [None]:
#Sol7...

The `make_circles` function in scikit-learn is used to generate synthetic datasets of concentric circles. It is useful for testing
clustering and classification algorithms, especially those that work well on non-linearly separable data. 
    
It can be used to create scenarios for anomaly detection, where points outside the circles might be considered outliers.


In [None]:
#Sol8...

- **Local outliers**: Anomalous data points that deviate significantly from their local neighbors but may not appear anomalous 
    in a global context. For instance, in a densely populated neighborhood, a house priced much higher than its neighboring 
                        homes can be a local outlier, even if it's within a typical price range for the entire city.
                                        
- **Global outliers**: Points that are anomalous compared to the entire dataset. 
    For example, in a housing dataset, a house priced 10 times higher than all other houses across the city would be a global outlier.

In [None]:
#Sol9...


                                     
LOF detects local outliers by comparing the density of a point to its neighbors. If a point has a significantly lower density 
compared to its neighbors, it is flagged as a local outlier.

The LOF algorithm calculates the local reachability density (LRD) 
for each point, and points with a low LRD (indicating sparse neighbors) are considered local outliers.

In [None]:
#Sol10...

The Isolation Forest algorithm isolates data points by recursively splitting the data. Anomalies (global outliers) are easier 
to isolate since they require fewer splits due to their uniqueness.
                                        
The algorithm randomly selects a feature and split value to divide the data, and points that are isolated quickly 
(with fewer splits) are considered anomalies.

In [None]:
#Sol11...


- **Local outlier detection**: More appropriate in fraud detection (e.g., credit card fraud), where unusual activity within a 
                              specific users transaction history (localized behavior) may signal fraud, even if it's not globally unusual.

- **Global outlier detection**: More suitable for detecting faults in sensor networks, where a malfunctioning sensor produces readings that 
                                deviate significantly from all other sensors across the entire system, making it a global outlier.
