# Q1. Ans

A method for automatic feature selection in anomaly detection is proposed which determines optimal mixture coefficients for various sets of features. The method generalizes the support vector data description (SVDD) and can be expressed as a semi-infinite linear program that can be solved with standard techniques.

# Q2. Ans

There are several common evaluation metrics for anomaly detection algorithms. Here are a few examples:

True Positive Rate (TPR) or Recall: It measures the proportion of actual anomalies that are correctly identified by the algorithm. It is calculated as TP / (TP + FN), where TP is the number of true positives (correctly detected anomalies) and FN is the number of false negatives (missed anomalies).

False Positive Rate (FPR): It measures the proportion of non-anomalies that are incorrectly classified as anomalies. It is calculated as FP / (FP + TN), where FP is the number of false positives (non-anomalies classified as anomalies) and TN is the number of true negatives (correctly classified non-anomalies).

Precision: It measures the proportion of correctly identified anomalies out of all instances classified as anomalies. It is calculated as TP / (TP + FP).

F1-Score: It is the harmonic mean of precision and recall, providing a balanced measure of the model's performance. It is calculated as 2 * (Precision * Recall) / (Precision + Recall).

Area Under the Receiver Operating Characteristic curve (AUROC): It is a popular metric that plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various classification thresholds. The AUROC value ranges from 0 to 1, with higher values indicating better performance.

Average Precision (AP): It calculates the average precision across all recall levels and is particularly useful when dealing with imbalanced datasets.

# Q3. Ans

DBSCAN is a density-based clustering algorithm that works on the assumption that clusters are dense regions in space separated by regions of lower density. It groups 'densely grouped' data points into a single cluster.

The Density-based Clustering tool works by detecting areas where points are concentrated and where they are separated by areas that are empty or sparse. Points that are not part of a cluster are labeled as noise.

# Q4. Ans

The epsilon (ε) parameter in DBSCAN determines the maximum distance between two points for them to be considered neighbors. It directly affects the performance of DBSCAN in detecting anomalies by influencing the size and shape of the detected clusters.

The impact of the epsilon parameter on anomaly detection can be summarized as follows:

Smaller Epsilon: Choosing a smaller epsilon value will result in tighter clusters. It means that points need to be closer to each other to be considered neighbors. As a result, anomalies that are far away from other data points or lie in sparse regions may not be included in any cluster and are more likely to be identified as outliers or anomalies.

Larger Epsilon: Increasing the epsilon value will lead to larger clusters as more points will be considered neighbors. This can result in anomalies being included within larger clusters, reducing their distinctiveness and making it more challenging to identify them as outliers.

# Q5. Ans

In DBSCAN (Density-Based Spatial Clustering of Applications with Noise), data points are classified into three categories: core points, border points, and noise points. These categories play a role in anomaly detection as they provide insights into the density and structure of the data.

Core Points: Core points are data points that have at least the minimum number of neighboring points (specified by the min_samples parameter) within a distance of epsilon (ε). These points are at the heart of clusters and contribute to the cluster's density. Core points are often considered as normal or non-anomalous points since they are surrounded by a sufficient number of similar points.

Border Points: Border points are data points that have fewer neighboring points than the minimum required but are within the distance of epsilon (ε) from a core point. These points lie on the edge of clusters and are less dense than core points. Border points can be considered as ambiguous or less certain points in terms of their anomaly status. They can belong to a cluster but have a lower level of confidence compared to core points.

Noise Points: Noise points, also known as outliers, are data points that do not meet the criteria for core or border points. They do not have enough neighboring points within the specified epsilon distance. Noise points are typically considered as anomalies since they do not conform to the density-based clustering patterns of the majority of the data.

The distinction between these categories is useful for anomaly detection because anomalies are often characterized by their deviation from the normal density patterns. Noise points, which do not belong to any cluster, can be seen as potential anomalies, while core and border points are more likely to represent normal data.

# Q6. Ans

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is primarily designed for clustering, but it can also be used for anomaly detection. DBSCAN detects anomalies by considering data points that do not belong to any cluster as noise points or outliers. Here's how DBSCAN detects anomalies and the key parameters involved in the process:

Density-Based Clustering: DBSCAN defines clusters based on the density of data points. It identifies dense regions in the data space as clusters and considers areas with low data density as anomalies.

Key Parameters:

Epsilon (ε): Also known as the radius, epsilon defines the maximum distance that two points can be from each other to be considered as neighbors. It determines the neighborhood of a point.
Min_samples: Min_samples specifies the minimum number of points that must be within the epsilon distance for a point to be classified as a core point. Points with fewer neighbors are considered outliers.
Core Points: Core points are data points that have at least the minimum number of neighboring points (min_samples) within the epsilon distance. They are considered as part of a cluster and are not anomalies.

Density-Reachable: DBSCAN identifies density-reachable points, which are points within the epsilon distance of a core point, even if they do not have enough neighbors themselves. Density-reachable points are part of the same cluster as the core point.

Border Points: Border points are data points that have fewer neighbors than the minimum required to be core points but are within the epsilon distance of a core point. They are part of a cluster but have a lower level of density than core points.

Noise Points: Noise points, also known as outliers, are data points that do not belong to any cluster. They are not classified as core points or border points because they do not meet the requirements for either.

# Q7. Ans

The make_circles function in scikit-learn is used to generate a synthetic dataset consisting of concentric circles. It is primarily used for testing and illustrating the performance of machine learning algorithms, particularly those that are designed to handle non-linearly separable data.

The make_circles function allows you to create a dataset with two classes, where each class forms a separate circle in the feature space. You can control various parameters of the dataset, such as the number of samples, noise level, and the ratio of inter-class and intra-class distances.

This synthetic dataset is useful for evaluating algorithms that are capable of capturing complex, non-linear relationships. For example, it can be used to test the performance of clustering algorithms like DBSCAN or density-based methods, as well as non-linear classifiers like support vector machines (SVM) or neural networks.

By using the make_circles function, you can generate a controlled dataset that exhibits specific characteristics, allowing you to study and analyze the behavior of different machine learning algorithms in the presence of non-linearly separable data.

# Q8. Ans

There are two general types of outlier detection: global and local. Global outliers fall outside the normal range for an entire dataset, whereas local outliers may fall within the normal range for the entire dataset, but outside the normal range for the surrounding data points.

# Q9. Ans

The Local Outlier Factor (LOF) algorithm is an unsupervised anomaly detection method which computes the local density deviation of a given data point with respect to its neighbors. It considers as outliers the samples that have a substantially lower density than their neighbors.

# Q10. Ans

The Isolation Forest algorithm can be used to detect global outliers in a dataset. In the Isolation Forest algorithm, anomalies are identified based on their tendency to have shorter average path lengths in the isolation trees.

To detect global outliers using the Isolation Forest algorithm, you can follow these steps:

Train the Isolation Forest model on the dataset that contains both normal and potentially anomalous data points.

For each data point in the dataset, compute its anomaly score. The anomaly score represents the average path length required to isolate the data point in the isolation trees.

Sort the data points based on their anomaly scores in ascending order. Data points with lower anomaly scores are considered less likely to be outliers, while data points with higher anomaly scores are considered more likely to be outliers.

Set a threshold for determining which data points are considered outliers. The threshold can be determined based on domain knowledge or by analyzing the distribution of anomaly scores.

Identify the data points with anomaly scores above the threshold as global outliers. These data points are considered to be significantly different from the majority of the data points in the dataset.

# Q11. Ans

Local outlier detection and global outlier detection have their own specific use cases depending on the nature of the data and the context of the problem. Here are some examples of real-world applications where one approach may be more appropriate than the other:

Local Outlier Detection:

Fraud Detection: In financial transactions, detecting local anomalies within a specific user's transaction history can help identify fraudulent activities that deviate from their normal behavior.
Network Intrusion Detection: Local outlier detection can be used to identify abnormal behavior or anomalies within a specific network segment or user's network traffic.
Disease Outbreak Detection: Local outlier detection can be applied to detect local clusters of disease outbreaks within a specific geographic region, helping to identify potential epidemics or outbreaks.
Global Outlier Detection:

Manufacturing Quality Control: Global outlier detection can be useful for identifying faulty products or components that deviate significantly from the norm across the entire manufacturing process.
Anomaly Detection in Sensor Networks: Global outlier detection can help identify sensor readings that deviate from the expected patterns across a network of sensors, indicating potential equipment malfunctions or abnormal conditions.
Credit Card Fraud Detection: Global outlier detection can be applied to identify unusual patterns or anomalies in credit card transactions across a large customer base, helping to detect fraudulent activities that span multiple accounts.