<a href="https://colab.research.google.com/github/Shivam4988/Assignment/blob/main/ML_Assignment_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### 1. What is clustering in machine learning?

Clustering is an unsupervised learning technique that groups similar data points together based on their features. The goal is to discover inherent structures within the data without prior knowledge of labels or categories.

### 2. Explain the difference between supervised and unsupervised clustering.

Clustering is inherently unsupervised. The term "supervised clustering" is non-standard but may refer to classification tasks where labeled data is used. In unsupervised clustering, data is grouped without labels, while supervised methods (e.g., classification) use labeled data to train models.



### 3. What are the key applications of clustering algorithms?

Customer segmentation

Image compression

Document clustering

Anomaly detection

Market basket analysis

### 4. Describe the K-means clustering algorithm.

Choose K initial centroids.

Assign each data point to the nearest centroid.

Recalculate centroids as the mean of assigned points.

Repeat steps 2–3 until convergence (no further changes).

In [1]:
from sklearn.cluster import KMeans
import numpy as np

# Sample data
X = np.array([[1, 2], [5, 8], [1.5, 1.8], [8, 8], [1, 0.6], [9, 11]])

# Initialize and fit K-means
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)

print("Cluster centers:", kmeans.cluster_centers_)
print("Labels:", kmeans.labels_)

Cluster centers: [[1.16666667 1.46666667]
 [7.33333333 9.        ]]
Labels: [0 1 0 1 0 1]


### 5. What are the main advantages and disadvantages of K-means clustering?

Advantages: Simple, efficient, works well with spherical clusters.

Disadvantages: Sensitive to initial centroids, struggles with non-convex clusters, assumes equal cluster sizes.

### 6. How does hierarchical clustering work?
It builds a hierarchy of clusters using either:

Agglomerative: Bottom-up approach, merging closest clusters iteratively.

Divisive: Top-down approach, splitting clusters recursively.

### 7. What are the different linkage criteria used in hierarchical clustering?

Single linkage (minimum distance).

Complete linkage (maximum distance).

Average linkage (average distance).

Ward’s method (minimizes variance).

### 8. Explain the concept of DBSCAN clustering.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) identifies dense regions separated by sparse regions. Points are classified as:

Core points: At least min_samples within eps radius.

Border points: Near core points but lack sufficient neighbors.

Noise points: Neither core nor border.

In [2]:
from sklearn.cluster import DBSCAN
import numpy as np

# Sample data
X = np.array([[1, 2], [2, 2], [2, 3], [8, 7], [8, 8], [25, 80]])

# Fit DBSCAN
dbscan = DBSCAN(eps=3, min_samples=2)
labels = dbscan.fit_predict(X)

print("Labels (Noise points labeled as -1):", labels)

Labels (Noise points labeled as -1): [ 0  0  0  1  1 -1]


### 9. What are the parameters involved in DBSCAN clustering?

eps: Radius defining neighborhood.

min_samples: Minimum points required to form a dense region.

### 10. Describe the process of evaluating clustering algorithms.
Use internal metrics (e.g., silhouette score, Davies-Bouldin index), external metrics (if labels exist, e.g., adjusted Rand index), or visual assessment (e.g., dendrograms, scatter plots).

### 11. What is the silhouette score, and how is it calculated?
It measures how well a point fits its cluster. For a point i:

s
(
i
)
=
b
(
i
)
−
a
(
i
)
max
⁡
(
a
(
i
)
,
b
(
i
)
/
max(a(i),b(i))
b(i)−a(i))
​

where
a
(
i
)
a(i) = mean intra-cluster distance,
b
(
i
)
b(i) = mean distance to the nearest neighboring cluster.

### 12. Discuss the challenges of clustering high-dimensional data.

Curse of dimensionality: Distance metrics lose meaning.

Increased sparsity.

Noise dominates.
Solutions: Dimensionality reduction (PCA, t-SNE).

### 13. Explain the concept of density-based clustering.
Clusters are defined as regions of higher density compared to their surroundings. Examples: DBSCAN, OPTICS.

### 14. How does Gaussian Mixture Model (GMM) clustering differ from K-means?
GMM assumes data is generated from a mixture of Gaussian distributions. It allows soft assignments (probabilistic) and captures elliptical clusters, unlike K-means (hard assignments, spherical clusters).

### 15. What are the limitations of traditional clustering algorithms?

Assume specific cluster shapes (e.g., spherical).

Struggle with noise and outliers.

Require predefined parameters (e.g., K in K-means).

### 16. Discuss the applications of spectral clustering.

Image segmentation.

Social network analysis.

Bioinformatics (gene expression analysis).

### 17. Explain the concept of affinity propagation.
A clustering algorithm that identifies exemplars (representative points) by exchanging messages between data points. It does not require specifying the number of clusters upfront.



### 18. How do you handle categorical variables in clustering?

Use algorithms like K-modes.

Apply distance metrics for categorical data (e.g., Hamming distance).

Encode variables (e.g., one-hot encoding).

### 19. Describe the elbow method for determining the optimal number of clusters.
Plot the within-cluster sum of squares (WCSS) against the number of clusters (K). The "elbow" point (where WCSS decline slows) indicates optimal K.

### 20. What are some emerging trends in clustering research?

Deep clustering (e.g., autoencoders).

Subspace clustering.

Integration with reinforcement learning.

### 21. What is anomaly detection, and why is it important?
Anomaly detection identifies rare events or outliers. It is critical for fraud detection, system health monitoring, and quality control.

### 22. Discuss the types of anomalies encountered in anomaly detection.

Point anomalies: Single unusual instances.

Contextual anomalies: Abnormal in specific contexts (e.g., temperature spikes in winter).

Collective anomalies: A group of instances deviating as a whole.

### 23. Explain the difference between supervised and unsupervised anomaly detection techniques.

Supervised: Uses labeled data (normal vs. anomalies) to train a classifier.

Unsupervised: Assumes anomalies are rare and distinct, requiring no labels.

### 24. Describe the Isolation Forest algorithm for anomaly detection.
Isolation Forest isolates anomalies by recursively partitioning data using random splits. Anomalies require fewer splits to isolate, resulting in shorter path lengths in decision trees.



In [3]:
from sklearn.ensemble import IsolationForest

# Sample data
X = [[0.5], [1.2], [3.4], [5.6], [120]]  # 120 is an outlier

# Fit model
clf = IsolationForest(contamination=0.1)
clf.fit(X)

print("Predictions (-1 = anomaly):", clf.predict(X))

Predictions (-1 = anomaly): [ 1  1  1  1 -1]


### 25. How does One-Class SVM work in anomaly detection?
It learns a decision boundary around normal data points in a high-dimensional space. Points outside the boundary are classified as anomalies.



### 26. Discuss the challenges of anomaly detection in high-dimensional data.

Increased computational complexity.

Irrelevant features masking anomalies.

Sparsity reduces detection accuracy.

### 27. Explain the concept of novelty detection.
Identifying new/unseen patterns that were not present in the training data. Differs from anomaly detection, which focuses on known anomalies.

### 28. What are some real-world applications of anomaly detection?

Credit card fraud detection.

Network intrusion detection.

Industrial defect detection.

### 29. Describe the Local Outlier Factor (LOF) algorithm.
LOF measures the local density deviation of a point compared to its neighbors. A score >> 1 indicates an outlier (lower density than neighbors).

### 30. How do you evaluate the performance of an anomaly detection model?

Metrics: Precision, recall, F1-score, ROC-AUC.

Challenges: Imbalanced data, lack of labeled anomalies.

### 31. Discuss the role of feature engineering in anomaly detection.
Feature engineering creates relevant features to highlight anomalies (e.g., time-based aggregates, domain-specific transformations).

### 32. What are the limitations of traditional anomaly detection methods?

Assume specific data distributions.

Struggle with high-dimensional data.

Limited adaptability to dynamic environments.

### 33. Explain the concept of ensemble methods in anomaly detection.
Combining multiple models (e.g., Isolation Forest, LOF) to improve robustness and accuracy. Example: Voting systems or score averaging.

### 34. How does autoencoder-based anomaly detection work?
An autoencoder is trained to reconstruct normal data. High reconstruction error indicates anomalies.

In [4]:
import tensorflow as tf
from tensorflow.keras import layers

# Build autoencoder
encoder = tf.keras.Sequential([
    layers.Dense(16, activation="relu", input_shape=(10,)),
    layers.Dense(8, activation="relu")
])

decoder = tf.keras.Sequential([
    layers.Dense(16, activation="relu", input_shape=(8,)),
    layers.Dense(10, activation="sigmoid")
])

autoencoder = tf.keras.Sequential([encoder, decoder])
autoencoder.compile(optimizer='adam', loss='mse')

# Train on normal data and compute reconstruction error
# reconstruction_error = tf.reduce_mean(tf.square(X_test - autoencoder.predict(X_test)), axis=1)

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


### 35. What are some approaches for handling imbalanced data in anomaly detection?

Resampling (oversampling anomalies, undersampling normal data).

Adjusting class weights in models.

Using anomaly score thresholds.

### 36. Describe the concept of semi-supervised anomaly detection.
Uses a small amount of labeled normal data and large unlabeled data. Models learn normal patterns and flag deviations.

### 37. Discuss the trade-offs between false positives and false negatives in anomaly detection.

False positives: Normal instances flagged as anomalies (costly in low-tolerance systems).

False negatives: Missed anomalies (risky in critical applications).
The balance depends on domain requirements.

### 38. How do you interpret the results of an anomaly detection model?

Analyze anomaly scores.

Investigate feature contributions (e.g., SHAP values).

Validate with domain experts.

### 39. What are some open research challenges in anomaly detection?

Real-time detection in streaming data.

Explainability of models.

Adapting to evolving anomaly patterns.

### 40. Explain the concept of contextual anomaly detection.
Identifies anomalies within specific contexts (e.g., time, location). For example, a sudden spike in sales at an unusual hour.

### 41. What is time series analysis, and what are its key components?
Analysis of time-ordered data. Components:

Trend: Long-term direction.

Seasonality: Periodic fluctuations.

Residual: Irregular noise.

### 42. Discuss the difference between univariate and multivariate time series analysis.

Univariate: Single variable tracked over time.

Multivariate: Multiple variables tracked, considering interdependencies.

### 43. Describe the process of time series decomposition.
Separates a time series into:

Trend component.

Seasonal component.

Residual component.
Methods: Additive or multiplicative decomposition.

### 44. What are the main components of a time series decomposition?

Trend

Seasonality

Cyclical (long-term fluctuations)

Irregular (random noise)

### 45. Explain the concept of stationarity in time series data.
A stationary series has constant mean, variance, and autocorrelation over time. Required for models like ARIMA.

### 46. How do you test for stationarity in a time series?

Augmented Dickey-Fuller (ADF) test: Null hypothesis = non-stationary. Reject if p-value < 0.05.

Visual inspection (rolling mean/variance).

### 47. Discuss the autoregressive integrated moving average (ARIMA) model.

ARIMA combines autoregressive (AR), differencing (I), and moving average (MA) components.


In [5]:
from statsmodels.tsa.arima.model import ARIMA

# Sample time series data
data = [10, 20, 30, 40, 50, 60, 70]

# Fit ARIMA(p=1, d=1, q=1)
model = ARIMA(data, order=(1, 1, 1))
results = model.fit()

# Forecast
print("Forecast:", results.forecast(steps=3))

  warn('Non-stationary starting autoregressive parameters'


Forecast: [79.99993289 89.99975073 99.99945354]




### 48. What are the parameters of the ARIMA model?

p: Autoregressive order.

d: Differencing order.

q: Moving average order.

### 49. Describe the seasonal autoregressive integrated moving average (SARIMA) model.
SARIMA extends ARIMA with seasonal terms: SARIMA(p,d,q)(P,D,Q,m), where m = seasonal period.

### 50. How do you choose the appropriate lag order in an ARIMA model?
Use ACF (identifies q) and PACF (identifies p) plots. The lag where ACF/PACF cuts off indicates the order.

### 51. Explain the concept of differencing in time series analysis.
Differencing subtracts the previous observation (
y
t
−
y
t
−
1
y
t
​
 −y
t−1
​
 ) to stabilize the mean and remove trends.

### 52. What is the Box-Jenkins methodology?
A three-step approach for ARIMA modeling:

Identification: Determine p, d, q using ACF/PACF.

Estimation: Fit the model.

Diagnostics: Validate residuals (e.g., Ljung-Box test).

### 53. Discuss the role of ACF and PACF plots in identifying ARIMA parameters.

ACF: Helps identify q (MA term) by showing correlation at lags.

PACF: Helps identify p (AR term) by showing partial correlations.

### 54. How do you handle missing values in time series data?

Interpolation (linear, spline).

Forward/backward filling.

Model-based imputation (e.g., Kalman filter).

### 55. Describe the concept of exponential smoothing.

Exponential smoothing assigns decaying weights to past observations.

In [6]:
from statsmodels.tsa.holtwinters import SimpleExpSmoothing

# Sample data
data = [10.2, 11.5, 12.3, 13.6, 14.7]

# Fit model
model = SimpleExpSmoothing(data)
fit = model.fit(smoothing_level=0.5)

# Forecast
print("Next value:", fit.forecast(1))

Next value: [13.64375]


  return func(*args, **kwargs)


### 56. What is the Holt-Winters method, and when is it used?
Extends exponential smoothing to capture trend and seasonality. Used for data with both components. Types: Additive and multiplicative.

### 57. Discuss the challenges of forecasting long-term trends in time series data.

Uncertainty accumulates over time.

Structural breaks (e.g., economic crises).

External factors unaccounted in the model.

### 58. Explain the concept of seasonality in time series analysis.
Regular, repeating patterns tied to time intervals (e.g., daily, monthly). Example: Holiday sales spikes.

### 59. How do you evaluate the performance of a time series forecasting model?
Metrics:

MAE (Mean Absolute Error).

RMSE (Root Mean Squared Error).

MAPE (Mean Absolute Percentage Error).

In [7]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

# Actual vs predicted values
y_true = [3, 5, 7, 9]
y_pred = [2.8, 5.2, 7.1, 8.5]

mae = mean_absolute_error(y_true, y_pred)
rmse = np.sqrt(mean_squared_error(y_true, y_pred))

print(f"MAE: {mae:.2f}, RMSE: {rmse:.2f}")

MAE: 0.25, RMSE: 0.29


### 60. What are some advanced techniques for time series forecasting?

LSTM networks (for capturing long-term dependencies).

Prophet (Facebook’s model for seasonality and holidays).

State space models (e.g., SARIMA with exogenous variables).