# Unsupervised Learning, Anomaly Detection, and Temporal Analysis

**Instructions:** Carefully read each question. Use Google Docs, Microsoft Word, or a similar tool to create a document where you type out each question along with its answer. Save the document as a PDF, and then upload it to the LMS. Please do not zip or  archive the files before uploading them. Each question carries 20 marks.

**Question 1 :** What is Dimensionality Reduction? Why is it important in machine learning?

**Answer :** Dimensionality Reduction is the process of reducing the number of input variables (features) in a dataset while keeping as much relevant information as possible. In simple terms, it means simplifying the data by removing less important or redundant features.

**Why It’s Important in Machine Learning:**

**1. Reduces Overfitting:**
Fewer features mean less noise, which helps the model generalize better instead of memorizing the training data.

**2. Improves Training Speed:**
With fewer dimensions, algorithms require less computation time and memory.

**3. Enhances Visualization:**
It’s easier to visualize and understand data when it’s reduced to 2D or 3D.

**4. Removes Multicollinearity:**
Redundant features that are highly correlated can confuse models; dimensionality reduction helps eliminate them.

**5. Improves Model Performance:**
By focusing on the most informative features, the model can often achieve higher accuracy.

**Question 2:** Name and briefly describe three common dimensionality reduction techniques..

**Answer:**
1. Principal Component Analysis (PCA)

**Type:** Linear technique

**Description:**

PCA transforms the original correlated features into a smaller set of uncorrelated variables called principal components. These components capture the maximum variance (information) in the data using fewer dimensions.

**Use case:** Ideal for numerical data and when you want to preserve as much variance as possible.

2. Linear Discriminant Analysis (LDA)

**Type:** Supervised technique

**Description:**

LDA reduces dimensions by finding feature combinations that best separate different classes in the data. It maximizes the distance between class means while minimizing the variation within each class.

**Use case:** Commonly used in classification problems (e.g., face recognition).

3. t-Distributed Stochastic Neighbor Embedding (t-SNE)

**Type:** Non-linear technique

**Description:**

t-SNE focuses on preserving the local structure of data. It converts high-dimensional similarities into probabilities and maps them into a lower-dimensional space (usually 2D or 3D) for visualization.
**Use case:** Best for exploring and visualizing complex datasets such as image or text embeddings.

**Question 3:** What is clustering in unsupervised learning? Mention three popularclustering algorithms.

**Answer:** Clustering in unsupervised learning is a technique used to group similar data points together based on their features, without using any predefined labels. The goal is to ensure that data points within the same cluster are more similar to each other than to those in other clusters.

#Three Popular Clustering Algorithms:

**1. K-Means Clustering**

- Divides data into K predefined clusters.

- Each point belongs to the cluster with the nearest mean (centroid).

- Works best for well-separated, spherical clusters.

**2. Hierarchical Clustering**

- Builds a hierarchy (tree-like structure) of clusters using a bottom-up (agglomerative) or top-down (divisive) approach.

- Results can be visualized with a dendrogram.

- No need to predefine the number of clusters.

**3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)**

- Groups points that are close in density and labels low-density points as noise or outliers.

- Can find clusters of arbitrary shape.

- Doesn’t require specifying the number of clusters beforehand.

**Question 4:** Explain the concept of anomaly detection and its significance.

**Answer:** Anomaly detection is the process of identifying data points, events, or patterns that deviate significantly from the normal behavior in a dataset. These unusual observations are called anomalies or outliers.

**Concept:**

In most datasets, the majority of data follows a common pattern or distribution. Anomaly detection focuses on spotting the few instances that don’t fit this pattern.

For example, if you track daily transactions for a credit card user, most purchases will fall within a typical range. A sudden, unusually large purchase in another country could be flagged as an anomaly.

**Significance:**

**1. Fraud Detection:**

Detects unusual transactions in banking or credit card systems that may indicate fraud.

**2. Network Security:**

Identifies abnormal network activity that could signal hacking or malware.

**3. Quality Control:**

Spots defects or irregularities in manufacturing processes.

**4. Healthcare Monitoring:**

Detects abnormal patterns in patient data, such as irregular heart rates.

**5. System Maintenance:**

Helps in predicting equipment failures by identifying abnormal sensor readings.

**Question 5:** List and briefly describe three types of anomaly detection techniques.

**Answer:**

**1. Statistical Methods**

**Description:**

These methods assume that normal data follows a certain statistical distribution (like Gaussian or normal distribution). Data points that fall far from the expected range are marked as anomalies.
Example: Using Z-score or Interquartile Range (IQR) to detect outliers.
Use case: Simple and effective for small or well-behaved datasets.

**2. Distance-Based Methods**

**Description:**

These rely on the idea that normal data points are close to their neighbors, while anomalies are far away.

Example: K-Nearest Neighbors (KNN) calculates distances between points and flags those far from others as anomalies.

**Use case:** Works well when the data structure is geometric (e.g., spatial or numeric data).

**3. Machine Learning–Based Methods**

**Description:**

These techniques learn normal patterns from data and identify instances that don’t fit those learned patterns.

Examples:

- Isolation Forest isolates anomalies instead of profiling normal data.

- One-Class SVM separates normal data from outliers using a decision boundary.

- Autoencoders (in deep learning) reconstruct normal data well but fail on anomalies.

**Use case:** Suitable for complex, high-dimensional, or large-scale datasets.

**Question 6:** What is time series analysis? Mention two key components of time series data.


**Answer:** Time series analysis is the process of studying data points collected or recorded over time to identify meaningful patterns, trends, and seasonal behaviors.

In simple terms, it helps understand how data changes over time and can be used to make forecasts or predictions about future values.

#Two Key Components of Time Series Data:

**1. Trend:**
The long-term direction or movement in the data over time.

- Example: A steady increase in a company’s sales over several years.

**2. Seasonality:**
Regular and repeating patterns that occur at specific intervals.

- Example: Ice cream sales increasing every summer and dropping in winter.

**Question 7:** Describe the difference between seasonality and cyclic behavior in time series.

**Answer:**
#1. Seasonality

- Definition: Regular, repeating patterns that occur at fixed and predictable intervals (like days, months, or quarters).

- Duration: Short-term and tied to the calendar.

- Cause: Often influenced by factors like weather, holidays, or human behavior.

- Example:

- Retail sales increasing every December due to the holiday season.

- Electricity usage peaking every summer because of air conditioning.

#2. Cyclic Behavior

- Definition: Fluctuations that occur over longer, irregular periods, often driven by economic or business cycles.

- Duration: No fixed or regular interval — cycles vary in length.

- Cause: Influenced by broader forces like economic growth, market trends, or social factors.

- Example:

- A country’s economy expanding and contracting over several years.

- Real estate prices rising and falling with market cycles.

**Question 8:** Write Python code to perform K-means clustering on a sample dataset.

(Include your Python code and output in the code box below.)


In [None]:
# Import required libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

# Step 1: Create a sample dataset
X, y = make_blobs(n_samples=300, centers=3, cluster_std=0.6, random_state=0)

# Step 2: Apply K-Means clustering
kmeans = KMeans(n_clusters=3, random_state=0)
kmeans.fit(X)

# Step 3: Get cluster centers and labels
centers = kmeans.cluster_centers_
labels = kmeans.labels_

# Step 4: Visualize the clusters
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=30)
plt.scatter(centers[:, 0], centers[:, 1], c='red', marker='X', s=200, label='Centroids')
plt.title("K-Means Clustering Example")
plt.legend()
plt.show()

# Step 5: Print cluster centers
print("Cluster Centers:\n", centers)


**Question 9:** What is inheritance in OOP? Provide a simple example in Python.


**Answer:** Inheritance in Object-Oriented Programming (OOP) is a mechanism that allows one class (called the child or derived class) to inherit properties and methods from another class (called the parent or base class).

It helps in code reusability, extending functionality, and maintaining a clear class hierarchy.

In [None]:
# Parent class
class Animal:
    def speak(self):
        print("Animals make sounds")

# Child class inheriting from Animal
class Dog(Animal):
    def speak(self):
        print("Dog barks")

# Creating objects
a = Animal()
d = Dog()

a.speak()   # Output: Animals make sounds
d.speak()   # Output: Dog barks


**Question 10:** How can time series analysis be used for anomaly detection?


**Answer:** Time series analysis can be used for anomaly detection by identifying data points or patterns that deviate from the expected time-based behavior. Since time series data changes over time, detecting anomalies means spotting sudden spikes, drops, or irregular patterns that don’t match the usual trend or seasonality.

**How It Works:**

**1. Model Normal Behavior:**

A time series model (like ARIMA, LSTM, or Exponential Smoothing) is trained on historical data to understand normal patterns — including trend and seasonality.

**2. Predict Expected Values:**

The model forecasts future values or reconstructs what normal values should look like.

**3. Compare Actual vs. Expected:**

The difference (called residual or error) between actual and predicted values is analyzed.
If the error exceeds a certain threshold, that point is flagged as an anomaly.