
# Anomaly Detection & Time Series Analysis Assignment
## PwSkills – Data Science

This notebook contains solutions for:

### Part A: Theoretical Questions
- Anomaly Detection
- Outlier Detection Algorithms
- Time Series Components
- Stationarity
- ARIMA Family Models

### Part B: Practical Questions
- Time Series Decomposition
- Isolation Forest
- Local Outlier Factor (LOF)
- SARIMA Forecasting
- Real-world Data Science Workflow

Each question is followed by its respective answer and implementation.


## Part A: Theoretical Questions

### Question 1: What is Anomaly Detection? Explain its types (point, contextual, and collective anomalies) with examples.

**Answer:**
Anomaly Detection is the process of identifying data points that deviate significantly from normal behavior.
It is widely used in fraud detection, system monitoring, medical diagnosis, and fault detection.

Types of anomalies:

1. Point Anomaly:
A single data point that is significantly different from others.
Example: A credit card transaction of ₹1,00,000 when normal spending is ₹2,000.

2. Contextual Anomaly:
An anomaly depending on context such as time or location.
Example: High electricity usage at midnight compared to daytime usage.

3. Collective Anomaly:
A group of data points behaving abnormally together.
Example: Continuous abnormal network traffic indicating a cyber attack.


### Question 2: Compare Isolation Forest, DBSCAN, and Local Outlier Factor in terms of their approach and suitable use cases.

**Answer:**
Isolation Forest:
- Approach: Isolates anomalies by randomly splitting data.
- Works well with high-dimensional datasets.
- Fast and scalable.
- Suitable for fraud detection and large datasets.

DBSCAN:
- Approach: Density-based clustering; low-density points are anomalies.
- Detects arbitrarily shaped clusters.
- Suitable for spatial data and noise detection.

Local Outlier Factor (LOF):
- Approach: Compares local density of a point with neighbors.
- Detects local anomalies effectively.
- Suitable when anomalies depend on neighborhood density.


### Question 3: What are the key components of a Time Series? Explain each with one example.

**Answer:**
A time series consists of the following components:

1. Trend:
Long-term increase or decrease in data.
Example: Increasing airline passengers over years.

2. Seasonality:
Regular repeating patterns at fixed intervals.
Example: Higher sales during festivals every year.

3. Cyclical Component:
Long-term oscillations not fixed in duration.
Example: Economic boom and recession cycles.

4. Residual (Noise):
Random variations not explained by other components.
Example: Unexpected events affecting demand.


### Question 4: Define Stationary in time series. How can you test and transform a non-stationary series into a stationary one?

**Answer:**
A stationary time series has constant mean, constant variance, and constant autocorrelation over time.

Testing methods:
- Augmented Dickey-Fuller (ADF) Test
- KPSS Test

Transformations:
- Differencing
- Log transformation
- Seasonal differencing
- Detrending


### Question 5: Differentiate between AR, MA, ARIMA, SARIMA, and SARIMAX models in terms of structure and application.

**Answer:**
AR (AutoRegressive):
Uses past values to predict future values.

MA (Moving Average):
Uses past forecast errors.

ARIMA:
Combination of AR and MA with differencing for non-stationary data.

SARIMA:
Extends ARIMA by adding seasonal components.

SARIMAX:
SARIMA with external (exogenous) variables such as weather or promotions.


## Part B: Practical Questions

In [None]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.datasets import make_blobs

from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.statespace.sarimax import SARIMAX
import statsmodels.api as sm



### Question 6:
Load AirPassengers dataset, plot original series, and decompose into trend, seasonality, and residual components.


In [None]:

data = sm.datasets.get_rdataset("AirPassengers").data
data.index = pd.date_range(start='1949-01', periods=len(data), freq='M')

plt.plot(data)
plt.title("AirPassengers Dataset")
plt.show()

decomposition = seasonal_decompose(data, model='multiplicative')
decomposition.plot()
plt.show()



### Question 7:
Apply Isolation Forest on a numerical dataset and visualize anomalies.


In [None]:

X, _ = make_blobs(n_samples=300, centers=1, cluster_std=1.0, random_state=42)

iso = IsolationForest(contamination=0.05)
labels = iso.fit_predict(X)

plt.scatter(X[:,0], X[:,1], c=labels)
plt.title("Isolation Forest Anomaly Detection")
plt.show()



### Question 8:
Train SARIMA model and forecast next 12 months.


In [None]:

model = SARIMAX(data, order=(1,1,1), seasonal_order=(1,1,1,12))
results = model.fit()

forecast = results.forecast(steps=12)

plt.plot(data, label="Original")
plt.plot(forecast, label="Forecast")
plt.legend()
plt.show()



### Question 9:
Apply Local Outlier Factor (LOF) and visualize anomalies.


In [None]:

X, _ = make_blobs(n_samples=300, centers=1, cluster_std=1.0)

lof = LocalOutlierFactor(n_neighbors=20)
labels = lof.fit_predict(X)

plt.scatter(X[:,0], X[:,1], c=labels)
plt.title("LOF Anomaly Detection")
plt.show()



### Question 10:
Real-time data science workflow for energy demand forecasting and anomaly detection.



**Answer:**

1. **Anomaly Detection in Streaming Data**
- Isolation Forest can be used for real-time anomaly detection because it is fast and scalable.
- LOF can detect local anomalies.
- DBSCAN can detect unusual density patterns.

2. **Time Series Model**
- SARIMAX is preferred because it supports seasonal patterns and external variables such as weather.

3. **Validation & Monitoring**
- Use rolling forecast validation.
- Monitor RMSE and MAE.
- Retrain models periodically.

4. **Business Impact**
- Detect abnormal spikes early.
- Prevent grid failures.
- Improve energy planning and cost optimization.
