# Anomaly Detection Analysis

This notebook demonstrates the process of anomaly detection using ARIMA forecasting residuals.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.seasonal import seasonal_decompose
import sys
sys.path.append('..')
from src.data_generation import generate_synthetic_data
from src.anomaly_detection import detect_anomalies
from src.visualization import plot_time_series_with_anomalies, plot_residuals_distribution

%matplotlib inline

## Generate and Explore Data

In [None]:
# Generate synthetic data
df = generate_synthetic_data()
df.to_csv('../data/synthetic_data.csv', index=False)
print("Synthetic data generated and saved to data/synthetic_data.csv")

# Display the first few rows and basic info about the dataset
print(df.head())
print(df.info())

## Time Series Decomposition

We'll decompose the time series to understand its components: trend, seasonality, and residuals.

In [None]:
# Aggregate data by date for decomposition
df_agg = df.groupby('date')['value'].sum().reset_index()
df_agg.set_index('date', inplace=True)

# Perform time series decomposition
result = seasonal_decompose(df_agg['value'], model='additive', period=365)
result.plot()
plt.tight_layout()
plt.show()

The decomposition shows:
1. A clear upward trend in the data.
2. A strong seasonal component with yearly cycles.
3. Residuals that appear mostly random, but with some potential anomalies.

## ARIMA Modeling and Anomaly Detection

In [None]:
# Perform anomaly detection
df_with_anomalies = detect_anomalies(df_agg)
print(f"Detected {df_with_anomalies['anomaly'].sum()} anomalies")

# Plot the time series with detected anomalies
plot_time_series_with_anomalies(df_with_anomalies)

# Plot the distribution of residuals
residuals = df_with_anomalies['value'] - df_with_anomalies['forecast']
plot_residuals_distribution(residuals)

## Interpretation of Results

1. The anomaly detection algorithm identified several points that deviate significantly from the expected values.
2. These anomalies could represent important events or issues in the data that warrant further investigation.
3. The distribution of residuals appears to be roughly normal, with some outliers that correspond to the detected anomalies.

Next steps could include:
1. Investigating the specific dates and circumstances of the detected anomalies.
2. Refining the ARIMA model parameters for potentially better forecasting.
3. Applying this method to real-world data and validating the results with domain experts.

## Detailed Analysis of Detected Anomalies

In [None]:
# Get the dates of detected anomalies
anomaly_dates = df_with_anomalies[df_with_anomalies['anomaly']].index

print("Dates of detected anomalies:")
for date in anomaly_dates:
    print(f"- {date}")

# Calculate the percentage difference between actual and forecasted values for anomalies
df_with_anomalies['percent_diff'] = (df_with_anomalies['value'] - df_with_anomalies['forecast']) / df_with_anomalies['forecast'] * 100

print("\nPercentage difference for anomalies:")
for date in anomaly_dates:
    percent_diff = df_with_anomalies.loc[date, 'percent_diff']
    print(f"- {date}: {percent_diff:.2f}%")

# Visualize the anomalies in context
plt.figure(figsize=(12, 6))
plt.plot(df_with_anomalies.index, df_with_anomalies['value'], label='Actual')
plt.plot(df_with_anomalies.index, df_with_anomalies['forecast'], label='Forecast')
plt.scatter(anomaly_dates, df_with_anomalies.loc[anomaly_dates, 'value'], color='red', label='Anomalies')
plt.title('Time Series with Detected Anomalies')
plt.xlabel('Date')
plt.ylabel('Value')
plt.legend()
plt.show()

## Conclusion and Next Steps

Based on our analysis:

1. We successfully identified several anomalies in our synthetic dataset using ARIMA forecasting residuals.
2. The detected anomalies represent significant deviations from the expected values, which could indicate important events or issues in a real-world scenario.
3. The distribution of residuals is approximately normal, with the detected anomalies appearing as outliers.

Next steps for improving and applying this analysis:

1. Fine-tune the ARIMA model parameters to potentially improve forecasting accuracy.
2. Experiment with different anomaly detection thresholds to balance sensitivity and specificity.
3. Apply this method to real-world data and collaborate with domain experts to validate and interpret the results.
4. Develop an automated alert system based on this anomaly detection method for real-time monitoring.
5. Investigate machine learning-based anomaly detection methods (e.g., isolation forests, autoencoders) and compare their performance with this ARIMA-based approach.

By following these steps, we can further refine our anomaly detection system and provide valuable insights for business decision-making and process optimization.