# Effortless and Trusted Anomaly Detection with Snowflake ML Functions

Anomaly detection is the process of identifying **outliers** in data, especially in **time-series** datasets where data points are indexed over time. Outliers are data points that deviate significantly from expected patterns and, if unaddressed, can distort **statistical analyses** and models. By detecting and removing anomalies, we improve the accuracy and reliability of our models. The process typically involves training a model on historical data to recognize normal patterns and using that model to spot data points that fall outside of these patterns. Anomaly detection improves **data integrity**

This Notebook is designed to help you get up to speed with Anomaly Detection ML Functions in Snowflake ([link](https://docs.snowflake.com/en/user-guide/ml-functions/anomaly-detection)). We will work through an example using data from a bank marketing dataset ([link](https://archive.ics.uci.edu/dataset/222/bank+marketing)). We will build an anomaly detection model to understand if certain education groups have anomalies regarding the duration of the last contact by the bank. We will wrap up this Notebook by showcasing how you can use **Tasks** to schedule your model training process and utilize the email notification integration to send out a report on trending food items.

Let's get started!









# Step 1: Setting Up Snowflake Environment

Before working with data in Snowflake, it's essential to set up the **necessary infrastructure**. This includes defining user roles, creating a database and schema for organizing data, and setting up a compute warehouse to process queries efficiently. The following steps ensure that the environment is correctly configured:

- **Assign Role:** First, use the `ACCOUNTADMIN` role, which has the highest level of access in Snowflake. This ensures that you have the necessary permissions to create and modify databases, schemas, and warehouses. If a different role has sufficient privileges, it can be used instead.  

- **Create Database and Schema:** A **database** is where all your data is stored, and a **schema** helps organize different tables and objects within the database. In this setup, we create a database named `fawazghali_db` and a schema called `fawazghali_schema`. The `OR REPLACE` option ensures that if they already exist, they are replaced with fresh instances.  

- **Select Database and Schema:** To make sure all subsequent SQL commands operate within the correct context, we explicitly set `fawazghali_db` as the active database and `fawazghali_schema` as the active schema. This avoids confusion and ensures that queries and table creations happen in the right location.  

- **Create and Use Warehouse:** A **warehouse** in Snowflake is a virtual compute engine that processes queries and computations. We create a warehouse named `fawazghali_wh`, replacing any existing instance. After creation, we set it as the active warehouse to ensure all queries utilize this compute resource efficiently.  

By completing these setup steps, Snowflake is properly configured, allowing for smooth data storage, retrieval, and processing. 🚀  


In [None]:
-- Using accountadmin is often suggested for fawazghali_dbs, but any role with sufficient privledges can work
USE ROLE ACCOUNTADMIN;

-- Create development database, schema for our work: 
CREATE OR REPLACE DATABASE fawazghali_db;
CREATE OR REPLACE SCHEMA fawazghali_schema;

-- Use appropriate resources: 
USE DATABASE fawazghali_db;
USE SCHEMA fawazghali_schema;

-- Create warehouse to work with: 
CREATE OR REPLACE WAREHOUSE fawazghali_wh;
USE WAREHOUSE fawazghali_wh;


# Step 2: Create an External Stage for AWS S3

In this step, we create an external stage that connects to an AWS S3 bucket where our data is stored. This stage will be used to load data into Snowflake.

- **Stage Name**: `s3_fawazghali_load`
- **Comment**: A description for the stage connection (e.g., "fawazghali_db S3 Stage Connection").
- **S3 URL**: Specifies the location of the data on AWS S3 (e.g., `s3://sfquickstarts/hol_snowflake_cortex_ml_for_sql/`).
- **File Format**: We specify the previously created file format (`csv_ff`) for reading CSV files. This ensures that the data will be processed correctly when loaded.

The external stage allows Snowflake to access the data in the specified S3 bucket and is an important step before ingesting the data into Snowflake tables.

In [None]:
-- Create a csv file format to be used to ingest from the stage: 
CREATE OR REPLACE FILE FORMAT fawazghali_db.fawazghali_schema.csv_ff
    TYPE = 'csv'
    SKIP_HEADER = 1,
    COMPRESSION = AUTO;

-- Create an external stage pointing to AWS S3 for loading our data:
CREATE OR REPLACE STAGE s3_fawazghali_load 
    COMMENT = 'fawazghali_db S3 Stage Connection'
    URL = 's3://sfquickstarts/hol_snowflake_cortex_ml_for_sql/'
    FILE_FORMAT = fawazghali_db.fawazghali_schema.csv_ff;

-- Define our table schema
CREATE OR REPLACE TABLE fawazghali_db.fawazghali_schema.bank_marketing(
    CUSTOMER_ID TEXT,
    AGE NUMBER,
    JOB TEXT, 
    MARITAL TEXT, 
    EDUCATION TEXT, 
    DEFAULT TEXT, 
    HOUSING TEXT, 
    LOAN TEXT, 
    CONTACT TEXT, 
    MONTH TEXT, 
    DAY_OF_WEEK TEXT, 
    DURATION NUMBER(4, 0), 
    CAMPAIGN NUMBER(2, 0), 
    PDAYS NUMBER(3, 0), 
    PREVIOUS NUMBER(1, 0), 
    POUTCOME TEXT, 
    EMPLOYEE_VARIATION_RATE NUMBER(2, 1), 
    CONSUMER_PRICE_INDEX NUMBER(5, 3), 
    CONSUMER_CONFIDENCE_INDEX NUMBER(3,1), 
    EURIBOR_3_MONTH_RATE NUMBER(4, 3),
    NUMBER_EMPLOYEES NUMBER(5, 1),
    CLIENT_SUBSCRIBED BOOLEAN,
    TIMESTAMP TIMESTAMP_NTZ(9)
);

-- Ingest data from S3 into our table:
COPY INTO fawazghali_db.fawazghali_schema.bank_marketing
FROM @s3_fawazghali_load/customers.csv;



## Step 3: View a Sample of the Ingested Data

In this step, we query the Snowflake table to view a sample of the data that has been ingested. This helps us verify that the data was loaded correctly from the external stage.

- **Query**: We use a `SELECT` statement to retrieve the first 10 rows from the `bank_marketing` table.
- **Purpose**: The goal is to check if the data is available and looks as expected after ingestion.

By running this query, we can ensure that the data is properly loaded into the Snowflake table and ready for further analysis.


In [None]:
-- View a sample of the ingested data: 
SELECT * FROM fawazghali_db.fawazghali_schema.bank_marketing LIMIT 10;

## Step 4: Building the Anomaly Detection Model

In this step, we create a view containing the training data that will be used to build the anomaly detection model.

- **Training Data**: The view, named `fawazghali_anomaly_training_set`, selects data from the `bank_marketing` table.
- **Filtering Data**: The data is filtered to include only records where the `timestamp` is older than the most recent record by at least 12 months. This ensures that the training data consists of historical data.
- **Purpose**: The goal is to prepare a training dataset that excludes recent data, which can be used for building the anomaly detection model.

After creating the view, we query the `fawazghali_anomaly_training_set` view to confirm the number of rows in the training set, ensuring that the dataset is properly filtered and ready for use in the model.









In [None]:
-- Create a view containing our training data
CREATE OR REPLACE VIEW fawazghali_anomaly_training_set AS (
    SELECT *
    FROM fawazghali_db.fawazghali_schema.bank_marketing
    WHERE timestamp < (SELECT MAX(timestamp) FROM fawazghali_db.fawazghali_schema.bank_marketing) - interval '12 Month'
);

select count(*) from fawazghali_anomaly_training_set;





## Step 5: Create a View for Anomaly Inference

In this step, we create a view containing the data on which we want to make inferences for anomaly detection.

- **Inference Data**: The view, named `fawazghali_anomaly_analysis_set`, selects data from the `bank_marketing` table.
- **Filtering Data**: The data is filtered to include only records where the `timestamp` is more recent than the most recent record in the `fawazghali_anomaly_training_set` view. This ensures that the inference data consists of the latest data, which has not been used in the training set.
- **Purpose**: The goal is to prepare a dataset that will be used for making predictions or detecting anomalies in the most recent data.

After creating the view, we query the `fawazghali_anomaly_analysis_set` view to confirm the number of rows in the analysis set, ensuring that the dataset is correctly filtered and ready for anomaly detection.


In [None]:

-- Create a view containing the data we want to make inferences on
CREATE OR REPLACE VIEW fawazghali_anomaly_analysis_set AS (
    SELECT *
    FROM fawazghali_db.fawazghali_schema.bank_marketing
    WHERE timestamp > (SELECT MAX(timestamp) FROM fawazghali_anomaly_training_set)
);
select count(*) from fawazghali_anomaly_analysis_set;



## Step 6: Create the Anomaly Detection Model

In this step, we create the anomaly detection model using the `UNSUPERVISED` method. The model will analyze the data to detect anomalies.

- **Model Creation**: We use the `CREATE OR REPLACE snowflake.ml.anomaly_detection` command to create the model, named `fawazghali_anomaly_model`. The model is built using the following parameters:
  - `INPUT_DATA`: The view `fawazghali_anomaly_training_set`, which contains the training data.
  - `SERIES_COLNAME`: The column used for time series analysis, in this case, `EDUCATION`.
  - `TIMESTAMP_COLNAME`: The column representing the timestamp, which is `TIMESTAMP`.
  - `TARGET_COLNAME`: The target variable for anomaly detection, here it’s `DURATION`.
  - `LABEL_COLNAME`: The column for labels (if available). In this case, it is left empty, implying the model is unsupervised, but labels could be passed if desired.

- **Time Considerations**: The creation of the model might take a few minutes, depending on the size of the warehouse and data. Please be patient during this process.

Once the model is created, it will be ready to detect anomalies in future data.









In [None]:

-- Create the model: UNSUPERVISED method, however can pass labels as well; this could take few minutes depending on the wharehouse size; please be patient 
CREATE OR REPLACE snowflake.ml.anomaly_detection fawazghali_anomaly_model(
    INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'fawazghali_anomaly_training_set'),
    SERIES_COLNAME => 'EDUCATION',
    TIMESTAMP_COLNAME => 'TIMESTAMP',
    TARGET_COLNAME => 'DURATION',
    LABEL_COLNAME => ''
); 



## Step 7: Call the Anomaly Detection Model and Store Results

In this step, we call the anomaly detection model to identify anomalies in the data and store the results in a table.

- **Model Call**: The `DETECT_ANOMALIES` function is invoked with the following parameters:
  - `INPUT_DATA`: The view `fawazghali_anomaly_analysis_set`, which contains the data for inference.
  - `SERIES_COLNAME`: The column used for time series analysis, in this case, `EDUCATION`.
  - `TIMESTAMP_COLNAME`: The column representing the timestamp, which is `TIMESTAMP`.
  - `TARGET_COLNAME`: The target variable for anomaly detection, here it is `DURATION`.
  - `CONFIG_OBJECT`: An object specifying additional configuration options like the prediction interval (`0.95`).

- **Storing Results**: After the model runs, the results are stored in a table `fawazghali_anomalies`. We use `RESULT_SCAN(-1)` to retrieve the output of the last function call and create a new table with the results.

- **Querying Anomalies**: We then query the `fawazghali_anomalies` table to identify the series with the highest number of anomalies, specifically those with `is_anomaly = 1`. The result is grouped and ordered to find the series with the most detected anomalies.

This process allows us to detect and review anomalies in the latest data based on the trained model.


In [None]:

-- Call the model and store the results into table; this could take few minutes depending on the wharehouse size; please be patient
CALL fawazghali_anomaly_model!DETECT_ANOMALIES(
    INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'fawazghali_anomaly_analysis_set'),
    SERIES_COLNAME => 'EDUCATION',
    TIMESTAMP_COLNAME => 'TIMESTAMP',
    TARGET_COLNAME => 'DURATION',
    CONFIG_OBJECT => {'prediction_interval': 0.95}
);


-- Create a table from the results
CREATE OR REPLACE TABLE fawazghali_anomalies AS (
    SELECT *
    FROM TABLE(RESULT_SCAN(-1))
);



SELECT series, is_anomaly, count(is_anomaly) AS num_records
FROM fawazghali_anomalies
WHERE is_anomaly =1
GROUP BY ALL
ORDER BY num_records DESC
LIMIT 1;

# Conclusion  

In this notebook, we explored **Anomaly Detection** using **Snowflake ML Functions**, a powerful toolset designed to identify **outliers** in datasets efficiently. We examined how Snowflake's built-in functions simplify anomaly detection in **time-series** and other structured data, ensuring **data integrity** and **model reliability**.  

## Key takeaways:  
- **Anomaly detection** helps in identifying data points that significantly deviate from expected patterns.  
- **Snowflake ML Functions** provide an effortless and scalable approach to implementing anomaly detection.  
- **Practical use case**: We demonstrated anomaly detection on a **bank marketing dataset**, showing how Snowflake can help uncover outliers in real-world data.  

By leveraging Snowflake's capabilities, organizations can **automate anomaly detection**, enhance **data-driven decision-making**, and ensure **high-quality insights**.  

## Resources  

To explore further, refer to the following resources:  

1. **Snowflake Quickstarts**: Hands-on guides for implementing ML solutions in Snowflake.  
   - [Quickstarts](https://quickstarts.snowflake.com/)  

2. **Anomaly Detection ML Functions Documentation**: Official documentation covering Snowflake's anomaly detection features.  
   - [Anomaly Detection ML Functions](https://docs.snowflake.com/en/user-guide/ml-functions/anomaly-detection)  

3. **SQL Reference for Anomaly Detection**: Detailed SQL syntax and examples for implementing anomaly detection in Snowflake.  
   - [SQL Reference for Anomaly Detection](https://docs.snowflake.com/en/sql-reference/classes/anomaly_detection)  