In [None]:
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Anomaly detection in security logs with BQML .ipynb

{TODO: Update the links below.}

<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/notebook_template.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/notebook_template.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/notebook_template.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>                                                                                               
</table>

**_NOTE_**: This notebook has been tested in the following environment:

* Python version = 3.9

## Overview

This Colab notebook demonstrates how to use BigQuery ML to detect anomalies in Cloud Audit logs.  We'll use two different pre-built ML models for unsupervised anomaly detection, K-means clustering and Autoencoders, to help us identify outliers such as an uncommon API usage by any user identity. Identifying anomalies in audit logs is critical for cloud administrators and operators to identify potential threats from priviledge escalation to API abuse.

### Objective

In this tutorial, you learn how to:

* Apply feature enginering by preprocessing Cloud Audit logs
* Use BigQuery ML for unsupervised anomaly detection in Cloud Audit logs
* Train and evaluate ML models such as K-means clustering and Autoencoders
* Extract and analyze outliers

This tutorial uses the following Google Cloud ML services and resources:

- BigQuery
- Cloud Storage
- Log Analytics

### Prerequisite
 If you haven't already done so, the only requirement is to [upgrade your existing log bucket](https://cloud.google.com/logging/docs/buckets#upgrade-bucket) to use Log Analytics which provides you with a linked BigQuery dataset with your own queryable logs data. This is a **one-click step without incurring additional costs**. By default, Cloud Audit Admin Activity logs are enabled, ingested and stored in every project's `_Required` bucket without any charges.

![one click prerequisite](https://services.google.com/fh/files/misc/upgrade_log_bucket.png)

### Dataset

For this notebook, you will analyze your own Cloud Audit logs such as Admin Activity logs which are enabled and stored by default in every Google Cloud project. Unlike synthetic data, analyzing your own real data will provide you with actual insights but results will vary.

### Costs


This tutorial uses billable components of Google Cloud:

* BigQuery

Learn about [BigQuery pricing](https://cloud.google.com/bigquery/pricing)
and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

## Before you begin

### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.

2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

3. [Enable the BigQuery API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com). {TODO: Update the APIs needed for your tutorial. Edit the API names, and update the link to append the API IDs, separating each one with a comma. For example, container.googleapis.com,cloudbuild.googleapis.com}

4. If you are running this notebook locally, you need to install the [Cloud SDK](https://cloud.google.com/sdk).

#### Set your project ID

**If you don't know your project ID**, try the following:
* Run `gcloud config list`.
* Run `gcloud projects list`.
* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)

In [4]:
PROJECT_ID = "cloud-sa-ml"  # @param {type:"string"}

# Set the project id
! gcloud config set project {PROJECT_ID}
%env GOOGLE_CLOUD_PROJECT=$PROJECT_ID
!echo project_id = $PROJECT_ID > ~/.bigqueryrc

Updated property [core/project].
env: GOOGLE_CLOUD_PROJECT=cloud-sa-ml


#### Region

You can also change the `REGION` variable used by Vertex AI. Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations).

In [5]:
REGION = "us-central1"  # @param {type: "string"}

Provide the Project, BigQuery dataset & BigQuery table where the audit logs are stored. You can find the linked BigQuery dataset ID for your log bucket from the [Logs Storage page](https://console.cloud.google.com/logs/storage).

In [6]:
logSourceProject = "sd-uxr-001"  # @param {type:"string"} custom
logSourceBqDataset = "required_bucket"  # @param {type:"string"} custom
logSourceBqTable = "_AllLogs"  # @param {type:"string"} custom

This is the BigQuery dataset & BigQuery table where the preprocessed training dataset will be stored

In [7]:
BQ_DATASET_NAME = "classical_ml_approach"  # @param {type:"string"} custom
BQ_TABLE_NAME = "training_data"  # @param {type:"string"} custom

 Provide the BQML model names; These models will be saved under the above mentioned BQ dataset

In [8]:
KMEANS_MODEL = "KMEANS_HTUNE"  # @param {type:"string"} custom
AUTO_ENCODER_MODEL = "autoencoder_feb4"  # @param {type:"string"} custom

### Authenticate your Google Cloud account

Depending on your Jupyter environment, you may have to manually authenticate. Follow the relevant instructions below.

**1. Vertex AI Workbench**
* Do nothing as you are already authenticated.

**2. Local JupyterLab instance, uncomment and run:**

In [9]:
# ! gcloud auth login

**3. Colab, uncomment and run:**

In [10]:
from google.colab import auth

auth.authenticate_user()

### Import libraries

In [None]:
import time

from google.cloud import bigquery

## Training Data Preparation and Analysis

Cloud Audit logs contain a wealth of important information but their volume, velocity and variety makes it challenging to analyze at scale. Each log entry has a relatively [complex schema](https://cloud.google.com/logging/docs/reference/v2/rest/v2/LogEntry) which makes it fruther challenging to analyze in their raw format.

Before running the ML models, you extract the relevant fields from these logs and aggregate (count) the **actions** by **day**, **actor**, **action**, and **source IP**. As we're primarily interested in determining user anomalous behavior, each of those features are relevant and collectively sufficient for our analysis.

In [None]:
# This helper function executes the sql query, wait for query execution completion and returns the results as dataframe
def execute_sql(sql_query: str):
    """The executes the sql.
    Args:
        sql_query:(:obj:`str`): SQL query to execute
    """
    from google.cloud import bigquery

    client = bigquery.Client()
    import traceback

    try:
        client = bigquery.Client()
        start = time.time()
        query_job = client.query(sql_query)  # Make an API request.
        print("Query Executed.Waiting for completion")
        results = query_job.result()  # Waits for job to complete.
        end = time.time()
        print("Query Execution completed")
        print("Time taken to execute:", end - start)
        if results.total_rows > 0:
            df = results.to_dataframe()
            df.head()
            return df
    except Exception as e:
        error = traceback.format_exc()
        print(error)
        print(e)
        raise RuntimeError(f"Can't execute the query {sql_query}")

The following UDF extracts the resourced ID that was acted on per the audit log entry. In the audit log entry, The resource ID is specified in a different resource label field depending on the resource type. That's why this UDF is needed to normalize that resource ID field.

In [None]:
# Deduce resource ID from a log entry resource field
UDF_NAME = "getResourceId"

sql = """
CREATE OR REPLACE FUNCTION `{}.{}.{}`(
  type STRING,
  labels JSON
)
RETURNS STRING
AS (
 COALESCE(
  JSON_VALUE(labels.email_id),     # service_account
  JSON_VALUE(labels.pod_id),       # container
  JSON_VALUE(labels.instance_id),  # gce_instance, spanner_instance, redis_instance, ...
  JSON_VALUE(labels.subnetwork_id),# gce_subnetwork,
  JSON_VALUE(labels.network_id),   # gce_network, gce_network_region, ...
  JSON_VALUE(labels.topic_id),     # pubsub_topic
  JSON_VALUE(labels.subscription_id), # pubsub_subscription
  JSON_VALUE(labels.endpoint_id),  # aiplatform.googleapis.com/Endpoint
  JSON_VALUE(labels.job_id),       # dataflow_step
  JSON_VALUE(labels.dataset_id),   # bigquery_dataset
  JSON_VALUE(labels.project_id),
  JSON_VALUE(labels.organization_id),
  JSON_VALUE(labels.id),
  "other")
);""".format(
    PROJECT_ID, BQ_DATASET_NAME, UDF_NAME
)

execute_sql(sql)
print(f"Created UDF {PROJECT_ID}.{BQ_DATASET_NAME}.{UDF_NAME}")

Query Executed.Waiting for completion
Query Execution completed
Time taken to execute: 1.0944032669067383
Created UDF cloud-sa-ml.classical_ml_approach.getResourceId


The following UDF deduces where a user or system action occured from per the audit log entry. For example, an action may have occured through the Cloud Console, or using gcloud CLI, or via Terraform script or another unknown client or channel.

In [None]:
# Deduce channel from a log entry request user agent
UDF_NAME = "getChannelType"

sql = """CREATE OR REPLACE FUNCTION `{}.{}.{}`(
  caller_supplied_user_agent STRING
)
RETURNS STRING
AS (
  CASE
    WHEN caller_supplied_user_agent LIKE "Mozilla/%" THEN 'Cloud Console'
    WHEN caller_supplied_user_agent LIKE "google-cloud-sdk gcloud/%" THEN 'gcloud CLI'
    WHEN caller_supplied_user_agent LIKE "google-api-go-client/% Terraform/%" THEN 'Terraform'
    ELSE 'other'
  END
);""".format(
    PROJECT_ID, BQ_DATASET_NAME, UDF_NAME
)

execute_sql(sql)
print(f"Created UDF {PROJECT_ID}.{BQ_DATASET_NAME}.{UDF_NAME}")

Query Executed.Waiting for completion
Query Execution completed
Time taken to execute: 0.9740383625030518
Created UDF cloud-sa-ml.classical_ml_approach.getChannelType


Query the log source to extract the training data with fields of interest

In [None]:
# Query to extract training data with fields of interest
query_str = """ SELECT
    EXTRACT(DATE FROM timestamp) AS day,
    IFNULL(proto_payload.audit_log.authentication_info.principal_email, "unknown") as principal_email,
    IFNULL(proto_payload.audit_log.method_name, "unknown") as action,
    IFNULL(resource.type, "unknown") as resource_type,
    {3}.getResourceId(resource.type, resource.labels) AS resource_id,
    -- proto_payload.audit_log.resource_name as resource_name,
    SPLIT(log_name, '/')[SAFE_OFFSET(0)] as container_type,
    SPLIT(log_name, '/')[SAFE_OFFSET(1)] as container_id,
    {3}.getChannelType(proto_payload.audit_log.request_metadata.caller_supplied_user_agent) AS channel,
    IFNULL(proto_payload.audit_log.request_metadata.caller_ip, "unknown") as ip,
    COUNT(*) counter,
    -- ANY_VALUE(resource) as resource,           -- for debugging
    -- ANY_VALUE(proto_payload) as proto_payload  -- for debugging
  FROM  `{0}.{1}.{2}`
  WHERE
    -- log_id = "cloudaudit.googleapis.com/activity" AND
    timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 360 DAY)
  GROUP BY
    day, principal_email, action, resource_type, resource_id, container_type, container_id, channel, ip, log_name
  ORDER BY
    day DESC, principal_email, action""".format(
    logSourceProject, logSourceBqDataset, logSourceBqTable, BQ_DATASET_NAME
)

Lets view the training data dataframe

In [None]:
client = bigquery.Client(project=PROJECT_ID)
df = client.query(query_str).to_dataframe()
df.head()

Unnamed: 0,day,principal_email,action,resource_type,resource_id,container_type,container_id,channel,ip,counter
0,2024-02-02,cbaer@google.com,google.cloud.bigquery.v2.JobService.InsertJob,bigquery_dataset,cbaer_test,projects,sd-uxr-001,other,unknown,1
1,2024-02-02,cbaer@google.com,google.cloud.bigquery.v2.JobService.InsertJob,bigquery_dataset,scheduledquery_cbaer,projects,sd-uxr-001,other,unknown,45
2,2024-02-02,p971828084600-050786@gcp-sa-logging.iam.gservi...,google.cloud.bigquery.v2.TableService.PatchTable,bigquery_dataset,logging_export,projects,sd-uxr-001,other,private,3
3,2024-02-02,p971828084600-050786@gcp-sa-logging.iam.gservi...,tableservice.update,bigquery_resource,sd-uxr-001,projects,sd-uxr-001,other,private,3
4,2024-02-02,service-971828084600@container-engine-robot.ia...,v1.compute.addresses.insert,gce_reserved_address,sd-uxr-001,projects,sd-uxr-001,other,108.59.87.52,270


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49495 entries, 0 to 49494
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   day              49495 non-null  dbdate
 1   principal_email  49495 non-null  object
 2   action           49495 non-null  object
 3   resource_type    49495 non-null  object
 4   resource_id      49495 non-null  object
 5   container_type   49495 non-null  object
 6   container_id     49495 non-null  object
 7   channel          49495 non-null  object
 8   ip               49495 non-null  object
 9   counter          49495 non-null  Int64 
dtypes: Int64(1), dbdate(1), object(8)
memory usage: 3.8+ MB


Create a table in BQ with the extracted data

In [None]:
create_training_data_table = (
    """ CREATE OR REPLACE TABLE `{}.{}.{}` AS""".format(
        PROJECT_ID, BQ_DATASET_NAME, BQ_TABLE_NAME
    )
    + query_str
)
client.query(create_training_data_table)

QueryJob<project=cloud-sa-ml, location=US, id=b0ce84ec-6a42-493b-b2e5-5d75664f8157>

## K-Means Clustering

Lets create K-Means clusters with the training data

### Model Training

In [None]:
train_kmeans = """CREATE MODEL IF NOT EXISTS `{0}.{1}`
OPTIONS(MODEL_TYPE = 'KMEANS',
NUM_CLUSTERS = HPARAM_RANGE(2, 10),
KMEANS_INIT_METHOD = 'KMEANS++',
DISTANCE_TYPE = 'COSINE',
STANDARDIZE_FEATURES = TRUE,
MAX_ITERATIONS = 10,
EARLY_STOP = TRUE,
NUM_TRIALS = 10
) AS
SELECT * FROM `{0}.{2}`;""".format(
    BQ_DATASET_NAME, KMEANS_MODEL, BQ_TABLE_NAME
)

In [None]:
# the below query brings the raw sql results. #Takes 16mins
execute_sql(train_kmeans)

Query Executed.Waiting for completion
Query Execution completed
Time taken to execute: 983.8348269462585


### Model Evaluation

In [None]:
eval_kmeans = """SELECT * FROM ML.EVALUATE(MODEL `{}.{}`);""".format(
    BQ_DATASET_NAME, KMEANS_MODEL
)
model_evalution = execute_sql(eval_kmeans)
model_evalution

Query Executed.Waiting for completion
Query Execution completed
Time taken to execute: 1.3773245811462402


Unnamed: 0,trial_id,davies_bouldin_index,mean_squared_distance
0,1,2.566582,0.216426
1,2,2.59401,0.219378
2,3,2.69306,0.219973
3,4,2.353669,0.200104
4,5,2.508562,0.198849
5,6,2.715906,0.207692
6,7,1.602581,0.246144
7,8,2.547436,0.227636
8,9,2.289053,0.235305


### Outlier Analysis

In [None]:
# --- DETECT ANOMALIES --- #
detect_anomaly = """SELECT * FROM ML.DETECT_ANOMALIES(MODEL `{0}.{1}.{2}`,
STRUCT(0.001 AS contamination),
TABLE `{0}.{1}.{3}`)
WHERE is_anomaly=true
ORDER BY normalized_distance DESC;""".format(
    PROJECT_ID, BQ_DATASET_NAME, KMEANS_MODEL, BQ_TABLE_NAME
)

kmeans_outliers = execute_sql(detect_anomaly)

Query Executed.Waiting for completion
Query Execution completed
Time taken to execute: 1.1434009075164795


In [None]:
kmeans_outliers

Unnamed: 0,trial_id,is_anomaly,normalized_distance,CENTROID_ID,day,principal_email,action,resource_type,resource_id,container_type,container_id,channel,ip,counter
0,7,True,1.576632,1,2023-04-27,cbaer@google.com,google.iam.admin.v1.SetIAMPolicy,service_account,rarsan-service-account@sd-uxr-001.iam.gservice...,projects,sd-uxr-001,gcloud CLI,34.67.91.162,1
1,7,True,1.568558,1,2023-04-27,cbaer@google.com,google.iam.admin.v1.CreateServiceAccount,service_account,rarsan-service-account@sd-uxr-001.iam.gservice...,projects,sd-uxr-001,Cloud Console,2620:15c:170:110:d019:7321:4b58:10f9,1
2,7,True,1.568553,1,2023-04-27,cbaer@google.com,google.iam.admin.v1.SetIAMPolicy,service_account,rarsan-service-account@sd-uxr-001.iam.gservice...,projects,sd-uxr-001,Cloud Console,2620:15c:170:110:c4c:8e1a:3d36:3cdb,1
3,7,True,1.567726,1,2023-02-10,hutz@google.com,google.cloud.bigquery.v2.JobService.InsertJob,bigquery_dataset,looker,projects,sd-uxr-001,Cloud Console,73.243.78.216,2
4,7,True,1.567626,1,2023-02-15,hutz@google.com,google.cloud.bigquery.v2.JobService.InsertJob,bigquery_dataset,looker,projects,sd-uxr-001,Cloud Console,73.243.78.216,1
5,7,True,1.567529,1,2023-05-15,hutz@google.com,google.cloud.bigquery.v2.JobService.InsertJob,bigquery_dataset,looker,projects,sd-uxr-001,Cloud Console,73.243.78.216,2
6,7,True,1.567072,1,2023-08-10,cbaer@google.com,beta.compute.subnetworks.insert,gce_subnetwork,7756423237025198361,projects,sd-uxr-001,Cloud Console,73.117.191.67,2
7,7,True,1.567036,1,2023-06-12,mkoes@google.com,beta.compute.instances.setLabels,gce_instance,8951054702375589002,projects,sd-uxr-001,Cloud Console,2620:15c:170:110:b8aa:a7b5:f5ca:8108,2
8,7,True,1.567027,1,2023-02-15,sbrunner@google.com,v1.compute.instances.setMetadata,gce_instance,3710254939720635530,projects,sd-uxr-001,Cloud Console,2600:4041:1c4:8600:b49d:2395:240e:4fc6,2
9,7,True,1.567,1,2023-05-31,cbaer@google.com,v1.compute.subnetworks.patch,gce_subnetwork,3635102086349064406,projects,sd-uxr-001,Cloud Console,73.117.191.67,2


## Auto Encoders

### Model Training

In [None]:
train_auto_encoder = """
CREATE MODEL IF NOT EXISTS `{0}.{1}`
OPTIONS(
MODEL_TYPE='autoencoder',
L1_REG_ACTIVATION = HPARAM_CANDIDATES([0.001, 0.01, 0.1]),
LEARN_RATE = HPARAM_CANDIDATES([0.001, 0.01, 0.1]),
OPTIMIZER = HPARAM_CANDIDATES(['ADAGRAD', 'ADAM', 'FTRL', ''RMSPROP', 'SGD']),
ACTIVATION_FN='relu',
BATCH_SIZE = HPARAM_CANDIDATES([16, 32, 64]),
DROPOUT = HPARAM_CANDIDATES([0.1, 0.2]),
HIDDEN_UNITS=HPARAM_CANDIDATES([struct([[16, 8, 4, 8, 16]]), struct([[32, 16, 4, 16, 32]])]),
TF_VERSION = '2.8.0',
EARLY_STOP = TRUE,
MIN_REL_PROGRESS = 0.01,
MAX_ITERATIONS=20,
WARM_START = TRUE,
NUM_TRIALS = 60,
MAX_PARALLEL_TRIALS = 1,
HPARAM_TUNING_ALGORITHM =  'VIZIER_DEFAULT',
HPARAM_TUNING_OBJECTIVES = MEAN_SQUARED_ERROR
) AS
SELECT
*
FROM `{0}.{2}`;""".format(
    BQ_DATASET_NAME, AUTO_ENCODER_MODEL, BQ_TABLE_NAME
)

In [None]:
execute_sql(train_auto_encoder)

Query Executed.Waiting for completion
Query Execution completed
Time taken to execute: 286.75821232795715


### Model Evaluation

In [None]:
eval_auto_encoder = """SELECT * FROM ML.EVALUATE(MODEL `{}.{}`);""".format(
    BQ_DATASET_NAME, AUTO_ENCODER_MODEL
)
model_evalution = execute_sql(eval_auto_encoder)
model_evalution

Query Executed.Waiting for completion
Query Execution completed
Time taken to execute: 1.7523226737976074


Unnamed: 0,trial_id,mean_absolute_error,mean_squared_error,mean_squared_log_error
0,1,0.002184,0.002253,0.001007
1,2,0.002172,0.002244,0.001003
2,3,0.001431,0.00148,0.000636
3,4,0.001584,0.0015,0.000641
4,5,0.001379,0.001421,0.000607
5,6,0.001443,0.001469,0.00063
6,7,0.001496,0.001334,0.000599
7,8,0.001668,0.001709,0.000746
8,9,0.001624,0.001546,0.000656
9,10,0.001756,0.001752,0.000761


### Outlier Analysis

In [None]:
# --- DETECT ANOMALIES --- #
detect_anomaly_auto_encoder = """SELECT * FROM ML.DETECT_ANOMALIES(MODEL `{0}.{1}.{2}`,
STRUCT(0.001 AS contamination),
TABLE `{0}.{1}.{3}`)
WHERE is_anomaly=true order by mean_squared_error desc;""".format(
    PROJECT_ID, BQ_DATASET_NAME, AUTO_ENCODER_MODEL, BQ_TABLE_NAME
)
# print(detect_anomaly_auto_encoder)
autoencoder_outliers = execute_sql(detect_anomaly_auto_encoder)

Query Executed.Waiting for completion
Query Execution completed
Time taken to execute: 1.3067491054534912


In [None]:
autoencoder_outliers

Unnamed: 0,trial_id,is_anomaly,mean_squared_error,day,principal_email,action,resource_type,resource_id,container_type,container_id,channel,ip,counter
0,31,True,0.002245,2023-05-02,971828084600@cloudservices.gserviceaccount.com,google.pubsub.v1.Publisher.CreateTopic,pubsub_topic,projects/sd-uxr-001/topics/us-east1-test-compo...,projects,sd-uxr-001,other,66.102.7.35,1
1,31,True,0.002244,2023-05-02,971828084600@cloudservices.gserviceaccount.com,google.pubsub.v1.Publisher.UpdateTopic,pubsub_topic,projects/sd-uxr-001/topics/us-east1-test-compo...,projects,sd-uxr-001,other,66.102.7.45,1
2,31,True,0.002244,2023-09-04,service-971828084600@container-analysis.iam.gs...,google.pubsub.v1.Publisher.CreateTopic,pubsub_topic,projects/sd-uxr-001/topics/container-analysis-...,projects,sd-uxr-001,other,private,1
3,31,True,0.002244,2023-09-04,service-971828084600@container-analysis.iam.gs...,google.pubsub.v1.Publisher.CreateTopic,pubsub_topic,projects/sd-uxr-001/topics/container-analysis-...,projects,sd-uxr-001,other,private,1
4,31,True,0.002244,2023-09-04,service-971828084600@container-analysis.iam.gs...,google.pubsub.v1.Publisher.CreateTopic,pubsub_topic,projects/sd-uxr-001/topics/container-analysis-...,projects,sd-uxr-001,other,private,1
5,31,True,0.002244,2023-05-02,971828084600@cloudservices.gserviceaccount.com,google.pubsub.v1.Publisher.CreateTopic,pubsub_topic,projects/sd-uxr-001/topics/us-east1-test-compo...,projects,sd-uxr-001,other,66.102.7.61,1
6,31,True,0.002244,2023-05-02,971828084600@cloudservices.gserviceaccount.com,google.pubsub.v1.Publisher.UpdateTopic,pubsub_topic,projects/sd-uxr-001/topics/us-east1-test-compo...,projects,sd-uxr-001,other,66.102.7.45,1
7,31,True,0.002244,2023-05-02,971828084600@cloudservices.gserviceaccount.com,google.pubsub.v1.Publisher.UpdateTopic,pubsub_topic,projects/sd-uxr-001/topics/us-east1-test-compo...,projects,sd-uxr-001,other,66.102.7.46,1
8,31,True,0.002244,2023-05-02,971828084600@cloudservices.gserviceaccount.com,google.pubsub.v1.Subscriber.UpdateSubscription,pubsub_subscription,projects/sd-uxr-001/subscriptions/us-east1-tes...,projects,sd-uxr-001,other,74.125.212.70,1
9,31,True,0.002244,2023-05-02,971828084600@cloudservices.gserviceaccount.com,google.pubsub.v1.Subscriber.CreateSubscription,pubsub_subscription,projects/sd-uxr-001/subscriptions/us-east1-tes...,projects,sd-uxr-001,other,66.102.7.39,1


## Common Outliers

Lets find out the Outliers reported by both BQML models

In [None]:
df1 = kmeans_outliers[
    [
        "day",
        "principal_email",
        "action",
        "resource_type",
        "resource_id",
        "container_type",
        "container_id",
        "channel",
        "ip",
        "counter",
    ]
]
df2 = autoencoder_outliers[
    [
        "day",
        "principal_email",
        "action",
        "resource_type",
        "resource_id",
        "container_type",
        "container_id",
        "channel",
        "ip",
        "counter",
    ]
]

In [None]:
common_outliers = df1.merge(
    df2,
    how="inner",
    on=[
        "day",
        "principal_email",
        "action",
        "resource_type",
        "resource_id",
        "container_type",
        "container_id",
        "channel",
        "ip",
        "counter",
    ],
)  # Replace 'column_name' if necessary

In [None]:
common_outliers

Unnamed: 0,day,principal_email,action,resource_type,resource_id,container_type,container_id,channel,ip,counter
0,2023-04-27,cbaer@google.com,google.iam.admin.v1.CreateServiceAccount,service_account,rarsan-service-account@sd-uxr-001.iam.gservice...,projects,sd-uxr-001,Cloud Console,2620:15c:170:110:d019:7321:4b58:10f9,1
1,2023-04-27,cbaer@google.com,google.iam.admin.v1.SetIAMPolicy,service_account,rarsan-service-account@sd-uxr-001.iam.gservice...,projects,sd-uxr-001,Cloud Console,2620:15c:170:110:c4c:8e1a:3d36:3cdb,1
2,2023-08-10,cbaer@google.com,beta.compute.subnetworks.insert,gce_subnetwork,7756423237025198361,projects,sd-uxr-001,Cloud Console,73.117.191.67,2
3,2023-06-12,mkoes@google.com,beta.compute.instances.setLabels,gce_instance,8951054702375589002,projects,sd-uxr-001,Cloud Console,2620:15c:170:110:b8aa:a7b5:f5ca:8108,2
4,2023-12-11,cbaer@google.com,google.iam.v1.IAMPolicy.SetIamPolicy,bigquery_dataset,default_demo_logs,projects,sd-uxr-001,Cloud Console,2620:15c:170:110:c464:db15:69bc:fa99,1
5,2024-01-08,cbaer@google.com,google.iam.v1.IAMPolicy.SetIamPolicy,bigquery_dataset,demo_logs,projects,sd-uxr-001,Cloud Console,2620:15c:170:110:6432:dcba:af65:ac4d,1
6,2023-09-04,service-971828084600@container-analysis.iam.gs...,google.pubsub.v1.Publisher.CreateTopic,pubsub_topic,projects/sd-uxr-001/topics/container-analysis-...,projects,sd-uxr-001,other,private,1
7,2023-09-04,service-971828084600@container-analysis.iam.gs...,google.pubsub.v1.Publisher.CreateTopic,pubsub_topic,projects/sd-uxr-001/topics/container-analysis-...,projects,sd-uxr-001,other,private,1
8,2023-09-04,service-971828084600@container-analysis.iam.gs...,google.pubsub.v1.Publisher.CreateTopic,pubsub_topic,projects/sd-uxr-001/topics/container-analysis-...,projects,sd-uxr-001,other,private,1
9,2023-09-04,service-971828084600@container-analysis.iam.gs...,google.pubsub.v1.Publisher.CreateTopic,pubsub_topic,projects/sd-uxr-001/topics/container-analysis-...,projects,sd-uxr-001,other,private,1


In [None]:
common_outliers.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16 entries, 0 to 15
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   day              16 non-null     dbdate
 1   principal_email  16 non-null     object
 2   action           16 non-null     object
 3   resource_type    16 non-null     object
 4   resource_id      16 non-null     object
 5   container_type   16 non-null     object
 6   container_id     16 non-null     object
 7   channel          16 non-null     object
 8   ip               16 non-null     object
 9   counter          16 non-null     Int64 
dtypes: Int64(1), dbdate(1), object(8)
memory usage: 1.4+ KB


## Uploading detected outliers to BQ table for further analysis

In [None]:
from google.cloud import bigquery


def create_table(client, table_id, schema):
    table = bigquery.Table(table_id, schema=schema)
    table = client.create_table(table, exists_ok=True)  # Make an API request
    print(
        "Created table {}.{}.{}".format(table.project, table.dataset_id, table.table_id)
    )


def upload_df_into_bq(client, table_id, df):
    # df.to_gbq(table_id, PROJECT, if_exists='replace', progress_bar=True)
    job_config = bigquery.LoadJobConfig(schema=schema)
    job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE
    # job_config.skip_leading_rows = 1
    job_config.autodetect = False
    # job_config.source_format = 'CSV'
    job = client.load_table_from_dataframe(df, table_id, job_config=job_config)
    job.result()
    print("Uploaded dataframe into table {}.{}".format(PROJECT_ID, table_id))


schema = [
    bigquery.SchemaField("day", "DATE", mode="REQUIRED"),
    bigquery.SchemaField("principal_email", "STRING", mode="REQUIRED"),
    bigquery.SchemaField("action", "STRING", mode="REQUIRED"),
    bigquery.SchemaField("resource_type", "STRING", mode="REQUIRED"),
    bigquery.SchemaField("resource_id", "STRING", mode="NULLABLE"),
    bigquery.SchemaField("container_type", "STRING", mode="NULLABLE"),
    bigquery.SchemaField("container_id", "STRING", mode="NULLABLE"),
    bigquery.SchemaField("channel", "STRING", mode="NULLABLE"),
    bigquery.SchemaField("ip", "STRING", mode="REQUIRED"),
    bigquery.SchemaField("counter", "INTEGER", mode="REQUIRED"),
]
client = bigquery.Client(PROJECT_ID)

table_id = "cloud-sa-ml.classical_ml_approach.common_outliers"

create_table(client, table_id, schema)

upload_df_into_bq(client, table_id, common_outliers)

Created table cloud-sa-ml.classical_ml_approach.common_outliers
Uploaded dataframe into table cloud-sa-ml.cloud-sa-ml.classical_ml_approach.common_outliers


## Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial

In [15]:
# Delete the BigQuery dataset
dataset_to_be_deleted = "test"  # @param {type:"string"}

In [16]:
!bq rm -r -f {PROJECT_ID}:{dataset_to_be_deleted}