In [None]:
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Identifying Customer Segments for Targeted Marketing

**The Business Challenge:** Marketing teams often need to optimize budget allocation to maximize return on investment (ROI). A one-size-fits-all marketing strategy is inefficient, leading to wasted ad spend and low engagement, as the messaging is not relevant to all recipients. The business problem is to develop a systematic, data-driven method for partitioning a customer base into distinct groups based on their behavior, enabling more effective and personalized marketing campaigns.

While basic segmentation using simple demographic data is straightforward, it often fails to capture the more nuanced differences in customer purchasing patterns. The analytical challenge is to move beyond these simple heuristics and identify meaningful segments based on complex behavioral data, such as purchase frequency, monetary value, and product category preferences. The goal is to produce segments that are not only statistically distinct but also interpretable and actionable for the marketing team.



**The Data Science Approach:** In this use case, we will combine unsupervised machine learning with generative AI to create and characterize customer segments. First, we will apply a k-means clustering algorithm directly within BigQuery ML to efficiently partition the entire customer dataset based on purchasing behavior within BigQuery.

Clustering effectively groups customers and generates a cluster ID. The second part of our approach is to automate the interpretation of these segments to provide business context. We will use a generative AI function to analyze the behavioral data of customers within each cluster and programmatically generate qualitative descriptions, including concise segment name, segment summary and tailored marketing suggestions for each segment.

In [5]:
#@title 1. Prepare Customer Data: Consolidate and clean customer purchasing data for analysis.

%%bigquery
SELECT * FROM `bigquery-public-data.thelook_ecommerce.orders` LIMIT 1000;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,order_id,user_id,status,gender,created_at,returned_at,shipped_at,delivered_at,num_of_item
0,7,6,Cancelled,F,2025-04-30 07:51:00+00:00,NaT,NaT,NaT,1
1,14,11,Cancelled,F,2025-04-11 03:11:00+00:00,NaT,NaT,NaT,1
2,34,23,Cancelled,F,2023-04-27 13:37:00+00:00,NaT,NaT,NaT,1
3,40,28,Cancelled,F,2024-10-30 10:33:00+00:00,NaT,NaT,NaT,2
4,58,42,Cancelled,F,2025-04-12 16:11:00+00:00,NaT,NaT,NaT,1
...,...,...,...,...,...,...,...,...,...
995,12387,9996,Cancelled,F,2024-05-23 09:41:00+00:00,NaT,NaT,NaT,2
996,12392,9998,Cancelled,F,2024-05-12 13:27:00+00:00,NaT,NaT,NaT,1
997,12408,10008,Cancelled,F,2023-03-17 13:19:00+00:00,NaT,NaT,NaT,1
998,12415,10015,Cancelled,F,2024-10-11 09:37:00+00:00,NaT,NaT,NaT,1


### Instructions:

*   Replace PROJECT_ID and DATASET_ID with your own project_id and daatset_id.
*   Create a connection (image_embed_conn) in the US region and give service account access by following these steps [here](https://cloud.google.com/bigquery/docs/generate-visual-content-embedding#create_a_connection).


In [None]:

%%bigquery
CREATE OR REPLACE TABLE {PROJECT_ID}.{DATASET_ID}.customer_date AS(
  SELECT
  users.id,
  SUM(order_items.sale_price) AS total_spend,
  COUNT(DISTINCT orders.order_id) AS number_of_orders,
  MAX(orders.created_at) AS last_purchase_date
FROM
  `bigquery-public-data.thelook_ecommerce.users` AS users
JOIN
  `bigquery-public-data.thelook_ecommerce.orders` AS orders
ON
  users.id = orders.user_id
JOIN
  `bigquery-public-data.thelook_ecommerce.order_items` AS order_items
ON
  orders.order_id = order_items.order_id
GROUP BY
  users.id );

Executing query with job ID: 4e1dae40-cabf-4a3d-b559-f1f7d0d0a161
Query executing: 0.19s


ERROR:
 400 Syntax error: Unexpected "{" at [1:25]; reason: invalidQuery, location: query, message: Syntax error: Unexpected "{" at [1:25]

Location: US
Job ID: 4e1dae40-cabf-4a3d-b559-f1f7d0d0a161



In [None]:
%%bigquery
SELECT * FROM `{PROJECT_ID}.{DATASET_ID}.customer_date` LIMIT 1000;

Executing query with job ID: f1791e25-84b5-44d2-88ea-277da0bac34d
Query executing: 0.25s


ERROR:
 400 Invalid project ID '{PROJECT_ID}'. Project IDs must contain 6-63 lowercase letters, digits, or dashes. Some project IDs also include domain name separated by a colon. IDs must start with a letter and may not end with a dash.; reason: invalid, location: {PROJECT_ID}.{DATASET_ID}.customer_date, message: Invalid project ID '{PROJECT_ID}'. Project IDs must contain 6-63 lowercase letters, digits, or dashes. Some project IDs also include domain name separated by a colon. IDs must start with a letter and may not end with a dash.

Location: US
Job ID: f1791e25-84b5-44d2-88ea-277da0bac34d



In [None]:
#@title Train a Clustering Model
# Create and train a k-means clustering model using a single SQL query in BigQuery ML to group customers into a specified number of segments.

%%bigquery
CREATE OR REPLACE MODEL `{PROJECT_ID}.{DATASET_ID}.customer_clusters`
OPTIONS(model_type='kmeans', num_clusters=3, standardize_features=TRUE) AS
SELECT
  id,
  total_spend,
  number_of_orders,
  last_purchase_date
FROM `{PROJECT_ID}.{DATASET_ID}.customer_date`;

Query is running:   0%|          |

In [None]:
#@title Assign Customers to Segments
# Use the ML.PREDICT function to assign each customer to their respective segment.

%%bigquery
CREATE OR REPLACE TABLE `{PROJECT_ID}.{DATASET_ID}.predictions` AS(
SELECT
id,
centroid_id,
total_spend,
number_of_orders,
last_purchase_date
FROM
  ML.PREDICT(MODEL `{PROJECT_ID}.{DATASET_ID}.customer_clusters`,
    (SELECT * FROM
       `{PROJECT_ID}.{DATASET_ID}.customer_date`)

));


Query is running:   0%|          |

In [None]:
%%bigquery
SELECT * FROM `{PROJECT_ID}.{DATASET_ID}.predictions` LIMIT 50;

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,id,centroid_id,total_spend,number_of_orders,last_purchase_date
0,69375,1,0.02,1,2020-09-26 13:16:00+00:00
1,94098,1,0.02,1,2024-10-30 14:53:00+00:00
2,91918,1,0.49,1,2025-08-07 13:14:00+00:00
3,76610,1,1.51,1,2025-03-17 02:14:00+00:00
4,74083,1,1.82,1,2025-02-19 15:35:00+00:00
5,64742,1,2.67,1,2022-07-08 14:10:00+00:00
6,77909,1,2.67,1,2025-01-19 09:32:00+00:00
7,58334,1,2.95,1,2025-01-05 14:16:00+00:00
8,74654,1,2.95,1,2025-02-24 00:58:00+00:00
9,75858,1,2.95,1,2022-06-28 02:38:00+00:00


### **4. Generate Segment Personas with AI:**
Use the AI.GENERATE_TABLE() function to analyze the members of each cluster and automatically generate a descriptive persona, a summary of their behavior, and potential marketing strategies for that segment.


In [None]:
# Create a BQML model to use the AI functions
# Create a connection (ai_function) in the US region and give service account access by following these steps here: https://cloud.google.com/bigquery/docs/generate-table#create_a_connection

%%bigquery
CREATE OR REPLACE MODEL`{PROJECT_ID}.{DATASET_ID}.ai_model`
REMOTE WITH CONNECTION `{PROJECT_ID}.us.ai_function`
OPTIONS (ENDPOINT = 'gemini-2.5-flash');

Query is running:   0%|          |

In [None]:
%%bigquery

CREATE OR REPLACE TABLE `{PROJECT_ID}.{DATASET_ID}.segment_campaigns` AS(
SELECT
  *
FROM
  AI.GENERATE_TABLE(
    MODEL `{PROJECT_ID}.{DATASET_ID}.ai_model`,
    (
      SELECT
          'For the customer segment with centroid_id, provide a concise segment_name, a segment_summary (1-2 sentences), and 3 numbered marketing_suggestions in a list based on the provided metrics. For context, these metrics are from an ecommerce store.'
          AS prompt
      FROM (
        SELECT
          CAST(centroid_id AS STRING) AS centroid_id,
          AVG(total_spend) AS avg_total_spend,
          AVG(number_of_orders) AS avg_number_of_orders,
          AVG(TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), last_purchase_date, DAY)) AS avg_time_since_last_purchase_days
        FROM
          `{PROJECT_ID}.{DATASET_ID}.predictions`
        GROUP BY
          centroid_id
      )
    ),
     STRUCT ("customer_segment STRING, segment_name STRING, segment_summary STRING, marketing_suggestions ARRAY<STRING>" AS output_schema ))
);

Query is running:   0%|          |

In [None]:
%%bigquery
SELECT * FROM `{PROJECT_ID}.{DATASET_ID}.segment_campaigns`

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,customer_segment,marketing_suggestions,segment_name,segment_summary,full_response,status,prompt
0,Centroid_ID_XYZ - High-Value Repeat Purchasers,[1. Offer exclusive early access to new produc...,Loyal Engaged Spenders,This segment consists of customers who frequen...,"{""candidates"":[{""avg_logprobs"":-1.152481361671...",,"For the customer segment with centroid_id, pro..."
1,High-Value Loyal Customers,[Implement a tiered loyalty program offering e...,Loyal VIPs,These customers make frequent purchases and ha...,"{""candidates"":[{""avg_logprobs"":-0.856455485026...",,"For the customer segment with centroid_id, pro..."
2,centroid_id_007,[Offer exclusive early access to new product l...,Luxury Brand Loyalists,This segment comprises customers who frequentl...,"{""candidates"":[{""avg_logprobs"":-1.427104695638...",,"For the customer segment with centroid_id, pro..."
