#### Problem Statement

Banks invest significant resources into outbound marketing campaigns, like calling customers to offer term deposit products. However, these campaigns can be inefficient, contacting many customers who are not interested, leading to:

- High operational costs

- Low conversion rates

- Customer fatigue and churn risk

1. This project seeks to optimize marketing effectiveness by:

2. Segmenting customers based on their financial and demographic behavior

3. Predicting which customers are most likely to subscribe to a term deposit

4. Designing a targeting strategy that reduces wasted outreach while maximizing ROI


**Stakeholders**

|Stakeholder              | What they care About
|-------------------------|------------------------
|Marketing Team           |Improve campaign efficiency and ROI; avoid over-contacting customers|
Data Science Team         |Build interpretable, reliable models with business impact
Compliance/Legal          |Ensure targeting practices are fair, non-discriminatory
Sales / Call Center       |Focus effort on the right customers; increase success rates
Senior Management         |Strategic insights on customer behavior and product interest


#### Imports

Import the required libraries

In [1]:
import pandas as pd
import zipfile
import urllib.request
import io
import os
from google.cloud import bigquery


### Loading dataset

Dataset is from Google [Big Query](https://console.cloud.google.com/bigquery?ws=!1m4!1m3!3m2!1sbigquery-public-data!2sthelook_ecommerce).
A function was writen to obtain the tables in the dataset. The ecommerce data has `7` tables namely:  "distribution_centers", "events", "inventory_items", "order_items", "orders", "products", "users".

In [2]:

client = bigquery.Client(project="testing-469722")

def load_thelook_tables(dataset_id="bigquery-public-data.thelook_ecommerce", table_names=None):
    """
    Load multiple tables from thelook_ecommerce dataset into separate DataFrames.

    Parameters:
        dataset_id (str): Full BigQuery dataset path
        table_names (list): List of table names to load

    Returns:
        dict: Dictionary of {table_name: DataFrame}
    """
    if table_names is None:
        table_names = [
            "distribution_centers",
            "events",
            "inventory_items",
            "order_items",
            "orders",
            "products",
            "users"
        ]

    dataframes = {}

    for table in table_names:
        query = f"SELECT * FROM `{dataset_id}.{table}`"
        df = client.query(query).to_dataframe()
        dataframes[table] = df
        print(f"Loaded {table}: {df.shape[0]} rows")

    return dataframes

### Load the dataset

The dataset are pulled from Big query.


In [3]:
thelook_dfs = load_thelook_tables()



Loaded distribution_centers: 10 rows
Loaded events: 2414831 rows
Loaded inventory_items: 486071 rows
Loaded order_items: 180350 rows
Loaded orders: 124567 rows
Loaded products: 29120 rows
Loaded users: 100000 rows


### Accessing the tables


In [4]:
# order table

orders_df = thelook_dfs['orders']
orders_df.head(4)

Unnamed: 0,order_id,user_id,status,gender,created_at,returned_at,shipped_at,delivered_at,num_of_item
0,1,1,Cancelled,F,2020-03-20 11:43:00+00:00,NaT,NaT,NaT,2
1,17,11,Cancelled,F,2024-04-16 14:15:00+00:00,NaT,NaT,NaT,1
2,78,52,Cancelled,F,2025-04-07 15:12:00+00:00,NaT,NaT,NaT,1
3,101,74,Cancelled,F,2025-08-12 14:56:00+00:00,NaT,NaT,NaT,2


**orders_df table dictionary**

| Column Name    | Description                                                           |
| -------------- | --------------------------------------------------------------------- |
| `order_id`     | Unique identifier for each order                                      |
| `user_id`      | Foreign key that links to the customer (`users.id`)                   |
| `status`       | Status of the order (e.g., `Complete`, `Returned`, `Cancelled`)       |
| `gender`       | Gender of the customer (from the `users` table, if joined)            |
| `created_at`   | Timestamp when the order was placed                                   |
| `returned_at`  | Timestamp when the item was returned (null if not returned)           |
| `shipped_at`   | Timestamp when the item was shipped                                   |
| `delivered_at` | Timestamp when the item was delivered to the customer                 |
| `num_of_item`  | Number of individual items in that order (usually from `order_items`) |


In [5]:
# distribution_centers

distribution_df = thelook_dfs['distribution_centers']
distribution_df.head(5)

Unnamed: 0,id,name,latitude,longitude,distribution_center_geom
0,4,Los Angeles CA,34.05,-118.25,POINT(-118.25 34.05)
1,2,Chicago IL,41.8369,-87.6847,POINT(-87.6847 41.8369)
2,5,New Orleans LA,29.95,-90.0667,POINT(-90.0667 29.95)
3,8,Mobile AL,30.6944,-88.0431,POINT(-88.0431 30.6944)
4,9,Charleston SC,32.7833,-79.9333,POINT(-79.9333 32.7833)


**distribution_df dictionary**

| Column Name                | Description                                                                 |
| -------------------------- | --------------------------------------------------------------------------- |
| `id`                       | Unique identifier for each distribution center                              |
| `name`                     | City and state abbreviation of the distribution center location             |
| `latitude`                 | Latitude coordinate of the distribution center (for mapping or distance)    |
| `longitude`                | Longitude coordinate of the distribution center                             |
| `distribution_center_geom` | Geometry field representing the center as a geographic point (GeoJSON-like) |


In [6]:
events_df = thelook_dfs['events']
events_df.head(5)

Unnamed: 0,id,user_id,sequence_number,session_id,created_at,ip_address,city,state,postal_code,browser,traffic_source,uri,event_type
0,2024807,,3,1b5cb0ad-b391-4018-a42e-9d7a7f36ac36,2023-04-08 05:54:00+00:00,102.175.176.144,São Paulo,São Paulo,02675-031,Firefox,Adwords,/cancel,cancel
1,1423972,,3,d76f6b66-1f49-45f7-89b9-2d4829f26961,2024-12-28 00:14:00+00:00,156.144.159.235,São Paulo,São Paulo,02675-031,Firefox,Email,/cancel,cancel
2,1690004,,3,bcd10892-62f0-49eb-a91e-d9a4eea0cf8e,2021-11-24 08:42:00+00:00,109.245.214.39,São Paulo,São Paulo,02675-031,Firefox,Adwords,/cancel,cancel
3,1531533,,3,7ec45893-3b80-4b1d-b6e1-cc4df28fd1ac,2021-06-13 12:46:00+00:00,22.59.191.157,São Paulo,São Paulo,02675-031,Chrome,Adwords,/cancel,cancel
4,2150929,,3,2a5b6bd2-e22a-4f76-9254-b579eefecdbb,2022-07-26 15:26:00+00:00,31.246.159.252,São Paulo,São Paulo,02675-031,Chrome,Organic,/cancel,cancel


**events_df dictionary**

| Column Name       | Description                                                           |
| ----------------- | --------------------------------------------------------------------- |
| `id`              | Unique identifier for each event record                               |
| `user_id`         | ID of the user associated with the event *(can be missing/anonymous)* |
| `sequence_number` | Order of the event within a session (helps track user journey)        |
| `session_id`      | Unique ID for the user session (used to group related events)         |
| `created_at`      | Timestamp when the event occurred                                     |
| `ip_address`      | User’s IP address at the time of the event                            |
| `city`            | City of the user (inferred from IP)                                   |
| `state`           | State of the user (inferred from IP)                                  |
| `postal_code`     | Postal code of the user (inferred from IP)                            |
| `browser`         | Browser used by the user (e.g., Chrome, Safari)                       |
| `traffic_source`  | Source that led the user to the site (e.g., Organic, Email, Adwords)  |
| `uri`             | Specific page or endpoint visited (e.g., `/cancel`)                   |
| `event_type`      | Type of user action or event (e.g., `cancel`, `purchase`, `checkout`) |


In [7]:
inventoryitems_df = thelook_dfs['inventory_items']
inventoryitems_df.head()

Unnamed: 0,id,product_id,created_at,sold_at,cost,product_category,product_name,product_brand,product_retail_price,product_department,product_sku,product_distribution_center_id
0,11788,13844,2022-01-23 12:18:55+00:00,2022-03-01 08:02:55+00:00,2.76804,Accessories,(ONE) 1 Satin Headband,Funny Girl Designs,6.99,Women,2A3E953A5E3D81E67945BCE5519F84C8,7
1,11789,13844,2023-11-25 03:44:00+00:00,NaT,2.76804,Accessories,(ONE) 1 Satin Headband,Funny Girl Designs,6.99,Women,2A3E953A5E3D81E67945BCE5519F84C8,7
2,42960,13844,2024-06-16 04:00:54+00:00,2024-07-30 12:50:54+00:00,2.76804,Accessories,(ONE) 1 Satin Headband,Funny Girl Designs,6.99,Women,2A3E953A5E3D81E67945BCE5519F84C8,7
3,42961,13844,2024-07-19 11:34:00+00:00,NaT,2.76804,Accessories,(ONE) 1 Satin Headband,Funny Girl Designs,6.99,Women,2A3E953A5E3D81E67945BCE5519F84C8,7
4,111481,13844,2020-08-08 22:40:51+00:00,2020-10-07 06:01:51+00:00,2.76804,Accessories,(ONE) 1 Satin Headband,Funny Girl Designs,6.99,Women,2A3E953A5E3D81E67945BCE5519F84C8,7


**inventoryitems_df dictionary**

| Column Name                      | Description                                                                                                 |
| -------------------------------- | ----------------------------------------------------------------------------------------------------------- |
| `id`                             | Unique identifier for each inventory record                                                                 |
| `product_id`                     | ID of the product listed in the inventory                                                                   |
| `created_at`                     | Timestamp when the product was added to inventory                                                           |
| `sold_at`                        | Timestamp when the product was sold (null if unsold)                                                        |
| `cost`                           | Cost to the business for this unit (wholesale price)                                                        |
| `product_category`               | Category label for the product (e.g., *Jumpsuits & Rompers*)                                                |
| `product_name`                   | Full name or title of the product                                                                           |
| `product_brand`                  | Brand/manufacturer of the product                                                                           |
| `product_retail_price`           | Suggested retail price or sale price to customer                                                            |
| `product_department`             | Department the product belongs to (e.g., *Women*)                                                           |
| `product_sku`                    | Unique stock keeping unit (SKU) for inventory tracking                                                      |
| `product_distribution_center_id` | Foreign key linking to the distribution center where the item is stocked (join with `distribution_centers`) |


In [8]:
orderitems_df = thelook_dfs['order_items']
orderitems_df.head(5)

Unnamed: 0,id,order_id,user_id,product_id,inventory_item_id,status,created_at,shipped_at,delivered_at,returned_at,sale_price
0,102433,70865,56891,14235,276182,Processing,2024-06-06 04:30:37+00:00,NaT,NaT,NaT,0.02
1,172184,118909,95392,14235,464193,Processing,2023-01-13 12:19:51+00:00,NaT,NaT,NaT,0.02
2,29962,20664,16594,14159,80716,Cancelled,2024-05-14 21:54:31+00:00,NaT,NaT,NaT,0.49
3,39421,27184,21869,14159,106192,Cancelled,2022-04-19 09:09:24+00:00,NaT,NaT,NaT,0.49
4,72384,50056,40169,14159,195132,Cancelled,2025-06-14 09:23:23+00:00,NaT,NaT,NaT,0.49


**orderitems_df**

| Column Name         | Description                                                                    |
| ------------------- | ------------------------------------------------------------------------------ |
| `id`                | Unique identifier for each order item (line item)                              |
| `order_id`          | Foreign key referencing the `orders` table (represents the main order)         |
| `user_id`           | ID of the customer who placed the order                                        |
| `product_id`        | ID of the product ordered (from `products` or `inventories` table)             |
| `inventory_item_id` | Foreign key linking to the specific item in the inventory (from `inventories`) |
| `status`            | Status of the item within the order: `Complete`, `Cancelled`, etc.             |
| `created_at`        | Timestamp when the item was added to the order                                 |
| `shipped_at`        | Timestamp when the item was shipped                                            |
| `delivered_at`      | Timestamp when the item was delivered                                          |
| `returned_at`       | Timestamp if the item was returned                                             |
| `sale_price`        | Price at which the item was sold to the customer                               |


In [9]:
products_df = thelook_dfs['products']
products_df.head(5)

Unnamed: 0,id,cost,category,name,brand,retail_price,department,sku,distribution_center_id
0,13842,2.51875,Accessories,Low Profile Dyed Cotton Twill Cap - Navy W39S55D,MG,6.25,Women,EBD58B8A3F1D72F4206201DA62FB1204,1
1,13928,2.33835,Accessories,Low Profile Dyed Cotton Twill Cap - Putty W39S55D,MG,5.95,Women,2EAC42424D12436BDD6A5B8A88480CC3,1
2,14115,4.87956,Accessories,Enzyme Regular Solid Army Caps-Black W35S45D,MG,10.99,Women,EE364229B2791D1EF9355708EFF0BA34,1
3,14157,4.64877,Accessories,Enzyme Regular Solid Army Caps-Olive W35S45D (...,MG,10.99,Women,00BD13095D06C20B11A2993CA419D16B,1
4,14273,6.50793,Accessories,Washed Canvas Ivy Cap - Black W11S64C,MG,15.99,Women,F531DC20FDE20B7ADF3A73F52B71D0AF,1


**products_df**

| Column Name              | Description                                                                                                      |
| ------------------------ | ---------------------------------------------------------------------------------------------------------------- |
| `id`                     | Unique product ID                                                                                                |
| `cost`                   | Internal cost of the product for the retailer                                                                    |
| `category`               | Product category (e.g., Accessories, Tops, Shoes)                                                                |
| `name`                   | Full name or description of the product                                                                          |
| `brand`                  | Brand or manufacturer of the product                                                                             |
| `retail_price`           | Recommended retail price for the product                                                                         |
| `department`             | Department the product belongs to (e.g., Women, Men, Kids)                                                       |
| `sku`                    | Stock Keeping Unit: unique code used to identify and track inventory                                             |
| `distribution_center_id` | Foreign key linking to the `distribution_centers` table, indicating where the product is stocked or shipped from |


In [10]:
users_df = thelook_dfs['users']
users_df.head(5)

Unnamed: 0,id,first_name,last_name,email,age,gender,state,street_address,postal_code,city,country,latitude,longitude,traffic_source,created_at,user_geom
0,93483,Brad,Ferguson,bradferguson@example.net,16,M,Acre,8427 Rachel Drive Suite 095,69980-000,,Brasil,-8.065346,-72.870949,Organic,2020-05-20 17:05:00+00:00,POINT(-72.87094866 -8.065346116)
1,685,Erica,Levine,ericalevine@example.com,37,F,Acre,3160 Lisa Springs Suite 593,69980-000,,Brasil,-8.065346,-72.870949,Search,2021-10-18 06:23:00+00:00,POINT(-72.87094866 -8.065346116)
2,99142,Austin,Simmons,austinsimmons@example.org,21,M,Acre,62944 Miles Avenue,69980-000,,Brasil,-8.065346,-72.870949,Search,2019-09-20 11:38:00+00:00,POINT(-72.87094866 -8.065346116)
3,38974,Linda,King,lindaking@example.com,20,F,Acre,26711 George Centers Suite 634,69980-000,,Brasil,-8.065346,-72.870949,Search,2025-01-03 18:57:00+00:00,POINT(-72.87094866 -8.065346116)
4,46889,Larry,Howard,larryhoward@example.net,39,M,Acre,438 Richard Roads,69980-000,,Brasil,-8.065346,-72.870949,Display,2022-09-17 00:29:00+00:00,POINT(-72.87094866 -8.065346116)


**users_df dictionary**

  | Column Name      | Description                                                                                |
| ---------------- | ------------------------------------------------------------------------------------------ |
| `id`             | Unique identifier for the user (primary key)                                               |
| `first_name`     | User's first name                                                                          |
| `last_name`      | User's last name                                                                           |
| `email`          | User's email address                                                                       |
| `age`            | Age of the user                                                                            |
| `gender`         | Gender of the user (`M`, `F`, or other)                                                    |
| `state`          | State of residence (e.g., Acre)                                                            |
| `street_address` | Full street address of the user                                                            |
| `postal_code`    | Postal or ZIP code                                                                         |
| `city`           | City (some entries may be `null`)                                                          |
| `country`        | Country of residence (e.g., Brasil)                                                        |
| `latitude`       | Latitude coordinate of the user's address                                                  |
| `longitude`      | Longitude coordinate of the user's address                                                 |
| `traffic_source` | Original acquisition source (e.g., Facebook, Email, Search, Organic)                       |
| `created_at`     | Timestamp of when the user was created in the system                                       |
| `user_geom`      | Spatial data point combining `longitude` and `latitude` (for mapping and geospatial joins) |


### Data Cleaning & Exploration

#### orders_df

In [12]:
# Check data type of the orders_df

orders_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 124567 entries, 0 to 124566
Data columns (total 9 columns):
 #   Column        Non-Null Count   Dtype              
---  ------        --------------   -----              
 0   order_id      124567 non-null  Int64              
 1   user_id       124567 non-null  Int64              
 2   status        124567 non-null  object             
 3   gender        124567 non-null  object             
 4   created_at    124567 non-null  datetime64[us, UTC]
 5   returned_at   12445 non-null   datetime64[us, UTC]
 6   shipped_at    80593 non-null   datetime64[us, UTC]
 7   delivered_at  43534 non-null   datetime64[us, UTC]
 8   num_of_item   124567 non-null  Int64              
dtypes: Int64(3), datetime64[us, UTC](4), object(2)
memory usage: 8.9+ MB


In [43]:
# Check for null values

orders_df.isnull().sum()

order_id             0
user_id              0
status               0
gender               0
created_at           0
returned_at     112122
shipped_at       43974
delivered_at     81033
num_of_item          0
dtype: int64

In [23]:
orders_df['returned_at'].unique()

array(['Cancelled', 'Complete', 'Processing', 'Returned', 'Shipped'],
      dtype=object)

The order_df has 112122 null valued for returned_at, this indicates that 112122 were not returned.  43974, and 81033 null values for returned_at, shipped_at and delivered_at columns. The null values for this columns is suggestive that the status of the order is cancelled. This would need to be investigated.

In [37]:
orders_shippedz = orders_df[(orders_df['status'] == 'Returned') & (orders_df['returned_at'].isnull())]
orders_returned.shape

(0, 9)

In [40]:
orders_df['status'].value_counts()

status
Shipped       37059
Complete      31089
Processing    25165
Cancelled     18809
Returned      12445
Name: count, dtype: int64

In [35]:
orders_Complete = orders_df[(orders_df['status'] == 'Complete') & (orders_df['returned_at'].isnull())]
orders_Complete.shape

(31089, 9)

In [41]:
orders_df.shape

(124567, 9)

In [24]:
orders_df.head()

Unnamed: 0,order_id,user_id,status,gender,created_at,returned_at,shipped_at,delivered_at,num_of_item
0,1,1,Cancelled,F,2020-03-20 11:43:00+00:00,NaT,NaT,NaT,2
1,17,11,Cancelled,F,2024-04-16 14:15:00+00:00,NaT,NaT,NaT,1
2,78,52,Cancelled,F,2025-04-07 15:12:00+00:00,NaT,NaT,NaT,1
3,101,74,Cancelled,F,2025-08-12 14:56:00+00:00,NaT,NaT,NaT,2
4,106,78,Cancelled,F,2025-06-06 07:09:00+00:00,NaT,NaT,NaT,1
