#### Problem Statement

Banks invest significant resources into outbound marketing campaigns, like calling customers to offer term deposit products. However, these campaigns can be inefficient, contacting many customers who are not interested, leading to:

- High operational costs

- Low conversion rates

- Customer fatigue and churn risk

1. This project seeks to optimize marketing effectiveness by:

2. Segmenting customers based on their financial and demographic behavior

3. Predicting which customers are most likely to subscribe to a term deposit

4. Designing a targeting strategy that reduces wasted outreach while maximizing ROI


**Stakeholders**

|Stakeholder              | What they care About
|-------------------------|------------------------
|Marketing Team           |Improve campaign efficiency and ROI; avoid over-contacting customers|
Data Science Team         |Build interpretable, reliable models with business impact
Compliance/Legal          |Ensure targeting practices are fair, non-discriminatory
Sales / Call Center       |Focus effort on the right customers; increase success rates
Senior Management         |Strategic insights on customer behavior and product interest


#### Imports

Import the required libraries

In [1]:
import pandas as pd
import zipfile
import urllib.request
import io
import os
from google.cloud import bigquery


### Loading dataset

Dataset is from Google [Big Query](https://console.cloud.google.com/bigquery?ws=!1m4!1m3!3m2!1sbigquery-public-data!2sthelook_ecommerce).
A function was writen to obtain the tables in the dataset. The ecommerce data has `7` tables namely:  "distribution_centers", "events", "inventory_items", "order_items", "orders", "products", "users".

In [2]:

client = bigquery.Client(project="testing-469722")

def load_thelook_tables(dataset_id="bigquery-public-data.thelook_ecommerce", table_names=None):
    """
    Load multiple tables from thelook_ecommerce dataset into separate DataFrames.

    Parameters:
        dataset_id (str): Full BigQuery dataset path
        table_names (list): List of table names to load

    Returns:
        dict: Dictionary of {table_name: DataFrame}
    """
    if table_names is None:
        table_names = [
            "distribution_centers",
            "events",
            "inventory_items",
            "order_items",
            "orders",
            "products",
            "users"
        ]

    dataframes = {}

    for table in table_names:
        query = f"SELECT * FROM `{dataset_id}.{table}`"
        df = client.query(query).to_dataframe()
        dataframes[table] = df
        print(f"Loaded {table}: {df.shape[0]} rows")

    return dataframes

### Load the dataset

The dataset are pulled from Big query.


In [3]:
thelook_dfs = load_thelook_tables()



Loaded distribution_centers: 10 rows
Loaded events: 2427988 rows
Loaded inventory_items: 489521 rows
Loaded order_items: 181272 rows
Loaded orders: 124691 rows
Loaded products: 29120 rows
Loaded users: 100000 rows


### Accessing the tables


In [4]:
# order table

orders_df = thelook_dfs['orders']
orders_df.head(4)

Unnamed: 0,order_id,user_id,status,gender,created_at,returned_at,shipped_at,delivered_at,num_of_item
0,22,21,Cancelled,F,2024-09-29 08:45:00+00:00,NaT,NaT,NaT,1
1,27,27,Cancelled,F,2022-09-15 14:24:00+00:00,NaT,NaT,NaT,1
2,63,59,Cancelled,F,2022-02-03 05:19:00+00:00,NaT,NaT,NaT,2
3,66,61,Cancelled,F,2020-05-07 10:04:00+00:00,NaT,NaT,NaT,1


**orders_df table dictionary**

| Column Name    | Description                                                           |
| -------------- | --------------------------------------------------------------------- |
| `order_id`     | Unique identifier for each order                                      |
| `user_id`      | Foreign key that links to the customer (`users.id`)                   |
| `status`       | Status of the order (e.g., `Complete`, `Returned`, `Cancelled`)       |
| `gender`       | Gender of the customer (from the `users` table, if joined)            |
| `created_at`   | Timestamp when the order was placed                                   |
| `returned_at`  | Timestamp when the item was returned (null if not returned)           |
| `shipped_at`   | Timestamp when the item was shipped                                   |
| `delivered_at` | Timestamp when the item was delivered to the customer                 |
| `num_of_item`  | Number of individual items in that order (usually from `order_items`) |


In [5]:
# distribution_centers

distribution_df = thelook_dfs['distribution_centers']
distribution_df.head(5)

Unnamed: 0,id,name,latitude,longitude,distribution_center_geom
0,10,Savannah GA,32.0167,-81.1167,POINT(-81.1167 32.0167)
1,9,Charleston SC,32.7833,-79.9333,POINT(-79.9333 32.7833)
2,1,Memphis TN,35.1174,-89.9711,POINT(-89.9711 35.1174)
3,3,Houston TX,29.7604,-95.3698,POINT(-95.3698 29.7604)
4,7,Philadelphia PA,39.95,-75.1667,POINT(-75.1667 39.95)


**distribution_df dictionary**

| Column Name                | Description                                                                 |
| -------------------------- | --------------------------------------------------------------------------- |
| `id`                       | Unique identifier for each distribution center                              |
| `name`                     | City and state abbreviation of the distribution center location             |
| `latitude`                 | Latitude coordinate of the distribution center (for mapping or distance)    |
| `longitude`                | Longitude coordinate of the distribution center                             |
| `distribution_center_geom` | Geometry field representing the center as a geographic point (GeoJSON-like) |


In [6]:
events_df = thelook_dfs['events']
events_df.head(5)

Unnamed: 0,id,user_id,sequence_number,session_id,created_at,ip_address,city,state,postal_code,browser,traffic_source,uri,event_type
0,1937376,,3,556fcff6-a7a3-4b02-a512-f491bcc6ab19,2023-03-08 05:37:00+00:00,38.103.36.86,São Paulo,São Paulo,02675-031,Chrome,Facebook,/cancel,cancel
1,2244726,,3,eab80561-3d0d-4cd9-aa0b-5f52ed60bb67,2024-07-08 16:12:00+00:00,15.55.4.254,Vargem Grande Paulista,São Paulo,06730-000,Safari,Organic,/cancel,cancel
2,1357710,,3,cf1f5585-a197-4218-8ffc-0d7f8da08d94,2023-05-14 01:54:00+00:00,13.137.185.63,Embu das Artes,São Paulo,06810-240,Firefox,Email,/cancel,cancel
3,1847066,,3,6a9dddfd-48a3-46c1-962c-cfac5cc3cdee,2024-09-10 11:28:00+00:00,197.65.153.203,Embu-Guaçu,São Paulo,06900-000,Firefox,Email,/cancel,cancel
4,2261643,,3,a52141ed-4790-486d-8da4-f3cf4ccef1c8,2019-01-02 07:15:00+00:00,149.159.232.254,Caieiras,São Paulo,07700-000,Safari,Email,/cancel,cancel


**events_df dictionary**

| Column Name       | Description                                                           |
| ----------------- | --------------------------------------------------------------------- |
| `id`              | Unique identifier for each event record                               |
| `user_id`         | ID of the user associated with the event *(can be missing/anonymous)* |
| `sequence_number` | Order of the event within a session (helps track user journey)        |
| `session_id`      | Unique ID for the user session (used to group related events)         |
| `created_at`      | Timestamp when the event occurred                                     |
| `ip_address`      | User’s IP address at the time of the event                            |
| `city`            | City of the user (inferred from IP)                                   |
| `state`           | State of the user (inferred from IP)                                  |
| `postal_code`     | Postal code of the user (inferred from IP)                            |
| `browser`         | Browser used by the user (e.g., Chrome, Safari)                       |
| `traffic_source`  | Source that led the user to the site (e.g., Organic, Email, Adwords)  |
| `uri`             | Specific page or endpoint visited (e.g., `/cancel`)                   |
| `event_type`      | Type of user action or event (e.g., `cancel`, `purchase`, `checkout`) |


In [7]:
inventoryitems_df = thelook_dfs['inventory_items']
inventoryitems_df.head()

Unnamed: 0,id,product_id,created_at,sold_at,cost,product_category,product_name,product_brand,product_retail_price,product_department,product_sku,product_distribution_center_id
0,149079,13844,2025-07-09 05:12:51+00:00,2025-08-10 03:13:51+00:00,2.76804,Accessories,(ONE) 1 Satin Headband,Funny Girl Designs,6.99,Women,2A3E953A5E3D81E67945BCE5519F84C8,7
1,149080,13844,2021-09-08 16:23:00+00:00,NaT,2.76804,Accessories,(ONE) 1 Satin Headband,Funny Girl Designs,6.99,Women,2A3E953A5E3D81E67945BCE5519F84C8,7
2,327585,13844,2025-02-12 04:45:48+00:00,2025-04-08 22:24:48+00:00,2.76804,Accessories,(ONE) 1 Satin Headband,Funny Girl Designs,6.99,Women,2A3E953A5E3D81E67945BCE5519F84C8,7
3,327586,13844,2020-09-08 05:58:00+00:00,NaT,2.76804,Accessories,(ONE) 1 Satin Headband,Funny Girl Designs,6.99,Women,2A3E953A5E3D81E67945BCE5519F84C8,7
4,360165,13844,2023-12-30 02:38:29+00:00,2024-02-01 04:11:29+00:00,2.76804,Accessories,(ONE) 1 Satin Headband,Funny Girl Designs,6.99,Women,2A3E953A5E3D81E67945BCE5519F84C8,7


**inventoryitems_df dictionary**

| Column Name                      | Description                                                                                                 |
| -------------------------------- | ----------------------------------------------------------------------------------------------------------- |
| `id`                             | Unique identifier for each inventory record                                                                 |
| `product_id`                     | ID of the product listed in the inventory                                                                   |
| `created_at`                     | Timestamp when the product was added to inventory                                                           |
| `sold_at`                        | Timestamp when the product was sold (null if unsold)                                                        |
| `cost`                           | Cost to the business for this unit (wholesale price)                                                        |
| `product_category`               | Category label for the product (e.g., *Jumpsuits & Rompers*)                                                |
| `product_name`                   | Full name or title of the product                                                                           |
| `product_brand`                  | Brand/manufacturer of the product                                                                           |
| `product_retail_price`           | Suggested retail price or sale price to customer                                                            |
| `product_department`             | Department the product belongs to (e.g., *Women*)                                                           |
| `product_sku`                    | Unique stock keeping unit (SKU) for inventory tracking                                                      |
| `product_distribution_center_id` | Foreign key linking to the distribution center where the item is stocked (join with `distribution_centers`) |


In [8]:
orderitems_df = thelook_dfs['order_items']
orderitems_df.head(5)

Unnamed: 0,id,order_id,user_id,product_id,inventory_item_id,status,created_at,shipped_at,delivered_at,returned_at,sale_price
0,110755,76044,60769,14235,299107,Complete,2023-11-24 10:10:15+00:00,2023-11-24 05:06:00+00:00,2023-11-28 13:36:00+00:00,NaT,0.02
1,42569,29086,23286,14235,114772,Shipped,2025-03-18 13:55:21+00:00,2025-03-20 08:28:00+00:00,NaT,NaT,0.02
2,68460,46920,37539,14235,184744,Shipped,2025-05-05 06:44:19+00:00,2025-05-06 12:16:00+00:00,NaT,NaT,0.02
3,87121,59811,47896,14235,235215,Shipped,2025-03-28 03:24:05+00:00,2025-03-28 17:29:00+00:00,NaT,NaT,0.02
4,116285,79822,63823,14235,314052,Shipped,2021-10-05 01:35:54+00:00,2021-10-07 00:21:00+00:00,NaT,NaT,0.02


**orderitems_df**

| Column Name         | Description                                                                    |
| ------------------- | ------------------------------------------------------------------------------ |
| `id`                | Unique identifier for each order item (line item)                              |
| `order_id`          | Foreign key referencing the `orders` table (represents the main order)         |
| `user_id`           | ID of the customer who placed the order                                        |
| `product_id`        | ID of the product ordered (from `products` or `inventories` table)             |
| `inventory_item_id` | Foreign key linking to the specific item in the inventory (from `inventories`) |
| `status`            | Status of the item within the order: `Complete`, `Cancelled`, etc.             |
| `created_at`        | Timestamp when the item was added to the order                                 |
| `shipped_at`        | Timestamp when the item was shipped                                            |
| `delivered_at`      | Timestamp when the item was delivered                                          |
| `returned_at`       | Timestamp if the item was returned                                             |
| `sale_price`        | Price at which the item was sold to the customer                               |


In [9]:
products_df = thelook_dfs['products']
products_df.head(5)

Unnamed: 0,id,cost,category,name,brand,retail_price,department,sku,distribution_center_id
0,13842,2.51875,Accessories,Low Profile Dyed Cotton Twill Cap - Navy W39S55D,MG,6.25,Women,EBD58B8A3F1D72F4206201DA62FB1204,1
1,13928,2.33835,Accessories,Low Profile Dyed Cotton Twill Cap - Putty W39S55D,MG,5.95,Women,2EAC42424D12436BDD6A5B8A88480CC3,1
2,14115,4.87956,Accessories,Enzyme Regular Solid Army Caps-Black W35S45D,MG,10.99,Women,EE364229B2791D1EF9355708EFF0BA34,1
3,14157,4.64877,Accessories,Enzyme Regular Solid Army Caps-Olive W35S45D (...,MG,10.99,Women,00BD13095D06C20B11A2993CA419D16B,1
4,14273,6.50793,Accessories,Washed Canvas Ivy Cap - Black W11S64C,MG,15.99,Women,F531DC20FDE20B7ADF3A73F52B71D0AF,1


**products_df**

| Column Name              | Description                                                                                                      |
| ------------------------ | ---------------------------------------------------------------------------------------------------------------- |
| `id`                     | Unique product ID                                                                                                |
| `cost`                   | Internal cost of the product for the retailer                                                                    |
| `category`               | Product category (e.g., Accessories, Tops, Shoes)                                                                |
| `name`                   | Full name or description of the product                                                                          |
| `brand`                  | Brand or manufacturer of the product                                                                             |
| `retail_price`           | Recommended retail price for the product                                                                         |
| `department`             | Department the product belongs to (e.g., Women, Men, Kids)                                                       |
| `sku`                    | Stock Keeping Unit: unique code used to identify and track inventory                                             |
| `distribution_center_id` | Foreign key linking to the `distribution_centers` table, indicating where the product is stocked or shipped from |


In [10]:
users_df = thelook_dfs['users']
users_df.head(5)

Unnamed: 0,id,first_name,last_name,email,age,gender,state,street_address,postal_code,city,country,latitude,longitude,traffic_source,created_at,user_geom
0,98707,Jason,Sharp,jasonsharp@example.org,58,M,Acre,903 Porter Valley Suite 916,69980-000,,Brasil,-8.065346,-72.870949,Email,2021-07-18 13:14:00+00:00,POINT(-72.87094866 -8.065346116)
1,63035,Jill,Hale,jillhale@example.net,64,F,Acre,562 Melanie Villages,69980-000,,Brasil,-8.065346,-72.870949,Organic,2022-04-11 07:09:00+00:00,POINT(-72.87094866 -8.065346116)
2,85614,Sandra,Sanchez,sandrasanchez@example.org,23,F,Acre,6588 Matthews Isle Suite 107,69980-000,,Brasil,-8.065346,-72.870949,Organic,2024-07-25 01:54:00+00:00,POINT(-72.87094866 -8.065346116)
3,60885,Kayla,Sanders,kaylasanders@example.com,15,F,Acre,845 Harrington Trace,69980-000,,Brasil,-8.065346,-72.870949,Search,2021-03-22 17:07:00+00:00,POINT(-72.87094866 -8.065346116)
4,15848,Monica,Mills,monicamills@example.com,29,F,Acre,5959 Thomas Center,69980-000,,Brasil,-8.065346,-72.870949,Display,2023-02-24 04:24:00+00:00,POINT(-72.87094866 -8.065346116)


**users_df dictionary**

  | Column Name      | Description                                                                                |
| ---------------- | ------------------------------------------------------------------------------------------ |
| `id`             | Unique identifier for the user (primary key)                                               |
| `first_name`     | User's first name                                                                          |
| `last_name`      | User's last name                                                                           |
| `email`          | User's email address                                                                       |
| `age`            | Age of the user                                                                            |
| `gender`         | Gender of the user (`M`, `F`, or other)                                                    |
| `state`          | State of residence (e.g., Acre)                                                            |
| `street_address` | Full street address of the user                                                            |
| `postal_code`    | Postal or ZIP code                                                                         |
| `city`           | City (some entries may be `null`)                                                          |
| `country`        | Country of residence (e.g., Brasil)                                                        |
| `latitude`       | Latitude coordinate of the user's address                                                  |
| `longitude`      | Longitude coordinate of the user's address                                                 |
| `traffic_source` | Original acquisition source (e.g., Facebook, Email, Search, Organic)                       |
| `created_at`     | Timestamp of when the user was created in the system                                       |
| `user_geom`      | Spatial data point combining `longitude` and `latitude` (for mapping and geospatial joins) |


### Data Cleaning & Exploration

#### orders_df

In [11]:
# number of colums
orders_df.shape

(124691, 9)

In [12]:
# Check data type of the orders_df

orders_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 124691 entries, 0 to 124690
Data columns (total 9 columns):
 #   Column        Non-Null Count   Dtype              
---  ------        --------------   -----              
 0   order_id      124691 non-null  Int64              
 1   user_id       124691 non-null  Int64              
 2   status        124691 non-null  object             
 3   gender        124691 non-null  object             
 4   created_at    124691 non-null  datetime64[us, UTC]
 5   returned_at   12491 non-null   datetime64[us, UTC]
 6   shipped_at    81220 non-null   datetime64[us, UTC]
 7   delivered_at  43863 non-null   datetime64[us, UTC]
 8   num_of_item   124691 non-null  Int64              
dtypes: Int64(3), datetime64[us, UTC](4), object(2)
memory usage: 8.9+ MB


In [13]:
# Check for null values

orders_df.isnull().sum()

order_id             0
user_id              0
status               0
gender               0
created_at           0
returned_at     112200
shipped_at       43471
delivered_at     80828
num_of_item          0
dtype: int64

In [14]:
orders_df['status'].unique()

array(['Cancelled', 'Complete', 'Processing', 'Returned', 'Shipped'],
      dtype=object)

In [15]:
orders_df['status'].value_counts(normalize = True)

status
Shipped       0.299597
Complete      0.251598
Processing    0.200335
Cancelled     0.148295
Returned      0.100176
Name: proportion, dtype: float64

In [16]:
# unique values and proportion of items in gender

orders_df['gender'].value_counts(normalize = True)

gender
M    0.503364
F    0.496636
Name: proportion, dtype: float64

In [17]:
orders_df.describe()

Unnamed: 0,order_id,user_id,num_of_item
count,124691.0,124691.0,124691.0
mean,62346.0,49918.307649,1.45377
std,35995.335545,28860.671042,0.808708
min,1.0,1.0,1.0
25%,31173.5,24938.0,1.0
50%,62346.0,49931.0,1.0
75%,93518.5,74996.0,2.0
max,124691.0,100000.0,4.0


**distribution_df (Distribution center)**

**Null values interpretation**

The order_df has 112122 null valued for returned_at, this indicates that 112122 were not returned, 43852 were never shipped, and 81210 has no delivery time, perhaps because they were never delivered. This could include:

- Orders that were cancelled before delivery.

- Orders that are still in transit.

- Orders that were lost or failed delivery.

**Status Value counts**

About 29.85% of orders were shipped but not yet marked as delivered, 25.14 were succesfully delivered, 19.92 % are currenly been processed, 15.11 % were cancelled before completion and 9.96 orders were delivered by returned by customers. 

**Gender**

`50.06` % of customers are Male and  `49.93` are female

**.decribe**

50% media of all orders contain 1 item and 75% of order contain 2 items and the maximum number of items ordered in a single order is 4

In [18]:
distribution_df.head()

Unnamed: 0,id,name,latitude,longitude,distribution_center_geom
0,10,Savannah GA,32.0167,-81.1167,POINT(-81.1167 32.0167)
1,9,Charleston SC,32.7833,-79.9333,POINT(-79.9333 32.7833)
2,1,Memphis TN,35.1174,-89.9711,POINT(-89.9711 35.1174)
3,3,Houston TX,29.7604,-95.3698,POINT(-95.3698 29.7604)
4,7,Philadelphia PA,39.95,-75.1667,POINT(-75.1667 39.95)


In [19]:
# no of rows and columns
distribution_df.shape

(10, 5)

In [20]:
# check for null values

distribution_df.isnull().sum()

id                          0
name                        0
latitude                    0
longitude                   0
distribution_center_geom    0
dtype: int64

In [21]:
distribution_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   id                        10 non-null     Int64  
 1   name                      10 non-null     object 
 2   latitude                  10 non-null     float64
 3   longitude                 10 non-null     float64
 4   distribution_center_geom  10 non-null     object 
dtypes: Int64(1), float64(2), object(2)
memory usage: 542.0+ bytes


**events_df**

In [22]:
# number of rows in the table
events_df.shape

(2427988, 13)

In [23]:
events_df.head(5)

Unnamed: 0,id,user_id,sequence_number,session_id,created_at,ip_address,city,state,postal_code,browser,traffic_source,uri,event_type
0,1937376,,3,556fcff6-a7a3-4b02-a512-f491bcc6ab19,2023-03-08 05:37:00+00:00,38.103.36.86,São Paulo,São Paulo,02675-031,Chrome,Facebook,/cancel,cancel
1,2244726,,3,eab80561-3d0d-4cd9-aa0b-5f52ed60bb67,2024-07-08 16:12:00+00:00,15.55.4.254,Vargem Grande Paulista,São Paulo,06730-000,Safari,Organic,/cancel,cancel
2,1357710,,3,cf1f5585-a197-4218-8ffc-0d7f8da08d94,2023-05-14 01:54:00+00:00,13.137.185.63,Embu das Artes,São Paulo,06810-240,Firefox,Email,/cancel,cancel
3,1847066,,3,6a9dddfd-48a3-46c1-962c-cfac5cc3cdee,2024-09-10 11:28:00+00:00,197.65.153.203,Embu-Guaçu,São Paulo,06900-000,Firefox,Email,/cancel,cancel
4,2261643,,3,a52141ed-4790-486d-8da4-f3cf4ccef1c8,2019-01-02 07:15:00+00:00,149.159.232.254,Caieiras,São Paulo,07700-000,Safari,Email,/cancel,cancel


In [24]:
#check data types
events_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2427988 entries, 0 to 2427987
Data columns (total 13 columns):
 #   Column           Dtype              
---  ------           -----              
 0   id               Int64              
 1   user_id          Int64              
 2   sequence_number  Int64              
 3   session_id       object             
 4   created_at       datetime64[us, UTC]
 5   ip_address       object             
 6   city             object             
 7   state            object             
 8   postal_code      object             
 9   browser          object             
 10  traffic_source   object             
 11  uri              object             
 12  event_type       object             
dtypes: Int64(3), datetime64[us, UTC](1), object(9)
memory usage: 247.8+ MB


In [25]:
#check for null values

events_df.isnull().sum()

id                       0
user_id            1124408
sequence_number          0
session_id               0
created_at               0
ip_address               0
city                     0
state                    0
postal_code              0
browser                  0
traffic_source           0
uri                      0
event_type               0
dtype: int64

In [26]:
# For user who are not logged in, what kind of events did they generate and how often 

events_df[events_df['user_id'].isnull()]['event_type'].value_counts(normalize = True)

event_type
product       0.444678
department    0.222388
cart          0.222067
cancel        0.110866
Name: proportion, dtype: float64

In [27]:
events_df['event_type'].unique()

array(['cancel', 'cart', 'department', 'home', 'product', 'purchase'],
      dtype=object)

**events_df breakdown**
  
The event table likely captures website /app user behaviour such as clicks, page visists and cancellations. user_id has 1125736 null values which could be customers who visited as guests, or people who simply did not sign in. 

Each website visitor, either signed in or not performs certain actions which are 'cancel', 'cart', 'department', 'home', 'product', 'purchase'"

**Why event type is important**

These event types are valuable for:

- Funnel analysis (home → department → product → cart → purchase)

- Drop-off analysis (e.g., many reach cart but few purchase)

- Engagement segmentation (users who only browse vs. those who purchase)

- Personalization and recommendation logic

**inventoryitems_df**

In [28]:
# number of columns and rows
inventoryitems_df.shape

(489521, 12)

In [29]:
# check data type
inventoryitems_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 489521 entries, 0 to 489520
Data columns (total 12 columns):
 #   Column                          Non-Null Count   Dtype              
---  ------                          --------------   -----              
 0   id                              489521 non-null  Int64              
 1   product_id                      489521 non-null  Int64              
 2   created_at                      489521 non-null  datetime64[us, UTC]
 3   sold_at                         181272 non-null  datetime64[us, UTC]
 4   cost                            489521 non-null  float64            
 5   product_category                489521 non-null  object             
 6   product_name                    489521 non-null  object             
 7   product_brand                   489521 non-null  object             
 8   product_retail_price            489521 non-null  float64            
 9   product_department              489521 non-null  object             
 

In [30]:
# checking null value

inventoryitems_df.isnull().sum()

id                                     0
product_id                             0
created_at                             0
sold_at                           308249
cost                                   0
product_category                       0
product_name                           0
product_brand                          0
product_retail_price                   0
product_department                     0
product_sku                            0
product_distribution_center_id         0
dtype: int64

In [31]:
inventoryitems_df.head()

Unnamed: 0,id,product_id,created_at,sold_at,cost,product_category,product_name,product_brand,product_retail_price,product_department,product_sku,product_distribution_center_id
0,149079,13844,2025-07-09 05:12:51+00:00,2025-08-10 03:13:51+00:00,2.76804,Accessories,(ONE) 1 Satin Headband,Funny Girl Designs,6.99,Women,2A3E953A5E3D81E67945BCE5519F84C8,7
1,149080,13844,2021-09-08 16:23:00+00:00,NaT,2.76804,Accessories,(ONE) 1 Satin Headband,Funny Girl Designs,6.99,Women,2A3E953A5E3D81E67945BCE5519F84C8,7
2,327585,13844,2025-02-12 04:45:48+00:00,2025-04-08 22:24:48+00:00,2.76804,Accessories,(ONE) 1 Satin Headband,Funny Girl Designs,6.99,Women,2A3E953A5E3D81E67945BCE5519F84C8,7
3,327586,13844,2020-09-08 05:58:00+00:00,NaT,2.76804,Accessories,(ONE) 1 Satin Headband,Funny Girl Designs,6.99,Women,2A3E953A5E3D81E67945BCE5519F84C8,7
4,360165,13844,2023-12-30 02:38:29+00:00,2024-02-01 04:11:29+00:00,2.76804,Accessories,(ONE) 1 Satin Headband,Funny Girl Designs,6.99,Women,2A3E953A5E3D81E67945BCE5519F84C8,7


**inventoryitems_df null values**

All features has 0 null values excpet sold_at with 307760 null values, this shows items that haven't been sold

**orderitems_df**

In [32]:
# numbert of columns and rows
orderitems_df.shape

(181272, 11)

In [33]:
# null values
orderitems_df.isnull().sum()

id                        0
order_id                  0
user_id                   0
product_id                0
inventory_item_id         0
status                    0
created_at                0
shipped_at            63022
delivered_at         117265
returned_at          163012
sale_price                0
dtype: int64

In [60]:
orderitems_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 181272 entries, 0 to 181271
Data columns (total 11 columns):
 #   Column             Non-Null Count   Dtype              
---  ------             --------------   -----              
 0   id                 181272 non-null  Int64              
 1   order_id           181272 non-null  Int64              
 2   user_id            181272 non-null  Int64              
 3   product_id         181272 non-null  Int64              
 4   inventory_item_id  181272 non-null  Int64              
 5   status             181272 non-null  object             
 6   created_at         181272 non-null  datetime64[us, UTC]
 7   shipped_at         118250 non-null  datetime64[us, UTC]
 8   delivered_at       64007 non-null   datetime64[us, UTC]
 9   returned_at        18260 non-null   datetime64[us, UTC]
 10  sale_price         181272 non-null  float64            
dtypes: Int64(5), datetime64[us, UTC](4), float64(1), object(1)
memory usage: 16.1+ MB


**orderitems_df null values**

63,457 orders have not been shipped yet or could indicate cancelled orders . 117650 indicates orders that were never delivered and 162843 shows the number of items that were never delivered. 

**products_df**

In [36]:
# number of rows and columns
products_df.shape


(29120, 9)

In [59]:
products_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 29118 entries, 0 to 29119
Data columns (total 9 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   id                      29118 non-null  Int64  
 1   cost                    29118 non-null  float64
 2   category                29118 non-null  object 
 3   name                    29118 non-null  object 
 4   brand                   29118 non-null  object 
 5   retail_price            29118 non-null  float64
 6   department              29118 non-null  object 
 7   sku                     29118 non-null  object 
 8   distribution_center_id  29118 non-null  Int64  
dtypes: Int64(2), float64(2), object(5)
memory usage: 2.3+ MB


In [37]:
# check null value

products_df.isnull().sum()

id                         0
cost                       0
category                   0
name                       2
brand                     24
retail_price               0
department                 0
sku                        0
distribution_center_id     0
dtype: int64

In [43]:
# replace null brand values with unknown

products_df['brand'] = products_df['brand'].fillna('unknown')

In [47]:
# products without name

products_df[products_df['name'].isnull()]

Unnamed: 0,id,cost,category,name,brand,retail_price,department,sku,distribution_center_id
3247,12586,18.972,Intimates,,Josie by Natori,36.0,Women,A7EA034186E14FB5F7B37CF664893CD2,1
5588,24455,67.335453,Outerwear & Coats,,Tru-Spec,147.990005,Men,B290A635641F585B3DD6B95FD42DC267,2


In [56]:
# drop null values in name

products_df = products_df.dropna(subset = ['name'])

**users_df**

In [63]:
# for number of rows and columns
users_df.shape

(100000, 16)

In [62]:
users_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 16 columns):
 #   Column          Non-Null Count   Dtype              
---  ------          --------------   -----              
 0   id              100000 non-null  Int64              
 1   first_name      100000 non-null  object             
 2   last_name       100000 non-null  object             
 3   email           100000 non-null  object             
 4   age             100000 non-null  Int64              
 5   gender          100000 non-null  object             
 6   state           100000 non-null  object             
 7   street_address  100000 non-null  object             
 8   postal_code     100000 non-null  object             
 9   city            100000 non-null  object             
 10  country         100000 non-null  object             
 11  latitude        100000 non-null  float64            
 12  longitude       100000 non-null  float64            
 13  traffic_source 

In [64]:
# Null values

users_df.isnull().sum()

id                0
first_name        0
last_name         0
email             0
age               0
gender            0
state             0
street_address    0
postal_code       0
city              0
country           0
latitude          0
longitude         0
traffic_source    0
created_at        0
user_geom         0
dtype: int64

## Creating a Mastersheet 

**join order_df and users_df on user_id**

In [96]:
#check the columns in each table and drop duplicates before merging
print(f'order_df columns : {orders_df.columns}')


print(f'users_df columns : {users_df.columns}')

order_df columns : Index(['order_id', 'user_id', 'status', 'gender', 'created_at', 'returned_at',
       'shipped_at', 'delivered_at', 'num_of_item'],
      dtype='object')
users_df columns : Index(['id', 'first_name', 'last_name', 'email', 'age', 'gender', 'state',
       'street_address', 'postal_code', 'city', 'country', 'latitude',
       'longitude', 'traffic_source', 'created_at', 'user_geom'],
      dtype='object')


In [103]:
# a function to identify duplicates columns 

def duplicates(x, y):
    return [c for c in x.columns if c in y.columns]

In [105]:
duplicates(orders_df, users_df)

['gender', 'created_at']

In [82]:
user_orders = orders_df.merge(users_df, left_on = 'user_id', right_on = 'id', how = 'inner' )
user_orders.head()

Unnamed: 0,order_id,user_id,status,gender_x,created_at_x,returned_at,shipped_at,delivered_at,num_of_item,id,...,state,street_address,postal_code,city,country,latitude,longitude,traffic_source,created_at_y,user_geom
0,22,21,Cancelled,F,2024-09-29 08:45:00+00:00,NaT,NaT,NaT,1,21,...,Wallonia,297 James Bypass Suite 003,6043,Charleroi,Belgium,50.455259,4.48236,Organic,2024-08-19 08:45:00+00:00,POINT(4.482359587 50.45525932)
1,27,27,Cancelled,F,2022-09-15 14:24:00+00:00,NaT,NaT,NaT,1,27,...,Shanghai,17429 John Lakes Apt. 178,200231,Tianjin,China,31.124933,121.45133,Email,2022-07-05 14:24:00+00:00,POINT(121.4513298 31.12493343)
2,63,59,Cancelled,F,2022-02-03 05:19:00+00:00,NaT,NaT,NaT,2,59,...,Shanxi,425 Simmons Dale,30072,Tianjin,China,37.904266,112.554583,Search,2022-01-04 05:19:00+00:00,POINT(112.554583 37.90426574)
3,66,61,Cancelled,F,2020-05-07 10:04:00+00:00,NaT,NaT,NaT,1,61,...,Comunidad de Madrid,5915 Bennett Fall,28030,Madrid,Spain,40.40661,-3.641385,Search,2019-03-24 10:04:00+00:00,POINT(-3.641385174 40.40660981)
4,76,69,Cancelled,F,2023-03-05 11:14:00+00:00,NaT,NaT,NaT,1,69,...,Paraíba,4191 Cody Wall Suite 506,58900-000,Cajazeiras,Brasil,-6.919303,-38.542034,Search,2019-08-22 11:14:00+00:00,POINT(-38.54203425 -6.919302569)


In [83]:
# check for nulls
user_orders.isnull().sum()

order_id               0
user_id                0
status                 0
gender_x               0
created_at_x           0
returned_at       112200
shipped_at         43471
delivered_at       80828
num_of_item            0
id                     0
first_name             0
last_name              0
email                  0
age                    0
gender_y               0
state                  0
street_address         0
postal_code            0
city                   0
country                0
latitude               0
longitude              0
traffic_source         0
created_at_y           0
user_geom              0
dtype: int64

In [84]:
# check duplicates

user_orders.duplicated().sum()

0

In [85]:
user_orders.shape

(124691, 25)

#### Merge orderitems_df with user_orders

In [79]:
orderitems_df.head(5)

Unnamed: 0,id,order_id,user_id,product_id,inventory_item_id,status,created_at,shipped_at,delivered_at,returned_at,sale_price
0,110755,76044,60769,14235,299107,Complete,2023-11-24 10:10:15+00:00,2023-11-24 05:06:00+00:00,2023-11-28 13:36:00+00:00,NaT,0.02
1,42569,29086,23286,14235,114772,Shipped,2025-03-18 13:55:21+00:00,2025-03-20 08:28:00+00:00,NaT,NaT,0.02
2,68460,46920,37539,14235,184744,Shipped,2025-05-05 06:44:19+00:00,2025-05-06 12:16:00+00:00,NaT,NaT,0.02
3,87121,59811,47896,14235,235215,Shipped,2025-03-28 03:24:05+00:00,2025-03-28 17:29:00+00:00,NaT,NaT,0.02
4,116285,79822,63823,14235,314052,Shipped,2021-10-05 01:35:54+00:00,2021-10-07 00:21:00+00:00,NaT,NaT,0.02


In [86]:
complete_orders_df = user_orders.merge(orderitems_df, on = 'user_id', how = 'inner')

In [87]:
#confirm shape
complete_orders_df.shape

(369938, 35)

In [88]:
complete_orders_df.isnull().sum()

order_id_x                0
user_id                   0
status_x                  0
gender_x                  0
created_at_x              0
returned_at_x        332973
shipped_at_x         128689
delivered_at_x       239546
num_of_item               0
id_x                      0
first_name                0
last_name                 0
email                     0
age                       0
gender_y                  0
state                     0
street_address            0
postal_code               0
city                      0
country                   0
latitude                  0
longitude                 0
traffic_source            0
created_at_y              0
user_geom                 0
id_y                      0
order_id_y                0
product_id                0
inventory_item_id         0
status_y                  0
created_at                0
shipped_at_y         128551
delivered_at_y       238988
returned_at_y        332646
sale_price                0
dtype: int64

In [89]:
orderitems_df.head(5)

Unnamed: 0,id,order_id,user_id,product_id,inventory_item_id,status,created_at,shipped_at,delivered_at,returned_at,sale_price
0,110755,76044,60769,14235,299107,Complete,2023-11-24 10:10:15+00:00,2023-11-24 05:06:00+00:00,2023-11-28 13:36:00+00:00,NaT,0.02
1,42569,29086,23286,14235,114772,Shipped,2025-03-18 13:55:21+00:00,2025-03-20 08:28:00+00:00,NaT,NaT,0.02
2,68460,46920,37539,14235,184744,Shipped,2025-05-05 06:44:19+00:00,2025-05-06 12:16:00+00:00,NaT,NaT,0.02
3,87121,59811,47896,14235,235215,Shipped,2025-03-28 03:24:05+00:00,2025-03-28 17:29:00+00:00,NaT,NaT,0.02
4,116285,79822,63823,14235,314052,Shipped,2021-10-05 01:35:54+00:00,2021-10-07 00:21:00+00:00,NaT,NaT,0.02
