# Purpose

Now that we have the data for each of our customers (see notebook 00_Generating_100Customers_NV), and data for each product our grocery shop sells (see 01_Scrape_Tesco_Groceries_NV), we can now prepare the data so that it is easier to simulate.

This means we'll be assigning an ID to each customer and to each item, so that it becomes easier to store, and we'll decide on how to store information about each transaction.

In [79]:
import numpy as np
import pandas as pd
from pprint import pprint

In [2]:
customers = pd.read_json(
    '../data/processed/customers.json'
)
display(customers.sample(5))
print(customers.shape)

items = pd.read_json(
    '../data/raw/sample_products.json'
)
display(items.sample(5))
print(items.shape)

Unnamed: 0,age,habit,budget
8,26,sporadic,115.0
65,26,sporadic,115.0
32,70,sporadic,115.0
95,63,sporadic,115.0
11,30,sporadic,115.0


(100, 3)


Unnamed: 0,product,price,type
105,Diet Coke 2L,1.77,drinks
227,Swan Menthol Extra Slim Filters 120 Pack,1.03,home-and-ents
126,Milton Antibacterial Surface Wipes X30,2.0,baby
150,Colgate Deep Clean Whitening Toothpaste 125Ml,3.8,health-and-beauty
23,Whole Cucumber Each,0.43,fresh-food


(240, 3)


We need to generate a unique ID for each customer. We do this so that we can store a customer ID in a database, and not the 3 columns we currently have. The same holds for the items.

One quick way to do this is to generate a unique ID using python's built-in `uuid` module.

In [3]:
import uuid

`uuid` has a handy function, `.uuid4()`, which generates a random ID.

In [4]:
uuid.uuid4()

UUID('489df39d-e2ac-4718-91cd-ddaf36d4c96c')

We can then use the `.fields` attribute to get a numeric ID.

In [5]:
uuid.uuid4().fields

(803145970, 45464, 17617, 143, 242, 198400897234744)

Since this is a fairly lightweight operation, we will use the first field, as it always yields a 32-bit integer. In other applications, we can use the other fields. We need to make sure to keep a record of which customers have which ID!

In [6]:
customers['uuid'] = [uuid.uuid4().fields[0] for _ in range(len(customers))]
print(len(customers['uuid'].unique()))

100


And with this we have 100 unique IDs, one for each customer. Let's keep a record of the new data, now in CV format.

In [7]:
customers.to_csv('../data/processed/customers_database.csv', index=False)

Now we do the same for the items. For practice, let's use one of the 16-bit integer fields.

In [8]:
items['uuid'] = [uuid.uuid4().fields[1] for _ in range(len(items))]
print(len(items['uuid'].unique()))

240


And we save.

In [9]:
items.to_csv('../data/processed/items_database.csv', index=False)

Onto a simple simulation.

### Simple simulations

Now that we have a list of customers, with budgets, and a list of products they could buy, with prices, we can simulate a shopping interaction. For simplicity, we will assume that customer who shop daily buy something from every category every day.

Let's start simple. For a given customer, we sample an item, add the cost to the bill, then check if the bill exceeds the budget. If it does, we pop the last item and checkout. In this scenario, a customer is just as likely to buy 2 packs of bananas, or 2 household items.

In [82]:
def generate_bill(customer: pd.Series, items: pd.DataFrame) -> (list, list):
    """
    Purchases for a single customer.
    
    We add items to a customer's basket, at random, and check if the
    sub-total exceeds the customer's budget.
    """
    items_purchased = []
    sub_total = []
    cust_budget = customer['budget']
    while sum(sub_total) < cust_budget:
        item = items.sample()
        sub_total.append(item['price'].values[0])
        items_purchased.append(item['uuid'].values[0])
    if sum(sub_total) > customer['budget']:
        sub_total.pop()
        items_purchased.pop()
    return items_purchased, sub_total

In [83]:
# Let's test
generate_bill(
    customer=customers.query('habit == "daily"').iloc[0],
    items=items
)

([31458, 60756], [1.5, 0.8])

Since we have a way to generate bills, so now we need to generate bills for one day for every customer, and bills for one week.

In [84]:
# We don't know when weekly customers will come in, so we will assign a
# random number, from 0 to 6, which will dictate when the "weekly" customers
# come in.
weekly_customers = customers.query('habit == "weekly"')
weekly_customers['day_in'] = np.random.randint(
    0, 7, 
    size=len(weekly_customers)
)
display(weekly_customers)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  weekly_customers['day_in'] = np.random.randint(


Unnamed: 0,age,habit,budget,uuid,day_in
0,24,weekly,26.5,3031153441,2
2,82,weekly,26.5,464317747,5
3,75,weekly,26.5,1412573174,0
5,52,weekly,26.5,3785220396,6
12,85,weekly,26.5,4178248121,6
14,46,weekly,26.5,3982346988,5
17,92,weekly,26.5,3745847835,4
18,85,weekly,26.5,789231310,4
19,54,weekly,26.5,2493851942,1
20,84,weekly,26.5,961301073,3


So now we can simulate a week's worth of billings. Let's ignore the hour component in the simulation for now.

In [85]:
for day in range(7):  # Sunday is 0
    todays_customers = customers.query('habit == "daily"').append(
        weekly_customers.query(f'day_in == {day}')
    )
    for idx,customer in todays_customers.iterrows():
        display(generate_bill(customer, items))
    break

([41231], [3.75])

([14117, 13222, 33464, 59734], [0.7000000000000001, 1.3, 0.5, 1.0])

([25512], [1.0])

([25764, 41428, 552, 54369, 60756, 4367, 3628, 11921],
 [1.85, 2.2, 7.0, 11.0, 0.8, 0.75, 0.85, 0.6000000000000001])

([13413, 26335, 19902, 6133, 27893, 35126, 18736],
 [4.0, 0.7000000000000001, 1.0, 1.77, 0.99, 1.5, 9.5])

([48466, 13824, 20944, 43094, 46027, 13413, 12388, 23858, 57458, 39324, 39962],
 [7.6, 1.48, 1.5, 1.5, 1.0, 4.0, 2.0, 1.0, 0.55, 2.85, 1.65])

([7737, 46027, 28275, 26820, 60151, 28451, 28474, 24832, 43692, 2846, 30113],
 [1.5,
  1.0,
  1.46,
  1.35,
  9.75,
  0.55,
  1.5,
  0.9500000000000001,
  1.5,
  2.95,
  1.9500000000000002])

([53453,
  64049,
  23876,
  52269,
  52529,
  49701,
  50775,
  14588,
  18633,
  13824,
  48361,
  2846,
  35126,
  1584,
  27893,
  23876,
  35566,
  5651],
 [1.55,
  1.0,
  1.13,
  1.0,
  0.9,
  2.0,
  0.6900000000000001,
  1.15,
  0.7000000000000001,
  1.48,
  0.85,
  2.95,
  1.5,
  2.5,
  0.99,
  1.13,
  1.0,
  1.48])

([5651, 52171, 49258, 10921, 25652, 39073, 52381, 30026, 16744, 60791],
 [1.48, 0.6000000000000001, 0.99, 1.5, 1.46, 13.0, 1.25, 1.0, 1.0, 3.29])

Ok, the skeleton is there. All we now need is to store the day and generate a unique transaction ID, for which we can rely again on `uuid`.

In [87]:
billings = []
for day in range(7):  # Sunday is 0
    todays_customers = customers.query('habit == "daily"').append(
        weekly_customers.query(f'day_in == {day}')
    )
    for idx,customer in todays_customers.set_index('uuid').iterrows():
        bill = generate_bill(customer, items)
        bill_id = uuid.uuid4().fields[0]
        for it,pr in zip(bill[0], bill[1]):
            billings.append(
                {
                    'id': bill_id,
                    'day': day,
                    'customer_id': idx,
                    'items': it,
                    'total': pr
                }
            )
pprint(billings)

[{'customer_id': 958420868,
  'day': 0,
  'id': 217520049,
  'items': 48607,
  'total': 1.8},
 {'customer_id': 958420868,
  'day': 0,
  'id': 217520049,
  'items': 3058,
  'total': 0.9},
 {'customer_id': 958420868,
  'day': 0,
  'id': 217520049,
  'items': 18633,
  'total': 0.7000000000000001},
 {'customer_id': 4036240255,
  'day': 0,
  'id': 2564927050,
  'items': 62726,
  'total': 1.75},
 {'customer_id': 1412573174,
  'day': 0,
  'id': 2842132865,
  'items': 16846,
  'total': 1.0},
 {'customer_id': 1412573174,
  'day': 0,
  'id': 2842132865,
  'items': 43692,
  'total': 1.5},
 {'customer_id': 1412573174,
  'day': 0,
  'id': 2842132865,
  'items': 62913,
  'total': 1.8},
 {'customer_id': 1412573174,
  'day': 0,
  'id': 2842132865,
  'items': 63618,
  'total': 4.0},
 {'customer_id': 1412573174,
  'day': 0,
  'id': 2842132865,
  'items': 18736,
  'total': 9.5},
 {'customer_id': 1412573174,
  'day': 0,
  'id': 2842132865,
  'items': 56792,
  'total': 1.65},
 {'customer_id': 1412573174,
 

  'id': 2809861288,
  'items': 46636,
  'total': 3.65},
 {'customer_id': 3031153441,
  'day': 2,
  'id': 2809861288,
  'items': 39584,
  'total': 2.25},
 {'customer_id': 3031153441,
  'day': 2,
  'id': 2809861288,
  'items': 47290,
  'total': 0.30000000000000004},
 {'customer_id': 3031153441,
  'day': 2,
  'id': 2809861288,
  'items': 36057,
  'total': 0.65},
 {'customer_id': 2630012468,
  'day': 2,
  'id': 1588363933,
  'items': 52196,
  'total': 2.9},
 {'customer_id': 2630012468,
  'day': 2,
  'id': 1588363933,
  'items': 6458,
  'total': 1.99},
 {'customer_id': 2630012468,
  'day': 2,
  'id': 1588363933,
  'items': 15694,
  'total': 5.5},
 {'customer_id': 2630012468,
  'day': 2,
  'id': 1588363933,
  'items': 41765,
  'total': 1.5},
 {'customer_id': 2630012468,
  'day': 2,
  'id': 1588363933,
  'items': 7969,
  'total': 2.35},
 {'customer_id': 2630012468,
  'day': 2,
  'id': 1588363933,
  'items': 56792,
  'total': 1.65},
 {'customer_id': 2630012468,
  'day': 2,
  'id': 1588363933,


  'day': 4,
  'id': 590038179,
  'items': 62726,
  'total': 1.75},
 {'customer_id': 603295827,
  'day': 4,
  'id': 590038179,
  'items': 28174,
  'total': 2.0},
 {'customer_id': 603295827,
  'day': 4,
  'id': 590038179,
  'items': 10132,
  'total': 1.3},
 {'customer_id': 603295827,
  'day': 4,
  'id': 590038179,
  'items': 48274,
  'total': 1.85},
 {'customer_id': 603295827,
  'day': 4,
  'id': 590038179,
  'items': 1614,
  'total': 3.5},
 {'customer_id': 106098987,
  'day': 4,
  'id': 654409269,
  'items': 13824,
  'total': 1.48},
 {'customer_id': 106098987,
  'day': 4,
  'id': 654409269,
  'items': 14679,
  'total': 0.6000000000000001},
 {'customer_id': 106098987,
  'day': 4,
  'id': 654409269,
  'items': 64049,
  'total': 1.0},
 {'customer_id': 106098987,
  'day': 4,
  'id': 654409269,
  'items': 6796,
  'total': 12.0},
 {'customer_id': 106098987,
  'day': 4,
  'id': 654409269,
  'items': 17707,
  'total': 4.0},
 {'customer_id': 106098987,
  'day': 4,
  'id': 654409269,
  'items': 4

 {'customer_id': 824870520,
  'day': 6,
  'id': 70746943,
  'items': 48607,
  'total': 1.8},
 {'customer_id': 824870520,
  'day': 6,
  'id': 70746943,
  'items': 25592,
  'total': 1.0},
 {'customer_id': 824870520,
  'day': 6,
  'id': 70746943,
  'items': 35527,
  'total': 0.6900000000000001},
 {'customer_id': 824870520,
  'day': 6,
  'id': 70746943,
  'items': 8334,
  'total': 1.5},
 {'customer_id': 824870520,
  'day': 6,
  'id': 70746943,
  'items': 25070,
  'total': 1.3},
 {'customer_id': 824870520,
  'day': 6,
  'id': 70746943,
  'items': 8334,
  'total': 1.5},
 {'customer_id': 824870520,
  'day': 6,
  'id': 70746943,
  'items': 8560,
  'total': 3.0},
 {'customer_id': 824870520,
  'day': 6,
  'id': 70746943,
  'items': 1960,
  'total': 1.79},
 {'customer_id': 824870520,
  'day': 6,
  'id': 70746943,
  'items': 25652,
  'total': 1.46},
 {'customer_id': 824870520,
  'day': 6,
  'id': 70746943,
  'items': 41069,
  'total': 3.5},
 {'customer_id': 824870520,
  'day': 6,
  'id': 70746943,

We want to store the data in an as granular a format as possible, as it will make it easier to both store in a SQL database, or process in pandas DataFrames (or other DataFrame libraries, like data.table, dask, and so on).

The final step is to create a timestamp for each transaction! We have a day, all we need now is to generate hour/minutes/seconds for each transaction. We will do this by using python's builtin `random` module. We will assume our store opens at 8am and closes at 10pm.

In [183]:
import random

def generate_time() -> tuple:
    """Generates a random time between 8am and 10pm, when our store is open."""
    rtime   = int(random.random() * 86_400)
    
    hours   = rtime//3600
    minutes = (rtime - hours*3600)//60
    seconds = rtime - hours*3600 - minutes*60

    if (hours <= 8) or (hours >= 22):
        return generate_time()
    else:
        return hours, minutes, seconds

For looping through datetimes, we can rely on pandas' function, `pd.date_range`, in which we can define a starting date, ending date, and frequency. We can then use `datetime.timedelta` to generate the final datetime, and then get the timestamp.

In [175]:
from datetime import timedelta

In [200]:
def generate_timestamp(date):
    """Returns a UNIX timestamp for a given date + random time."""
    return int((date + timedelta(*generate_time())).timestamp())

In [201]:
generate_timestamp(pd.date_range(start='2031-01-01', end='2031-01-08')[0])

1926547240

### Putting it all together

Now that we have everything in place, let's generate a sample of the data we are looking to simulate. Let's generate 1 week's worth of transactions!

In [213]:
billings = []
for day in pd.date_range(start='2031-01-01', end='2031-01-08'):
    # we get the weekday from the timestamp
    todays_customers = customers.query('habit == "daily"').append(
        weekly_customers.query(f'day_in == {int(day.strftime("%w"))}')
    )
    for idx,customer in todays_customers.set_index('uuid').iterrows():
        bill = generate_bill(customer, items)
        bill_id = uuid.uuid4().fields[0]
        for it,pr in zip(bill[0], bill[1]):
            billings.append(
                {
                    'id': bill_id,
                    'timestamp': generate_timestamp(day),
                    'customer_id': idx,
                    'items': it,
                    'total': pr
                }
            )
pprint(billings[:10])

[{'customer_id': 958420868,
  'id': 3526142073,
  'items': 23339,
  'timestamp': 1926028826,
  'total': 1.0},
 {'customer_id': 958420868,
  'id': 3526142073,
  'items': 43692,
  'timestamp': 1926806415,
  'total': 1.5},
 {'customer_id': 4036240255,
  'id': 2343320927,
  'items': 53646,
  'timestamp': 1926547255,
  'total': 1.35},
 {'customer_id': 269896857,
  'id': 651960562,
  'items': 53480,
  'timestamp': 1926028855,
  'total': 0.75},
 {'customer_id': 269896857,
  'id': 651960562,
  'items': 35527,
  'timestamp': 1926720037,
  'total': 0.6900000000000001},
 {'customer_id': 961301073,
  'id': 3461929721,
  'items': 3410,
  'timestamp': 1926115212,
  'total': 1.0},
 {'customer_id': 961301073,
  'id': 3461929721,
  'items': 32,
  'timestamp': 1926201622,
  'total': 1.4},
 {'customer_id': 961301073,
  'id': 3461929721,
  'items': 35126,
  'timestamp': 1926806425,
  'total': 1.5},
 {'customer_id': 961301073,
  'id': 3461929721,
  'items': 28174,
  'timestamp': 1926201610,
  'total': 2.0}

Awesome! We can analyse the results better by converting them to a DataFrame.

In [216]:
pd.DataFrame(billings).head(10)

Unnamed: 0,id,timestamp,customer_id,items,total
0,3526142073,1926028826,958420868,23339,1.0
1,3526142073,1926806415,958420868,43692,1.5
2,2343320927,1926547255,4036240255,53646,1.35
3,651960562,1926028855,269896857,53480,0.75
4,651960562,1926720037,269896857,35527,0.69
5,3461929721,1926115212,961301073,3410,1.0
6,3461929721,1926201622,961301073,32,1.4
7,3461929721,1926806425,961301073,35126,1.5
8,3461929721,1926201610,961301073,28174,2.0
9,3461929721,1926547228,961301073,18633,0.7


With this, we now have a repeatable way of generating billings for any number of days.

In the next notebook, we will generate billings for 1 year, which will include the sporadic customers.

In [222]:
day.day

8