# Generating sales and soh records in a stream

The notebook's purpose is to generate data in a stream process, simulating the process of getting data from endpoints.

In this notebook have a copy of function from the notebook "generating_batch_sales"

Requirements: 
1. install local kafka follor this [tutorial](https://kafka.apache.org/quickstart)
2. run the cells in order

In [1]:
!pip install confluent-kafka --break-system-packages

Defaulting to user installation because normal site-packages is not writeable


In [2]:
from confluent_kafka import Producer, KafkaException
from confluent_kafka.admin import AdminClient, NewTopic
from json import dumps
from datetime import timedelta

import socket
import pandas as pd
import time
import datetime
import random
from dotenv import load_dotenv
import os

import sys
sys.path.append('../../libraries')
import utils

In [3]:
load_dotenv()

True

### Configuring kafka topics

Creating topics `sales_topic` and `soh_topic` to simulate send the data from producers

In [4]:
admin_client = AdminClient({
    'bootstrap.servers': os.getenv('KAFKA_SERVER')
})

topic_config = {
    'partitions': 3,  
    'replication_factor': 1,
    'config': {
        'retention.bytes': '10485760',  # 10 MB retention size
        'retention.ms': '86400000', # 1 day retention time
    }
}
for topic in ['sales_topic', 'soh_topic']:
    try:
        metadata = admin_client.list_topics(timeout=10)
        topic_creation_result = admin_client.create_topics(
            [NewTopic(
                topic, 
                num_partitions=topic_config['partitions'], 
                replication_factor=topic_config['replication_factor'], 
                config=topic_config['config']
            )]
        )

        topic_creation_result[topic].result()

        print("Topic created successfully.")
    except KafkaException as e:
        print(f"Error creating topic: {e}")


%3|1744149910.433|FAIL|rdkafka#producer-1| [thrd:localhost:9092/bootstrap]: localhost:9092/bootstrap: Connect to ipv4#127.0.0.1:9092 failed: Connection refused (after 0ms in state CONNECT)
%3|1744149911.433|FAIL|rdkafka#producer-1| [thrd:localhost:9092/bootstrap]: localhost:9092/bootstrap: Connect to ipv4#127.0.0.1:9092 failed: Connection refused (after 0ms in state CONNECT, 1 identical error(s) suppressed)


Error creating topic: KafkaError{code=_TRANSPORT,val=-195,str="Failed to get metadata: Local: Broker transport failure"}
Error creating topic: KafkaError{code=_TRANSPORT,val=-195,str="Failed to get metadata: Local: Broker transport failure"}


### Simulate sales and replenishment

In this section as in the batch generation of sales, it simulates the process of sales in all stores.

In [5]:
def get_sales_consumption_weight(sales_consumption_df, country_code, season):
    """
    Retrieves the sales consumption weight for a country and season.
    
    Parameters: 
    sales_consumption_df (pandas.DataFrame): Distribution of customer's behavior based on season of the year
    country_code (str): Code of a country.
    season (str): Current season of the year

    Returns: 
    (float) Customer behavior in the specified country in the current season
    """
    weight_rows = sales_consumption_df[sales_consumption_df['country'].str.upper() == country_code]
    return weight_rows[season].values[0] if not weight_rows.empty else 1.0

In [6]:
def get_current_season(current_date):
    """
    Determines the current season of the year based on the date.

    Parameter: 
    current_date(datetime): A date using datetime library

    Returns: 
    (str) The season on which the current date falls.
    """
    
    day = current_date.timetuple().tm_yday
    if 80 <= day <= 172: return "Spring"
    elif 173 <= day <= 266: return "Summer"
    elif 267 <= day <= 355: return "Fall"
    return "Winter"

In [7]:
def generate_sales_quantity(sales_consumption_weight):
    """
    Generates the quantity of sales for a day.

    Parameter: 
    sales_consumption_weight(float): A probability of customer' behavior to buy items

    Return: 
    (int) The number of sales quantity
    """
    return max(0, int(random.gauss(50 * sales_consumption_weight, 10)))

In [8]:
def select_product_from_category(products_df, category, available_products):
    """
    Selects a product from a category, considering available products.
    
    Parameters:
    products_df (pandas.DataFrame): Current products offered 
    category (str): A category in which the products classify
    available_products (list): A list of the products provided in the current store

    Return: 
    (int) index of a product - simulates the customer behavior
    """
    category_products = products_df[
        (products_df['category'].str.contains(category, case=False)) & 
        (products_df['productCode'].isin(available_products))
    ]
    return random.choice(category_products.index) if not category_products.empty else None

In [9]:
def generate_date_format(date):
    """
    Generate a date format

    Parameter
    date (datetime): A date

    Return:
    (str) Format a date into one selected 
    """
    format_ = random.choice([
        "%Y-%m-%d",      # 2023-12-31
        "%d/%m/%Y",      # 31/12/2023
        "%m-%d-%Y",      # 12-31-2023
        "%B %d, %Y",     # December 31, 2023
    ])
    return date.strftime(format_)

This functions vary with the batch notebook, them send the data to a specific topic

In [10]:
def record_sale(producer, current_date, store, sku, quantity):
    """
    Records a sale with potential null values and more noise.
    
    Parameters:
    producer (confluent-kafka.Producer): Producer object of confluent-kafka to send data to a Kafka topic
    current_date (datetime): Date of the sale occurred
    store (str): site_code of the store based on its unique identifier
    sku (str): unique code of the product
    quantity (int): quantity of the selected product 
    """
    sale_record = [sku, quantity, store, generate_date_format(current_date)]
    cols = ['sku', 'quantity', random.choice(['store', 'site_code']), 'date']
    
    if random.random() < 0.25:
        sale_record[2] = randomize_case(sale_record[2])

    if random.random() < 0.1: 
        sale_record[1] *= -1

    if random.random() < 0.1:
        sale_record[random.randint(0, 3)] = None
    
    producer.produce(
        topic='sales_topic', 
        key='sales', 
        value=dumps(dict(zip(cols, sale_record)))
    )
    time.sleep(random.random()*3)


def record_replenishment(producer, current_date, store, sku, quantity):
    """
    Records a soh row with potential null values and more noise.
    
    Parameters: 
    producer (confluent-kafka.Producer): Producer object of confluent-kafka to send data to a Kafka topic
    current_date (datetime): Date of the replenishment
    store (str): site_code of the store based on its identifier
    sku (str): unique code of the product
    quantity (int): quantity of replenishment for the selected product 
    """
    soh_record = [store, sku, quantity, generate_date_format(current_date)]
    cols = [random.choice(['site_code', 'store']), 'sku', 'quantity', 'date']
    
    if random.random() < 0.25:
        soh_record[0] = randomize_case(soh_record[0])

    if random.random() < 0.1: 
        soh_record[2] *= -1

    if random.random() < 0.1: 
        soh_record[random.randint(0, 3)] = None
    
    producer.produce(
        topic='soh_topic', 
        key='soh', 
        value=dumps(dict(zip(cols, soh_record)))
    )
    time.sleep(random.random()*3)


def randomize_case(store_code):
    """
    Randomly changes the case of the store code.
    
    Parameters:
    store_code (str): The store code to modify.
    
    Returns:
    (str) The store code with random case changes.
    """
    return ''.join(random.choice([char.upper(), char.lower()]) for char in store_code)


In [11]:
def simulate_sales(products_df, weights_df, inventory,
                          sales_consumption_df, site_codes, current_date):
    """
    Generate random sales of the current date

    Parameters:
    products_df (pandas.DataFrame): Dataframe of products offered in all locations
    weights_df (pandas.DataFrame): Dataframe of distributions of customer's behavior 
    inventory (pandas.DataFrame): Dataframe of actual inventory in all locations for all products available
    site_codes (list): list of stores in all countries
    current_date (datetime): The date to simulate
    """
    config = {
        'client.id': os.getenv('KAFKA_USER', socket.gethostname()), 
        'bootstrap.servers': os.getenv('KAFKA_SERVER'),
        'retries': 3
    }
    producer = Producer(config)
    for store in site_codes:
        country_code = store[:3].upper()
        if country_code[-1] == '0': country_code = country_code[:2]
        country = weights_df[weights_df['country'].str.upper().str.startswith(country_code)].iloc[0].country

        season = get_current_season(current_date)
        sales_consumption_weight = get_sales_consumption_weight(sales_consumption_df, country_code, season)
        num_sales = generate_sales_quantity(sales_consumption_weight)

        country_weights = weights_df[weights_df['country'] == country]
        categories = country_weights['category'].tolist()
        weights = country_weights['consumption'].tolist()

        available_products = inventory.loc[store].index.tolist()

        for _ in range(num_sales):
            chosen_category = random.choices(categories, weights=weights, k=1)[0]
            chosen_product_index = select_product_from_category(products_df, chosen_category, available_products)

            if chosen_product_index is not None:
                product = products_df.loc[chosen_product_index]
                sku = product['productCode']

                try:
                    current_stock = inventory.loc[(store, sku), 'quantity']
                    if current_stock > 0:
                        quantity = min(current_stock, random.randint(1, 5))
                        inventory.loc[(store, sku), 'quantity'] -= quantity
                        record_sale(producer, current_date, store, sku, quantity)

                        current_stock = inventory.loc[(store, sku), 'quantity']
                        if current_stock < 20:
                            replenishment_quantity = 200
                            inventory.loc[(store, sku), 'quantity'] += replenishment_quantity
                            record_replenishment(producer, current_date, store, sku, replenishment_quantity)
                except KeyError:
                    print(f"KeyError: {(store, sku)}")
                    return
    producer.flush()


In [12]:
def simulate_daily_sales(products_df, weights_df, initial_inventory_df,
                          sales_consumption_df, start_date, end_date):
    """
    Simulates daily sales and inventory management.

    Params:
    products_df (pandas.DataFrame): list of available products
    weights_df (pandas.DataFrame): probability for each category
    initial_inventory_df (pandas.DataFrame): available stocks
    sales_consumption_df (pandas.DataFrame): relation between sales on each store
    start_date (datetime): initial date
    end_date (datetime): end_date
    """
    current_date = start_date

    sales_data = []
    site_codes = initial_inventory_df.site_code.unique()
    inventory = initial_inventory_df.set_index(['site_code', 'sku'])

    while current_date <= end_date:
        simulate_sales(products_df, weights_df, inventory, sales_consumption_df, site_codes, current_date)
        
        current_date += timedelta(days=1)

#### Loading all data for the simulation

In [13]:
soh = utils.load_data('soh.csv', '../../data')
products = utils.load_data('products.csv', '../../data')
distribution_by_cat = utils.load_data('distribution_by_category.csv', '../../data')
distribution_by_sales = utils.load_data('distribution_of_sales_by_country.csv', '../../data')

Cleaning the soh data and getting the last state of all products in all stores

In [14]:
soh = soh.dropna()

soh['date'] = pd.to_datetime(soh['date'], format='mixed')
soh = soh.sort_values('date', ascending=False)
soh = soh.drop_duplicates(subset=['site_code', 'sku'], keep='first')
soh.quantity /= 10

Simulating the complete process from January 2, 2025 to today.

In [15]:
try:
    simulate_daily_sales(
        products, 
        distribution_by_cat, 
        soh, 
        distribution_by_sales, 
        datetime.date(2025, 1, 2), 
        datetime.datetime.now().date()
    )
except KeyboardInterrupt:
    print('Keyboard interrupted')

%3|1744149937.415|FAIL|rdkafka#producer-2| [thrd:localhost:9092/bootstrap]: localhost:9092/bootstrap: Connect to ipv4#127.0.0.1:9092 failed: Connection refused (after 0ms in state CONNECT)
%3|1744149938.409|FAIL|rdkafka#producer-2| [thrd:localhost:9092/bootstrap]: localhost:9092/bootstrap: Connect to ipv4#127.0.0.1:9092 failed: Connection refused (after 0ms in state CONNECT, 1 identical error(s) suppressed)
%3|1744149941.436|FAIL|rdkafka#producer-1| [thrd:localhost:9092/bootstrap]: localhost:9092/bootstrap: Connect to ipv4#127.0.0.1:9092 failed: Connection refused (after 0ms in state CONNECT, 30 identical error(s) suppressed)


Keyboard interrupted


%4|1744149943.765|TERMINATE|rdkafka#producer-2| [thrd:app]: Producer terminating with 4 messages (322 bytes) still in queue or transit: use flush() to wait for outstanding message delivery
