## GENERATING SOH SYNTHETIC DATA

The notebook's purpose is to generate synthetic data from a fake retail company focused on the field of clothing, in this notebook you will find a simple simulation from this company in the international {mercado} about 2 years

It's necessary to run this code before generating sales because this generates the initial state of all stores

In [1]:
!pip install faker --break-system-packages

Defaulting to user installation because normal site-packages is not writeable


In [2]:
import pandas as pd
import numpy as np
import random
import datetime
from datetime import datetime, timedelta
from faker import Faker
import seaborn as sns
import csv
import os
import sys
sys.path.append('../../libraries')
import utils

creating the distribution of sales by season on each country in which the retail company have stores

In [3]:
distribution_by_cat = utils.load_data('distribution_by_category.csv', '../../data')
sites = distribution_by_cat.country.unique()
distribution_by_cat.sample(5)

Unnamed: 0,country,consumption,category,season
19,Mexico,0.03,Outerwear,Fall
40,USA,0.16,Tops,Winter
83,France,0.26,Bottoms,Winter
282,UK,0.23,Accessories,Winter
28,Brazil,0.32,Outerwear,Spring


doing the same process for distribution of sales based on USA sales

In [4]:
distribution_of_sales = utils.load_data('distribution_of_sales_by_country.csv', '../../data')
distribution_of_sales.sample(5)

Unnamed: 0,country,Winter,Spring,Summer,Fall
5,Australia,1.45073,1.89267,0.97865,0.72812
2,UK,1.75351,0.82734,1.29788,1.01478
4,Germany,1.32199,0.91104,0.69951,1.35444
9,Mexico,0.57022,1.24395,1.16541,1.27533
3,France,0.85243,1.52887,0.31489,1.18286


### Definition of records
Defining the structure for the records to add them to a csv file 

- `soh`
  - sku
  - quantity
  - date
  - site_code

#### Generating initial inventory

In [5]:
site_stores = np.array(
    [[country[:3].upper()+f'{i:03}' for i in range(5)] for country in sites]
).flatten()

fake = Faker()

def assign_products_to_stores(weight_df, products_df, stores, min_products=150, max_products=300):
    """
    Assigns products to stores with initial stock, considering weights and other factors.

    Params: 
    weight_df (pandas.DataFrame): probability of store products in a specific site_store
    product_df (pandas.DataFrame): offered products
    stores (list): list of active stores
    min_products (int): minimum quantity of products to store
    max_products (int): maximum quantity of products to store
    """

    assignments = []
    for store in stores:
        country_code = store[:3].upper()
        if country_code[-1] == '0': country_code = country_code[:2]

        country = weight_df[weight_df.country.str.upper().str.startswith(country_code)]['country'].iloc[0]
        num_products = random.randint(min_products, max_products)
        num_available_products = len(products_df)  # Get the number of available products

        # Adjust num_products if it's greater than available products
        num_products = min(num_products, num_available_products)

        assigned_products = random.sample(products_df.index.tolist(), num_products)

        for product_index in assigned_products:
            product = products_df.loc[product_index]
            category = product['label'].split()[-1]
            season = random.choice(['Winter', 'Spring', 'Summer', 'Fall'])
            weight = weight_df[
                (weight_df['country'] == country) & 
                (weight_df['category'] == category) & 
                (weight_df['season'] == season)]['consumption'].values[0] if not weight_df[(weight_df['country'] == country) & (weight_df['category'] == category) & (weight_df['season'] == season)].empty else 0.1
            initial_stock = random.randint(min_products, max_products)

            start_date = datetime.now() - timedelta(days=365 * 4)
            end_date = datetime.now() - timedelta(days=365 * 3)
            initial_date = fake.date_between_dates(date_start=start_date, date_end=end_date)

            assignments.append({
                'site_code': store,
                'sku': product['productCode'],
                'quantity': initial_stock,
                'date': initial_date
            })

    return pd.DataFrame(assignments)


In [6]:
products = utils.load_data('products.csv', '../../data')
soh = assign_products_to_stores(
    distribution_by_cat,
    products,
    site_stores
)
soh.sample(5)

Unnamed: 0,site_code,sku,quantity,date
6424,AUS001,DRESS-079,298,2021-07-31
9281,JAP003,SUIT-071,163,2021-07-31
11615,MEX003,TRACK-056,240,2021-05-07
2702,UK001,SHIRT-009,186,2021-07-13
7750,IND001,CAPRI-026,194,2021-07-08


In [7]:
utils.save_data(soh, 'soh.csv', '../../data')

Data saved to: /mnt/sda2/ICC/pasantia/final-project/data/soh.csv
