The identify_weekly_sales function merges primary_keys with the trans_weekly DataFrame to identify total weekly sales, aligning data based on the primary keys of outlet_code, item_department, and week. It fills missing sales data with zero, ensuring completeness, and sorts the resulting data by week. This provides a structured view of the total sales quantity for each store, department, and week.

The create_target_variable function builds on the weekly sales data by creating a target variable, sales_next_week. This is calculated by shifting the total_sales_qty column to the next week for each store-department pair. The function ensures the data is sorted by week and resets the index to maintain proper order. This target variable is used to predict future sales, essential for forecasting models.

In [1]:
%run ./run_script.ipynb

conf = get_conf()

trans = get_datasources(conf)["trans_info"]
item = get_datasources(conf)["item_info"]
stores = get_datasources(conf)["outlets_info"]

trans = pre_process_transaction_info(trans)
item = pre_process_item_info(item)
store = pre_process_stores_info(stores)

trans_weekly= get_weekly_sales(item, trans)
primary_keys = create_primary_keys(trans_weekly)

In [2]:
def identify_weekly_sales(trans_weekly, primary_keys):
    
    """
    Identifying total weekly sales on primary key
    
    Args:
        weekly_sales: Pandas DataFrame
            Total weekly sales on primary key
    
    Returns:
        target_variable: Pandas DataFrame
            Target Variable
    """
    
    weekly_sales = pd.merge(primary_keys, trans_weekly, on=["outlet_code", "item_department", "week"], how="left")
    weekly_sales = weekly_sales[["week", "outlet_code", "item_department", "total_sales_qty"]]
    weekly_sales["total_sales_qty"].fillna(0, inplace=True)
    weekly_sales.sort_values("week", inplace=True)
    
    return weekly_sales

In [3]:
weekly_sales = identify_weekly_sales(trans_weekly, primary_keys)
weekly_sales

Unnamed: 0,week,outlet_code,item_department,total_sales_qty
0,2022-01-17,A,Beverages,598.0
14,2022-01-17,D,Grocery,1094.0
13,2022-01-17,E,Chilled,0.0
12,2022-01-17,C,Chilled,48.0
11,2022-01-17,C,Grocery,464.0
...,...,...,...,...
586,2022-10-17,C,Grocery,387.0
585,2022-10-17,C,Chilled,27.0
598,2022-10-17,D,Beverages,192.0
591,2022-10-17,D,Chilled,52.0


In [4]:
def create_target_variable(weekly_sales):
    """
    Creating target variable 
    
    Args:
        weekly_sales: pyspark dataframe
            Total weekly sales on primary key
    
    Returns:
        target_variable: pyspark dataframe
            Target Variable with "sales_next_week" column
    """
    
    target_variable = weekly_sales.copy()
    target_variable['sales_next_week'] = target_variable.groupby(['outlet_code', 'item_department'])['total_sales_qty'].shift(-1)
    target_variable = target_variable.groupby(['outlet_code', 'item_department']).apply(lambda x: x.sort_values('week'))
    target_variable = target_variable.reset_index(drop=True)
    
    return target_variable

In [5]:
target_variable = create_target_variable(weekly_sales)
target_variable

Unnamed: 0,week,outlet_code,item_department,total_sales_qty,sales_next_week
0,2022-01-17,A,Beverages,598.0,1342.0
1,2022-01-24,A,Beverages,1342.0,1744.0
2,2022-01-31,A,Beverages,1744.0,1098.0
3,2022-02-07,A,Beverages,1098.0,1574.0
4,2022-02-14,A,Beverages,1574.0,1304.0
...,...,...,...,...,...
595,2022-09-19,E,Grocery,604.0,669.0
596,2022-09-26,E,Grocery,669.0,691.0
597,2022-10-03,E,Grocery,691.0,737.0
598,2022-10-10,E,Grocery,737.0,200.0
