The following functions provided handle the pre-processing of item, transaction, and store data for an analysis or machine learning task.

- The pre_process_item_info function filters the item data to only include items from specific departments defined in the configuration, ensuring unique entries by dropping duplicates. 

- The pre_process_transaction_info function processes transaction data by sorting, converting necessary columns (like item_code and sales_qty) to appropriate data types, and calculating the week for each transaction based on the DATE column. It then aggregates the sales data by week, outlet, and item, summing the sales quantities to create a total_sales column. 

- Lastly, the pre_process_stores_info function ensures that store-related columns, such as outlet_area and outlet_parking_lots, are properly typed as integers. These functions collectively prepare the data for further analysis or model training by cleaning and structuring it for easier interpretation.

In [1]:
%run ./run_script.ipynb

conf = get_conf()

trans = get_datasources(conf)["trans_info"]
item = get_datasources(conf)["item_info"]
stores = get_datasources(conf)["outlets_info"]
start_date = conf['dates']["start_date"] 
end_date = conf['dates']["end_date"]

In [2]:
def pre_process_item_info(item):

    """
    Pre-processing item info table
    
    Args:
        item : Pandas DataFrame
            Contains information of items
    
    Returns:
        item_info : Pandas DataFrame
            Contains filtered items based on required departments
    """
    
    item_info = item[
        item["item_department"].isin(conf["required_columns"]["departments"])
    ].drop_duplicates()
    
    return item_info

def pre_process_transaction_info(trans):
    
    """
    Pre-processing transcation info table
    
    Args:
        trans : Pandas DataFrame
            contains information of transcations
        
    Returns:
        trans : Pandas DataFrame
            contains transcation columns with correct data types
    """
    
    trans = trans.sort_values('DATE')
    trans["item_code"] = trans["item_code"].astype(int)
    trans["DATE"] = pd.to_datetime(trans["DATE"])   
    trans["sales_qty"] = trans["sales_qty"].astype(int)    
    trans["week"] = trans['DATE'].apply(lambda x: previous_day(x, "monday"))
    trans = trans.groupby(["week", "outlet_code", "item_code"]).agg({"sales_qty": "sum"}).rename(
        columns={"sales_qty":"total_sales"}).reset_index() 
    
    return trans

def pre_process_stores_info(stores):
    
    """
    Pre-processing store info table
    
    Args:
        stores : Pandas DataFrame : 
            contains information of stores
        
    Returns:
        stores : Pandas DataFrame
            contains stores columns with correct data types
    """
    
    stores["outlet_area"] = stores["outlet_area"].astype(int)  
    stores["outlet_parking_lots"] = stores["outlet_parking_lots"].astype(int)
    
    return stores

In [3]:
item = pre_process_item_info(item)
item

Unnamed: 0,item_code,item_category,item_sub_department,item_department
0,1016782.0,Instant Chocolate Drinks,Chocolate Drink,Beverages
1,94111.0,Cheese Blocks,Cheese,Chilled
2,839302.0,Prepared Meals,Baby Foods,Grocery
3,1070377.0,Sugar Confectionary,Confectionery,Grocery
4,1077721.0,Cereal Bars,Cereals,Grocery
...,...,...,...,...
31090,20833.0,Dessert Pre-Mixes,Desserts,Grocery
31091,427480.0,Dessert Pre-Mixes,Desserts,Grocery
31092,506482.0,Ambient Dessert Syrups & Toppings,Desserts,Grocery
31093,1065499.0,"Snack-Nuts,Peas & Mixes",Snacks,Grocery


In [4]:
trans= pre_process_transaction_info(trans)
trans

Unnamed: 0,week,outlet_code,item_code,total_sales
0,2022-01-17,A,223,10
1,2022-01-17,A,232,2
2,2022-01-17,A,259,21
3,2022-01-17,A,268,7
4,2022-01-17,A,295,3
...,...,...,...,...
40581,2022-10-17,E,123478,1
40582,2022-10-17,E,123694,1
40583,2022-10-17,E,123703,2
40584,2022-10-17,E,123730,1


In [5]:
store= pre_process_stores_info(stores)
store

Unnamed: 0,outlet_code,outlet_area,outlet_parking_lots,outlet_profile_category,outlet_cluster_category
0,D,11237,68,Moderate,Small
1,B,11500,50,High,Medium
2,A,10150,52,Moderate,Small
3,E,10000,12,High,Medium
4,C,14425,41,High,Large
