# 2. Labeling, structuring and visualizing the data
In this step, we wanted to gain an understanding of how the collected data could inform ML-generated predictions. To do so, we started by organizing and structuring the datasets so that they could be used to create meaningful visualizations, and to enable its interpretation by an algorithm. Where necessary, we added labels that add a layer of meaningful information to the data for the algorithm to use. For the grocery automation case study, we poured the data into a full dataset of all items bought by each individual household. Examples of labels that were added to the data are: an identification of what types of items each item name signified (e.g., “Old Goudse 45+” is cheese); an identification of what type of store the items were bought at (e.g., “Albert Heijn” is a supermarket), and assigning a category to which each item belongs (e.g., milk is a dairy product).

In this notebook (2.1) we:
- Import the library/dataframe and recode/add all required variables (2.1 Labeling and structuring)

In the next notebook (2.2) we:
- Generate visualizations and correlations to analyze the data (2.2 Visualizing)

## 2.1. Labeling and structuring

### Import libraries 

In [1]:
#Pandas is a software library written for the Python programming language for data manipulation and analysis.
import pandas as pd
#NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays
import numpy as np
# Matplotlib is a plotting library for python and pyplot gives us a MatLab like plotting framework. We will use this in our plotter function to plot data.
import matplotlib.pyplot as plt
#Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics
import seaborn as sns

### Load and view data 

In [2]:
df = pd.read_csv (r"/workspaces/DesignerlyAlgorithmicPrototyping/database/DATA-HH (dummy).csv")

del df["HH"]
df.head()

KeyError: 'HH'

## Recode variables

Create (number) IDs for certain variables (item type, order...)
Create new derived variables (e.g. reordering an item, total order price...)
Make sure all variables are formatted as the correct type (e.g., int, bool...)

In [3]:
# new row for reordered items 
df['reorder'] = df.item_type.duplicated()

# convert reorder row to boolean value
df['reorder'] = df['reorder'].astype(bool)

#create an ID for each item 
df.rename(columns={'item_id':'item_name'}, inplace=True)
df['item_id'] = pd.factorize(df['item_name'])[0]

#create an ID for each type 
df['type_id'] = pd.factorize(df['item_type'])[0]

#first make a new column for the amount of items per order
df['order_amount'] = df.groupby('order_ID')['amount'].transform('sum')

# make a new column for the price per order
df['order_price'] = df.groupby('order_ID')['price_total'].transform('sum')

In [4]:
df['week'] = df['week'].astype(int)

df['order_ID'] = df['order_ID'].astype(int)

df['amount'] = df['amount'].astype(int)

df['promo'] = df['promo'].astype(bool)

df['item_id'] = df['item_id'].astype(int)

df['type_id'] = df['type_id'].astype(int)

In [5]:
df_des = df.describe(include='all')

df_styled = df_des.style.background_gradient() #adding a gradient based on values in cell
# dfi.export(df_styled,"describeHH2.png")
df_styled

Unnamed: 0,week,order_ID,item_name,amount,price_unit,price_total,date,day,timestamp,time,store_type,store_name,promo,item_type,category,reorder,item_id,type_id,order_amount,order_price
count,372.0,372.0,372,372.0,372.0,372.0,372,372,372,372,372,372,372,372,372,372,372.0,372.0,372.0,372.0
unique,,,314,,,,26,7,35,4,5,9,2,126,16,2,,,,
top,,,GROF BROOD GESN.,,,,2022-01-08,Saturday,17:25:00,morning,supermarket,Okay,False,charcuterie,fruit & vegetables,True,,,,
freq,,,6,,,,45,87,44,186,306,127,341,25,103,246,,,,
mean,4.056452,19.182796,,1.274194,2.602328,2.867247,,,,,,,,,,,147.865591,42.543011,30.38172,68.919265
std,2.06747,10.531632,,1.103819,1.947725,2.0192,,,,,,,,,,,91.676674,34.169954,16.118312,36.489459
min,1.0,1.0,,1.0,0.06468,0.06468,,,,,,,,,,,0.0,0.0,1.0,2.95
25%,2.0,10.0,,1.0,1.3,1.54397,,,,,,,,,,,68.75,14.0,13.0,25.5715
50%,4.0,18.0,,1.0,2.24025,2.46286,,,,,,,,,,,145.5,34.5,34.0,72.487014
75%,5.0,26.25,,1.0,3.29,3.875,,,,,,,,,,,225.0,68.0,38.0,103.32998


In [6]:
df.dtypes

week              int64
order_ID          int64
item_name        object
amount            int64
price_unit      float64
price_total     float64
date             object
day              object
timestamp        object
time             object
store_type       object
store_name       object
promo              bool
item_type        object
category         object
reorder            bool
item_id           int64
type_id           int64
order_amount      int64
order_price     float64
dtype: object

### New (numeric) variables for the confusion matrix

Turn numeric values into category codes (to facilitate making the confusion matrix)

In [7]:
# Recode dow
df['day_num']=df['day'].astype('category').cat.codes

# Recode store_type
df['storetype_num']=df['store_type'].astype('category').cat.codes

# Recode store_name
df['storename_num']=df['store_name'].astype('category').cat.codes

# Recode categories
df['cat_num']=df['category'].astype('category').cat.codes

# Recode time
df['time_num']=df['time'].astype('category').cat.codes

# Recode time
df['promo_num']=df['promo'].astype('category').cat.codes

In [8]:
df.dtypes

week               int64
order_ID           int64
item_name         object
amount             int64
price_unit       float64
price_total      float64
date              object
day               object
timestamp         object
time              object
store_type        object
store_name        object
promo               bool
item_type         object
category          object
reorder             bool
item_id            int64
type_id            int64
order_amount       int64
order_price      float64
day_num             int8
storetype_num       int8
storename_num       int8
cat_num             int8
time_num            int8
promo_num           int8
dtype: object

## Save the final dataframe as a new csv file

In [9]:
df.head()

Unnamed: 0,week,order_ID,item_name,amount,price_unit,price_total,date,day,timestamp,time,...,item_id,type_id,order_amount,order_price,day_num,storetype_num,storename_num,cat_num,time_num,promo_num
0,1,1,RABEKO choco light 250g,2,2.82,5.64,2021-11-23,Tuesday,12:32:00,noon,...,0,0,9,16.77,5,4,6,2,3,0
1,1,1,JOYVALLE pudding griesmeel natuur 135g,4,0.99,3.96,2021-11-23,Tuesday,12:32:00,noon,...,1,1,9,16.77,5,4,6,7,3,0
2,1,1,BONI tomatensoep met balletjes 950ml,1,1.99,1.99,2021-11-23,Tuesday,12:32:00,noon,...,2,2,9,16.77,5,4,6,3,3,0
3,1,1,LIEBIG DELISOUP 9 groenten brik 1L,1,2.59,2.59,2021-11-23,Tuesday,12:32:00,noon,...,3,2,9,16.77,5,4,6,3,3,0
4,1,1,LIEBIG DELISOUP tom. Balletjes brik 1L,1,2.59,2.59,2021-11-23,Tuesday,12:32:00,noon,...,4,2,9,16.77,5,4,6,3,3,0


In [10]:
df.to_csv(r"/workspaces/DesignerlyAlgorithmicPrototyping/database/DATA-HH (dummy).csv", index=False)

In [11]:
# Get simplified dataframe (no cat num etc.)
df.drop(["storename_num", 'cat_num', 'time_num', 'promo_num', 'timestamp'], axis=1, inplace=True)
df.drop(['day_num', 'storetype_num', 'reorder'], axis=1, inplace=True)
df.head()

Unnamed: 0,week,order_ID,item_name,amount,price_unit,price_total,date,day,time,store_type,store_name,promo,item_type,category,item_id,type_id,order_amount,order_price
0,1,1,RABEKO choco light 250g,2,2.82,5.64,2021-11-23,Tuesday,noon,supermarket,Okay,False,chocolate spread,breakfast & spreads,0,0,9,16.77
1,1,1,JOYVALLE pudding griesmeel natuur 135g,4,0.99,3.96,2021-11-23,Tuesday,noon,supermarket,Okay,False,pudding,dairy & plant based,1,1,9,16.77
2,1,1,BONI tomatensoep met balletjes 950ml,1,1.99,1.99,2021-11-23,Tuesday,noon,supermarket,Okay,False,soup,canned foods,2,2,9,16.77
3,1,1,LIEBIG DELISOUP 9 groenten brik 1L,1,2.59,2.59,2021-11-23,Tuesday,noon,supermarket,Okay,False,soup,canned foods,3,2,9,16.77
4,1,1,LIEBIG DELISOUP tom. Balletjes brik 1L,1,2.59,2.59,2021-11-23,Tuesday,noon,supermarket,Okay,False,soup,canned foods,4,2,9,16.77
