## 2. Structuring and labeling the data
In this step, we wanted to gain an understanding of how the collected data could inform ML-generated predictions. To do so, we started by organizing and structuring the datasets so that they could be used to create meaningful visualizations, and to enable its interpretation by an algorithm. Where necessary, we added labels that add a layer of meaningful information to the data for the algorithm to use. For the grocery automation case study, we poured the data into a full dataset of all items bought by each individual household. Examples of labels that were added to the data are: an identification of what types of items each item name signified (e.g., "Old Goudse 45+" is cheese); an identification of what type of store the items were bought at (e.g., "Albert Heijn" is a supermarket), and assigning a category to which each item belongs (e.g., milk is a dairy product).

In this notebook we:
1. Import the library/dataframe and recode/add all required variables
2. Create & export descriptions


### 2.1 Visualize the data & explore correlations
Now that we had a structured dataset, we could visualize it to try to identify patterns in the data, and look for statistical correlations. Figure 8 illustrates the the variety of visualizations that were made with the grocery dataset. The patterns that were found in the dataset inform the order in which the predictions would be structured. For instance, once we knew on which days the households may shop,
we could identify another variable has the highest correlation to the day variable. If that variable would for instance be the type of store, then this becomes the next variable we could look at (i.e., if
day is a predictor for store type, subsequently store type may be a predictor for amount of items bought, and so on).

In this notebook we:
1. Create & export a confusion matrix
...
2. Export the dataframe


<!-- 1. build grid: when do/dont they shop, how many times, on which day...
2. define priliminaries for algorithm
3. define filters for algorithm
4. RUN algorithm 


----


Outcome (example):
1. input: rows of dow shopped
2. output: 0,1,2,0,0,1,0 -->

### Import libraries 

In [1]:
# %matplotlib notebook
%matplotlib inline
#Pandas is a software library written for the Python programming language for data manipulation and analysis.
import pandas as pd
#NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays
import numpy as np
# Matplotlib is a plotting library for python and pyplot gives us a MatLab like plotting framework. We will use this in our plotter function to plot data.
import matplotlib.pyplot as plt
#Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics
import seaborn as sns
import dataframe_image as dfi
from datetime import time
import matplotlib.dates as mdates
from matplotlib.ticker import StrMethodFormatter
from matplotlib.pyplot import figure
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.tree import DecisionTreeClassifier, export_graphviz, plot_tree
from scipy.stats import chi2_contingency

### Load and view data 

In [2]:
df = pd.read_csv (r"/workspaces/Plenty-in-the-Pantry/database/Groceries_onehousehold.csv")
# del df["HH"]
df.describe(include='all')

Unnamed: 0,HH,week,order_ID,item_id,amount,price_unit,price_total,date,day,timestamp,time,store_type,store_name,promo,item_type,category
count,372.0,372.0,372.0,372,372.0,372.0,372.0,372,372,372,372,372,372,372.0,372,372
unique,,,,314,,,,26,7,35,4,5,9,,126,16
top,,,,GROF BROOD GESN.,,,,2022-01-08,Saturday,17:25:00,morning,supermarket,Okay,,charcuterie,fruit & vegetables
freq,,,,6,,,,45,87,44,186,306,127,,25,103
mean,2.0,4.056452,18.88172,,1.274194,2.602328,2.867247,,,,,,,0.083333,,
std,0.0,2.06747,10.908193,,1.103819,1.947725,2.0192,,,,,,,0.276758,,
min,2.0,1.0,1.0,,1.0,0.06468,0.06468,,,,,,,0.0,,
25%,2.0,2.0,8.0,,1.0,1.3,1.54397,,,,,,,0.0,,
50%,2.0,4.0,20.0,,1.0,2.24025,2.46286,,,,,,,0.0,,
75%,2.0,5.0,26.25,,1.0,3.29,3.875,,,,,,,0.0,,


## Recode variables

In [3]:
# new row for reordered items 
df['reorder'] = df.item_type.duplicated()
# convert reorder row to boolean value
df['reorder'] = df['reorder'].astype(bool)
#create an ID for each item 
df.rename(columns={'item_id':'item_name'}, inplace=True)
df['item_id'] = pd.factorize(df['item_name'])[0]
#create an ID for each type 
df['type_id'] = pd.factorize(df['item_type'])[0]
#first make a new column for the amount of items per order
df['order_amount'] = df.groupby('order_ID')['amount'].transform('sum')
# make a new column for the price per order
df['order_price'] = df.groupby('order_ID')['price_total'].transform('sum')

In [4]:
df.dtypes

HH                int64
week              int64
order_ID          int64
item_name        object
amount            int64
price_unit      float64
price_total     float64
date             object
day              object
timestamp        object
time             object
store_type       object
store_name       object
promo             int64
item_type        object
category         object
reorder            bool
item_id           int64
type_id           int64
order_amount      int64
order_price     float64
dtype: object

In [5]:
# Recode object variable types to integers, to enable functions in next cell:
df['week'] = df['week'].astype(int)
df['order_ID'] = df['order_ID'].astype(int)
df['amount'] = df['amount'].astype(int)
df['promo'] = df['promo'].astype(bool)
df['item_id'] = df['item_id'].astype(int)
df['type_id'] = df['type_id'].astype(int)

In [6]:
# Recode categorical values to cat codes:

# Recode dow
df['day_num']=df['day'].astype('category').cat.codes
# Recode store_type
df['storetype_num']=df['store_type'].astype('category').cat.codes
# Recode store_name
df['storename_num']=df['store_name'].astype('category').cat.codes
# Recode categories
df['cat_num']=df['category'].astype('category').cat.codes
# Recode time
df['time_num']=df['time'].astype('category').cat.codes
# Recode time
df['promo_num']=df['promo'].astype('category').cat.codes
df.dtypes

HH                 int64
week               int64
order_ID           int64
item_name         object
amount             int64
price_unit       float64
price_total      float64
date              object
day               object
timestamp         object
time              object
store_type        object
store_name        object
promo               bool
item_type         object
category          object
reorder             bool
item_id            int64
type_id            int64
order_amount       int64
order_price      float64
day_num             int8
storetype_num       int8
storename_num       int8
cat_num             int8
time_num            int8
promo_num           int8
dtype: object

## Save the final dataframe as a new csv file

In [7]:
df.to_csv(r"/workspaces/Plenty-in-the-Pantry/database/Groceries_onehousehold.csv", index=False)
df.drop(["storename_num", 'cat_num', 'time_num', 'promo_num', 'timestamp'], axis=1, inplace=True)
df.drop(['day_num', 'storetype_num', 'reorder'], axis=1, inplace=True)
df.head()

Unnamed: 0,HH,week,order_ID,item_name,amount,price_unit,price_total,date,day,time,store_type,store_name,promo,item_type,category,item_id,type_id,order_amount,order_price
0,2,1,5,RABEKO choco light 250g,2,2.82,5.64,2021-11-23,Tuesday,noon,supermarket,Okay,False,chocolate spread,breakfast & spreads,0,0,9,16.77
1,2,1,5,JOYVALLE pudding griesmeel natuur 135g,4,0.99,3.96,2021-11-23,Tuesday,noon,supermarket,Okay,False,pudding,dairy & plant based,1,1,9,16.77
2,2,1,5,BONI tomatensoep met balletjes 950ml,1,1.99,1.99,2021-11-23,Tuesday,noon,supermarket,Okay,False,soup,canned foods,2,2,9,16.77
3,2,1,5,LIEBIG DELISOUP 9 groenten brik 1L,1,2.59,2.59,2021-11-23,Tuesday,noon,supermarket,Okay,False,soup,canned foods,3,2,9,16.77
4,2,1,5,LIEBIG DELISOUP tom. Balletjes brik 1L,1,2.59,2.59,2021-11-23,Tuesday,noon,supermarket,Okay,False,soup,canned foods,4,2,9,16.77


In [8]:
df.dtypes

HH                int64
week              int64
order_ID          int64
item_name        object
amount            int64
price_unit      float64
price_total     float64
date             object
day              object
time             object
store_type       object
store_name       object
promo              bool
item_type        object
category         object
item_id           int64
type_id           int64
order_amount      int64
order_price     float64
dtype: object

In [9]:
# use the corr function to display the correlation between all the features
data_corr = df.corr()
data_corr

ValueError: could not convert string to float: 'RABEKO choco light 250g'

# 1. WHAT DAY? -  New dataframe: grocery visits/day/week

### Is there a correlation between day (of week) and week?

In [53]:
df_orders = df[['week', 'order_ID', 'store_name', 'storename_num', 'store_type', 'storetype_num','day', 'day_num', 'time', 'time_num', 'timestamp', 'times', 'dates', 'times_min', 'dates_days', 'order_amount', 'order_price']]
df_orders = df_orders.drop_duplicates()

In [54]:
# Cross tabulation between DAY and WEEK
CrosstabResult=pd.crosstab(index=df_orders['week'],columns=df_orders['day'])
CrosstabResult

day,Friday,Monday,Saturday,Sunday,Thursday,Tuesday,Wednesday
week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,1,0,0,0,0,2,1
2,0,0,0,3,2,0,2
3,1,0,1,2,1,0,0
4,0,0,1,3,0,0,1
5,0,2,3,0,0,2,0
6,0,0,1,1,0,1,0
7,0,2,1,0,1,0,0
8,1,2,0,0,1,0,1


In [55]:
# Performing Chi-sq test
ChiSqResult = chi2_contingency(CrosstabResult)

# P-Value is the Probability of H0 being True
# If P-Value > 0.05 then only we Accept the assumption(H0)

print('The P-Value of the ChiSq Test is:', ChiSqResult[1])

The P-Value of the ChiSq Test is: 0.16141960652205875


There is no correlation: more indication of a pattern?

#### Can we check whether shopping days are predictable over longer time periods?
(e.g.: every two weeks, they go shopping in the weekend)

In [26]:
#Let's try grouping per two (consecutive) dats
df_orders['week'] = df_orders['week'].replace([1, 2], 1)
df_orders['week'] = df_orders['week'].replace([3, 4], 2)
df_orders['week'] = df_orders['week'].replace([5, 6], 3)
df_orders['week'] = df_orders['week'].replace([7, 8], 4)

In [27]:
# Cross tabulation between DAY and WEEK
CrosstabResult=pd.crosstab(index=df_orders['week'],columns=df_orders['day'])

In [28]:
# Performing Chi-sq test
ChiSqResult = chi2_contingency(CrosstabResult)

# P-Value is the Probability of H0 being True
# If P-Value > 0.05 then only we Accept the assumption(H0)

print('The P-Value of the ChiSq Test is:', ChiSqResult[1])

The P-Value of the ChiSq Test is: 0.20589510072116995


Still significant!
> Let's try even vs uneven weeks

In [29]:
df = pd.read_csv (r"C:\Users\20204113\OneDrive - TU Eindhoven\2_Research\1_Groceries\DATA\9th week - narrative (3rd attempt)\HH2\df\df_HH2.csv")

df_orders = df[['week','order_ID', 'store_name', 'storename_num', 'store_type', 'storetype_num','day', 'day_num', 'time', 'time_num', 'timestamp', 'times', 'dates', 'times_min', 'dates_days', 'order_amount', 'order_price']]
df_orders = df_orders.drop_duplicates()

In [30]:
#Let's try grouping per even and uneven weeks
df_orders['week'] = df_orders['week'].replace([1, 3, 5, 7], 1)
df_orders['week'] = df_orders['week'].replace([2, 4, 6, 8], 2)

In [31]:
# Cross tabulation between DAY and WEEK
CrosstabResult=pd.crosstab(index=df_orders['week'],columns=df_orders['day'])
CrosstabResult

day,Friday,Monday,Saturday,Sunday,Thursday,Tuesday,Wednesday
week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,2,4,5,2,2,4,1
2,1,2,2,7,3,1,4


In [32]:
# Performing Chi-sq test
ChiSqResult = chi2_contingency(CrosstabResult)

# P-Value is the Probability of H0 being True
# If P-Value > 0.05 then only we Accept the assumption(H0)

print('The P-Value of the ChiSq Test is:', ChiSqResult[1])

The P-Value of the ChiSq Test is: 0.18140198169035493


Not significant!
> Even weeks are not different from uneven weeks

Are the first 4 weeks different from the last 4?

In [33]:
df = pd.read_csv (r"C:\Users\20204113\OneDrive - TU Eindhoven\2_Research\1_Groceries\DATA\9th week - narrative (3rd attempt)\HH2\df\df_HH2.csv")

df_orders = df[['week','order_ID', 'store_name', 'storename_num', 'store_type', 'storetype_num','day', 'day_num', 'time', 'time_num', 'timestamp', 'times', 'dates', 'times_min', 'dates_days', 'order_amount', 'order_price']]
df_orders = df_orders.drop_duplicates()

In [34]:
# group per period 1 and 2
df_orders['week'] = df_orders['week'].replace([1, 2, 3, 4], 1)
df_orders['week'] = df_orders['week'].replace([5, 6, 7, 8], 2)

In [35]:
# Cross tabulation between DAY and WEEK
CrosstabResult=pd.crosstab(index=df_orders['week'],columns=df_orders['day'])
CrosstabResult

day,Friday,Monday,Saturday,Sunday,Thursday,Tuesday,Wednesday
week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,2,0,1,5,3,2,4
2,1,6,6,4,2,3,1


In [36]:
# Performing Chi-sq test
ChiSqResult = chi2_contingency(CrosstabResult)

# P-Value is the Probability of H0 being True
# If P-Value > 0.05 then only we Accept the assumption(H0)

print('The P-Value of the ChiSq Test is:', ChiSqResult[1])

The P-Value of the ChiSq Test is: 0.0721155900715905


The first period is different from the second period!
> Are the weeks in both periods comparable?

In [37]:
df = pd.read_csv (r"C:\Users\20204113\OneDrive - TU Eindhoven\2_Research\1_Groceries\DATA\9th week - narrative (3rd attempt)\HH2\df\df_HH2.csv")
df_orders = df[['week','order_ID', 'store_name', 'storename_num', 'store_type', 'storetype_num','day', 'day_num', 'time', 'time_num', 'timestamp', 'times', 'dates', 'times_min', 'dates_days', 'order_amount', 'order_price']]
df_orders = df_orders.drop_duplicates()

# split up df to first and second period
df_period1 = df_orders[df_orders['week'] < 5]
df_period2 = df_orders[df_orders['week'] > 4]

In [38]:
# Cross tabulation between DAY and WEEK
CrosstabResult1=pd.crosstab(index=df_period1['week'],columns=df_period1['day'])
CrosstabResult2=pd.crosstab(index=df_period2['week'],columns=df_period2['day'])

In [39]:
# Performing Chi-sq test
ChiSqResult1 = chi2_contingency(CrosstabResult1)
ChiSqResult2 = chi2_contingency(CrosstabResult2)

# P-Value is the Probability of H0 being True
# If P-Value > 0.05 then only we Accept the assumption(H0)

print('The P-Value of the ChiSq Test 1 is:', ChiSqResult1[1])
print('The P-Value of the ChiSq Test 2 is:', ChiSqResult2[1])

The P-Value of the ChiSq Test 1 is: 0.26642044077417815
The P-Value of the ChiSq Test 2 is: 0.5796249574256348


Not significant: the shopping days are quite the same for each period
> In November/December, different from January/February.

## 2. Build Algorithm to 'randomize' shopping days

#### Grid for dow/week & descriptions

In [56]:
df = pd.read_csv (r"C:\Users\20204113\OneDrive - TU Eindhoven\2_Research\1_Groceries\DATA\9th week - narrative (3rd attempt)\HH2\df\df_HH2.csv")

In [57]:
days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
df['day'] = pd.Categorical(df['day'], categories=days, ordered=True)
df_dow = df.sort_values(by=['week','day'])

# grouping the variables for week, day and unique order id's
df_dow = df.groupby(['week', 'day'])['order_ID'].nunique()
df_dow = pd.DataFrame (df_dow)
df_dow.head()

# make grid for days vs. week
df_dowgrid1 = df_dow.groupby(['week', 'day'])['order_ID'].aggregate('first').unstack()
df_dowgrid1 = df_dowgrid1.reset_index()
df_dowgrid1.replace(0, np.nan, inplace=True)
df_dowgrid1

# second grid to generate extra variables
df_dowgrid2 = df_dowgrid1.copy()
del df_dowgrid2["week"]
# column for total grocery visits
df_dowgrid1['sum'] = df_dowgrid2.sum(axis=1)
# column for total days shopped
df_dowgrid1['ndays'] = df_dowgrid2.count(axis=1)
# column for median visits/week
df_dowgrid1['med'] = df_dowgrid2.median(numeric_only=True, axis=1)

df_dowgrid1 = df_dowgrid1.round(0)

In [58]:
df_dowgrid1 = df_dowgrid1.round(0)
df_dowgrid1. replace(np. nan,0)

day,week,Monday,Tuesday,Wednesday,Thursday,Friday,Saturday,Sunday,sum,ndays,med
0,1,0.0,2.0,1.0,0.0,1.0,0.0,0.0,4.0,3,1.0
1,2,0.0,0.0,2.0,2.0,0.0,0.0,3.0,7.0,3,2.0
2,3,0.0,0.0,0.0,1.0,1.0,1.0,2.0,5.0,4,1.0
3,4,0.0,0.0,1.0,0.0,0.0,1.0,3.0,5.0,3,1.0
4,5,2.0,2.0,0.0,0.0,0.0,3.0,0.0,7.0,3,2.0
5,6,0.0,1.0,0.0,0.0,0.0,1.0,1.0,3.0,3,1.0
6,7,1.0,0.0,0.0,1.0,0.0,1.0,0.0,3.0,3,1.0
7,8,2.0,0.0,1.0,1.0,1.0,0.0,0.0,5.0,4,1.0


#### Generate randomized weeks

In [64]:
def period1():
    # split up df to first and second period
    df_period1 = df_dowgrid1[df_dowgrid1['week'] < 5]

    del df_period1["week"]
    df_period1 = df_period1. replace(np. nan,0)
    
    return df_period1

def period2():
    # split up df to first and second period
    df_period2 = df_dowgrid1[df_dowgrid1['week'] > 4]

    del df_period2["week"]
    df_period2 = df_period2. replace(np. nan,0)
    
    return df_period2
    return dataframe

In [65]:
period1()

day,Monday,Tuesday,Wednesday,Thursday,Friday,Saturday,Sunday,sum,ndays,med
0,0.0,2.0,1.0,0.0,1.0,0.0,0.0,4.0,3,1.0
1,0.0,0.0,2.0,2.0,0.0,0.0,3.0,7.0,3,2.0
2,0.0,0.0,0.0,1.0,1.0,1.0,2.0,5.0,4,1.0
3,0.0,0.0,1.0,0.0,0.0,1.0,3.0,5.0,3,1.0


In [38]:
period2()

day,Monday,Tuesday,Wednesday,Thursday,Friday,Saturday,Sunday,sum,ndays,med
4,2.0,2.0,0.0,0.0,0.0,0.0,0.0,4.0,2,2.0
5,0.0,1.0,0.0,0.0,0.0,3.0,0.0,4.0,2,2.0
6,1.0,0.0,0.0,1.0,0.0,1.0,1.0,4.0,4,1.0
7,2.0,0.0,1.0,1.0,1.0,1.0,0.0,6.0,5,1.0


In [422]:
df_period1.to_csv (r"C:\Users\20204113\OneDrive - TU Eindhoven\2_Research\1_Groceries\DATA\9th week - narrative (3rd attempt)\HH2\df\df_HH2_period1.csv", index = None, header=True)
df_period2.to_csv (r"C:\Users\20204113\OneDrive - TU Eindhoven\2_Research\1_Groceries\DATA\9th week - narrative (3rd attempt)\HH2\df\df_HH2_period2.csv", index = None, header=True)

# CONCLUSION

Data to build the algorithm:
>  1. Split per period (first 4 weeks, last 4 weeks)
>  2. Number of visits per day (per week)
    1. Assign weights to each day (based on times shopped on these days)
    
    
We then have the first given:
> 1. Week 9: HH2 will shop on <b>Monday/Tuesday/..., X times</b>