## Step 1.0 - Importing libraries and files for analysis on oil, finding out if there could be any relation between oil and supermarket related data

1- Import libraries  
2- Import files   
3- Filter out stores that don't have all datapoints  
4- Filter out items that don't have all datapoints  
5- Merge oil data with aggregated sales data by date - find out correlation  
6- Merge oil data with aggregated sales data by store - find out correlation  
7- Merge oil data with aggregated item data - find out correlation  
8- Merge oil data with combo of item and store data - find out relation (if possible)  
9- As we found positively and negatively correlated items, we want to see if it has something to do with their respective item family

In [None]:
import pandas as pd
import altair as alt
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import vegafusion as vf

#pip install "vegafusion[embed]>=1.5.0" (not in requirements.txt)

file_path_df_0 = r'C:\Users\sebas\OneDrive\Documenten\GitHub\Supermarketcasegroupproject\Group4B\data\interim\df_0.parquet'
file_path_df_oiladjusted = r'C:\Users\sebas\OneDrive\Documenten\GitHub\Supermarketcasegroupproject\Group4B\data\interim\df_oiadjusted.parquet'
file_path_df_items = r'C:\Users\sebas\OneDrive\Documenten\GitHub\Supermarketcasegroupproject\Group4B\data\raw\items.parquet'


df_salesdata = pd.read_parquet(file_path_df_0)
df_oiladjusted = pd.read_parquet(file_path_df_oiladjusted)
df_items = pd.read_parquet(file_path_df_items)

In [None]:
df_salesdata.info()

In [None]:
df_oiladjusted['date'].unique()

In [None]:
df_salesdata = df_salesdata.drop(columns = ['day','year','month','onpromotion'])

## Step 3.0 - Filter out stores that don't have all the datapoints

For the analysis regarding a relation with oil, we assume we don't need all the stores. Thus, taking only stores that have almost all datapoints seems a valid way. Our cut-off is to take stores that don't miss more than 30 rows relative to the total amount of rows per store in the dataset. This means that per store we don't want to miss more than 30 rows in the date range between 2013-01-02 and 2017-08-15

In [None]:

df_salesdata_groupbystoreanddate = df_salesdata.groupby(['date','store_nbr'])['unit_sales'].sum()

df_salesdata_countstoredatefrequency = df_salesdata_groupbystoreanddate.groupby('store_nbr').count()

df_salesdata_countstoredatefrequency = df_salesdata_countstoredatefrequency.sort_values(ascending=True)

df_salesdata_countstoredatefrequency.head(60)

As the total amount of rows is 1679 for stores that have all datapoints, we set our cut-off on 1650 for the series that contains the count of the amount of rows per store.

In [None]:
df_salesdata_countstoredatefrequency = df_salesdata_countstoredatefrequency[df_salesdata_countstoredatefrequency > 1650]

print(df_salesdata_countstoredatefrequency)
print(f' In total, {df_salesdata_countstoredatefrequency.count()} stores fulfill the condition of having more than 1650 days of sales data, coming from a total of 54 stores')


Adjust df_salesdata to only contain stores that have >1650 rows

In [None]:
df_salesdata_as = df_salesdata.merge(df_salesdata_countstoredatefrequency.astype(df_salesdata['store_nbr'].dtype), left_on= 'store_nbr', right_index= True, how = 'inner').drop(columns= 'unit_sales_y')
df_salesdata_as = df_salesdata.rename(columns={"unit_sales_x" : "unit_sales"})
unique_store_nbr = df_salesdata_as['store_nbr'].unique()
print(unique_store_nbr)

In [None]:
df_salesdata_as.head(10)

print(f'After adjusting the dataset for stores that have > 1650 rows, we are left with a sales dataset on date with {df_salesdata_as.shape[0]} rows')

## Step 4.0 - Filter out items that don't have all the datapoints

For the analysis regarding a relation with oil, we assume we don't need all the items. Thus, taking only items that have almost all datapoints seems a valid way. Our 

Nog aanpassen -> cut-off is to take stores that don't miss more than 30 rows relative to the total amount of rows per store in the dataset. This means that per store we don't want to miss more than 30 rows in the date range between 2013-01-02 and 2017-08-15

In [None]:
df_salesdata_groupbystoreitemanddate = df_salesdata_as.groupby(['date','store_nbr','item_nbr'])['unit_sales'].sum()

df_salesdata_countstoreitemdatefrequency = df_salesdata_groupbystoreitemanddate .groupby(['store_nbr','item_nbr']).count()

df_salesdata_countstoreitemdatefrequency = df_salesdata_countstoreitemdatefrequency.sort_values(ascending=False)

df_salesdata_countstoreitemdatefrequency.head(60)

I just wanted to check if there are more items for store 3 then presented in the above output.

In [None]:
df_salesdata_countstoreitemdatefrequency.shape[0]

In [None]:
distinct_item_nbr_count = len(df_salesdata_countstoreitemdatefrequency.index.get_level_values('item_nbr').unique())
print("Count of distinct item_nbr:", distinct_item_nbr_count)

In [None]:
df_salesdata_countstoreitemdatefrequency_filtered = df_salesdata_countstoreitemdatefrequency[df_salesdata_countstoreitemdatefrequency >1650]

distinct_item_nbr_count2 = len(df_salesdata_countstoreitemdatefrequency_filtered.index.get_level_values('item_nbr').unique())
print("Count of distinct item_nbr:", distinct_item_nbr_count2)

df_salesdata_countstoreitemdatefrequency_filtered.head(10)

df_salesdata_countstoreitemdatefrequency_filtered.shape[0]

In [None]:
df_salesdata_as.shape[0]

In [None]:
df_salesdata_asai = df_salesdata_as.merge(df_salesdata_countstoreitemdatefrequency_filtered, on=['store_nbr', 'item_nbr'], how='inner').drop(columns= 'unit_sales_y')
df_salesdata_asai = df_salesdata_asai.rename(columns={"unit_sales_x" : "unit_sales"})
df_salesdata_asai.shape[0]

## Step 5.0 - Find a relation between oil and aggregated sales date by date

Find a relationship between the two datasets

In [None]:
df_salesdata_asai['date'] = pd.to_datetime(df_salesdata_asai['date'])
df_oiladjusted['date'] = pd.to_datetime(df_oiladjusted['date'])

In [None]:
df_salesdata_asai_total = df_salesdata_asai.groupby('date')['unit_sales'].sum()

df_oilandsalesadjusted_total = df_oiladjusted.merge(df_salesdata_asai_total, on = ['date'], how='inner')

df_oilandsalesadjusted_total.head(5)

In [None]:
y_range = [80000,250000]
df_oilandsalesadjusted_total = df_oilandsalesadjusted_total[(df_oilandsalesadjusted_total['unit_sales'] <240000) & (df_oilandsalesadjusted_total['unit_sales'] >80000)]

alt_totalsalesandoil = alt.Chart(df_oilandsalesadjusted_total).mark_circle(size=60).encode(
    x=alt.X('dcoilwtico', title='Oil Price'),
    y= alt.Y('unit_sales',title='Unit Sales',scale= alt.Scale(domain=y_range)),
    tooltip=['dcoilwtico', 'unit_sales']
)
alt_totalsalesandoil

In [None]:
df_oilandsalesadjusted_total.corr(numeric_only=True)

In [None]:
df_oiladjusted.shape

## Step 6.0 - Find a relation between oil and aggregated sales date and store

Find a relationship between the two datasets

In [None]:
df_salesdata_asai_stores = df_salesdata_asai.groupby(['date','store_nbr'])['unit_sales'].sum().reset_index()

df_oilandsalesadjusted_stores = df_salesdata_asai_stores.merge(df_oiladjusted, on = ['date'], how='inner')

df_oilandsalesadjusted_stores.head(5)

In [None]:
alt.data_transformers.enable("vegafusion")

df_oilandsalesadjusted_stores['store_nbr'] = df_oilandsalesadjusted_stores['store_nbr'].astype('category')

alt_storesalesandoil = alt.Chart(df_oilandsalesadjusted_stores).mark_circle(size=60).encode(
    x=alt.X('dcoilwtico', title='Oil Price'),
    y= alt.Y('unit_sales',title='Unit Sales'),
).properties(
    width=150,
    height=150
)

facet = alt_storesalesandoil.facet(
    column=alt.Column('store_nbr:N', header=alt.Header(labelAngle=-90)),
    columns=5
).resolve_scale(
    y='independent'
)
facet

In [None]:
df_oilandsalesadjusted_storestransposed = df_oilandsalesadjusted_stores.pivot_table(index='date', columns='store_nbr', values='unit_sales')

df_oilandsalesadjusted_storestransposed.shape

In [None]:
df_oilandsalesadjusted_storesmatrix = df_oilandsalesadjusted_storestransposed.merge(df_oiladjusted, on = ['date'], how='inner')

df_oilandsalesadjusted_storesmatrix = df_oilandsalesadjusted_storesmatrix.drop(columns = ['date'])

df_oilandsalesadjusted_storesmatrixcorrelation = df_oilandsalesadjusted_storesmatrix.corr()


plt.figure(figsize=(30, 30))

matrix = np.triu(df_oilandsalesadjusted_storesmatrixcorrelation)
sns.heatmap(
    df_oilandsalesadjusted_storesmatrix.corr(),
    annot = True,
    fmt='.2g',
    vmin=-1, 
    vmax=1, center= 0, cmap= 'coolwarm', mask = matrix, cbar = False)

In [None]:
df_oilandsalesadjusted_storesmatrixcorrelation['dcoilwtico'].sort_values(ascending=False)

From the above, we can conclude that there is a moderate and for some stores even a strong correlation, it might be helpful to look at a item level within these stores to look if oil correlates with a specific item?

## Step 7.0 - Find a relation between oil and aggregated sales data and item data

Find a relationship between the two datasets

In [None]:
df_salesdata_asai_items = df_salesdata_asai.groupby(['date','item_nbr'])['unit_sales'].sum().reset_index()

df_oilandsalesadjusted_items = df_salesdata_asai_items.merge(df_oiladjusted, on = ['date'], how='inner')

df_oilandsalesadjusted_items.head(5)

In [None]:
df_oilandsalesadjusted_items['item_nbr'] = df_oilandsalesadjusted_items['item_nbr'].astype('category')

alt_itemsalesandoil = alt.Chart(df_oilandsalesadjusted_stores).mark_circle(size=60).encode(
    x=alt.X('dcoilwtico', title='Oil Price'),
    y= alt.Y('unit_sales',title='Unit Sales'),
).properties(
    width=150,
    height=150
)
alt_itemsalesandoil

In [None]:
df_oilandsalesadjusted_itemsgroupby = df_oilandsalesadjusted_items.groupby('item_nbr')['unit_sales'].sum().reset_index()
df_oilandsalesadjusted_itemsgroupby = df_oilandsalesadjusted_itemsgroupby.sort_values(by='unit_sales', ascending=False)
df_oilandsalesadjusted_itemsgroupby = df_oilandsalesadjusted_itemsgroupby.head(100).reset_index(drop=True)
df_oilandsalesadjusted_itemsgroupby = df_oilandsalesadjusted_itemsgroupby[['item_nbr']]

df_oilandsalesadjusted_itemsgroupby.head(5)

Filter out the top 100 items based on sales to make this work somehow :-)

In [None]:
df_oilandsalesadjusted_items100 = df_oilandsalesadjusted_items.merge(df_oilandsalesadjusted_itemsgroupby, on = ['item_nbr'], how='inner')

print(f' In the dataset with all the items we find {df_oilandsalesadjusted_items.shape[0]} rows')
print(f' In the dataset with the top 100 of items based on unit sales we find {df_oilandsalesadjusted_items100.shape[0]} rows')

In [None]:
df_oilandsalesadjusted_items100['item_nbr'] = df_oilandsalesadjusted_items100['item_nbr'].astype('category')

alt_itemsalesandoil = alt.Chart(df_oilandsalesadjusted_items100).mark_circle(size=60).encode(
    x=alt.X('dcoilwtico', title='Oil Price'),
    y= alt.Y('unit_sales',title='Unit Sales'),
).properties(
    width=150,
    height=150
)

facetitems = alt_itemsalesandoil.facet(
    column=alt.Column('item_nbr:N', header=alt.Header(labelAngle=-90)),
    columns=5
).resolve_scale(
    y='independent'
)
facetitems

It seems Python doesnt like me trying to pull off 141 scatterplots at the same time :-()

In [None]:
df_oilandsalesadjusted_itemstransposed = df_oilandsalesadjusted_items100.pivot_table(index='date', columns='item_nbr', values='unit_sales')

df_oilandsalesadjusted_itemstransposed.shape

In [None]:
df_oilandsalesadjusted_itemstransposed.head(100)

In [None]:
df_oilandsalesadjusted_itemsmatrix = df_oilandsalesadjusted_itemstransposed.merge(df_oiladjusted, on = ['date'], how='inner')

df_oilandsalesadjusted_itemsmatrix = df_oilandsalesadjusted_itemsmatrix.drop(columns = ['date'])

df_oilandsalesadjusted_itemsmatrixcorrelation = df_oilandsalesadjusted_itemsmatrix.corr()

plt.figure(figsize=(60, 60))

matrix = np.triu(df_oilandsalesadjusted_itemsmatrixcorrelation)

sns.heatmap(
    df_oilandsalesadjusted_itemsmatrix.corr(),
    annot = True,
    fmt='.2g',
    vmin=-1, 
    vmax=1, center= 0, cmap= 'coolwarm', mask = matrix, cbar = False)

In [None]:
df_oilandsalesadjusted_itemsmatrixcorrelationoilcolumn = df_oilandsalesadjusted_itemsmatrixcorrelation['dcoilwtico'].sort_values( ascending=False)
print(f'These are the top 10 of the highest positively correlated items with oil {df_oilandsalesadjusted_itemsmatrixcorrelationoilcolumn.head(10)}')
print(f'These are the top 10 of the lowest negatively correlated items with oil {df_oilandsalesadjusted_itemsmatrixcorrelationoilcolumn.tail(10)}')
print('Lets put these in a list to use in the next step')

# Create a new dataframe with head(10) and tail(10) of df_oilandsalesadjusted_itemsmatrixcorrelationoilcolumn
se_top10positiveitems = df_oilandsalesadjusted_itemsmatrixcorrelationoilcolumn.head(10)
se_top10negativeitems = df_oilandsalesadjusted_itemsmatrixcorrelationoilcolumn.tail(10)
se_top20correlateditems = pd.concat([se_top10positiveitems, se_top10negativeitems])
# Print the new dataframe
se_top20correlateditems

In [None]:
df_oilandsalesadjusted_itemsmatrix.shape

## Step 8.0 - Find a relation between oil and aggregated sales data and item data and store data comined

Find a relationship between the two datasets. For this analysis.

In [None]:
df_salesdata_asai['store_nbr'] = df_salesdata_asai['store_nbr'].astype('category')
df_salesdata_asai['item_nbr'] = df_salesdata_asai['item_nbr'].astype('category')

df_salesdata_asai['store-item'] = df_salesdata_asai['store_nbr'].astype(str) + '-' + df_salesdata_asai['item_nbr'].astype(str)

df_salesdata_asai

In [None]:
# Merge the sales data with the oil data
df_salesdata_asai_storesitems = df_salesdata_asai.merge(df_oiladjusted, on = ['date'], how='inner')

print(f'After joining with the oil data we are left with {df_salesdata_asai_storesitems.shape[0]} rows')

In [None]:
# Merge the sales data with the top 100 of items based on unit sales
df_salesdata_asai_storesitems = df_salesdata_asai_storesitems.merge(df_oilandsalesadjusted_itemsgroupby, on = ['item_nbr'], how='inner')

print(f'After joining with the top 100 of items based on unit sales we are left with {df_salesdata_asai_storesitems.shape[0]} rows')

In [None]:
df_salesdata_asai_storesitems

However, i later found out that making a correlation matrix based on the top 100 items times all the stores isn't practical anymore. Thus, as a second best i focus on joining my data together with the top 20 correlated items from step 7 so my dataset is smaller.

In [None]:
df_salesdata_asai_storesitemstop20 = df_salesdata_asai_storesitems[df_salesdata_asai_storesitems['item_nbr'].isin(se_top20correlateditems.index)]   

df_salesdata_asai_storesitemstop20

In [None]:
#df_oilandsalesadjusted_items100['item_nbr'] = df_oilandsalesadjusted_items100['item_nbr'].astype('category')

alt_itemsalesandoil = alt.Chart(df_salesdata_asai_storesitemstop20).mark_circle(size=60).encode(
    x=alt.X('dcoilwtico', title='Oil Price'),
    y= alt.Y('unit_sales',title='Unit Sales'),
).properties(
    width=150,
    height=150
)

facetitemsstores = alt_itemsalesandoil.facet(
    column=alt.Column('store-item:N', header=alt.Header(labelAngle=-90)),
    columns=5
).resolve_scale(
    y='independent'
)
facetitemsstores

In [None]:
df_salesdata_asai_storesitemstop20['date'].unique()

In [None]:
df_oilandsalesadjusted_itemsstorestransposed = df_salesdata_asai_storesitemstop20.pivot_table(index='date', columns='store-item', values='unit_sales')

df_oilandsalesadjusted_itemsstorestransposed.head(3)

In [None]:
df_oilandsalesadjusted_itemsstorestransposed.info()

In [None]:
df_oilandsalesadjusted_itemsmatrix = df_oilandsalesadjusted_itemsstorestransposed.merge(df_oiladjusted, on = ['date'], how='inner')

df_oilandsalesadjusted_itemsstoresmatrixcorrelation = df_oilandsalesadjusted_itemsmatrix.corr()

plt.figure(figsize=(120, 120))

matrix = np.triu(df_oilandsalesadjusted_itemsstoresmatrixcorrelation)

sns.heatmap(
    df_oilandsalesadjusted_itemsmatrix.corr(),
    annot = True,
    fmt='.2g',
    vmin=-1, 
    vmax=1, center= 0, cmap= 'coolwarm', mask = matrix, cbar = False)

## Step 9.0 - Find a relation between oil and item family, just to see if correlation only randomly exist on store and item level?

Find a relationship between the two datasets. For this analysis.

In [None]:
df_salesdata_asai_itemcategory = df_salesdata_asai.merge(df_items, on = ['item_nbr'], how='inner')

df_salesdata_asai_itemcategory.head(3) 

In [None]:
df_salesdate_itemfamily = df_salesdata_asai_itemcategory.groupby(['date','family'])['unit_sales'].sum().reset_index()

df_salesdate_itemfamily = df_salesdate_itemfamily.merge(df_oiladjusted, on = ['date'], how='inner')

df_salesdate_itemfamily.head(3)

In [None]:
df_salesdate_itemfamilytransposed = df_salesdate_itemfamily.pivot_table(index='date', columns='family', values='unit_sales')

df_salesdate_itemfamilytransposed.head(3)

In [None]:
df_oilandsalesadjusted_familymatrix = df_salesdate_itemfamilytransposed.merge(df_oiladjusted, on = ['date'], how='inner')

df_oilandsalesadjusted_familymatrixcorrelation = df_oilandsalesadjusted_familymatrix.corr()

plt.figure(figsize=(60, 60))

matrix = np.triu(df_oilandsalesadjusted_familymatrixcorrelation)

sns.heatmap(
    df_oilandsalesadjusted_familymatrix.corr(),
    annot = True,
    fmt='.2g',
    vmin=-1, 
    vmax=1, center= 0, cmap= 'coolwarm', mask = matrix, cbar = False)

In [None]:
df_salesdate_itemclass = df_salesdata_asai_itemcategory.groupby(['date','class'])['unit_sales'].sum().reset_index()

df_salesdate_itemclass = df_salesdate_itemclass.merge(df_oiladjusted, on = ['date'], how='inner')

df_salesdate_itemclass.head(3)

In [None]:
df_salesdate_itemclasstransposed = df_salesdate_itemclass.pivot_table(index='date', columns='class', values='unit_sales')

df_salesdate_itemclasstransposed.head(3)

In [None]:
df_oilandsalesadjusted_classmatrix = df_salesdate_itemclasstransposed.merge(df_oiladjusted, on = ['date'], how='inner')

df_oilandsalesadjusted_classmatrixcorrelation = df_oilandsalesadjusted_classmatrix.corr()

plt.figure(figsize=(60, 60))

matrix = np.triu(df_oilandsalesadjusted_classmatrixcorrelation)

sns.heatmap(
    df_oilandsalesadjusted_classmatrix.corr(),
    annot = True,
    fmt='.2g',
    vmin=-1, 
    vmax=1, center= 0, cmap= 'coolwarm', mask = matrix, cbar = False)

Conclusion from the oil analysis:

We imported the sales data together with the oil and items data.
We set restriction on the sales data that only stores and items are allowed in this EDA that have atleast 1650 days in the dataset (the max for sales data would be 1679 per store/item combination)
In total, this left us with 41 stores to work with (coming from 54 in total)
In total, this left us with 648 items to work with (coming from 4036 in total)
From the items, we took the top 100 based on sales units as otherwise, it wouldn't work technically

We found positively as well as negatively correlated for all of our analyses (on store, item and the combination of the two). 
As an extra, we found these patterns as well when looking into class.

It seems that oil has some relation to some of the items and maybe, some of the stores as well. 

In [None]:
print(f' In total, {df_salesdata_countstoredatefrequency.count()} stores fulfill the condition of having more than 1650 days of sales data, coming from a total of 54 stores')

