## Step 1.0 & 2.0- Importing libraries and files for analysis on oil, finding out if there could be any relation between sales data and store number, type or cluster

The steps to be conducted

1- Import libraries V
2- Import files V
3- Filter out stores that don't have all datapoints

4- Filter out items that don't have all datapoints
5- Merge oil data with aggregated sales data by date - find out correlation
6- Merge oil data with aggregated sales data by store - find out correlation
7- Merge oil data with aggregated item data - find out correlation
8- Merge oil data with combo of item and store data - find out relation (if possible)

In [1]:
import pandas as pd
import altair as alt
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import vegafusion as vf

#pip install "vegafusion[embed]>=1.5.0" (not in requirements.txt)

# Reading the files for salesdata and stores data into my notebook
file_path_df_0 = r'C:\Users\sebas\OneDrive\Documenten\GitHub\Supermarketcasegroupproject\Group4B\data\interim\df_0.parquet'
file_path_stores = r'C:\Users\sebas\OneDrive\Documenten\GitHub\Supermarketcasegroupproject\Group4B\data\raw\stores.parquet'

df_salesdata = pd.read_parquet(file_path_df_0)
df_stores = pd.read_parquet(file_path_stores)

In [2]:
print(df_stores.info())
print(df_salesdata.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54 entries, 0 to 53
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   store_nbr  54 non-null     int64 
 1   city       54 non-null     object
 2   state      54 non-null     object
 3   type       54 non-null     object
 4   cluster    54 non-null     int64 
dtypes: int64(2), object(3)
memory usage: 2.2+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 125497040 entries, 0 to 125497039
Data columns (total 9 columns):
 #   Column       Dtype         
---  ------       -----         
 0   id           uint32        
 1   store_nbr    uint8         
 2   item_nbr     uint32        
 3   unit_sales   float32       
 4   onpromotion  boolean       
 5   day          uint8         
 6   year         int32         
 7   month        int32         
 8   date         datetime64[ns]
dtypes: boolean(1), datetime64[ns](1), float32(1), int32(2), uint32(2), uint8(2)
memory usa

We find different datatypes within the dataframes. To make the join succesful, we need to consider making those fields the same. Following datamanagement logic, we want to have our dimension fields or non calculation fields in string/date format. We basically only use numerical values whenever we want to calculate with the field.

In [3]:
# Adjust data types and drop columns we don't need
df_salesdata['store_nbr'] = df_salesdata['store_nbr'].astype(str)
df_salesdata = df_salesdata.drop(columns=['year', 'day','onpromotion','month'])
df_stores['store_nbr'] = df_stores['store_nbr'].astype(str)
df_stores['cluster'] = df_stores['cluster'].astype(str)

print(df_stores.info())
print(df_salesdata.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54 entries, 0 to 53
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   store_nbr  54 non-null     object
 1   city       54 non-null     object
 2   state      54 non-null     object
 3   type       54 non-null     object
 4   cluster    54 non-null     object
dtypes: object(5)
memory usage: 2.2+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 125497040 entries, 0 to 125497039
Data columns (total 5 columns):
 #   Column      Dtype         
---  ------      -----         
 0   id          uint32        
 1   store_nbr   object        
 2   item_nbr    uint32        
 3   unit_sales  float32       
 4   date        datetime64[ns]
dtypes: datetime64[ns](1), float32(1), object(1), uint32(2)
memory usage: 3.3+ GB
None


In [4]:
# Group the sales date by store and item
df_salesdatagrouped = df_salesdata.groupby(['store_nbr','date']).agg({'unit_sales':'sum'}).reset_index()

print(f' In df_salesdatagrouped zitten nu {df_salesdatagrouped.shape[0]} rijen en {df_salesdatagrouped.shape[1]} kolommen')
print(df_salesdatagrouped.info())

 In df_salesdatagrouped zitten nu 83606 rijen en 3 kolommen
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 83606 entries, 0 to 83605
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   store_nbr   83606 non-null  object        
 1   date        83606 non-null  datetime64[ns]
 2   unit_sales  83606 non-null  float32       
dtypes: datetime64[ns](1), float32(1), object(1)
memory usage: 1.6+ MB
None


In [5]:
df_salesandstoresdata = df_salesdatagrouped.merge(df_stores, left_on='store_nbr', right_on='store_nbr', how='inner')

print(f' In df_salesandstoredata zitten nu {df_salesandstoresdata.shape[0]} rijen en {df_salesandstoresdata.shape[1]} kolommen')
print(df_salesandstoresdata.info())

 In df_salesandstoredata zitten nu 83606 rijen en 7 kolommen
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 83606 entries, 0 to 83605
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   store_nbr   83606 non-null  object        
 1   date        83606 non-null  datetime64[ns]
 2   unit_sales  83606 non-null  float32       
 3   city        83606 non-null  object        
 4   state       83606 non-null  object        
 5   type        83606 non-null  object        
 6   cluster     83606 non-null  object        
dtypes: datetime64[ns](1), float32(1), object(5)
memory usage: 4.1+ MB
None


## Step 3.0- Filter out all stores that don't have all the datapoint or atleast mark them



In [6]:
# Count amount of values per store
se_storedatecount = df_salesandstoresdata['store_nbr'].value_counts()

print(f"The daterange of the salesdata starts at {df_salesandstoresdata['date'].min()}")
print(f"The daterange of the salesdata ends at {df_salesandstoresdata['date'].max()}")
print(f"The daterange of the salesdata is {df_salesandstoresdata['date'].max() - df_salesandstoresdata['date'].min()}")
print(se_storedatecount)

The daterange of the salesdata starts at 2013-01-01 00:00:00
The daterange of the salesdata ends at 2017-08-15 00:00:00
The daterange of the salesdata is 1687 days 00:00:00
store_nbr
34    1679
32    1679
10    1679
35    1679
37    1679
38    1679
39    1679
4     1679
40    1679
41    1679
44    1679
45    1679
46    1679
47    1679
48    1679
49    1679
5     1679
50    1679
51    1679
54    1679
6     1679
7     1679
8     1679
33    1679
9     1679
31    1679
2     1679
3     1679
11    1679
13    1679
28    1679
27    1679
26    1679
15    1679
16    1679
23    1679
19    1679
1     1678
17    1677
43    1675
30    1656
14    1641
12    1619
25    1618
24    1578
18    1569
36    1553
53    1169
20     911
29     876
21     750
42     722
22     673
52     118
Name: count, dtype: int64


In [7]:
# Create a date range from the start date to the end date of the sales data
start_date = pd.to_datetime('2013-01-02')
end_date = pd.to_datetime('2017-08-15')

# Create a date range variable from the start date to the end date of the sales data
date_range = pd.date_range(start=start_date, end=end_date, freq='D')

# Create a dataframe from the date range
date_range = pd.DataFrame(date_range, columns=['date'])

print(f'The date_range dataframe starts at {date_range["date"].min()} and ends at {date_range["date"].max()}')

The date_range dataframe starts at 2013-01-02 00:00:00 and ends at 2017-08-15 00:00:00


In [8]:
df_salesandstoresdata34 = df_salesandstoresdata[df_salesandstoresdata['store_nbr'] == '34']

df_salesandstoresdata34missingdates = df_salesandstoresdata34.merge(date_range, left_on='date', right_on='date', how='outer')

empty_unit_sales = df_salesandstoresdata34missingdates[df_salesandstoresdata34missingdates['unit_sales'].isnull()]
print(empty_unit_sales)
print('As we can see, stores that have all data seem to be closed on christmas day and on new years day')

     store_nbr       date  unit_sales city state type cluster
357        NaN 2013-12-25         NaN  NaN   NaN  NaN     NaN
364        NaN 2014-01-01         NaN  NaN   NaN  NaN     NaN
722        NaN 2014-12-25         NaN  NaN   NaN  NaN     NaN
729        NaN 2015-01-01         NaN  NaN   NaN  NaN     NaN
1087       NaN 2015-12-25         NaN  NaN   NaN  NaN     NaN
1094       NaN 2016-01-01         NaN  NaN   NaN  NaN     NaN
1453       NaN 2016-12-25         NaN  NaN   NaN  NaN     NaN
1460       NaN 2017-01-01         NaN  NaN   NaN  NaN     NaN
As we can see, stores that have all data seem to be closed on christmas day and on new years day


We want to do multiple things now with the store and salesdata so far, namely:  
1- We want to find out which days are missing per store (are they just new or missing data in between?)  
2- We want to mark the stores that are missing something  
3- We might want to have some divsion between what we find acceptable in terms of missing data.  

In [9]:
# Let's try to make a dataframe that consists of all stores that are missing data for a certain date
# Step 1 - Crossjoin stores with the daterange
df_storesreduced = df_stores.drop(columns=['city', 'state', 'type', 'cluster'])
df_storesanddates = df_storesreduced.merge(date_range, how='cross')

print(f' Now we onstructed a dataframe with all stores and all dates, it contains {df_storesanddates.shape[0]} rows')
print(df_storesanddates.head(5))

# Step 2 - Merge the salesdata with the storesanddates dataframe to have a dataframe consisting of all stores and all dates with unit_sales

df_salesandstoresdata_alldates = df_salesandstoresdata.merge(df_storesanddates, on = ['store_nbr','date'], how='outer')

print(f' Now we onstructed a dataframe with all stores and all dates, it contains {df_salesandstoresdata_alldates.shape[0]} rows')
print(df_salesandstoresdata_alldates.head(5))

 Now we onstructed a dataframe with all stores and all dates, it contains 91098 rows
  store_nbr       date
0         1 2013-01-02
1         1 2013-01-03
2         1 2013-01-04
3         1 2013-01-05
4         1 2013-01-06
 Now we onstructed a dataframe with all stores and all dates, it contains 91099 rows
  store_nbr       date   unit_sales   city      state type cluster
0         1 2013-01-02  7417.147949  Quito  Pichincha    D      13
1         1 2013-01-03  5873.244141  Quito  Pichincha    D      13
2         1 2013-01-04  5919.878906  Quito  Pichincha    D      13
3         1 2013-01-05  6318.785156  Quito  Pichincha    D      13
4         1 2013-01-06  2199.086914  Quito  Pichincha    D      13


We now made a dataframe with all dates from the date_range daraframe (having all dates in the total date_range). From this we can also see that we have 1 extra row in the combined dataframe (a bit odd), let's find out why.

In [10]:
# Merge the two dataframes and keep only the records that are in the first dataframe but not in the second dataframe
Difference_df_salesandstoresdata_alldates_df_storesanddates = df_salesandstoresdata_alldates.merge(df_storesanddates, on = ['store_nbr','date'], how='outer', indicator=True).loc[lambda x : x['_merge']=='left_only']
Difference_df_salesandstoresdata_alldates_df_storesanddates

Unnamed: 0,store_nbr,date,unit_sales,city,state,type,cluster,_merge
28679,25,2013-01-01,2511.618896,Salinas,Santa Elena,D,1,left_only


Thus, we found 1 store that is open at new years day, thereby breaking the pattern of most stores.

In [11]:
# Now, let's see how this works out for store number 30 (just a random one that is missing some dates according to our earlier analysis)
df_salesandstoresdata_alldates30 = df_salesandstoresdata_alldates[df_salesandstoresdata_alldates['store_nbr']=='30']
df_salesandstoresdata_alldates30 = df_salesandstoresdata_alldates30[df_salesandstoresdata_alldates30['unit_sales'].isnull()]
df_salesandstoresdata_alldates30.head(5)

Unnamed: 0,store_nbr,date,unit_sales,city,state,type,cluster
38989,30,2013-07-08,,,,,
38990,30,2013-07-09,,,,,
38991,30,2013-07-10,,,,,
38992,30,2013-07-11,,,,,
38993,30,2013-07-12,,,,,


As we can see, the df_salesandstoresdata_alldates30 file gives us all dates per store and gives us the possibility to explore which dates are missing per store (based on having NaN for unit sales).

In [12]:
# Only the stores that have are value count less than 1679 in se_storedatecount
se_storedatecountmissing = se_storedatecount[se_storedatecount < 1679]

# Now, let's have df_salesandstoresdata_alldates but only for the stores where we are missing some of the data (well, atleast we miss sales on those date, we don't have them in the original data)
df_salesandstoresdata_alldatesnull = df_salesandstoresdata_alldates[df_salesandstoresdata_alldates['store_nbr'].isin(se_storedatecountmissing.index)]

# From the stores with missing data, we only want the records where the unit_sales is missing
df_salesandstoresdata_alldatesnull = df_salesandstoresdata_alldatesnull[df_salesandstoresdata_alldatesnull['unit_sales'].isnull()]

df_salesandstoresdata_alldatesnull = df_salesandstoresdata_alldatesnull[['date', 'store_nbr','unit_sales']]

# Add a unit_sales of 1 to the dataframe to make it easier to plot, it's just a dummy value
df_salesandstoresdata_alldatesnull['unit_sales'] = 1

# Merge the dataframe with the date_range dataframe to have all dates in the dataframe
df_salesandstoresdata_alldatesnull = df_storesanddates.merge(df_salesandstoresdata_alldatesnull, on=['store_nbr','date'] ,how='left')

# Now we have a dataframe with all stores and all dates, but only for the stores that are missing some data
df_salesandstoresdata_alldatesnull = df_salesandstoresdata_alldatesnull[df_salesandstoresdata_alldatesnull['store_nbr'].isin(se_storedatecountmissing.index)]

print(f"Stores {df_salesandstoresdata_alldatesnull['store_nbr'].unique()} are in the dataset with stores with <1679 datapoints and all dates, having imputed a value of 1 for all dates missing in the range")

Stores ['1' '12' '14' '17' '18' '20' '21' '22' '24' '25' '29' '30' '36' '42' '43'
 '52' '53'] are in the dataset with stores with <1679 datapoints and all dates, having imputed a value of 1 for all dates missing in the range


Let's now make a graph to see where in the timeline we are missing data per store were we actually miss some data.

In [13]:
alt.data_transformers.enable("vegafusion")

df_salesandstoresdata_alldatesnull_chart = alt.Chart(df_salesandstoresdata_alldatesnull , title='In color, days that are missing unit sales for stores that miss data').mark_circle(size=8).encode(
    y="store_nbr:N",
    x="date:T",
    yOffset="unit_sales:Q",
    color=alt.Color('store_nbr:N').legend(None)
)
df_salesandstoresdata_alldatesnull_chart = df_salesandstoresdata_alldatesnull_chart.properties(
    width=1000,  # Set the width
    height=500  # Set the height
)

df_salesandstoresdata_alldatesnull_chart

In [14]:
df_salesandstoresdata_alldatesnull36 = df_salesandstoresdata_alldatesnull[df_salesandstoresdata_alldatesnull['store_nbr'] == '36']
df_salesandstoresdata_alldatesnull36

Unnamed: 0,store_nbr,date,unit_sales
59045,36,2013-01-02,1.0
59046,36,2013-01-03,1.0
59047,36,2013-01-04,1.0
59048,36,2013-01-05,1.0
59049,36,2013-01-06,1.0
...,...,...,...
60727,36,2017-08-11,
60728,36,2017-08-12,
60729,36,2017-08-13,
60730,36,2017-08-14,


What can conclude the following from this initial analysis:  
1- Some stores are relatively new, we can see they miss data for a longer period wherafter they don't miss any data (20,21,22,29,36,42,52,53)  
2- Other stores miss data in between (they most likely closed some time)  

We might want to distinguish between stores that are relatively new and stores that miss data. 

In [15]:
# Stores that are new we mark as 1, stores that are old we mark as 0
se_storedatecountmissingsome = se_storedatecount[se_storedatecount < 1670]

df_salesandstoresdata_alldatesnull1 = df_salesandstoresdata_alldatesnull.copy()

# Identify stores that are new based on having a dummy value on 2013-01-02
new_store_nbrs = df_salesandstoresdata_alldatesnull[
    (df_salesandstoresdata_alldatesnull['date'] == '2013-01-02') & 
    (df_salesandstoresdata_alldatesnull['unit_sales'] == 1)
]['store_nbr'].unique()

# Make a new column missingdatacategory where stores that have a dummy unit for 2013-01-02 are marked as a new store, the rest for now is seen as an old store. This is still the whole dataset 
# We get the right storenumbers based on the isin part of the expression
df_salesandstoresdata_alldatesnull1['missingdatacategory'] = np.where(df_salesandstoresdata_alldatesnull1['store_nbr'].isin(new_store_nbrs),
                                                                     'new_store', 
                                                                     'old_store'
                                                                     )

# Step 2 - For all stores that have < 1670 days of data, name the stores that are not new and old store missing > days of data
df_salesandstoresdata_alldatesnull2 = df_salesandstoresdata_alldatesnull1[df_salesandstoresdata_alldatesnull1['store_nbr'].isin(se_storedatecountmissingsome.index)]

df_salesandstoresdata_alldatesnull2['missingdatacategory'] = np.where((df_salesandstoresdata_alldatesnull2['missingdatacategory'] == 'new_store'),
                                                                    'new_store',
                                                                    'old_store missing >9 days'
                                                                    )

# Step 3 - For all stores that are missing <9 days of data we just label them "missing < 9 days"
df_salesandstoresdata_alldatesnull3 = df_salesandstoresdata_alldatesnull[~df_salesandstoresdata_alldatesnull1['store_nbr'].isin(se_storedatecountmissingsome.index)]
df_salesandstoresdata_alldatesnull3['missingdatacategory'] =        'missing <9 days'

# Put the dataframes of step 2 and 3 together to get all rows back together as in the original dataframes
df_salesandstoresdata_alldatesnullfinal = pd.concat([df_salesandstoresdata_alldatesnull2, df_salesandstoresdata_alldatesnull3])

print(df_salesandstoresdata_alldatesnull.shape)
print(df_salesandstoresdata_alldatesnull1.shape)
print(df_salesandstoresdata_alldatesnull2.shape)
print(df_salesandstoresdata_alldatesnull3.shape)

df_salesandstoresdata_alldatesnullfinal = df_salesandstoresdata_alldatesnullfinal.groupby(['store_nbr','missingdatacategory']).agg({'unit_sales':'count'}).reset_index()
df_salesandstoresdata_alldatesnullfinal['missingdata'] = '1'
df_salesandstoresdata_alldatesnullfinal

(28679, 3)
(28679, 4)
(23618, 4)
(5061, 4)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_salesandstoresdata_alldatesnull2['missingdatacategory'] = np.where((df_salesandstoresdata_alldatesnull2['missingdatacategory'] == 'new_store'),
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_salesandstoresdata_alldatesnull3['missingdatacategory'] =        'missing <9 days'


Unnamed: 0,store_nbr,missingdatacategory,unit_sales,missingdata
0,1,missing <9 days,9,1
1,12,old_store missing >9 days,68,1
2,14,old_store missing >9 days,46,1
3,17,missing <9 days,10,1
4,18,old_store missing >9 days,118,1
5,20,new_store,776,1
6,21,new_store,937,1
7,22,new_store,1014,1
8,24,old_store missing >9 days,109,1
9,25,old_store missing >9 days,70,1


We now labeled the stores that are missing data based how much data they miss and if they'r new or not (yay!)

## Step 4.0- Determine the impact of stores that are missing data

1- Find out the total unit sales in july 2017 (we want to make sure we have data for all stores at a timepoint to compare) and how it differs from the total sales over all timeperiods?  
2- Find out what the impact is of storetypes and what missing stores are related to what type of store?  


In [16]:
df_salesandstoresdatatotal = df_salesandstoresdata.groupby(['store_nbr']).agg({'unit_sales':'sum'}).reset_index()
df_salesandstoresdatatotal = df_salesandstoresdatatotal.merge(df_salesandstoresdata_alldatesnullfinal, on='store_nbr', how='left')
df_salesandstoresdatatotal = df_salesandstoresdatatotal.drop(columns=['unit_sales_y'])
df_salesandstoresdatatotal = df_salesandstoresdatatotal.rename(columns={'unit_sales_x':'unit_sales'})
df_salesandstoresdatatotal['missingdata'] = df_salesandstoresdatatotal['missingdata'].fillna('0')
df_salesandstoresdatatotal['missingdatacategory'] = df_salesandstoresdatatotal['missingdatacategory'].fillna('0')
df_salesandstoresdatatotal = df_salesandstoresdatatotal.sort_values(by='unit_sales', ascending=False)

df_salesandstoresdatatotal

Unnamed: 0,store_nbr,unit_sales,missingdatacategory,missingdata
38,44,62087544.0,0,0
39,45,54498012.0,0,0
41,47,50948308.0,0,0
22,3,50481900.0,0,0
43,49,43420088.0,0,0
40,46,41896052.0,0,0
42,48,35933132.0,0,0
46,51,32911484.0,0,0
52,8,30491336.0,0,0
45,50,28653018.0,0,0


In [17]:
df_salesandstoresdatatotalgroupedby = df_salesandstoresdatatotal.groupby(['missingdata','missingdatacategory']).agg({'unit_sales':'sum'}).reset_index()
df_salesandstoresdatatotalgroupedby['Percentage'] = df_salesandstoresdatatotalgroupedby['unit_sales']/df_salesandstoresdatatotalgroupedby['unit_sales'].sum()*100

df_salesandstoresdatatotalgroupedby

Unnamed: 0,missingdata,missingdatacategory,unit_sales,Percentage
0,0,0,873018688.0,81.316162
1,1,missing <9 days,48567668.0,4.52377
2,1,new_store,74159080.0,6.907449
3,1,old_store missing >9 days,77864856.0,7.252618


In [18]:


# Filter rows for July 2017
df_salesandstoresdata_july_2017 = df_salesandstoresdata[(df_salesandstoresdata['date'].dt.year == 2017) & (df_salesandstoresdata['date'].dt.month == 7)]

# Print the filtered DataFrame
print(df_salesandstoresdata_july_2017)

      store_nbr       date    unit_sales   city      state type cluster
1632          1 2017-07-01  11801.933594  Quito  Pichincha    D      13
1633          1 2017-07-02   5308.296875  Quito  Pichincha    D      13
1634          1 2017-07-03  12201.218750  Quito  Pichincha    D      13
1635          1 2017-07-04  10951.704102  Quito  Pichincha    D      13
1636          1 2017-07-05  14023.387695  Quito  Pichincha    D      13
...         ...        ...           ...    ...        ...  ...     ...
83586         9 2017-07-27  13218.336914  Quito  Pichincha    B       6
83587         9 2017-07-28  14229.087891  Quito  Pichincha    B       6
83588         9 2017-07-29  20919.781250  Quito  Pichincha    B       6
83589         9 2017-07-30  22259.890625  Quito  Pichincha    B       6
83590         9 2017-07-31  20525.878906  Quito  Pichincha    B       6

[1674 rows x 7 columns]


In [30]:
df_salesandstoresdata_july_2017_total = df_salesandstoresdata_july_2017.groupby(['store_nbr']).agg({'unit_sales':'sum'}).reset_index()
df_salesandstoresdata_july_2017_total = df_salesandstoresdata_july_2017_total.merge(df_salesandstoresdata_alldatesnullfinal, on='store_nbr', how='left')
df_salesandstoresdata_july_2017_total = df_salesandstoresdata_july_2017_total.drop(columns=['unit_sales_y'])
df_salesandstoresdata_july_2017_total = df_salesandstoresdata_july_2017_total.rename(columns={'unit_sales_x':'unit_sales'})
df_salesandstoresdata_july_2017_total['missingdata'] = df_salesandstoresdata_july_2017_total['missingdata'].fillna('0')
df_salesandstoresdata_july_2017_total['missingdatacategory'] = df_salesandstoresdata_july_2017_total['missingdatacategory'].fillna('0')
df_salesandstoresdata_july_2017_total = df_salesandstoresdata_july_2017_total.sort_values(by='unit_sales', ascending=False)

In [31]:
df_salesandstoresdata_july_2017_totalgroupedby = df_salesandstoresdata_july_2017_total.groupby(['missingdata','missingdatacategory']).agg({'unit_sales':'sum'}).reset_index()
df_salesandstoresdata_july_2017_totalgroupedby['Percentage'] = df_salesandstoresdata_july_2017_totalgroupedby['unit_sales']/df_salesandstoresdata_july_2017_totalgroupedby['unit_sales'].sum()*100

df_salesandstoresdata_july_2017_totalgroupedby

Unnamed: 0,missingdata,missingdatacategory,unit_sales,Percentage
0,0,0,20449800.0,75.707825
1,1,missing <9 days,1273863.75,4.716009
2,1,new_store,3443426.25,12.748012
3,1,old_store missing >9 days,1844385.0,6.828153


Thus, it has a significant impact which timeframe you select in determining if stores with missing data have a impact or not and how big that impact is.

In [40]:
df_salesandstoresdata_july_2017_totalcopy = df_salesandstoresdata_july_2017_total.copy()

df_salesandstoresdata_july_2017_total_type = df_salesandstoresdata_july_2017_totalcopy.merge(df_stores, on='store_nbr', how='left')
df_salesandstoresdata_july_2017_total_type = df_salesandstoresdata_july_2017_total_type.groupby(['type']).agg({'unit_sales':'sum'}).reset_index()
df_salesandstoresdata_july_2017_total_type = df_salesandstoresdata_july_2017_total_type.sort_values(by='unit_sales', ascending=False)
df_salesandstoresdata_july_2017_total_type['Percentage'] = df_salesandstoresdata_july_2017_total_type['unit_sales']/df_salesandstoresdata_july_2017_total_type['unit_sales'].sum()*100
df_salesandstoresdata_july_2017_total_type['CumulativePercentage'] = df_salesandstoresdata_july_2017_total_type['Percentage'].cumsum()

df_salesandstoresdata_july_2017_total_type

Unnamed: 0,type,unit_sales,Percentage,CumulativePercentage
0,A,9099141.0,33.686207,33.686207
3,D,8375932.0,31.008793,64.695
2,C,4001083.5,14.812533,79.50753
1,B,3800042.25,14.068252,93.575783
4,E,1735275.375,6.424215,100.0


In [45]:
df_salesandstoresdata_july_2017_total2 = df_salesandstoresdata_july_2017_totalcopy.merge(df_stores, on='store_nbr', how='left')


df_salesandstoresdata_july_2017_total2['Total unit sales'] = df_salesandstoresdata_july_2017_total2['unit_sales'].sum()
df_salesandstoresdata_july_2017_total2['Percentage'] = df_salesandstoresdata_july_2017_total2['unit_sales']/df_salesandstoresdata_july_2017_total2['Total unit sales']*100

df_salesandstoresdata_july_2017_total2

Unnamed: 0,store_nbr,unit_sales,missingdatacategory,missingdata,city,state,type,cluster,Total unit sales,Percentage
0,44,1453714.0,0,0,Quito,Pichincha,A,5,27011474.0,5.381841
1,45,1329123.0,0,0,Quito,Pichincha,A,11,27011474.0,4.920586
2,47,1223399.0,0,0,Quito,Pichincha,A,14,27011474.0,4.529182
3,3,1159554.0,0,0,Quito,Pichincha,D,8,27011474.0,4.292822
4,49,1093551.0,0,0,Quito,Pichincha,A,11,27011474.0,4.048469
5,46,991521.1,0,0,Quito,Pichincha,A,14,27011474.0,3.67074
6,48,832164.0,0,0,Quito,Pichincha,A,14,27011474.0,3.08078
7,51,758040.0,0,0,Guayaquil,Guayas,A,17,27011474.0,2.806363
8,52,719688.9,new_store,1,Manta,Manabi,A,11,27011474.0,2.664382
9,50,697939.9,0,0,Ambato,Tungurahua,A,14,27011474.0,2.583864


In [50]:
# df_salesandstoresdata_july_2017_total2 only with missingdatacategory 0 or missing <9 days
df_salesandstoresdata_july_2017_total3 = df_salesandstoresdata_july_2017_total2[(df_salesandstoresdata_july_2017_total2['missingdatacategory'] == '0') | (df_salesandstoresdata_july_2017_total2['missingdatacategory'] == 'missing <9 days')]

df_salesandstoresdata_july_2017_total3 = df_salesandstoresdata_july_2017_total3.groupby(['type']).agg({'Percentage':'sum'}).reset_index()
df_salesandstoresdata_july_2017_total3 = df_salesandstoresdata_july_2017_total3.sort_values(by='Percentage', ascending=False)
df_salesandstoresdata_july_2017_total3

Unnamed: 0,type,Percentage
0,A,31.021826
3,D,25.091228
2,C,11.360919
1,B,9.325864
4,E,3.623997
