## Store and daily sales analaysis_continuing our analysis from EDA_Stores and EDA_Storeswithotherdata. 

Based on discussion with other experts on the 7th of June 2024 (aka "the practictioner") we were suggested to not only analyse the stores and sales combination on an aggregated level but also on a daily level. This notebook will look further into that.

The steps to be conducted
1- Import libraries and files
2- Cleaning/restructuring datasets
3- Filter out stores that don't have all datapoints  
4- Determine the impact of stores that are missing data

Goal for notebook -> Find out what stores we don't want to select for our Proof of Concept. 

## Step 1.0 & 2.0- Importing libraries and files for analysis on stores and sales. Thereafter, cleaning the data for further analysis.

In [1]:
import pandas as pd
import altair as alt
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import vegafusion as vf

#pip install "vegafusion[embed]>=1.5.0" (not in requirements.txt)

# Reading the files for salesdata and stores data into my notebook
file_path_df_0 = r'C:\Users\sebas\OneDrive\Documenten\GitHub\Supermarketcasegroupproject\Group4B\data\interim\df_0.parquet'
file_path_stores = r'C:\Users\sebas\OneDrive\Documenten\GitHub\Supermarketcasegroupproject\Group4B\data\raw\stores.parquet'

df_salesdata = pd.read_parquet(file_path_df_0)
df_stores = pd.read_parquet(file_path_stores)

In [None]:
print(df_stores.info())
print(df_salesdata.info())

We find different datatypes within the dataframes. To make the join succesful, we need to consider making those fields the same. Following datamanagement logic, we want to have our dimension fields or non calculation fields in string/date format. We basically only use numerical values whenever we want to calculate with the field.

In [None]:
# Adjust data types and drop columns we don't need
df_salesdata['store_nbr'] = df_salesdata['store_nbr'].astype(str)
df_salesdata = df_salesdata.drop(columns=['year', 'day','onpromotion','month'])
df_stores['store_nbr'] = df_stores['store_nbr'].astype(str)
df_stores['cluster'] = df_stores['cluster'].astype(str)

print(df_stores.info())
print(df_salesdata.info())

In [None]:
# Group the sales date by store and item
df_salesdatagrouped = df_salesdata.groupby(['store_nbr','date']).agg({'unit_sales':'sum'}).reset_index()

print(f' In df_salesdatagrouped zitten nu {df_salesdatagrouped.shape[0]} rijen en {df_salesdatagrouped.shape[1]} kolommen')
print(df_salesdatagrouped.info())

In [None]:
df_salesandstoresdata = df_salesdatagrouped.merge(df_stores, left_on='store_nbr', right_on='store_nbr', how='inner')

print(f' In df_salesandstoredata zitten nu {df_salesandstoresdata.shape[0]} rijen en {df_salesandstoresdata.shape[1]} kolommen')
print(df_salesandstoresdata.info())

## Step 3.0- Filter out all stores that don't have all the datapoint or atleast mark them



In [None]:
# Count amount of values per store
se_storedatecount = df_salesandstoresdata['store_nbr'].value_counts()

print(f"The daterange of the salesdata starts at {df_salesandstoresdata['date'].min()}")
print(f"The daterange of the salesdata ends at {df_salesandstoresdata['date'].max()}")
print(f"The daterange of the salesdata is {df_salesandstoresdata['date'].max() - df_salesandstoresdata['date'].min()}")
print(se_storedatecount)

In [None]:
# Create a date range from the start date to the end date of the sales data
start_date = pd.to_datetime('2013-01-02')
end_date = pd.to_datetime('2017-08-15')

# Create a date range variable from the start date to the end date of the sales data
date_range = pd.date_range(start=start_date, end=end_date, freq='D')

# Create a dataframe from the date range
date_range = pd.DataFrame(date_range, columns=['date'])

print(f'The date_range dataframe starts at {date_range["date"].min()} and ends at {date_range["date"].max()}')

In [None]:
df_salesandstoresdata34 = df_salesandstoresdata[df_salesandstoresdata['store_nbr'] == '34']

df_salesandstoresdata34missingdates = df_salesandstoresdata34.merge(date_range, left_on='date', right_on='date', how='outer')

empty_unit_sales = df_salesandstoresdata34missingdates[df_salesandstoresdata34missingdates['unit_sales'].isnull()]
print(empty_unit_sales)
print('As we can see, stores that have all data seem to be closed on christmas day and on new years day')

We want to do multiple things now with the store and salesdata so far, namely:  
1- We want to find out which days are missing per store (are they just new or missing data in between?)  
2- We want to mark the stores that are missing something  
3- We might want to have some divsion between what we find acceptable in terms of missing data.  

In [None]:
# Let's try to make a dataframe that consists of all stores that are missing data for a certain date
# Step 1 - Crossjoin stores with the daterange
df_storesreduced = df_stores.drop(columns=['city', 'state', 'type', 'cluster'])
df_storesanddates = df_storesreduced.merge(date_range, how='cross')

print(f' Now we onstructed a dataframe with all stores and all dates, it contains {df_storesanddates.shape[0]} rows')
print(df_storesanddates.head(5))

# Step 2 - Merge the salesdata with the storesanddates dataframe to have a dataframe consisting of all stores and all dates with unit_sales

df_salesandstoresdata_alldates = df_salesandstoresdata.merge(df_storesanddates, on = ['store_nbr','date'], how='outer')

print(f' Now we onstructed a dataframe with all stores and all dates, it contains {df_salesandstoresdata_alldates.shape[0]} rows')
print(df_salesandstoresdata_alldates.head(5))

We now made a dataframe with all dates from the date_range daraframe (having all dates in the total date_range). From this we can also see that we have 1 extra row in the combined dataframe (a bit odd), let's find out why.

In [None]:
# Merge the two dataframes and keep only the records that are in the first dataframe but not in the second dataframe
Difference_df_salesandstoresdata_alldates_df_storesanddates = df_salesandstoresdata_alldates.merge(df_storesanddates, on = ['store_nbr','date'], how='outer', indicator=True).loc[lambda x : x['_merge']=='left_only']
Difference_df_salesandstoresdata_alldates_df_storesanddates

Thus, we found 1 store that is open at new years day, thereby breaking the pattern of most stores.

In [None]:
# Now, let's see how this works out for store number 30 (just a random one that is missing some dates according to our earlier analysis)
df_salesandstoresdata_alldates30 = df_salesandstoresdata_alldates[df_salesandstoresdata_alldates['store_nbr']=='30']
df_salesandstoresdata_alldates30 = df_salesandstoresdata_alldates30[df_salesandstoresdata_alldates30['unit_sales'].isnull()]
df_salesandstoresdata_alldates30.head(5)

As we can see, the df_salesandstoresdata_alldates30 file gives us all dates per store and gives us the possibility to explore which dates are missing per store (based on having NaN for unit sales).

In [None]:
# Only the stores that have are value count less than 1679 in se_storedatecount
se_storedatecountmissing = se_storedatecount[se_storedatecount < 1679]

# Now, let's have df_salesandstoresdata_alldates but only for the stores where we are missing some of the data (well, atleast we miss sales on those date, we don't have them in the original data)
df_salesandstoresdata_alldatesnull = df_salesandstoresdata_alldates[df_salesandstoresdata_alldates['store_nbr'].isin(se_storedatecountmissing.index)]

# From the stores with missing data, we only want the records where the unit_sales is missing
df_salesandstoresdata_alldatesnull = df_salesandstoresdata_alldatesnull[df_salesandstoresdata_alldatesnull['unit_sales'].isnull()]

df_salesandstoresdata_alldatesnull = df_salesandstoresdata_alldatesnull[['date', 'store_nbr','unit_sales']]

# Add a unit_sales of 1 to the dataframe to make it easier to plot, it's just a dummy value
df_salesandstoresdata_alldatesnull['unit_sales'] = 1

# Merge the dataframe with the date_range dataframe to have all dates in the dataframe
df_salesandstoresdata_alldatesnull = df_storesanddates.merge(df_salesandstoresdata_alldatesnull, on=['store_nbr','date'] ,how='left')

# Now we have a dataframe with all stores and all dates, but only for the stores that are missing some data
df_salesandstoresdata_alldatesnull = df_salesandstoresdata_alldatesnull[df_salesandstoresdata_alldatesnull['store_nbr'].isin(se_storedatecountmissing.index)]

print(f"Stores {df_salesandstoresdata_alldatesnull['store_nbr'].unique()} are in the dataset with stores with <1679 datapoints and all dates, having imputed a value of 1 for all dates missing in the range")

Let's now make a graph to see where in the timeline we are missing data per store were we actually miss some data.

In [None]:
alt.data_transformers.enable("vegafusion")

df_salesandstoresdata_alldatesnull_chart = alt.Chart(df_salesandstoresdata_alldatesnull , title='In color, dates with no unit sales per store').mark_circle(size=8).encode(
    y=alt.Y("store_nbr:N",title='Store number'),
    x=alt.X("date:T", title='Date'),
    yOffset="unit_sales:Q",
    color=alt.Color('store_nbr:N').legend(None)
)
df_salesandstoresdata_alldatesnull_chart = df_salesandstoresdata_alldatesnull_chart.properties(
    width=1000,  # Set the width
    height=500  # Set the height
)

df_salesandstoresdata_alldatesnull_chart

In [None]:
df_salesandstoresdata_alldatesnull36 = df_salesandstoresdata_alldatesnull[df_salesandstoresdata_alldatesnull['store_nbr'] == '36']
df_salesandstoresdata_alldatesnull36

What can conclude the following from this initial analysis:  
1- Some stores are relatively new, we can see they miss data for a longer period wherafter they don't miss any data (20,21,22,29,36,42,52,53)  
2- Other stores miss data in between (they most likely closed some time)  

We might want to distinguish between stores that are relatively new and stores that miss data. 

In [None]:
# Stores that are new we mark as 1, stores that are old we mark as 0
se_storedatecountmissingsome = se_storedatecount[se_storedatecount < 1670]

df_salesandstoresdata_alldatesnull1 = df_salesandstoresdata_alldatesnull.copy()

# Identify stores that are new based on having a dummy value on 2013-01-02
new_store_nbrs = df_salesandstoresdata_alldatesnull[
    (df_salesandstoresdata_alldatesnull['date'] == '2013-01-02') & 
    (df_salesandstoresdata_alldatesnull['unit_sales'] == 1)
]['store_nbr'].unique()

# Make a new column missingdatacategory where stores that have a dummy unit for 2013-01-02 are marked as a new store, the rest for now is seen as an old store. This is still the whole dataset 
# We get the right storenumbers based on the isin part of the expression
df_salesandstoresdata_alldatesnull1['missingdatacategory'] = np.where(df_salesandstoresdata_alldatesnull1['store_nbr'].isin(new_store_nbrs),
                                                                     'new_store', 
                                                                     'old_store'
                                                                     )

# Step 2 - For all stores that have < 1670 days of data, name the stores that are not new and old store missing > days of data
df_salesandstoresdata_alldatesnull2 = df_salesandstoresdata_alldatesnull1[df_salesandstoresdata_alldatesnull1['store_nbr'].isin(se_storedatecountmissingsome.index)]

df_salesandstoresdata_alldatesnull2['missingdatacategory'] = np.where((df_salesandstoresdata_alldatesnull2['missingdatacategory'] == 'new_store'),
                                                                    'new_store',
                                                                    'old_store missing >9 days'
                                                                    )

# Step 3 - For all stores that are missing <9 days of data we just label them "missing < 9 days"
df_salesandstoresdata_alldatesnull3 = df_salesandstoresdata_alldatesnull[~df_salesandstoresdata_alldatesnull1['store_nbr'].isin(se_storedatecountmissingsome.index)]
df_salesandstoresdata_alldatesnull3['missingdatacategory'] =        'missing <9 days'

# Put the dataframes of step 2 and 3 together to get all rows back together as in the original dataframes
df_salesandstoresdata_alldatesnullfinal = pd.concat([df_salesandstoresdata_alldatesnull2, df_salesandstoresdata_alldatesnull3])

print(df_salesandstoresdata_alldatesnull.shape)
print(df_salesandstoresdata_alldatesnull1.shape)
print(df_salesandstoresdata_alldatesnull2.shape)
print(df_salesandstoresdata_alldatesnull3.shape)

# Make a dataframe that groups the data by store and missingdatacategory
df_salesandstoresdata_alldatesnullfinal = df_salesandstoresdata_alldatesnullfinal.groupby(['store_nbr','missingdatacategory']).agg({'unit_sales':'count'}).reset_index()

# Add a dummy value to the dataframe to make it easier to plot or to join with other dataframes
df_salesandstoresdata_alldatesnullfinal['missingdata'] = '1'
df_salesandstoresdata_alldatesnullfinal

We now labeled the stores that are missing data based how much data they miss and if they'r new or not (yay!)

## Step 4.0- Determine the impact of stores that are missing data

1- Find out the total unit sales in july 2017 (we want to make sure we have data for all stores at a timepoint to compare) and how it differs from the total sales over all timeperiods?  
2- Find out what the impact is of storetypes and what missing stores are related to what type of store?  


In [None]:
# Calculate the total unit sales per store from the salesandstoresdata dataframe (this original dataframe grouped the data by store and date with the original dates)
df_salesandstoresdatatotal = df_salesandstoresdata.groupby(['store_nbr']).agg({'unit_sales':'sum'}).reset_index()

# Take the df_salesandstoresdata_alldatesnullfinal dataframe and merge it with the df_salesandstoresdatatotal dataframe to get the total unit sales per store and the marking if dates are missing per store including the categories why something is missing.
df_salesandstoresdatatotal = df_salesandstoresdatatotal.merge(df_salesandstoresdata_alldatesnullfinal, on='store_nbr', how='left')

# Drop the unit_sales_y column and rename the unit_sales_x column to unit_sales (just cleaning things from the last merge)
df_salesandstoresdatatotal = df_salesandstoresdatatotal.drop(columns=['unit_sales_y'])
df_salesandstoresdatatotal = df_salesandstoresdatatotal.rename(columns={'unit_sales_x':'unit_sales'})

# If a store isn't missing data, give the missingdata column a value of 0, do the same for the missingdatacategory column
df_salesandstoresdatatotal['missingdata'] = df_salesandstoresdatatotal['missingdata'].fillna('0')
df_salesandstoresdatatotal['missingdatacategory'] = df_salesandstoresdatatotal['missingdatacategory'].fillna('0')

# Sort the dataframe by unit_sales (we want to have the highest sales first)
df_salesandstoresdatatotal = df_salesandstoresdatatotal.sort_values(by='unit_sales', ascending=False)

df_salesandstoresdatatotal

For now, we have an overview of all stores, their total sales over the whole time period (starting from 2013 up untill the beginning of august 2017). Let's see what's the relative portion of the sales from the stores that are actually missing data. When initially selecting stores for our proof of concept forecasting, we prefer stores that have all datapoints over stores that miss data. However, we have to consider it's relative share of the sales pie.

In [None]:
# Group the data by missingdata and missingdatacategory and calculate the percentage of the total unit sales per store
df_salesandstoresdatatotalgroupedby = df_salesandstoresdatatotal.groupby(['missingdata','missingdatacategory']).agg({'unit_sales':'sum', 'store_nbr':'count'}).reset_index()
df_salesandstoresdatatotalgroupedby = df_salesandstoresdatatotalgroupedby.rename(columns={'store_nbr':'store_count'})
df_salesandstoresdatatotalgroupedby['Percentage'] = df_salesandstoresdatatotalgroupedby['unit_sales']/df_salesandstoresdatatotalgroupedby['unit_sales'].sum()*100

df_salesandstoresdatatotalgroupedby

From the perspective of the whole unit sales history. We can see that the stores that miss data count up to approximately 18,7% of total sales. However, it doesn't seem fair to look at it this way. For example, one new store started in 2017 and therefore didn't make a lot of sales of the whole time period. To make an equal level playing field we will now look at the sales for just the month of july 2017 (this on itself might also have some risks but we assume these to have less impact).

In [None]:
# Take the df_salesandstoresdata dataframe again and filter it for the year 2017 and the month july

# Filter rows for July 2017
df_salesandstoresdata_july_2017 = df_salesandstoresdata[(df_salesandstoresdata['date'].dt.year == 2017) & (df_salesandstoresdata['date'].dt.month == 7)]

# Print the filtered DataFrame
print(df_salesandstoresdata_july_2017)

In [None]:
# Group the data by store and calculate the total unit sales per store (we repeat the same steps as we did for the whole time period)
df_salesandstoresdata_july_2017_total = df_salesandstoresdata_july_2017.groupby(['store_nbr']).agg({'unit_sales':'sum'}).reset_index()

# Take the df_salesandstoresdata_alldatesnullfinal dataframe and merge it with the df_salesandstoresdatatotal dataframe to get the total unit sales per store and the marking if dates are missing per store including the categories why something is missing.
df_salesandstoresdata_july_2017_total = df_salesandstoresdata_july_2017_total.merge(df_salesandstoresdata_alldatesnullfinal, on='store_nbr', how='left')

# Drop the unit_sales_y column and rename the unit_sales_x column to unit_sales (just cleaning things from the last merge)
df_salesandstoresdata_july_2017_total = df_salesandstoresdata_july_2017_total.drop(columns=['unit_sales_y'])
df_salesandstoresdata_july_2017_total = df_salesandstoresdata_july_2017_total.rename(columns={'unit_sales_x':'unit_sales'})

# If a store isn't missing data, give the missingdata column a value of 0, do the same for the missingdatacategory column
df_salesandstoresdata_july_2017_total['missingdata'] = df_salesandstoresdata_july_2017_total['missingdata'].fillna('0')
df_salesandstoresdata_july_2017_total['missingdatacategory'] = df_salesandstoresdata_july_2017_total['missingdatacategory'].fillna('0')

# Sort the dataframe by unit_sales (we want to have the highest sales first)
df_salesandstoresdata_july_2017_total = df_salesandstoresdata_july_2017_total.sort_values(by='unit_sales', ascending=False)

df_salesandstoresdata_july_2017_total

Now we made a dataframe with all stores and missing data labels for july 2017. Let's again look at the relative percentages per missingdata category.

In [None]:
# Group the data by missingdata and missingdatacategory and calculate the percentage of the total unit sales per store
df_salesandstoresdata_july_2017_totalgroupedby = df_salesandstoresdata_july_2017_total.groupby(['missingdata','missingdatacategory']).agg({'unit_sales':'sum', 'store_nbr':'count'}).reset_index()
df_salesandstoresdata_july_2017_totalgroupedby = df_salesandstoresdata_july_2017_totalgroupedby.rename(columns={'store_nbr':'store_count'})
df_salesandstoresdata_july_2017_totalgroupedby['Percentage'] = df_salesandstoresdata_july_2017_totalgroupedby['unit_sales']/df_salesandstoresdata_july_2017_totalgroupedby['unit_sales'].sum()*100

# Format the table, just to make it look nice and use the gradient to show the effect of having a lot of stores in certain cities
df_salesandstoresdata_july_2017_totalgroupedbysytyle = df_salesandstoresdata_july_2017_totalgroupedby.style.background_gradient(subset=['unit_sales', 'Percentage','store_count'], cmap='Blues')\
                                                                       .format({"unit_sales": "{:20,.0f}",
                                                                                "Percentage": "{:20,.1f}",
                                                                                "store_count" : "{:20,.1f}"})

df_salesandstoresdata_july_2017_totalgroupedbysytyle

Thus, it has a significant impact which timeframe you select in determining if stores with missing data have a impact or not and how big that impact is. We went from a share of 18,7% to a share of 24,3% just by changing the timeframe! Since we might have other reasons to drop data for our proof of concept, we have to be very careful in just dropping this 24,3% of stores from our dataset for developing the proof of concept. 

In [None]:
df_salesandstoresdata_july_2017_totalcopy = df_salesandstoresdata_july_2017_total.copy()

# Let's investigate the share per store type for july 2017
df_salesandstoresdata_july_2017_total_type = df_salesandstoresdata_july_2017_totalcopy.merge(df_stores, on='store_nbr', how='left')
df_salesandstoresdata_july_2017_total_type = df_salesandstoresdata_july_2017_total_type.groupby(['type']).agg({'unit_sales':'sum'}).reset_index()
df_salesandstoresdata_july_2017_total_type = df_salesandstoresdata_july_2017_total_type.sort_values(by='unit_sales', ascending=False)
df_salesandstoresdata_july_2017_total_type['Percentage'] = df_salesandstoresdata_july_2017_total_type['unit_sales']/df_salesandstoresdata_july_2017_total_type['unit_sales'].sum()*100
df_salesandstoresdata_july_2017_total_type['CumulativePercentage'] = df_salesandstoresdata_july_2017_total_type['Percentage'].cumsum()

df_salesandstoresdata_july_2017_total_type

In [None]:
# Add store information (like type, cluster, state and city) to the dataframe with the total unit sales per store for July 2017
df_salesandstoresdata_july_2017_total2 = df_salesandstoresdata_july_2017_totalcopy.merge(df_stores, on='store_nbr', how='left')

# Add the total unit sales for all stores to each row of the dataframe (this makes it easier to calculate the percentage of the total unit sales per store)
df_salesandstoresdata_july_2017_total2['Total unit sales'] = df_salesandstoresdata_july_2017_total2['unit_sales'].sum()

# Calculate the percentage of the total unit sales per store
df_salesandstoresdata_july_2017_total2['Percentage'] = df_salesandstoresdata_july_2017_total2['unit_sales']/df_salesandstoresdata_july_2017_total2['Total unit sales']*100

df_salesandstoresdata_july_2017_total2

Look at the percentage of sales per city

In [None]:
# Group the data by city and calculate the total unit sales per city
df_salesandstoresdata_july_2017_totalgroupedcity = df_salesandstoresdata_july_2017_total2.groupby(['city']).agg({'unit_sales':'sum', 'Percentage':'sum'}).reset_index()
df_salesandstoresdata_july_2017_totalgroupedcity = df_salesandstoresdata_july_2017_totalgroupedcity.sort_values(by='unit_sales', ascending=False)
df_salesandstoresdata_july_2017_totalgroupedcity['Cumulative Percentage'] = df_salesandstoresdata_july_2017_totalgroupedcity['Percentage'].cumsum()
df_salesandstoresdata_july_2017_totalgroupedcity


# Format the table, just to make it look nice and use the gradient to show the effect of having a lot of stores in certain cities
df_salesandstoresdata_july_2017_totalgroupedcitytyle = df_salesandstoresdata_july_2017_totalgroupedcity.style.background_gradient(subset=['unit_sales', 'Percentage'], cmap='Blues')\
                                                                       .format({"unit_sales": "{:20,.0f}",
                                                                                "Percentage": "{:20,.1f}",
                                                                                "Cumulative Percentage" : "{:20,.1f}"})

df_salesandstoresdata_july_2017_totalgroupedcitytyle

So now we have all the stores with their sales for July 2017, have a lot of store information, know if they miss data and how much and have their relative share of total sales in july 2017. 

STEP 1
From our 3 categories of missing data (new store, missing >9 days or <9 days), we would like to first exclude the categories that miss the most data. These categories are new_store and old_store missing >9 days. 

STEP 2
Besides that, we know that cluster 10 has some data issues in that all clusters belong to 1 type of store except cluster 10. Cluster 10 belongs to type B, D and E. Therefore, in terms of data quality we would like to exclude this cluster as well.

It is interesting to see what the effect is of combining these two points (excluding the 2 missingdata categories and cluster 10)

In [None]:
df_salesandstoresdata_july_2017_total3 = df_salesandstoresdata_july_2017_total2.copy()

# STEP 1 - df_salesandstoresdata_july_2017_total2 only with missingdatacategory 0 or missing <9 days
df_salesandstoresdata_july_2017_total4 = df_salesandstoresdata_july_2017_total3[(df_salesandstoresdata_july_2017_total3['missingdatacategory'] == '0') | (df_salesandstoresdata_july_2017_total3['missingdatacategory'] == 'missing <9 days')]

# STEP 2 - Drop the rows where the cluster is 10 (this is a missing value)
df_salesandstoresdata_july_2017_total4 = df_salesandstoresdata_july_2017_total4[df_salesandstoresdata_july_2017_total4['cluster'] != '10']

# Sum the percentages of sales in July 2017 per store type
df_salesandstoresdata_july_2017_total4 = df_salesandstoresdata_july_2017_total4.groupby(['type']).agg({'Percentage':'sum', 'unit_sales':'sum'}).reset_index()
df_salesandstoresdata_july_2017_total4 = df_salesandstoresdata_july_2017_total4.sort_values(by='Percentage', ascending=False)

df_salesandstoresdata_july_2017_total4

In [None]:
df_salesandstoresdata_july_2017_total4['Percentage'].sum()

print(f"By excluding new_stores, stores that miss more than 9 days of data and cluster 10, we end up with {df_salesandstoresdata_july_2017_total4['Percentage'].sum()} % of the total sales in July 2017")
print(f"This represents {df_salesandstoresdata_july_2017_total4['unit_sales'].sum()} units sold in July 2017")