### Prepping Data Challenge: C&BSCo Next Sale (Week 33)
 
### Requirements
 - Input the data sets
 - Link the Instore and Online sales together to be one data source
   - Call the Nulls in the Stores field Online 
 - Link in the product Lookup to name the products instead of having their ID number
 - Create the 'Product Type' field by taking the first word of the product name
 - Create a data set from your work so far that includes the next sale after the one made in the SAME store of the same product type 
   - Requirement updated 20th Aug 2022
 - Workout how long it took between the original sale and the next sale in minutes
   - Remove any negative differences. These are sales that got refunded. 
 - Create a data set that shows the average of these values for each store and product type. Call this field 'Average mins to next sale' 
 - Output the results

In [1]:
import pandas as pd
import numpy as np

In [2]:
#Input the data sets
#Link the Instore and Online sales together to be one data source

df_1 = (pd.concat([pd.read_csv('Wk33 Input Instore Orders.csv', parse_dates=['Sales Date'], dayfirst=True)
                     .rename(columns={'Sales Date':'Sales Timestamp'}),
                   pd.read_csv('Wk33 Input Online Orders.csv', parse_dates=['Sales Timestamp'], dayfirst=True)
                     .assign(Store='Online')]))

df_2 = (pd.read_csv('wk33 Product Lookup.csv').assign(Product_Type = lambda x: x['Product Name'].str.extract('(.*) - .*'))
                                              .rename(columns = lambda y: y.replace('_', ' ')))

In [3]:
df_1.head()

Unnamed: 0,Sales Timestamp,Store,ID,Product
0,2022-01-01 10:00:00,Wimbledon,1,7
1,2022-01-01 10:00:00,Wimbledon,2,1
2,2022-01-01 12:00:00,Wimbledon,3,4
3,2022-01-01 12:00:00,Wimbledon,4,6
4,2022-01-01 14:00:00,Wimbledon,5,4


In [4]:
df_2.head()

Unnamed: 0,Product ID,Product Name,Product Type
0,1,Liquid - 25ml,Liquid
1,2,Liquid - 50ml,Liquid
2,3,Liquid - 100ml,Liquid
3,4,Liquid - 250ml,Liquid
4,5,Liquid - 500ml,Liquid


In [5]:
# Create the 'Product Type' field by taking the first word of the product name
df_1['Product Type'] = df_1['Product'].replace({k:v for k,v in zip(df_2['Product ID'], df_2['Product Type'])})

In [6]:
# find the time between each sale (by store and product type)
df_1 = df_1.sort_values(by=['ID'])

In [7]:
df = pd.merge_asof(df_1,df_1[['Store','Product Type','ID','Sales Timestamp']].rename(columns={'Sales Timestamp':'Next Timestamp'}), 
                         by=['Store', 'Product Type'], on='ID',allow_exact_matches=False, direction='forward')

df['time_diff_min'] = (df['Next Timestamp'] - df['Sales Timestamp']).dt.total_seconds() / 60

In [8]:
# summarize by product type and store
output = (df[df['time_diff_min'] >= 0].groupby(['Product Type', 'Store'], as_index=False)['time_diff_min'].mean().round(1)
               .rename(columns={'time_diff_min' : 'Average mins to next sale'}))

In [9]:
output.head(10)

Unnamed: 0,Product Type,Store,Average mins to next sale
0,Bar,Lewisham,284.5
1,Bar,Online,159.7
2,Bar,Wimbledon,258.2
3,Liquid,Lewisham,117.0
4,Liquid,Online,75.7
5,Liquid,Wimbledon,113.8


In [10]:
#Output the results 
output.to_csv('wk33-output.csv', index=False)