### Prepping Data Challenge: Getting Trolleyed (week 21)

Our final challenge for calculations month is all about the Analytical Calculations. These calculations let you answer the questions your stakeholders have before you've even visualised anything. 

### Challenge
New Trolley Inventory project finally delivered at the end of May, we want to analyse what are the products that we are now selling for a much higher amount than we did before the project. We want to analyse the top three products based on price rise per destination. 

### Requirements
 - Input data
 - Bring all the sheets together
 - Use the Day of Month and Table Names (sheet name in other tools) to form a date field for the purchase called 'Date'
 - Create 'New Trolley Inventory?' field to show whether the purchase was made on or after 1st June 2021 (the first date with the revised inventory after the project closed)
 - Remove lots of the detail of the product name:
   - Only return any names before the '-' (hyphen)
   - If a product doesn't have a hyphen return the full product name
 - Make price a numeric field
 - Work out the average selling price per product
 - Workout the Variance (difference) between the selling price and the average selling price
 - Rank the Variances (1 being the largest positive variance) per destination and whether the product was sold before or after the new trolley inventory project delivery
 - Return only ranks 1-5 
 - Output the data

In [2]:
import pandas as pd
import numpy as np

In [3]:
#Input the data
#Bring all the Sheets together

df = None
with pd.ExcelFile('Wk21-Input.xlsx') as xlsx:
    for s in xlsx.sheet_names:
        df_new = pd.read_excel(xlsx, s)
        df_new['sheet_name'] = s
        df = pd.concat([df, df_new])

In [4]:
df.head()

Unnamed: 0,Day of Month,first_name,last_name,email,Product,Price,Destination,sheet_name
0,9,Daffie,Clemont,dclemont0@unc.edu,Emulsifier,$10.14,New York,Month 1
1,19,Lucio,Muzzall,lmuzzall1@dell.com,Chambord Royal,$33.89,London,Month 1
2,25,Corbie,Shrigley,cshrigley2@sourceforge.net,Apples - Sliced / Wedge,$1.64,Perth,Month 1
3,9,Sioux,Couth,scouth3@bluehost.com,Vinegar - White Wine,$19.84,Paris,Month 1
4,21,Almira,Rickards,arickards4@godaddy.com,Food Colouring - Pink,$20.15,Edinburgh,Month 1


In [5]:
# trim the Destination
df['Destination'] = df['Destination'].str.strip()

In [6]:
#Use the Day of Month and Table Names (sheet name in other tools) to form a date field for the purchase called 'Date'
df['Date'] = pd.to_datetime('2021' + '-' + df['sheet_name'].str.replace('Month ', '')
                         + '-' + df['Day of Month'].astype(str))

In [9]:
#Create 'New Trolley Inventory?' field to show whether the purchase was made on or after 1st June 2021
#(the first date with the revised inventory after the project closed)
df['New Trolley Inventory?'] = (df['Date'] >= pd.datetime(2021, 6, 1))

  This is separate from the ipykernel package so we can avoid doing imports until


In [11]:
#Remove lots of the detail of the product name:
df['Product'] = df['Product'].str.split(' -').str[0]

In [13]:
# make price a numeric field
df['Price'] = df['Price'].str.strip('$').astype(float)

In [15]:
# work out the average selling price per product
df['Avg Price per Product'] = df.groupby('Product')['Price'].transform('mean')

In [17]:
# work out the Variance (difference) between the selling price and the average selling price
df['Variance'] = df['Price'] - df['Avg Price per Product']

In [19]:
# rank the Variances (1 being the largest positive variance) per destination and whether the product
# was sold before or after the new trolley inventory project delivery
df['Variance Rank by Destination'] = df.groupby(['Destination', 'New Trolley Inventory?'])\
                                       ['Variance'].rank(ascending=False).astype(int)

In [21]:
df = df[['New Trolley Inventory?', 'Variance Rank by Destination', 'Variance',
            'Avg Price per Product', 'Date', 'Product', 'first_name', 'last_name', 'email',
            'Price', 'Destination']]

In [25]:
#Return only ranks 1-5
rank1_5 = df[df['Variance Rank by Destination'] <= 5]

In [26]:
rank1_5.head()

Unnamed: 0,New Trolley Inventory?,Variance Rank by Destination,Variance,Avg Price per Product,Date,Product,first_name,last_name,email,Price,Destination
160,False,4,21.132,18.748,2021-01-02,Seedlings,Roger,Meaker,rmeaker4g@google.co.jp,39.88,Perth
241,False,1,22.955,15.515,2021-01-13,Blue Curacao,Torr,Weeden,tweeden6p@ebay.co.uk,38.47,Perth
376,False,4,17.34,18.85,2021-01-30,"Chilli Paste, Hot Sambal Oelek",Tresa,Shawe,tshaweag@paginegialle.it,36.19,Edinburgh
624,False,1,25.918,12.702,2021-01-12,Allspice,Lockwood,Boydon,lboydonhc@wisc.edu,38.62,New York
698,False,4,22.864,14.746,2021-01-11,Towel Multifold,Sean,Ivashev,sivashevje@stumbleupon.com,37.61,London


In [28]:
#OUTPUT
rank1_5.to_csv('wk21-output1.csv', index= False)