### Prepping Data Challenge: C&BSCo Clean and Aggregate (Week 27)
 
### Requirements
- Input the data
- Separate out the Product Name field to form Product Type and Quantity
- Rename the fields to 'Product Type' and 'Quantity' respectively
- Create two paths in your flow: 
  - One to deal with the data about Liquid Soap sales
  - One to deal with the data about Bar Soap sales
- For each path in your flow:
  - Clean the Quantity field to just leave values
    - For Liquid, ensure every value is in millilitres 
  - Sum up the sales for each combination of Store, Region and Quantity
  - Also, count the number of orders that has the combination of Store, Region and Quantity. Name this field 'Present in N orders' 
- Output each file from the separate paths

In [1]:
import pandas as pd
import numpy as np

In [2]:
#input the data
df = pd.read_csv('wk27-input.csv', parse_dates=['Sale Date'], dayfirst=True)

In [3]:
#Separate out the Product Name field to form Product Type and Quantity
#Rename the fields to 'Product Type' and 'Quantity' respectively
#For each path in your flow; Clean the Quantity field to just leave values
df[['Product Type','Quantity']] = df['Product Name'].str.split(' - ', expand=True)
df['unit'] = df['Quantity'].str.extract('\d+(.*)')
df['Quantity'] = df['Quantity'].str.extract('(\d+)').astype('int')

In [4]:
df.head(10)

Unnamed: 0,Sale Date,Order ID,Sale Value,Product Name,Store Name,Region,Scent Name,Product Type,Quantity,unit
0,2022-12-12,937,109.84,Liquid - 25ml,Lewisham,East,Rose,Liquid,25,ml
1,2022-10-14,427,207.61,Liquid - 25ml,Lewisham,East,Rose,Liquid,25,ml
2,2022-09-09,135,111.96,Liquid - 25ml,Lewisham,East,Rose,Liquid,25,ml
3,2022-12-11,791,170.68,Liquid - 25ml,Wimbledon,West,Rose,Liquid,25,ml
4,2022-09-08,270,214.12,Liquid - 25ml,Wimbledon,West,Rose,Liquid,25,ml
5,2022-01-18,726,29.55,Liquid - 25ml,Dulwich,East,Rose,Liquid,25,ml
6,2022-05-29,692,194.32,Liquid - 25ml,Dulwich,East,Rose,Liquid,25,ml
7,2022-12-08,672,160.45,Liquid - 25ml,Dulwich,East,Rose,Liquid,25,ml
8,2022-01-14,551,125.41,Liquid - 25ml,Dulwich,East,Rose,Liquid,25,ml
9,2022-08-02,516,60.75,Liquid - 25ml,Dulwich,East,Rose,Liquid,25,ml


In [5]:
#Also, count the number of orders that has the combination of Store, Region and Quantity. Name this field 'Present in N orders'
df['Present in N orders'] = (df.groupby(['Store Name', 'Region', 'Quantity', 'Product Type'])['Order ID'].transform('nunique'))

In [6]:
#Create two paths in your flow:
df1 = df[df['Product Type'] == 'Liquid']
df2 = df[df['Product Type'] == 'Bar']

In [7]:
#For Liquid, ensure every value is in millilitres 
df1['Quantity'] = df1['Quantity'] * np.where(df1['unit'] == 'L', 1000, 1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['Quantity'] = df1['Quantity'] * np.where(df1['unit'] == 'L', 1000, 1)


In [8]:
#Sum up the sales for each combination of Store, Region and Quantity
df1 = df1.groupby(['Store Name', 'Region', 'Quantity', 'Present in N orders'])['Sale Value'].sum().reset_index()
df2 = df2.groupby(['Store Name', 'Region', 'Quantity', 'Present in N orders'])['Sale Value'].sum().reset_index()

In [9]:
df1.head(10)

Unnamed: 0,Store Name,Region,Quantity,Present in N orders,Sale Value
0,Chelsea,West,25,48,8550.55
1,Chelsea,West,50,52,9338.79
2,Chelsea,West,100,43,7989.93
3,Chelsea,West,250,40,6574.2
4,Chelsea,West,500,62,11111.0
5,Chelsea,West,750,55,8822.72
6,Chelsea,West,1000,46,7253.31
7,Dulwich,East,25,51,8788.93
8,Dulwich,East,50,50,9255.96
9,Dulwich,East,100,49,9079.9


In [10]:
df2.head(10)

Unnamed: 0,Store Name,Region,Quantity,Present in N orders,Sale Value
0,Chelsea,West,1,50,9928.69
1,Chelsea,West,2,49,9279.84
2,Chelsea,West,4,54,10037.28
3,Dulwich,East,1,55,9971.78
4,Dulwich,East,2,62,11115.21
5,Dulwich,East,4,54,9069.22
6,Lewisham,East,1,63,12815.77
7,Lewisham,East,2,51,10846.53
8,Lewisham,East,4,73,12023.33
9,Notting Hill,West,1,49,9679.22


In [11]:
#Output each file from the separate paths
df1.to_csv('wk27-outputA.csv', index=False)
df2.to_csv('wk27-outputB.csv', index=False)