# Exercise walkthrough

In [None]:
# initial setup, import, defs, loading dataframe

import pandas as pd
import os
prep_data_path = '../2_Data/2_Prepared_Data'
df_orders_products_merged = pd.read_pickle(os.path.join(prep_data_path, 'orders_products_merged.pkl'))

In [2]:
# slice first million rows
df_sliced = df_orders_products_merged[:1000000]

In [3]:
df_sliced.shape

(1000000, 11)

The output is not corresponding to the screenshot on the Exercise, since I've been optimizing the needed columns.

In [4]:
#function to label products according defined price ranges
#The meaning of numbers is as follows:
# 1-low-range product
# 2-mid-range product
# 3-high-range product
# 0-not enough data

def price_label(row):

  if row['prices'] <= 5:
    return 1
  elif (row['prices'] > 5) and (row['prices'] <= 15):
    return 2
  elif row['prices'] > 15:
    return 3
  else: return 0

In [5]:
#applying the function to create a new column in df with the appropriate label

df_sliced['price_range'] = df_sliced.apply(price_label, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_sliced['price_range'] = df_sliced.apply(price_label, axis=1)


In [6]:
df_sliced['price_range'].value_counts()

price_range
2    673200
1    314531
3     12269
Name: count, dtype: int64

The output does not match the expected output written in the exercise......

In [7]:
df_sliced['prices'].max()

25.0

After the correction in price of the 2 outlier prices, made in 4.5, now it's more consistent, with max prices as 25.0

I decided to stop worrying about the issue of the counts not matching the screenshot in the exercise.

This cannot be anything else than a complete disalignement between the content of the exercise and the content of the original dataset (that possibly was reworked meanwhile without having updated the content). I'll proceed as is.

In [8]:
#using same numbers as previously

df_sliced.loc[df_sliced['prices'] > 15, 'price_range_loc'] = 3
df_sliced.loc[(df_sliced['prices'] <= 15) & (df_sliced['prices'] > 5), 'price_range_loc'] = 2 
df_sliced.loc[df_sliced['prices'] <= 5, 'price_range_loc'] = 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_sliced.loc[df_sliced['prices'] > 15, 'price_range_loc'] = 3


In [9]:
df_sliced['price_range_loc'].value_counts()

price_range_loc
2.0    673200
1.0    314531
3.0     12269
Name: count, dtype: int64

The `loc` method provides exactly the same results as the function `price_label()`

In [10]:
#repeating loc method to label products according to prices in the whole dataset

df_orders_products_merged.loc[df_orders_products_merged['prices'] > 15, 'price_range_loc'] = 3
df_orders_products_merged.loc[(df_orders_products_merged['prices'] <= 15) & (df_orders_products_merged['prices'] > 5), 'price_range_loc'] = 2 
df_orders_products_merged.loc[df_orders_products_merged['prices'] <= 5, 'price_range_loc'] = 1

In [11]:
df_orders_products_merged['price_range_loc'].value_counts()

price_range_loc
2.0    21860980
1.0    10130188
3.0      412551
Name: count, dtype: int64

This result confirms the previous conclusion. The figures here differ a lot from the ones in the screenshots in the exercise.

Most likely, it's due to a rework on the dataset that was not updated in the content of the exercise!

In [12]:
#re-assigning datatype to optimal memory usage
df_orders_products_merged['price_range_loc'] = df_orders_products_merged['price_range_loc'].astype('int8')

In [13]:
df_orders_products_merged['orders_day_of_week'].value_counts(dropna=False)

orders_day_of_week
0    6203898
1    5660040
6    4496316
2    4213690
5    4205651
3    3840418
4    3783706
Name: count, dtype: int64

In [14]:
# the number convention here will be:
# 1 = least busy
# 2 = regularly busy
# 3 = busiest day

busy_label = []

for day_of_week in df_orders_products_merged["orders_day_of_week"]:
  if day_of_week == 0:
    busy_label.append(3)
  elif day_of_week == 4:
    busy_label.append(1)
  else:
    busy_label.append(2)

In [15]:
df_orders_products_merged['busy'] = busy_label

In [16]:
#re-assigning datatype to optimal memory usage
df_orders_products_merged['busy'] = df_orders_products_merged['busy'].astype('int8')

In [17]:
df_orders_products_merged['busy'].value_counts(dropna=False)

busy
2    22416115
3     6203898
1     3783706
Name: count, dtype: int64

The values of Saturday and Wednesday naturally show up as the values for Busiest day and Least busy, while all the others add up to Regularly busy category

# TASK

### Step 2 - New labels for top 2 and bottom 2 busiest days

In [18]:
# the number convention here will be:
# 1 = slowest days
# 2 = regular days
# 3 = busiest days

busy_days_label = []

for day_of_week in df_orders_products_merged["orders_day_of_week"]:
  if day_of_week == 0 or day_of_week == 1:
    busy_days_label.append(3)
  elif day_of_week == 3 or day_of_week == 4:
    busy_days_label.append(1)
  else:
    busy_days_label.append(2)

In [19]:
df_orders_products_merged['busy_days'] = busy_days_label

In [20]:
df_orders_products_merged['busy_days'] = df_orders_products_merged['busy_days'].astype('int8')

### Step 3 - Values verification

In [21]:
df_orders_products_merged['busy_days'].value_counts(dropna=False)

busy_days
2    12915657
3    11863938
1     7624124
Name: count, dtype: int64

Busiest days match the sum of Saturday and Sunday, as expected.

Slowest days match the sum of Wednesday and Tuesday, as expected.

The remaining days add up in the regular days.

### Step 4 - Periods of day with most orders

In [22]:
#create a dataframe with the info we need

busy_hours = df_orders_products_merged['order_hour_of_day'].value_counts(dropna=False)
df_busy_hours = busy_hours.to_frame()
df_busy_hours

Unnamed: 0_level_0,count
order_hour_of_day,Unnamed: 1_level_1
10,2761660
11,2736010
14,2689036
15,2662044
13,2660846
12,2618430
16,2535106
9,2454127
17,2087564
8,1718082


In [23]:
df_orders_products_merged.shape[0]

32403719

It's not mentioned in the exercise how many hours we should consider to label as "Most orders" or "Fewest orders".

For this reason, I opted to define a function that you pass these as arguments

In [24]:
#This function takes as arguments:
#   - series: an ordered series coming from value_counts 
#   - most_orders: the value for how many hours we want labeled as Most orders
#   - fewest_orders: the value for how many hours we want labeled as Fewest orders
#Output: two lists of values to check if hour of day is in one of those lists

def labeling_hours_of_day(series, most_orders, fewest_orders):
    most_ord_list = series.index[0:most_orders].to_list()
    fewest_ord_list = series.index[series.shape[0]-fewest_orders:series.shape[0]].to_list()
    return most_ord_list, fewest_ord_list
        

In [25]:
#Testing function with top 4 hours as Most orders and 4 bottom as Fewest orders
most_ord_list, fewest_ord_list = labeling_hours_of_day(busy_hours, 4, 4)

In [26]:
#running throught the dataframe, checking for each row if order_hour_of_day is inside each list and assigning the correct label
# the number convention here will be:
# 1 = fewest orders
# 2 = average orders
# 3 = most orders

hour_label = []

for hour in df_orders_products_merged["order_hour_of_day"]:
  if hour in most_ord_list:
    hour_label.append(3)
  elif hour in fewest_ord_list:
    hour_label.append(1)
  else:
    hour_label.append(2)

In [27]:
#adding column to the main dataframe

df_orders_products_merged['busiest_period_of_day'] = hour_label

In [28]:
df_orders_products_merged['busiest_period_of_day'] = df_orders_products_merged['busiest_period_of_day'].astype('int8')

In [29]:
df_orders_products_merged['busiest_period_of_day'].value_counts(dropna=False)

busiest_period_of_day
2    21293118
3    10848750
1      261851
Name: count, dtype: int64

Picking up a calculator, the values match:
- Most orders correspond to the sum of orders in the hours 10, 11, 14 and 15
- Fewest orders correspond to the sum of orders in the hours 5, 2, 4 and 3

In [30]:
#exporting to pickle format
df_orders_products_merged.to_pickle(os.path.join(prep_data_path, 'orders_products_labeled.pkl'))