# 4.7 Deriving New Variables

### This script contains the following points:
1. Importing libraries
2. Importing data
3. From exercise: Creating a user defined function to categorize dataframe
4. From exercise: Loc() functions and if-statements
5. Deriving new variables in Task 4.7

### 1. Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import os

### 2. Importing Data

In [2]:
path = r'C:\Users\keely\Documents\Courses\CareerFoundry\Immersion\Achievement 4 - Python\01-2023 Instacart Basket Analysis'

In [3]:
# Importing inner merged data set. 

df_ords_prods_merged = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'orders_products_inner_merged.pkl'))

In [4]:
df_ords_prods_merged.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,new_customer,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices
0,2539329,1,1,2,8,,True,196,1,0,Soda,77,7,9.0
1,2398795,1,2,3,7,15.0,False,196,1,1,Soda,77,7,9.0
2,473747,1,3,3,12,21.0,False,196,1,1,Soda,77,7,9.0
3,2254736,1,4,4,7,29.0,False,196,1,1,Soda,77,7,9.0
4,431534,1,5,4,15,28.0,False,196,1,1,Soda,77,7,9.0


In [5]:
df_ords_prods_merged.shape

(32404859, 14)

In [6]:
df = df_ords_prods_merged[:1000000]

In [7]:
df.shape

(1000000, 14)

### 3. From Exercise: Create a User Defined Function to Categorize a Dataframe

In [8]:
# User defined function with if-elif-else statement to categorize products by price.

# Use when sharing and reusabililty is highest priority.

def price_label(row):

  if row['prices'] <= 5:
    return 'Low-range product'
  elif (row['prices'] > 5) and (row['prices'] <= 15):
    return 'Mid-range product'
  elif row['prices'] > 15:
    return 'High range'
  else: return 'Not enough data'

In [9]:
# Create a column, price_range, in your dataframe to hold the returned results of the user-defined function.
# IMPORTANT! This method takes longer than the ones that follow, and uses more CPU/RAM on the computer. Only do
# when you do not have any other choice.

df['price_range'] = df.apply(price_label, axis=1)

# IMPORTANT! Sometimes, if statements with user-defined functions are the only way to accomplish a task in Python.
# However, as you can see in the warning message below, sometimes an if statement with the loc() function is 
# the preferred way to handle something.

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['price_range'] = df.apply(price_label, axis=1)


In [10]:
df['price_range'].value_counts(dropna = False)

Mid-range product    756450
Low-range product    243550
Name: price_range, dtype: int64

In [11]:
# There are no products within the high-range product category. Let's find the most expensive product they carry.

df['prices'].max()

14.800000190734863

### 4. From Exercise: Loc() Functions and If-Statements

In [12]:
# Doing the same thing as the if-statement with a user-defined function above, but with an if-statement and loc()
# function. Use a separate row for each condition.

df.loc[df['prices'] > 15, 'price_range_loc'] = 'High-range product'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[df['prices'] > 15, 'price_range_loc'] = 'High-range product'


In [13]:
df.loc[(df['prices'] <= 15) & (df['prices'] > 5), 'price_range_loc'] = 'Mid-range product'

In [14]:
df.loc[df['prices'] <= 5, 'price_range_loc'] = 'Low-range product'

In [15]:
df['price_range_loc'].value_counts(dropna = False)

Mid-range product    756450
Low-range product    243550
Name: price_range_loc, dtype: int64

In [None]:
# Now, the entire dataframe, df_ords_prods_merged will be sorted by price category using a if statement and
# loc() function.

# IMPORTANT! An if-statement with the highest performance uses a .loc function. Unlike a for-loop or even a user-defined
# function and if-statement, the loc/if-statement combo has the highest performance (for-loop 2nd highest performance)
# because it does not scan the entire dataframe, but goes right to the column specified in the code.

### 5. Deriving New Variables in Task 4.7 

In [16]:
# 1) If you haven’t done so already, complete the instructions in the Exercise for creating the “price_label” 
# and “busiest_day” columns.

df_ords_prods_merged.loc[df_ords_prods_merged['prices'] > 15, 'price_label'] = 'High-range product'

In [17]:
df_ords_prods_merged.loc[(df_ords_prods_merged['prices'] <= 15) & (df_ords_prods_merged['prices'] > 5), 'price_label'] = 'Mid-range product'

In [18]:
df_ords_prods_merged.loc[df_ords_prods_merged['prices'] <= 5, 'price_label'] = 'Low-range product'

In [19]:
df_ords_prods_merged.head() 

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,new_customer,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,price_label
0,2539329,1,1,2,8,,True,196,1,0,Soda,77,7,9.0,Mid-range product
1,2398795,1,2,3,7,15.0,False,196,1,1,Soda,77,7,9.0,Mid-range product
2,473747,1,3,3,12,21.0,False,196,1,1,Soda,77,7,9.0,Mid-range product
3,2254736,1,4,4,7,29.0,False,196,1,1,Soda,77,7,9.0,Mid-range product
4,431534,1,5,4,15,28.0,False,196,1,1,Soda,77,7,9.0,Mid-range product


In [20]:
# We want to know how busy each day of the week is. We will start by counting the number of orders 
# on each day of the week.

df_ords_prods_merged['orders_day_of_week'].value_counts(dropna = False)

0    6204182
1    5660230
6    4496490
2    4213830
5    4205791
3    3840534
4    3783802
Name: orders_day_of_week, dtype: int64

In [21]:
# Now, we will create a column with a for-loop that categories each day of the week as busiest, least busy, and regularly 
# busy. We are assigning the busy-level based on the count above. The day 0 (Saturday), is assigned the busiest because
# it has the highest number of orders in the value_counts() function above.

result = []

for value in df_ords_prods_merged["orders_day_of_week"]:
  if value == 0:
    result.append("Busiest day")
  elif value == 4:
    result.append("Least busy")
  else:
    result.append("Regularly busy")

In [22]:
# Now, we must create a new column in the df_ords_prods_merged dataframe that can accept the result from above.

df_ords_prods_merged['busiest_day'] = result


In [23]:
# Counting values in new column, busiest_day:

df_ords_prods_merged['busiest_day'].value_counts(dropna = False)

Regularly busy    22416875
Busiest day        6204182
Least busy         3783802
Name: busiest_day, dtype: int64

In [24]:
# 2) Client now wants column labeled "busiest_days", plural with the two busiest days, two slowest days, and regular days.

df_ords_prods_merged['orders_day_of_week'].value_counts(dropna = False)

0    6204182
1    5660230
6    4496490
2    4213830
5    4205791
3    3840534
4    3783802
Name: orders_day_of_week, dtype: int64

In [25]:
result_2 = []

for value in df_ords_prods_merged["orders_day_of_week"]:
  if value == 0 or value == 1:
    result_2.append("Busiest days")
  elif value == 4 or value == 3:
    result_2.append("Least busy days")
  else:
    result_2.append("Regularly busy")

In [26]:
# Assigning result_2 to busiest_days column

df_ords_prods_merged['busiest_days'] = result_2

In [28]:
# 3) Checking new busiest_days column counts. 

df_ords_prods_merged['busiest_days'].value_counts(dropna = False)

Regularly busy     12916111
Busiest days       11864412
Least busy days     7624336
Name: busiest_days, dtype: int64

#### Number 3 above appears to add up correctly. Also ~ 37% of all orders occur over the weekend.

In [29]:
# 4) Instacart's app freezes during peak times during day. Now make a column designating busiest hours of the day. 

# Rather than by hour, they want periods of time labeled “Most orders,” “Average orders,” and “Fewest orders.” Create 
# a new column containing these labels called “busiest_period_of_day.”

df_ords_prods_merged['order_hour_of_day'].value_counts(dropna = False)

10    2761760
11    2736140
14    2689136
15    2662144
13    2660954
12    2618532
16    2535202
9     2454203
17    2087654
8     1718118
18    1636502
19    1258305
20     976156
7      891054
21     795637
22     634225
23     402316
6      290493
0      218769
1      115700
5       87961
2       69375
4       53242
3       51281
Name: order_hour_of_day, dtype: int64

In [30]:
df_ords_prods_merged.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,new_customer,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,price_label,busiest_day,busiest_days
0,2539329,1,1,2,8,,True,196,1,0,Soda,77,7,9.0,Mid-range product,Regularly busy,Regularly busy
1,2398795,1,2,3,7,15.0,False,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Least busy days
2,473747,1,3,3,12,21.0,False,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Least busy days
3,2254736,1,4,4,7,29.0,False,196,1,1,Soda,77,7,9.0,Mid-range product,Least busy,Least busy days
4,431534,1,5,4,15,28.0,False,196,1,1,Soda,77,7,9.0,Mid-range product,Least busy,Least busy days


In [31]:
# Creating a dictionary to view hours of day along with counts

dict_hour_counts = df_ords_prods_merged['order_hour_of_day'].value_counts(dropna = False).to_dict()

In [32]:
# Creating a dictionary to be used in later for loop.

dict_hour_counts

{10: 2761760,
 11: 2736140,
 14: 2689136,
 15: 2662144,
 13: 2660954,
 12: 2618532,
 16: 2535202,
 9: 2454203,
 17: 2087654,
 8: 1718118,
 18: 1636502,
 19: 1258305,
 20: 976156,
 7: 891054,
 21: 795637,
 22: 634225,
 23: 402316,
 6: 290493,
 0: 218769,
 1: 115700,
 5: 87961,
 2: 69375,
 4: 53242,
 3: 51281}

In [34]:
list_percentile = list(df_ords_prods_merged['order_hour_of_day'].value_counts(dropna = False).quantile([0.75, 0.5, 0.25]))

In [35]:
list_percentile

[2556034.5, 1117230.5, 272562.0]

In [36]:
# This nested for loop is dynamic and allows the flexibility of changing order numbers, 
# so when run again, the 25th, 50th, and 75th percentiles will be updated with the newest data.

results_hour = []

for value in df_ords_prods_merged["order_hour_of_day"]:
    for k, v in dict_hour_counts.items():
        if value == k:
            if v < list_percentile[2]:
                results_hour.append('Fewest orders')
            elif v >= list_percentile[2] and v < list_percentile[1]:
                results_hour.append('Average orders')
            elif v >= list_percentile[1]:
                results_hour.append('Most orders')


In [37]:
# Assigning results_hour to new column called busiest_period_of_day.
    
df_ords_prods_merged['busiest_period_of_day'] = results_hour

In [38]:
# 5) Looking at counts of busiest period of day. Adds up to total dataframe rows.

df_ords_prods_merged['busiest_period_of_day'].value_counts(dropna = False)

Most orders       27818650
Average orders     3989881
Fewest orders       596328
Name: busiest_period_of_day, dtype: int64

In [39]:
df_ords_prods_merged.shape

(32404859, 18)

In [40]:
df_ords_prods_merged.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,new_customer,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,price_label,busiest_day,busiest_days,busiest_period_of_day
0,2539329,1,1,2,8,,True,196,1,0,Soda,77,7,9.0,Mid-range product,Regularly busy,Regularly busy,Most orders
1,2398795,1,2,3,7,15.0,False,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Least busy days,Average orders
2,473747,1,3,3,12,21.0,False,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Least busy days,Most orders
3,2254736,1,4,4,7,29.0,False,196,1,1,Soda,77,7,9.0,Mid-range product,Least busy,Least busy days,Average orders
4,431534,1,5,4,15,28.0,False,196,1,1,Soda,77,7,9.0,Mid-range product,Least busy,Least busy days,Most orders


In [55]:
# 6) Ensure notebook is clean and structured.

In [41]:
# 7) Export final dataframe to pickle format.

df_ords_prods_merged.to_pickle(os.path.join(path, '02 Data','Prepared Data', 'ords_prods_merged_busy_updates.pkl'))