# Deriving new variables 

### This script contains the following points:
#### 1. Importing dataset
#### 2. Creating 'price_label' column
#### 3. Creating 'busiest_day' column 
#### 4. Update 'busiest_day' to 'busiest_days'
#### 5. Creating new column 'busiest_period_of_day'
#### 6 Exporting the updated dataframe

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import os

# 1. Importing dataset 

In [6]:
# Project folder path 
path = r'C:\Users\hp\08-2024 Instacart Basket Analysis\Data'
# Import orders products merged
df_ords_prods = pd.read_pickle(os.path.join(path, 'Prepared Data', 'orders_products_merged.pkl'))

# 2. Creating 'price_label' column 

In [12]:
# Creating sub set first million rows
df = df_ords_prods[:1000000]

In [16]:
df.shape

(1000000, 14)

In [49]:
#defining price_label column
def price_label(row):

  if row['prices'] <= 5:
    return 'Low-range product'
  elif (row['prices'] > 5) and (row['prices'] <= 15):
    return 'Mid-range product'
  elif row['prices'] > 15:
    return 'High range'
  else: return 'Not enough data'

In [18]:
#Apply the function 
df['price_range'] = df.apply(price_label, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['price_range'] = df.apply(price_label, axis=1)


In [20]:
df['price_range'].value_counts(dropna=False)

price_range
Mid-range product    652638
Low-range product    338018
High range             9344
Name: count, dtype: int64

#### Now that we've tested the new column, we will apply another methode 'loc()' on the whole data to avoid the warning message we've received above

In [22]:
#loc in the entire data 
df_ords_prods.loc[df_ords_prods['prices'] > 15, 'price_range_loc'] = 'High-range product'

In [24]:
df_ords_prods.loc[(df_ords_prods['prices'] <= 15) & (df_ords_prods['prices'] > 5), 'price_range_loc'] = 'Mid-range product' 

In [26]:
df_ords_prods.loc[df_ords_prods['prices'] <= 5, 'price_range_loc'] = 'Low-range product'

In [28]:
df_ords_prods['price_range_loc'].value_counts(dropna=False)

price_range_loc
Mid-range product     21860860
Low-range product     10126321
High-range product      416980
Name: count, dtype: int64

# 3. Creating 'busiest_day' column 

In [36]:
#which day most orders take place
df_ords_prods['order_dow'].value_counts(dropna=False)

order_dow
0    6204090
1    5660099
6    4496415
2    4213729
5    4205663
3    3840449
4    3783716
Name: count, dtype: int64

#### Days of the week 
#### 0 : Saturday 

#### 1 : Sunday 

#### 2 : Monday 

#### 3 : Tuesday

#### 4 : Wednesday 

#### 5 : Thursday 

#### 6 : Friday 

In [59]:
# Use for-loop to determine whether orders are on "busiest day" (0 = Saturday), "least busy" (4 = Wednesday), or "regularly busy" (other days of the week).
result = []

for value in df_ords_prods["order_dow"]:
  if value == 0:
    result.append("Busiest day")
  elif value == 4:
    result.append("Least busy")
  else:
    result.append("Regularly busy")

In [42]:
#Combine the result with the dataframe : 
df_ords_prods['busiest day'] = result

In [44]:
df_ords_prods['busiest day'].value_counts(dropna=False)

busiest day
Regularly busy    22416355
Busiest day        6204090
Least busy         3783716
Name: count, dtype: int64

#### Key Insights:

**Busiest Day: Saturday (6,204,090 orders)**

**Least Busy Day: Wednesday (3,783,716 orders)**

**Regularly Busy Days: Sunday, Friday, Monday, Thursday, Tuesday**

# 4. Update 'busiest_day' to 'busiest_days'

In [80]:
#Use loop for the update 
result_2 = []

for value in df_ords_prods["order_dow"]:
  if value == 0 or value == 1:
    result_2.append("Busiest days")
  elif value == 4 or value == 3:
    result_2.append("Slowest days")
  else:
    result_2.append("Regularly busy")

In [84]:
#Combine the result with the dataframe : 
df_ords_prods['busiest days'] = result_2

In [86]:
df_ords_prods['busiest days'].value_counts(dropna=False)

busiest days
Regularly busy    12915807
Busiest days      11864189
Slowest days       7624165
Name: count, dtype: int64

#### Key Insights:
The Regularly busy days have the highest total number of orders, followed by the Busiest days (Saturday and Sunday : When combining Saturday and Sunday, the total is 11,864,189 orders, showing that the weekend generally has the highest activity.)
Slowest days (Tuesday and Wednesday) have the fewest orders.

#### Comparison of Insights:

Saturday remains the single busiest day, but when considering the weekend (Saturday and Sunday), the combined volume far exceeds any other group of days. Tuesday and Wednesday are consistently the slowest days, but their combined impact is still notable. Regularly busy days now show a broader distribution of activity across more days, highlighting the significance of mid-week and end-week orders.

In [89]:
# Checking the output of the dataframe : 
df_ords_prods.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices,order_id,user_id,order_number,order_dow,order_hour_of_day,days_since_prior_order,add_to_cart_order,reordered,_merge,price_range_loc,busiest day,busiest days
0,1,Chocolate Sandwich Cookies,61,19,5.8,3139998,138,28,6,11,3.0,5,0,both,Mid-range product,Regularly busy,Regularly busy
1,1,Chocolate Sandwich Cookies,61,19,5.8,1977647,138,30,6,17,20.0,1,1,both,Mid-range product,Regularly busy,Regularly busy
2,1,Chocolate Sandwich Cookies,61,19,5.8,389851,709,2,0,21,6.0,20,0,both,Mid-range product,Busiest day,Busiest days
3,1,Chocolate Sandwich Cookies,61,19,5.8,652770,764,1,3,13,,10,0,both,Mid-range product,Regularly busy,Slowest days
4,1,Chocolate Sandwich Cookies,61,19,5.8,1813452,764,3,4,17,9.0,11,1,both,Mid-range product,Least busy,Slowest days


# 5. Creating new column 'busiest_period_of_day'

In [92]:
#which hour most orders take place
df_ords_prods['order_hour_of_day'].value_counts(dropna=False)

order_hour_of_day
10    2761700
11    2736077
14    2689083
15    2662082
13    2660906
12    2618479
16    2535141
9     2454151
17    2087598
8     1718068
18    1636473
19    1258275
20     976138
7      891030
21     795624
22     634220
23     402310
6      290487
0      218767
1      115700
5       87959
2       69372
4       53241
3       51280
Name: count, dtype: int64

In [94]:
# Initialize an empty list to store the categories
hour_categories = []

# Categorize each hour based on the provided conditions
for value in df_ords_prods["order_hour_of_day"]:
    if value in [10, 11, 14, 15, 13]:
        hour_categories.append("Most orders")
    elif value in [12, 16, 9, 17, 8, 18]:
        hour_categories.append("Average orders")
    else:
        hour_categories.append("Fewest orders")

# Add the new column to the DataFrame
df_ords_prods['busiest_period_of_day'] = hour_categories

# Verify the new column
print(df_ords_prods['busiest_period_of_day'].value_counts())


busiest_period_of_day
Most orders       13509848
Average orders    13049910
Fewest orders      5844403
Name: count, dtype: int64


#### The distribution of orders is relatively balanced between "Most Orders" and "Average Orders," with "Fewest Orders" having significantly fewer orders. Understanding these patterns can help in optimizing resource allocation and handling peak loads effectively.

# 6 Exporting the updated dataframe 

In [103]:
df_ords_prods.to_pickle(os.path.join(path, 'Prepared Data', 'orders_products_derived.pkl'))