# 4.7 Deriving New Variables

### This notebook contains:
    01. Importing Libraries
    02. Importing Data
    03. Deriving New Variables - Exercise
        A. If-Statements with User-Defined Functions
        B. If-Statements with the loc() Function
        C. If-Statements with For-Loops
    04. Deriving New Variables - Task
   

## 01. Importing Libraries

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import os

## 02. Improting Data

In [2]:
# turning project folder path into string
path = r'/Users/lisa/DA Projects/12-2022 Instacart Basket Analysis'

In [3]:
# Importing merged orders and prods data
ords_prods_merged = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'orders_products_merged.pkl'))

## 03. Deriving New Variables - Exercise

In [4]:
# creating subset for dataset
df = ords_prods_merged[:1000000]

### A. If-Statements with User-Defined Functions

In [5]:
# define Function for Price Labels
def price_label(row):

  if row['prices'] <= 5:
    return 'Low-range product'
  elif (row['prices'] > 5) and (row['prices'] <= 15):
    return 'Mid-range product'
  elif row['prices'] > 15:
    return 'High range'
  else: return 'Not enough data'

In [6]:
# using function on dataframe
df['price_range'] = df.apply(price_label, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['price_range'] = df.apply(price_label, axis=1)


In [8]:
# checking frequency
df['price_range'].value_counts(dropna = False)

Mid-range product    756450
Low-range product    243550
Name: price_range, dtype: int64

In [9]:
# looking for most expensive product
df['prices'].max()

14.8

#### Summary
Used when reusability is high priority and you want to use specific logic that's not available in built-in Python functions.
Runs slower than loc() function and can take a lot of processing power.
Can return error messages.

###  B. If-Statements with the loc() Function

In [10]:
# Condition high-range product
df.loc[df['prices'] > 15, 'price_range_loc'] = 'High-range product'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[df['prices'] > 15, 'price_range_loc'] = 'High-range product'


In [11]:
# Condition mid-range product
df.loc[(df['prices'] <= 15) & (df['prices'] > 5), 'price_range_loc'] = 'Mid-range product' 

In [12]:
# Condition low-range product
df.loc[df['prices'] <= 5, 'price_range_loc'] = 'Low-range product'

In [13]:
# checking frequency
df['price_range_loc'].value_counts(dropna = False)

Mid-range product    756450
Low-range product    243550
Name: price_range_loc, dtype: int64

#### Labeling whole ords prods dataframe

In [14]:
# Condition high-range product
ords_prods_merged.loc[ords_prods_merged['prices'] > 15, 'price_range_loc'] = 'High-range product'

In [15]:
# Condition mid-range product
ords_prods_merged.loc[(ords_prods_merged['prices'] <= 15) & (df['prices'] > 5), 'price_range_loc'] = 'Mid-range product' 

In [16]:
# Condition low-range product
ords_prods_merged.loc[ords_prods_merged['prices'] <= 5, 'price_range_loc'] = 'Low-range product'

In [19]:
# checking frequency
ords_prods_merged['price_range_loc'].value_counts(dropna = False)

NaN                   21104410
Low-range product     10126321
Mid-range product       756450
High-range product      417678
Name: price_range_loc, dtype: int64

#### Summary
Used to apply the conditional logic of an if-statement to a function without explicitly creating an if-else construct. Good for filtering operations.
Faster than a user-defined function. Preferable for large dataframes.

### C. If-Statements with For-Loops

In [27]:
ords_prods_merged.head(0)

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,first_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,_merge,price_range_loc


In [25]:
# Checking frequency for order day of the week
ords_prods_merged['order_day_of_week'].value_counts(dropna = False)

0    6204182
1    5660230
6    4496490
2    4213830
5    4205791
3    3840534
4    3783802
Name: order_day_of_week, dtype: int64

In [30]:
# creating for-loop for daily order activity
result = []

for value in ords_prods_merged["order_day_of_week"]:
  if value == 0:
    result.append("Busiest day")
  elif value == 4:
    result.append("Least busy")
  else:
    result.append("Regularly busy")

In [31]:
result

['Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Least busy',
 'Least busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Least busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Least busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Least busy',
 'Least busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Least busy',
 'Regularly busy',
 'Busiest day',
 'Regularly busy',
 'Reg

In [32]:
# Combining results with ords prods merged
ords_prods_merged['busiest_day'] = result

In [33]:
# frequency check
ords_prods_merged['busiest_day'].value_counts(dropna = False)

Regularly busy    22416875
Busiest day        6204182
Least busy         3783802
Name: busiest_day, dtype: int64

#### Summary
Runs the same block of code for multiple items. They’re used to perform the same function on multiple elements, for instance, by running through an entire dataframe and performing a function on each row within that dataframe.
Speeds up performance as the function only runs through one column (as specified in code) instead of the whole dataframe.

## 04. Deriving New Variables - Task

In [34]:
# Getting Overview 
ords_prods_merged.head(0)

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,first_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,_merge,price_range_loc,busiest_day


##### 2.Suppose your clients have changed their minds about the labels you created in your “busiest_day” column. Now, they want “Busiest day” to become “Busiest days” (plural). This label should correspond with the two busiest days of the week as opposed to the single busiest day. At the same time, they’d also like to know the two slowest days. Create a new column for this using a suitable method.

In [35]:
# Checking frequency for order day of the week
ords_prods_merged['order_day_of_week'].value_counts(dropna = False)

0    6204182
1    5660230
6    4496490
2    4213830
5    4205791
3    3840534
4    3783802
Name: order_day_of_week, dtype: int64

The two busiest days are Saturday (0) and Sunday (1). The two least busy days are Wednesday (4) and Tuesday (3).
Friday (6), Monday (2) and Thursday (5) are "Regularly busy".

In [53]:
# creating for-loop for daily order activity
result2 = []

for value in ords_prods_merged["order_day_of_week"]:
  if value == 0 or value == 1:
    result2.append("Busy day")
  elif value == 4 or value == 3:
    result2.append("Less busy")
  else:
    result2.append("Regularly busy")

In [54]:
result2

['Regularly busy',
 'Less busy',
 'Less busy',
 'Less busy',
 'Less busy',
 'Regularly busy',
 'Busy day',
 'Busy day',
 'Busy day',
 'Less busy',
 'Busy day',
 'Regularly busy',
 'Regularly busy',
 'Busy day',
 'Busy day',
 'Regularly busy',
 'Regularly busy',
 'Less busy',
 'Less busy',
 'Less busy',
 'Less busy',
 'Less busy',
 'Less busy',
 'Busy day',
 'Busy day',
 'Busy day',
 'Regularly busy',
 'Regularly busy',
 'Busy day',
 'Regularly busy',
 'Regularly busy',
 'Busy day',
 'Regularly busy',
 'Regularly busy',
 'Regularly busy',
 'Less busy',
 'Regularly busy',
 'Less busy',
 'Busy day',
 'Busy day',
 'Regularly busy',
 'Less busy',
 'Less busy',
 'Regularly busy',
 'Regularly busy',
 'Busy day',
 'Busy day',
 'Regularly busy',
 'Busy day',
 'Busy day',
 'Less busy',
 'Regularly busy',
 'Busy day',
 'Busy day',
 'Busy day',
 'Busy day',
 'Busy day',
 'Regularly busy',
 'Less busy',
 'Regularly busy',
 'Busy day',
 'Less busy',
 'Busy day',
 'Regularly busy',
 'Regularly busy',

In [55]:
# Combining results with ords prods merged
ords_prods_merged['busiest_days'] = result2

##### 3.Check the values of this new column for accuracy. Note any observations in markdown format.

In [56]:
# frequency check
ords_prods_merged['busiest_days'].value_counts(dropna = False)

Regularly busy    12916111
Busy day          11864412
Less busy          7624336
Name: busiest_days, dtype: int64

In [52]:
# check 2: calculating order days of week
print(6204182+5660230) 
print('busy days')
print(3783802+3840534)
print('less busy')
print(4496490+4213830+4205791)
print('reg busy')

11864412
busy days
7624336
less busy
12916111
reg busy


In [57]:
# checking column headers
ords_prods_merged.head()

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,first_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,_merge,price_range_loc,busiest_day,busiest_days
0,2539329,1,1,2,8,,True,196,1,0,Soda,77,7,9.0,both,Mid-range product,Regularly busy,Regularly busy
1,2398795,1,2,3,7,15.0,False,196,1,1,Soda,77,7,9.0,both,Mid-range product,Regularly busy,Less busy
2,473747,1,3,3,12,21.0,False,196,1,1,Soda,77,7,9.0,both,Mid-range product,Regularly busy,Less busy
3,2254736,1,4,4,7,29.0,False,196,1,1,Soda,77,7,9.0,both,Mid-range product,Least busy,Less busy
4,431534,1,5,4,15,28.0,False,196,1,1,Soda,77,7,9.0,both,Mid-range product,Least busy,Less busy


I decided to leave the busiest_day varibale in the dataframe.

##### 4.When too many users make Instacart orders at the same time, the app freezes. The senior technical officer at Instacart wants you to identify the busiest hours of the day. Rather than by hour, they want periods of time labeled “Most orders,” “Average orders,” and “Fewest orders.” Create a new column containing these labels called “busiest_period_of_day.”

In [60]:
# checking frequency in order hour of day per hour
ords_prods_merged['order_hour_of_day'].value_counts(dropna = False)

10    2761760
11    2736140
14    2689136
15    2662144
13    2660954
12    2618532
16    2535202
9     2454203
17    2087654
8     1718118
18    1636502
19    1258305
20     976156
7      891054
21     795637
22     634225
23     402316
6      290493
0      218769
1      115700
5       87961
2       69375
4       53242
3       51281
Name: order_hour_of_day, dtype: int64

I decided to categorize the time ranges by determining the quartiles. All order hours frequencies over 75% of all orders will be hours with "Most orders". All hours in between 75% and 50% are "Average orders" and everything below 50% will be "Fewest orders".

In [72]:
# Calculating quartiles
(ords_prods_merged['order_hour_of_day'].value_counts(dropna = False).quantile([0.25,0.5,0.75]))

0.25     272562.0
0.50    1117230.5
0.75    2556034.5
Name: order_hour_of_day, dtype: float64

#### Results:
    Most orders: 10,11,12,13,14,15
    Average orders: 8,9,16,17,18,19
    Fewest orders: 0,1,2,3,4,5

In [80]:
# creating for-loop for busiest time periods
result3 = []

for value in ords_prods_merged["order_hour_of_day"]:
  if value <= 5:
    result3.append("Fewest orders")
  elif value >= 10:
    result3.append("Most orders")
  else:
    result3.append("Average orders")

In [81]:
result3

['Average orders',
 'Average orders',
 'Most orders',
 'Average orders',
 'Most orders',
 'Average orders',
 'Average orders',
 'Most orders',
 'Most orders',
 'Average orders',
 'Average orders',
 'Most orders',
 'Most orders',
 'Average orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Average orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Average orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Average orders',
 'Most orders',
 'Most orders',
 'Average orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Average orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Average orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Average orders',
 'Most orders',
 'Most orders',
 'Most orders',
 'Average orders',
 'Most orders',
 'Most orders',
 'Average orders',
 'Most orders',
 'Most orders',
 'Mos

In [82]:
# Combining results with ords prods merged
ords_prods_merged['busiest_period_of_day'] = result3

In [83]:
# check header
ords_prods_merged.head(0)

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,first_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,_merge,price_range_loc,busiest_day,busiest_days,busiest_period_of_day


##### 5.Print the frequency for this new column.

In [84]:
# Checking frequency
ords_prods_merged['busiest_period_of_day'].value_counts(dropna = False)

Most orders       26454663
Average orders     5353868
Fewest orders       596328
Name: busiest_period_of_day, dtype: int64

##### 7.Export your dataframe as a pickle file (since you added new columns) and store it correctly in your “Prepared Data” folder.

In [85]:
# Exporting merged ords as PKL
ords_prods_merged.to_pickle(os.path.join(path, '02 Data','Prepared Data', 'orders_products_flagged.pkl'))