# 5. Deriving new variables
** **
## Table of contents:

1. Importing libraries <br>
2. Importing dataframe <br>
3. Deriving a new variable
    - 3.1 Creating an "indicator column" for "prices"
        - 3.1.1 Method number 1: defining a custom function
        - 3.1.2 Method number 2: using the .loc function
        - 3.1.3 Creating the new "price_range" column in the real dataframe
    - 3.2 Creating a new column to check the busiest day
        - 3.2.1 Creating the new column with a for-loop
4. Tasks
    - 4.1 Creating a new column to check the two most busiest and least busiest days
    - 4.2 Creating a new column to check the hours with most orders
    - 4.3 Exporting the new dataframe with derived variables
** **

# 1. Importing libraries
** **

In [1]:
import pandas as pd
import numpy as np
import os

# 2. Importing dataframe
** **

In [2]:
# Creating a path variabile for the folder
path = r'C:\Users\Simone\Desktop\Career Foundry\Esercizi modulo 5\Instacart basket analysis'

In [3]:
# Importing the merged dataframe from Prepared Data
df_ords_prods_merged = pd.read_pickle(os.path.join(path, '02. Data', 'Prepared Data', 'orders_products_merged.pkl'))

In [4]:
# Checking the head
df_ords_prods_merged.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_creation,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices
0,2398795,1,2,3,7,15.0,196,1,1,Soda,77,7,9.0
1,473747,1,3,3,12,21.0,196,1,1,Soda,77,7,9.0
2,2254736,1,4,4,7,29.0,196,1,1,Soda,77,7,9.0
3,431534,1,5,4,15,28.0,196,1,1,Soda,77,7,9.0
4,3367565,1,6,2,7,19.0,196,1,1,Soda,77,7,9.0


# 3. Deriving a new variable
** **

In this section will be shown how is possible to derive (create) a new variable (column) from already existing variables.

## 3.1 Creating an "indicator column" for "prices"

First of all, we will test the process on a subset.

In [5]:
# Creating a subset to test (containing only 1 million rows)
df = df_ords_prods_merged[:1000000]

In [6]:
# Checking the shape of the subset
df.shape

(1000000, 13)

### 3.1.1 Method number 1: defining a custom function

In [7]:
# Defining a new function
def price_label(row):

  if row['prices'] <= 5:
    return 'Low-range product'
  elif (row['prices'] > 5) and (row['prices'] <= 15):
    return 'Mid-range product'
  elif row['prices'] > 15:
    return 'High range'
  else: return 'Not enough data'

The following code defines a new function called "price_label" that will check every row of the dataframe and return the following values ('Low-range product', 'Mid-range product', 'High range') if the conditions on the prices column are met. <br>
Let's apply the function to our subset, using .apply.

In [8]:
# Using the new function on the subset to create the new column 'price_range'
df['price_range'] = df.apply(price_label, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['price_range'] = df.apply(price_label, axis=1)


<b> Observations: </b> <br>
^ This is not an error message per se, but Python is informing us that this procedure could cause issue and recommends using the loc method. <br>
We can ignore this message. The function has been applied anyway.

In [9]:
# Checking the values in the new column
df['price_range'].value_counts(dropna = False)

Mid-range product    631969
Low-range product    368031
Name: price_range, dtype: int64

There isn't a single product labeled as "High range". It does mean that in this subset, there isn't a single product costing more than 15€.

In [10]:
# Testing what is the max value for the prices column in the subset
df['prices'].max()

14.0

<b> Observations: </b> <br>
This confirms that there aren't products costing more than 15€ in the subset created.

### 3.1.2 Method number 2: using the .loc function

This second method uses the .loc function to establish the conditions of the new column.

In [11]:
# Creating first condition
df.loc[df['prices'] > 15, 'price_range_loc'] = 'High-range product'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[df['prices'] > 15, 'price_range_loc'] = 'High-range product'


In [12]:
# Creating second condition
df.loc[(df['prices'] <= 15) & (df['prices'] > 5), 'price_range_loc'] = 'Mid-range product'

In [13]:
# Creating third condition
df.loc[df['prices'] <= 5, 'price_range_loc'] = 'Low-range product'

In [14]:
# Checking the values in the new column
df['price_range_loc'].value_counts(dropna = False)

Mid-range product    631969
Low-range product    368031
Name: price_range_loc, dtype: int64

<b> Observations: </b> <br>
The results are the same as the first method used.

### 3.1.3 Creating the new "price_range" column in the real dataframe

We don't need the subset anymore. It's the time to derive the new variable in the real dataframe, using the .loc method.

In [15]:
# Condition 1
df_ords_prods_merged.loc[df_ords_prods_merged['prices'] > 15, 'price_range_loc'] = 'High-range product'

In [16]:
# Condition 2
df_ords_prods_merged.loc[(df_ords_prods_merged['prices'] <= 15) & (df_ords_prods_merged['prices'] > 5), 'price_range_loc'] = 'Mid-range product'

In [17]:
# Condition 3
df_ords_prods_merged.loc[df_ords_prods_merged['prices'] <= 5, 'price_range_loc'] = 'Low-range product'

In [18]:
# Checking the values in the new column
df_ords_prods_merged['price_range_loc'].value_counts(dropna = False)

Mid-range product     20462144
Low-range product      9476774
High-range product      389845
Name: price_range_loc, dtype: int64

Here is the distribution of the products in the dataframe. It seems that more than 60% of products ordered is mid-range.

In [19]:
# Testing the head
df_ords_prods_merged.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_creation,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,price_range_loc
0,2398795,1,2,3,7,15.0,196,1,1,Soda,77,7,9.0,Mid-range product
1,473747,1,3,3,12,21.0,196,1,1,Soda,77,7,9.0,Mid-range product
2,2254736,1,4,4,7,29.0,196,1,1,Soda,77,7,9.0,Mid-range product
3,431534,1,5,4,15,28.0,196,1,1,Soda,77,7,9.0,Mid-range product
4,3367565,1,6,2,7,19.0,196,1,1,Soda,77,7,9.0,Mid-range product


## 3.2 Creating a new column to check the busiest day

In this section we will derive a new variable that will indicate us if the order has been placed in the most busy days or in the least busy.

### 3.2.1 Creating the new column with a for-loop

In [20]:
# Checking the frequency of the interested column
df_ords_prods_merged['orders_day_of_week'].value_counts (dropna = False)

0    5779087
1    5303718
6    4190948
5    3952326
2    3947564
3    3600589
4    3554531
Name: orders_day_of_week, dtype: int64

<b> Observations: </b> <br>
According to the count of values, the day most busy (highest frequency) is Saturday, while the least busy is Wednesday.

In [21]:
# Creating a list that will check for the values and label them
result = []

for value in df_ords_prods_merged["orders_day_of_week"]:
  if value == 0:
    result.append("Busiest day")
  elif value == 4:
    result.append("Least busy")
  else:
    result.append("Regularly busy")

In [22]:
# Creating a new column in the dataframe, containing the list created
df_ords_prods_merged['busiest_day'] = result

In [23]:
# Checking the head
df_ords_prods_merged.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_creation,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,price_range_loc,busiest_day
0,2398795,1,2,3,7,15.0,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy
1,473747,1,3,3,12,21.0,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy
2,2254736,1,4,4,7,29.0,196,1,1,Soda,77,7,9.0,Mid-range product,Least busy
3,431534,1,5,4,15,28.0,196,1,1,Soda,77,7,9.0,Mid-range product,Least busy
4,3367565,1,6,2,7,19.0,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy


In [24]:
#Counting the values
df_ords_prods_merged['busiest_day'].value_counts(dropna = False)

Regularly busy    20995145
Busiest day        5779087
Least busy         3554531
Name: busiest_day, dtype: int64

Over 60% of the orders are placed in days that are regularly busy, while almost 6 millions are placed during the busiest day (Saturday) and 3 millions and a half during the least busy (Wednesday).

# 4. Tasks
** **

Before diving to the tasks, I will change the name gof the first derived column, that is not intuitive enough.

In [25]:
# Renaming the price_range_loc column
df_ords_prods_merged.rename(columns = {'price_range_loc' : 'price_label'}, inplace = True)

In [26]:
# Testing the change
df_ords_prods_merged.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_creation,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,price_label,busiest_day
0,2398795,1,2,3,7,15.0,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy
1,473747,1,3,3,12,21.0,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy
2,2254736,1,4,4,7,29.0,196,1,1,Soda,77,7,9.0,Mid-range product,Least busy
3,431534,1,5,4,15,28.0,196,1,1,Soda,77,7,9.0,Mid-range product,Least busy
4,3367565,1,6,2,7,19.0,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy


## 4.1 Creating a new column to check the two most busiest and least busiest days

We are going to create a new column that will flag the two most busy days and the two least busy days.

In [27]:
# Creating a new list that will check for the values and label them
result_2 = []

for value in df_ords_prods_merged["orders_day_of_week"]:
  if value == 0 or value == 1:
    result_2.append("Busiest day")
  elif value == 4 or value == 3:
    result_2.append("Least busy")
  else:
    result_2.append("Regularly busy")

0 and 1 (Saturday and Sunday) will be flagged as "Busiest day", while 3 and 4 (Tuesday and Wednesday) will be flagged as "Least busy".

In [28]:
# Creating a new column in the dataframe, containing the list created
df_ords_prods_merged['busiest_days'] = result_2

In [29]:
# Checking the head
df_ords_prods_merged.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_creation,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,price_label,busiest_day,busiest_days
0,2398795,1,2,3,7,15.0,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Least busy
1,473747,1,3,3,12,21.0,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Least busy
2,2254736,1,4,4,7,29.0,196,1,1,Soda,77,7,9.0,Mid-range product,Least busy,Least busy
3,431534,1,5,4,15,28.0,196,1,1,Soda,77,7,9.0,Mid-range product,Least busy,Least busy
4,3367565,1,6,2,7,19.0,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Regularly busy


The new column has been successfully added.

In [30]:
# Checking the values in the new column
df_ords_prods_merged['busiest_days'].value_counts(dropna = False)

Regularly busy    12090838
Busiest day       11082805
Least busy         7155120
Name: busiest_days, dtype: int64

<b> Observations: </b> <br>
Sum of values is 30.328.763 (as the total rows of the dataframe). No missing values. <br>
In this new column, the "regularly busy" days decreased from around 20 millions to around 12 millions, because two more days (Sunday and Tuesday) were "flagged" in a different category. <br>
"Busiest days" increased from almost 6 millions to 11 millions (+5) while "Least busy" days increased from around 3 millions and a half to 7. <br>
I think this new columns contains data that is more on point, because the values are better distribuited. <br>
Also, in the original column, the days 0 and 1 (and 3 and 4 as well) had a similar frequency, so it makes sense to aggregate these days together as the busiests and least busiests.

## 4.2 Creating a new column to check the hours with most orders

In order to derive a new column and flag the hours in some categories, we need to check hours frequency.

In [31]:
# Checking the frequency of the "order hour of creation" column
df_ords_prods_merged['order_hour_of_creation'].value_counts (dropna = False)

10    2593725
11    2564597
14    2517238
15    2487586
13    2487500
12    2445841
16    2364969
9     2311334
17    1943858
8     1622394
18    1520954
19    1169224
20     910005
7      844665
21     746254
22     592432
23     375889
6      274801
0      203460
1      108110
5       82706
2       63961
4       49400
3       47860
Name: order_hour_of_creation, dtype: int64

<b> Observations: </b> <br>
The hours should be divided and labeled as “Most orders,” “Average orders,” and “Fewest orders.” <br>
Looking at the frequency, the top 8 results goes from 9 to 16, and should be labeled as the hours with "Most orders". <br>
During the morning (7 and 8) and during 17 to 23, the number of orders slowly decreases, so I will label these hours as "Average orders". <br>
From 0 to 6 is the period with the "Fewest orders", so these hours will be labeled accordingly.

In [32]:
# Creating a new list that will check for the values and label them
result_3 = []

for value in df_ords_prods_merged["order_hour_of_creation"]:
  if value in [9, 10, 11, 12, 13, 14, 15, 16]:
    result_3.append("Most orders")
  elif value in [0, 1, 2, 3, 4, 5, 6]:
    result_3.append("Fewest orders")
  else:
    result_3.append("Average orders")

In [33]:
# Creating a new column in the dataframe, containing the list created
df_ords_prods_merged['busiest_period_of_day'] = result_3

In [34]:
# Testing
df_ords_prods_merged.head(10)

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_creation,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,price_label,busiest_day,busiest_days,busiest_period_of_day
0,2398795,1,2,3,7,15.0,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Least busy,Average orders
1,473747,1,3,3,12,21.0,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Least busy,Most orders
2,2254736,1,4,4,7,29.0,196,1,1,Soda,77,7,9.0,Mid-range product,Least busy,Least busy,Average orders
3,431534,1,5,4,15,28.0,196,1,1,Soda,77,7,9.0,Mid-range product,Least busy,Least busy,Most orders
4,3367565,1,6,2,7,19.0,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Regularly busy,Average orders
5,550135,1,7,1,9,20.0,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Busiest day,Most orders
6,3108588,1,8,1,14,14.0,196,2,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Busiest day,Most orders
7,2295261,1,9,1,16,0.0,196,4,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Busiest day,Most orders
8,2550362,1,10,4,8,30.0,196,1,1,Soda,77,7,9.0,Mid-range product,Least busy,Least busy,Average orders
9,2968173,15,15,1,9,7.0,196,2,0,Soda,77,7,9.0,Mid-range product,Regularly busy,Busiest day,Most orders


The new column has been added successfully.

In [35]:
# Checking the frequency of the new column
df_ords_prods_merged['busiest_period_of_day'].value_counts(dropna = False)

Most orders       19772790
Average orders     9725675
Fewest orders       830298
Name: busiest_period_of_day, dtype: int64

<b> Observations: </b> <br>
Around 60% of the orders are placed during the hours with the "Most orders". <br>
Around 30% of the orders are placed during the hours with "Average orders". <br>
Only a small percentage of orders are placed from 0 to 6 ("Fewest orders).

## 4.3 Exporting the new dataframe with derived variables

In [36]:
# Exporting the new dataframe (with derived columns) in pkl format
df_ords_prods_merged.to_pickle(os.path.join(path, '02. Data','Prepared Data', 'orders_products_merged_derived.pkl'))