# 4.8 (Task) Grouping and aggregating data
** **
## Table of contents:

1. Importing libraries <br>
2. Importing dataframe <br>
3. Tasks
    - 3.1 How to group and aggregate data
    - 3.2 Creating a loyalty flag
        - 3.2.1 Performing an aggregation, grouping by the new flag created
    - 3.3 Creating a spending flag
    - 3.4 Creating an order frequency flag
4. Exporting dataframe with new aggregations and flags
    - 4.1 Final checks before exporting
    - 4.2 Exporting dataframe
** **

# 1. Importing libraries
** **

In [1]:
import pandas as pd
import numpy as np
import os

# 2. Importing dataframe
** **

In [2]:
# Creating a path variabile for the folder
path = r'C:\Users\Simone\Desktop\Career Foundry\Esercizi modulo 5\Instacart basket analysis'

In [3]:
# Importing dataframe with the derived columns from Prepared Data
df_ords_prods_merged = pd.read_pickle(os.path.join(path, '02. Data', 'Prepared Data', 'orders_products_merged_derived.pkl'))

In [4]:
# Checking the head
df_ords_prods_merged.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_creation,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,price_label,busiest_day,busiest_days,busiest_period_of_day
0,2398795,1,2,3,7,15.0,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Least busy,Average orders
1,473747,1,3,3,12,21.0,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Least busy,Most orders
2,2254736,1,4,4,7,29.0,196,1,1,Soda,77,7,9.0,Mid-range product,Least busy,Least busy,Average orders
3,431534,1,5,4,15,28.0,196,1,1,Soda,77,7,9.0,Mid-range product,Least busy,Least busy,Most orders
4,3367565,1,6,2,7,19.0,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Regularly busy,Average orders


# 3. Tasks
** **

## 3.1 How to group and aggregate data

In [5]:
# Grouping data by department_id and aggregating by order_number (mean)
df_ords_prods_merged.groupby('department_id').agg({'order_number': ['mean']})

Unnamed: 0_level_0,order_number
Unnamed: 0_level_1,mean
department_id,Unnamed: 1_level_2
1,16.559358
2,18.413176
3,18.2796
4,18.91589
5,16.497751
6,17.60939
7,18.303975
8,16.383301
9,17.022963
10,21.227447


<b> Observations: </b> <br>
Compared to the operation performed on the subset (Script 4.8 - Exercise), this line of code is comprehensive of more elements. <br>
In the subset there were only 8 department_ids, while here we can see all 21. <br>
The order number mean for every department_id is similar, fluctuating between 16 to 18, with 3 exceptions. <br> 
In the subset, the mean fluctuated between 18 and 20, with an exception of 12, because of the subsetting.

## 3.2 Creating a loyalty flag

To create a loyalty flag, we first of all need to create a new order containing the max value for order number, grouped by user id.

In [6]:
#Creating a new column containing the max value order number, with the data groupd by user id.
df_ords_prods_merged['max_order'] = df_ords_prods_merged.groupby(['user_id'])['order_number'].transform(np.max)

In [7]:
# Checking the head
df_ords_prods_merged.head(15)

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_creation,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,price_label,busiest_day,busiest_days,busiest_period_of_day,max_order
0,2398795,1,2,3,7,15.0,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Least busy,Average orders,10
1,473747,1,3,3,12,21.0,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Least busy,Most orders,10
2,2254736,1,4,4,7,29.0,196,1,1,Soda,77,7,9.0,Mid-range product,Least busy,Least busy,Average orders,10
3,431534,1,5,4,15,28.0,196,1,1,Soda,77,7,9.0,Mid-range product,Least busy,Least busy,Most orders,10
4,3367565,1,6,2,7,19.0,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Regularly busy,Average orders,10
5,550135,1,7,1,9,20.0,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Busiest day,Most orders,10
6,3108588,1,8,1,14,14.0,196,2,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Busiest day,Most orders,10
7,2295261,1,9,1,16,0.0,196,4,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Busiest day,Most orders,10
8,2550362,1,10,4,8,30.0,196,1,1,Soda,77,7,9.0,Mid-range product,Least busy,Least busy,Average orders,10
9,2968173,15,15,1,9,7.0,196,2,0,Soda,77,7,9.0,Mid-range product,Regularly busy,Busiest day,Most orders,22


Now that we have our new column containing the max order value for every user_id, we can use it to segmentate our customer into three different categories: Loyal, Regular or New. <br>
If the max order value of a user is > 40, then that user is a "Loyal Customer". <br>
If the max order value of a user is > 10 and <= 40, then that user is a "Regular Customer". <br>
If the max order value of a user is <= 10, then that user is a "New Customer".

In [8]:
# Setting up the conditions for the new flag column
df_ords_prods_merged.loc[df_ords_prods_merged['max_order'] > 40, 'loyalty_flag'] = 'Loyal customer'

In [9]:
df_ords_prods_merged.loc[(df_ords_prods_merged['max_order'] <= 40) & (df_ords_prods_merged['max_order'] > 10), 'loyalty_flag'] = 'Regular customer'

In [10]:
df_ords_prods_merged.loc[df_ords_prods_merged['max_order'] <= 10, 'loyalty_flag'] = 'New customer'

In [11]:
# Checking the head
df_ords_prods_merged.head(10)

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_creation,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,price_label,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag
0,2398795,1,2,3,7,15.0,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Least busy,Average orders,10,New customer
1,473747,1,3,3,12,21.0,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Least busy,Most orders,10,New customer
2,2254736,1,4,4,7,29.0,196,1,1,Soda,77,7,9.0,Mid-range product,Least busy,Least busy,Average orders,10,New customer
3,431534,1,5,4,15,28.0,196,1,1,Soda,77,7,9.0,Mid-range product,Least busy,Least busy,Most orders,10,New customer
4,3367565,1,6,2,7,19.0,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Regularly busy,Average orders,10,New customer
5,550135,1,7,1,9,20.0,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Busiest day,Most orders,10,New customer
6,3108588,1,8,1,14,14.0,196,2,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Busiest day,Most orders,10,New customer
7,2295261,1,9,1,16,0.0,196,4,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Busiest day,Most orders,10,New customer
8,2550362,1,10,4,8,30.0,196,1,1,Soda,77,7,9.0,Mid-range product,Least busy,Least busy,Average orders,10,New customer
9,2968173,15,15,1,9,7.0,196,2,0,Soda,77,7,9.0,Mid-range product,Regularly busy,Busiest day,Most orders,22,Regular customer


In [12]:
# Printing the frequency
df_ords_prods_merged['loyalty_flag'].value_counts(dropna = False)

Regular customer    15081691
Loyal customer      10095381
New customer         5151691
Name: loyalty_flag, dtype: int64

### 3.2.1 Performing an aggregation, grouping by the new flag created

What if we perform an aggregation grouping by the new flag created, to check some descriptive statistics?

In [13]:
# Checking the descriptive statistics of the product prices, grouping by the flag just created
df_ords_prods_merged.groupby('loyalty_flag').agg({'prices': ['mean', 'min', 'max']})

Unnamed: 0_level_0,prices,prices,prices
Unnamed: 0_level_1,mean,min,max
loyalty_flag,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Loyal customer,10.402162,1.0,99999.0
New customer,13.396333,1.0,99999.0
Regular customer,12.546842,1.0,99999.0


<b> Observations: </b> <br>
Surprisingly, the mean is slightly higher for the Regular customers. <br>
The min value is 1, while the max value in all categories is suspiciously 99999 dollars. <br>
This sounds like a typo or a placeholder. However, the wrong price of this item is hiding the real true max value.

In [14]:
# Trying to identify the item
df_ords_prods_merged.loc[df_ords_prods_merged['prices']==99999.0]

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_creation,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,price_label,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag
27127653,183964,873,3,0,10,7.0,33664,11,0,2 % Reduced Fat Milk,84,16,99999.0,High-range product,Busiest day,Busiest day,Most orders,8,New customer
27127654,1851256,873,4,6,12,13.0,33664,8,1,2 % Reduced Fat Milk,84,16,99999.0,High-range product,Regularly busy,Regularly busy,Most orders,8,New customer
27127655,2763293,1893,2,4,16,13.0,33664,6,1,2 % Reduced Fat Milk,84,16,99999.0,High-range product,Least busy,Least busy,Most orders,6,New customer
27127656,2564805,1893,4,1,17,30.0,33664,3,1,2 % Reduced Fat Milk,84,16,99999.0,High-range product,Regularly busy,Busiest day,Average orders,6,New customer
27127657,420057,3339,2,0,11,13.0,33664,29,1,2 % Reduced Fat Milk,84,16,99999.0,High-range product,Busiest day,Busiest day,Most orders,6,New customer
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27128304,2249946,204099,29,0,8,4.0,33664,1,0,2 % Reduced Fat Milk,84,16,99999.0,High-range product,Busiest day,Busiest day,Average orders,39,Regular customer
27128305,2363282,204099,31,0,9,2.0,33664,1,1,2 % Reduced Fat Milk,84,16,99999.0,High-range product,Busiest day,Busiest day,Most orders,39,Regular customer
27128306,3181945,204395,13,3,15,8.0,33664,25,0,2 % Reduced Fat Milk,84,16,99999.0,High-range product,Regularly busy,Least busy,Most orders,15,Regular customer
27128307,2486215,205227,7,3,20,4.0,33664,8,0,2 % Reduced Fat Milk,84,16,99999.0,High-range product,Regularly busy,Least busy,Average orders,12,Regular customer


<b> Observations: </b> <br>
It seems the "2 % Reduced Fat Milk" is the item with the wrong price, and is appearing in 656 rows. <br>
This issue will be addressed with client. <br>
<b> NOTE: </b> The issue has been solved in script 4.9

## 3.3 Creating a spending flag

Before creating a spending flag, first of all we need to derive a new column with an aggregation.

In [15]:
# Creating a new column with the average (mean) price, grouped by user_id
df_ords_prods_merged['avg_price'] = df_ords_prods_merged.groupby(['user_id'])['prices'].transform(np.mean)

In [16]:
# Rounding average to 2 decimal places
df_ords_prods_merged['avg_price'] = df_ords_prods_merged['avg_price'].round(2)

In [17]:
# Checking the head
df_ords_prods_merged.head(15)

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_creation,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,price_label,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,avg_price
0,2398795,1,2,3,7,15.0,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Least busy,Average orders,10,New customer,6.37
1,473747,1,3,3,12,21.0,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Least busy,Most orders,10,New customer,6.37
2,2254736,1,4,4,7,29.0,196,1,1,Soda,77,7,9.0,Mid-range product,Least busy,Least busy,Average orders,10,New customer,6.37
3,431534,1,5,4,15,28.0,196,1,1,Soda,77,7,9.0,Mid-range product,Least busy,Least busy,Most orders,10,New customer,6.37
4,3367565,1,6,2,7,19.0,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Regularly busy,Average orders,10,New customer,6.37
5,550135,1,7,1,9,20.0,196,1,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Busiest day,Most orders,10,New customer,6.37
6,3108588,1,8,1,14,14.0,196,2,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Busiest day,Most orders,10,New customer,6.37
7,2295261,1,9,1,16,0.0,196,4,1,Soda,77,7,9.0,Mid-range product,Regularly busy,Busiest day,Most orders,10,New customer,6.37
8,2550362,1,10,4,8,30.0,196,1,1,Soda,77,7,9.0,Mid-range product,Least busy,Least busy,Average orders,10,New customer,6.37
9,2968173,15,15,1,9,7.0,196,2,0,Soda,77,7,9.0,Mid-range product,Regularly busy,Busiest day,Most orders,22,Regular customer,3.99


The new column "avg_price" has been successfully created.

Now we can create the flag using the .loc function, with the following conditions: <br>
If the average price is >= 10, then the customer is an "High spender". <br>
If the average price is < 10, then the customer is a "Low spender".

In [18]:
# Setting the conditions
df_ords_prods_merged.loc[df_ords_prods_merged['avg_price'] >= 10, 'spending_flag'] = 'High spender'

In [19]:
df_ords_prods_merged.loc[df_ords_prods_merged['avg_price'] < 10, 'spending_flag'] = 'Low spender'

In [20]:
# Checking the head
df_ords_prods_merged.head(10)

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_creation,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,...,department_id,prices,price_label,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,avg_price,spending_flag
0,2398795,1,2,3,7,15.0,196,1,1,Soda,...,7,9.0,Mid-range product,Regularly busy,Least busy,Average orders,10,New customer,6.37,Low spender
1,473747,1,3,3,12,21.0,196,1,1,Soda,...,7,9.0,Mid-range product,Regularly busy,Least busy,Most orders,10,New customer,6.37,Low spender
2,2254736,1,4,4,7,29.0,196,1,1,Soda,...,7,9.0,Mid-range product,Least busy,Least busy,Average orders,10,New customer,6.37,Low spender
3,431534,1,5,4,15,28.0,196,1,1,Soda,...,7,9.0,Mid-range product,Least busy,Least busy,Most orders,10,New customer,6.37,Low spender
4,3367565,1,6,2,7,19.0,196,1,1,Soda,...,7,9.0,Mid-range product,Regularly busy,Regularly busy,Average orders,10,New customer,6.37,Low spender
5,550135,1,7,1,9,20.0,196,1,1,Soda,...,7,9.0,Mid-range product,Regularly busy,Busiest day,Most orders,10,New customer,6.37,Low spender
6,3108588,1,8,1,14,14.0,196,2,1,Soda,...,7,9.0,Mid-range product,Regularly busy,Busiest day,Most orders,10,New customer,6.37,Low spender
7,2295261,1,9,1,16,0.0,196,4,1,Soda,...,7,9.0,Mid-range product,Regularly busy,Busiest day,Most orders,10,New customer,6.37,Low spender
8,2550362,1,10,4,8,30.0,196,1,1,Soda,...,7,9.0,Mid-range product,Least busy,Least busy,Average orders,10,New customer,6.37,Low spender
9,2968173,15,15,1,9,7.0,196,2,0,Soda,...,7,9.0,Mid-range product,Regularly busy,Busiest day,Most orders,22,Regular customer,3.99,Low spender


In [21]:
# Printing the frequency of the column
df_ords_prods_merged['spending_flag'].value_counts(dropna = False)

Low spender     29729366
High spender      599397
Name: spending_flag, dtype: int64

<b> Observations: </b> <br>
Only a relatively small part of customers are "High spenders".

In [22]:
# Checking the head of the columns of interest
df_ords_prods_merged[['user_id', 'spending_flag', 'avg_price']].head(40)

Unnamed: 0,user_id,spending_flag,avg_price
0,1,Low spender,6.37
1,1,Low spender,6.37
2,1,Low spender,6.37
3,1,Low spender,6.37
4,1,Low spender,6.37
5,1,Low spender,6.37
6,1,Low spender,6.37
7,1,Low spender,6.37
8,1,Low spender,6.37
9,15,Low spender,3.99


## 3.4 Creating an order frequency flag

Before creating an order frequency flag, first of all we need to derive a new column with an aggregation.

In [23]:
# Creating a new column, grouping by user_id and aggregating by the median of days_since_prior_order
df_ords_prods_merged['median_days_since_prior_order'] = df_ords_prods_merged.groupby(['user_id'])['days_since_prior_order'].transform(np.median)

In [24]:
# Checking the head
df_ords_prods_merged.head(15)

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_creation,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,...,prices,price_label,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,avg_price,spending_flag,median_days_since_prior_order
0,2398795,1,2,3,7,15.0,196,1,1,Soda,...,9.0,Mid-range product,Regularly busy,Least busy,Average orders,10,New customer,6.37,Low spender,20.5
1,473747,1,3,3,12,21.0,196,1,1,Soda,...,9.0,Mid-range product,Regularly busy,Least busy,Most orders,10,New customer,6.37,Low spender,20.5
2,2254736,1,4,4,7,29.0,196,1,1,Soda,...,9.0,Mid-range product,Least busy,Least busy,Average orders,10,New customer,6.37,Low spender,20.5
3,431534,1,5,4,15,28.0,196,1,1,Soda,...,9.0,Mid-range product,Least busy,Least busy,Most orders,10,New customer,6.37,Low spender,20.5
4,3367565,1,6,2,7,19.0,196,1,1,Soda,...,9.0,Mid-range product,Regularly busy,Regularly busy,Average orders,10,New customer,6.37,Low spender,20.5
5,550135,1,7,1,9,20.0,196,1,1,Soda,...,9.0,Mid-range product,Regularly busy,Busiest day,Most orders,10,New customer,6.37,Low spender,20.5
6,3108588,1,8,1,14,14.0,196,2,1,Soda,...,9.0,Mid-range product,Regularly busy,Busiest day,Most orders,10,New customer,6.37,Low spender,20.5
7,2295261,1,9,1,16,0.0,196,4,1,Soda,...,9.0,Mid-range product,Regularly busy,Busiest day,Most orders,10,New customer,6.37,Low spender,20.5
8,2550362,1,10,4,8,30.0,196,1,1,Soda,...,9.0,Mid-range product,Least busy,Least busy,Average orders,10,New customer,6.37,Low spender,20.5
9,2968173,15,15,1,9,7.0,196,2,0,Soda,...,9.0,Mid-range product,Regularly busy,Busiest day,Most orders,22,Regular customer,3.99,Low spender,10.0


The new column, "median_days_since_prior_order" has been successfully created.

Now we can create the flag using the .loc function, with the following conditions: <br>
If the median days since prior order are > 20, then the customer is a "Non-frequent customer". <br>
If the median days since prior order are > 10 and <= 20, then the customer is a "Regular customer". <br>
If the median days since prior order are <= 10, then the customer is a "Frequent customer".

In [25]:
# Setting the conditions
df_ords_prods_merged.loc[df_ords_prods_merged['median_days_since_prior_order'] > 20, 'order_frequency_flag'] = 'Non-frequent customer'

In [26]:
df_ords_prods_merged.loc[(df_ords_prods_merged['median_days_since_prior_order'] > 10) & (df_ords_prods_merged['median_days_since_prior_order'] <= 20), 'order_frequency_flag'] = 'Regular customer'

In [27]:
df_ords_prods_merged.loc[df_ords_prods_merged['median_days_since_prior_order'] <= 10, 'order_frequency_flag'] = 'Frequent customer'

In [28]:
# Checking the head
df_ords_prods_merged.head(10)

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_creation,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,...,price_label,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,avg_price,spending_flag,median_days_since_prior_order,order_frequency_flag
0,2398795,1,2,3,7,15.0,196,1,1,Soda,...,Mid-range product,Regularly busy,Least busy,Average orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
1,473747,1,3,3,12,21.0,196,1,1,Soda,...,Mid-range product,Regularly busy,Least busy,Most orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
2,2254736,1,4,4,7,29.0,196,1,1,Soda,...,Mid-range product,Least busy,Least busy,Average orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
3,431534,1,5,4,15,28.0,196,1,1,Soda,...,Mid-range product,Least busy,Least busy,Most orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
4,3367565,1,6,2,7,19.0,196,1,1,Soda,...,Mid-range product,Regularly busy,Regularly busy,Average orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
5,550135,1,7,1,9,20.0,196,1,1,Soda,...,Mid-range product,Regularly busy,Busiest day,Most orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
6,3108588,1,8,1,14,14.0,196,2,1,Soda,...,Mid-range product,Regularly busy,Busiest day,Most orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
7,2295261,1,9,1,16,0.0,196,4,1,Soda,...,Mid-range product,Regularly busy,Busiest day,Most orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
8,2550362,1,10,4,8,30.0,196,1,1,Soda,...,Mid-range product,Least busy,Least busy,Average orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
9,2968173,15,15,1,9,7.0,196,2,0,Soda,...,Mid-range product,Regularly busy,Busiest day,Most orders,22,Regular customer,3.99,Low spender,10.0,Frequent customer


In [29]:
# Printing the frequency of the new column
df_ords_prods_merged['order_frequency_flag'].value_counts(dropna = False)

Frequent customer        20675290
Regular customer          6594542
Non-frequent customer     3058931
Name: order_frequency_flag, dtype: int64

<b> Observations: </b> <br>
The majority of Instacart customers are "Frequent", but there is also a consistent part of them that are "Regular".

In [30]:
# Checking the head of the columns of interest
df_ords_prods_merged[['user_id', 'order_frequency_flag', 'median_days_since_prior_order']].head(40)

Unnamed: 0,user_id,order_frequency_flag,median_days_since_prior_order
0,1,Non-frequent customer,20.5
1,1,Non-frequent customer,20.5
2,1,Non-frequent customer,20.5
3,1,Non-frequent customer,20.5
4,1,Non-frequent customer,20.5
5,1,Non-frequent customer,20.5
6,1,Non-frequent customer,20.5
7,1,Non-frequent customer,20.5
8,1,Non-frequent customer,20.5
9,15,Frequent customer,10.0


# 4. Exporting dataframe with new aggregations and flags
** **

## 4.1 Final checks before exporting

In [31]:
# Checking the index, looking at the head and tail of the dataframe
df_ords_prods_merged.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_creation,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,...,price_label,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,avg_price,spending_flag,median_days_since_prior_order,order_frequency_flag
0,2398795,1,2,3,7,15.0,196,1,1,Soda,...,Mid-range product,Regularly busy,Least busy,Average orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
1,473747,1,3,3,12,21.0,196,1,1,Soda,...,Mid-range product,Regularly busy,Least busy,Most orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
2,2254736,1,4,4,7,29.0,196,1,1,Soda,...,Mid-range product,Least busy,Least busy,Average orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
3,431534,1,5,4,15,28.0,196,1,1,Soda,...,Mid-range product,Least busy,Least busy,Most orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
4,3367565,1,6,2,7,19.0,196,1,1,Soda,...,Mid-range product,Regularly busy,Regularly busy,Average orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer


In [32]:
df_ords_prods_merged.tail()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_creation,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,...,price_label,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,avg_price,spending_flag,median_days_since_prior_order,order_frequency_flag
30328758,31526,202557,18,5,11,3.0,43553,2,1,Orange Energy Shots,...,Low-range product,Regularly busy,Regularly busy,Most orders,31,Regular customer,6.91,Low spender,8.0,Frequent customer
30328759,2745165,203436,2,3,5,15.0,42338,16,1,"Zucchini Chips, Pesto",...,Mid-range product,Regularly busy,Least busy,Fewest orders,3,New customer,7.5,Low spender,15.0,Regular customer
30328760,850996,204229,12,2,3,25.0,37595,20,0,Dead Sea Minerals Eucalyptus Triple Milled Soap,...,Mid-range product,Regularly busy,Regularly busy,Fewest orders,22,Regular customer,7.95,Low spender,16.0,Regular customer
30328761,2550789,204472,6,3,15,7.0,37595,9,0,Dead Sea Minerals Eucalyptus Triple Milled Soap,...,Mid-range product,Regularly busy,Least busy,Most orders,11,Regular customer,7.97,Low spender,30.0,Non-frequent customer
30328762,2518919,204641,10,2,21,11.0,20543,16,0,Soleil Razors,...,Mid-range product,Regularly busy,Regularly busy,Average orders,35,Regular customer,44.59,High spender,11.0,Regular customer


Index is okay.

In [33]:
# Checking datatypes
df_ords_prods_merged.dtypes

order_id                           int64
user_id                            int64
order_number                       int64
orders_day_of_week                 int64
order_hour_of_creation             int64
days_since_prior_order           float64
product_id                         int64
add_to_cart_order                  int64
reordered                          int64
product_name                      object
aisle_id                           int64
department_id                      int64
prices                           float64
price_label                       object
busiest_day                       object
busiest_days                      object
busiest_period_of_day             object
max_order                          int64
loyalty_flag                      object
avg_price                        float64
spending_flag                     object
median_days_since_prior_order    float64
order_frequency_flag              object
dtype: object

All the columns containing IDs and order number, should be strings.

In [34]:
# Changing the datatypes of multiple columns.
df_ords_prods_merged['order_id'] = df_ords_prods_merged['order_id'].astype('str')

In [35]:
df_ords_prods_merged['user_id'] = df_ords_prods_merged['user_id'].astype('str')

In [36]:
df_ords_prods_merged['order_number'] = df_ords_prods_merged['order_number'].astype('str')

In [37]:
df_ords_prods_merged['product_id'] = df_ords_prods_merged['product_id'].astype('str')

In [38]:
df_ords_prods_merged['aisle_id'] = df_ords_prods_merged['aisle_id'].astype('str')

In [39]:
df_ords_prods_merged['department_id'] = df_ords_prods_merged['department_id'].astype('str')

In [40]:
# Checking again datatypes
df_ords_prods_merged.dtypes

order_id                          object
user_id                           object
order_number                      object
orders_day_of_week                 int64
order_hour_of_creation             int64
days_since_prior_order           float64
product_id                        object
add_to_cart_order                  int64
reordered                          int64
product_name                      object
aisle_id                          object
department_id                     object
prices                           float64
price_label                       object
busiest_day                       object
busiest_days                      object
busiest_period_of_day             object
max_order                          int64
loyalty_flag                      object
avg_price                        float64
spending_flag                     object
median_days_since_prior_order    float64
order_frequency_flag              object
dtype: object

Datatype have been changed.

In [41]:
# Final checks
df_ords_prods_merged

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_creation,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,...,price_label,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,avg_price,spending_flag,median_days_since_prior_order,order_frequency_flag
0,2398795,1,2,3,7,15.0,196,1,1,Soda,...,Mid-range product,Regularly busy,Least busy,Average orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
1,473747,1,3,3,12,21.0,196,1,1,Soda,...,Mid-range product,Regularly busy,Least busy,Most orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
2,2254736,1,4,4,7,29.0,196,1,1,Soda,...,Mid-range product,Least busy,Least busy,Average orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
3,431534,1,5,4,15,28.0,196,1,1,Soda,...,Mid-range product,Least busy,Least busy,Most orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
4,3367565,1,6,2,7,19.0,196,1,1,Soda,...,Mid-range product,Regularly busy,Regularly busy,Average orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30328758,31526,202557,18,5,11,3.0,43553,2,1,Orange Energy Shots,...,Low-range product,Regularly busy,Regularly busy,Most orders,31,Regular customer,6.91,Low spender,8.0,Frequent customer
30328759,2745165,203436,2,3,5,15.0,42338,16,1,"Zucchini Chips, Pesto",...,Mid-range product,Regularly busy,Least busy,Fewest orders,3,New customer,7.50,Low spender,15.0,Regular customer
30328760,850996,204229,12,2,3,25.0,37595,20,0,Dead Sea Minerals Eucalyptus Triple Milled Soap,...,Mid-range product,Regularly busy,Regularly busy,Fewest orders,22,Regular customer,7.95,Low spender,16.0,Regular customer
30328761,2550789,204472,6,3,15,7.0,37595,9,0,Dead Sea Minerals Eucalyptus Triple Milled Soap,...,Mid-range product,Regularly busy,Least busy,Most orders,11,Regular customer,7.97,Low spender,30.0,Non-frequent customer


In [42]:
df_ords_prods_merged.head(10)

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_creation,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,...,price_label,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,avg_price,spending_flag,median_days_since_prior_order,order_frequency_flag
0,2398795,1,2,3,7,15.0,196,1,1,Soda,...,Mid-range product,Regularly busy,Least busy,Average orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
1,473747,1,3,3,12,21.0,196,1,1,Soda,...,Mid-range product,Regularly busy,Least busy,Most orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
2,2254736,1,4,4,7,29.0,196,1,1,Soda,...,Mid-range product,Least busy,Least busy,Average orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
3,431534,1,5,4,15,28.0,196,1,1,Soda,...,Mid-range product,Least busy,Least busy,Most orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
4,3367565,1,6,2,7,19.0,196,1,1,Soda,...,Mid-range product,Regularly busy,Regularly busy,Average orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
5,550135,1,7,1,9,20.0,196,1,1,Soda,...,Mid-range product,Regularly busy,Busiest day,Most orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
6,3108588,1,8,1,14,14.0,196,2,1,Soda,...,Mid-range product,Regularly busy,Busiest day,Most orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
7,2295261,1,9,1,16,0.0,196,4,1,Soda,...,Mid-range product,Regularly busy,Busiest day,Most orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
8,2550362,1,10,4,8,30.0,196,1,1,Soda,...,Mid-range product,Least busy,Least busy,Average orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
9,2968173,15,15,1,9,7.0,196,2,0,Soda,...,Mid-range product,Regularly busy,Busiest day,Most orders,22,Regular customer,3.99,Low spender,10.0,Frequent customer


In [43]:
df_ords_prods_merged.tail(10)

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_creation,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,...,price_label,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,avg_price,spending_flag,median_days_since_prior_order,order_frequency_flag
30328753,960088,202557,13,4,12,15.0,43553,3,1,Orange Energy Shots,...,Low-range product,Least busy,Least busy,Most orders,31,Regular customer,6.91,Low spender,8.0,Frequent customer
30328754,343962,202557,14,0,10,3.0,43553,2,1,Orange Energy Shots,...,Low-range product,Busiest day,Busiest day,Most orders,31,Regular customer,6.91,Low spender,8.0,Frequent customer
30328755,2329472,202557,15,6,12,6.0,43553,2,1,Orange Energy Shots,...,Low-range product,Regularly busy,Regularly busy,Most orders,31,Regular customer,6.91,Low spender,8.0,Frequent customer
30328756,694731,202557,16,1,14,2.0,43553,2,1,Orange Energy Shots,...,Low-range product,Regularly busy,Busiest day,Most orders,31,Regular customer,6.91,Low spender,8.0,Frequent customer
30328757,1320836,202557,17,2,15,1.0,43553,2,1,Orange Energy Shots,...,Low-range product,Regularly busy,Regularly busy,Most orders,31,Regular customer,6.91,Low spender,8.0,Frequent customer
30328758,31526,202557,18,5,11,3.0,43553,2,1,Orange Energy Shots,...,Low-range product,Regularly busy,Regularly busy,Most orders,31,Regular customer,6.91,Low spender,8.0,Frequent customer
30328759,2745165,203436,2,3,5,15.0,42338,16,1,"Zucchini Chips, Pesto",...,Mid-range product,Regularly busy,Least busy,Fewest orders,3,New customer,7.5,Low spender,15.0,Regular customer
30328760,850996,204229,12,2,3,25.0,37595,20,0,Dead Sea Minerals Eucalyptus Triple Milled Soap,...,Mid-range product,Regularly busy,Regularly busy,Fewest orders,22,Regular customer,7.95,Low spender,16.0,Regular customer
30328761,2550789,204472,6,3,15,7.0,37595,9,0,Dead Sea Minerals Eucalyptus Triple Milled Soap,...,Mid-range product,Regularly busy,Least busy,Most orders,11,Regular customer,7.97,Low spender,30.0,Non-frequent customer
30328762,2518919,204641,10,2,21,11.0,20543,16,0,Soleil Razors,...,Mid-range product,Regularly busy,Regularly busy,Average orders,35,Regular customer,44.59,High spender,11.0,Regular customer


## 4.2 Exporting dataframe

In [44]:
# Exporting the dataframe as a pickle file
df_ords_prods_merged.to_pickle(os.path.join(path, '02. Data','Prepared Data', 'orders_products_merged_derived_flags.pkl'))