### Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import os

### Importing data

In [2]:
## Importing data set path
path = r'C:\Users\mgril\OneDrive\Desktop\Instacart Basket Analysis Folder'

In [3]:
## Importing df_ords_prods_merged
df_ords_prods_merged = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'orders_products_merged_updated.pkl'))

In [4]:
df_ords_prods_merged.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,_merge,price_range_loc,price_range_loc.1,busiest_day,busiest_days,busiest_period_of_day
0,2539329,1,1,2,8,,196,1,0,Soda,77,7,9.0,both,,Mid-range product,Regularly busy,Regularly busy,Average orders
1,2398795,1,2,3,7,15.0,196,1,1,Soda,77,7,9.0,both,,Mid-range product,Regularly busy,Least busy days,Average orders
2,473747,1,3,3,12,21.0,196,1,1,Soda,77,7,9.0,both,,Mid-range product,Regularly busy,Least busy days,Most orders
3,2254736,1,4,4,7,29.0,196,1,1,Soda,77,7,9.0,both,,Mid-range product,Least busy,Least busy days,Average orders
4,431534,1,5,4,15,28.0,196,1,1,Soda,77,7,9.0,both,,Mid-range product,Least busy,Least busy days,Most orders


In [5]:
df_ords_prods_merged.shape

(32404859, 19)

In [6]:
# Creating subset

df = df_ords_prods_merged[:1000000]

In [7]:
df.shape

(1000000, 19)

In [8]:
df.head(10)

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,_merge,price_range_loc,price_range_loc.1,busiest_day,busiest_days,busiest_period_of_day
0,2539329,1,1,2,8,,196,1,0,Soda,77,7,9.0,both,,Mid-range product,Regularly busy,Regularly busy,Average orders
1,2398795,1,2,3,7,15.0,196,1,1,Soda,77,7,9.0,both,,Mid-range product,Regularly busy,Least busy days,Average orders
2,473747,1,3,3,12,21.0,196,1,1,Soda,77,7,9.0,both,,Mid-range product,Regularly busy,Least busy days,Most orders
3,2254736,1,4,4,7,29.0,196,1,1,Soda,77,7,9.0,both,,Mid-range product,Least busy,Least busy days,Average orders
4,431534,1,5,4,15,28.0,196,1,1,Soda,77,7,9.0,both,,Mid-range product,Least busy,Least busy days,Most orders
5,3367565,1,6,2,7,19.0,196,1,1,Soda,77,7,9.0,both,,Mid-range product,Regularly busy,Regularly busy,Average orders
6,550135,1,7,1,9,20.0,196,1,1,Soda,77,7,9.0,both,,Mid-range product,Regularly busy,Busiest days,Most orders
7,3108588,1,8,1,14,14.0,196,2,1,Soda,77,7,9.0,both,,Mid-range product,Regularly busy,Busiest days,Most orders
8,2295261,1,9,1,16,0.0,196,4,1,Soda,77,7,9.0,both,,Mid-range product,Regularly busy,Busiest days,Most orders
9,2550362,1,10,4,8,30.0,196,1,1,Soda,77,7,9.0,both,,Mid-range product,Least busy,Least busy days,Average orders


### Grouping data with pandas

In [9]:
df.groupby('product_name')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000015DBE31C910>

#### To recap, you should always use the groupby() function as part of a series of steps, namely, the following:

Split the data into groups based on some criteria.
Apply a function to each group separately.
Combine the results into a dataframe or alternative data structure or create a new column in the current dataframe.
So far, you’ve only completed the first step—splitting the data into groups (with the group being the “product_name” column). Now, let’s take a look into the second step, which will involve some aggregation!

### Aggregating Data with agg()

#### Performing a Single Aggregation
Aggregating in Python is done by using the appropriately named agg() function. Let’s see how you might use this function to produce a single descriptive statistic for the “order_number” column. If you were to calculate the mean of the “order_number” column grouped by the “department_id” column, you could quickly compare the average number of orders per user across each Instacart department. (Recall that “order_number” refers to the number of orders placed by a given user.) This will involve two of the steps in the three-step process introduced above:

Split the data into groups based on “department_id.”
Apply the agg() function to each group to obtain the mean values for the “order_number” column.
To do this, run the following code:

In [10]:
df.groupby('department_id').agg({'order_number': ['mean']})

Unnamed: 0_level_0,order_number
Unnamed: 0_level_1,mean
department_id,Unnamed: 1_level_2
4,18.82578
7,17.472355
13,17.993423
14,19.246334
16,19.463012
17,11.294069
19,19.305237
20,17.599636


#### There are some aggregations that can be conducted without use of the agg() function. For instance, the command above could be replaced with a command that uses the mean() function to achieve the same results:

In [11]:
# Another method of finding mean of 'order_number'

df.groupby('department_id')['order_number'].mean()

department_id
4     18.825780
7     17.472355
13    17.993423
14    19.246334
16    19.463012
17    11.294069
19    19.305237
20    17.599636
Name: order_number, dtype: float64

### Performing Multiple Aggregations

#### Now that you know how to produce a single statistic using agg(), let’s explore how you can produce multiple statistics at the same time. All it comes down to is adding more arguments to your code. You’ll be working again with the “order_number” column, this time producing not only the mean but also the min and max:

In [12]:
df.groupby('department_id').agg({'order_number': ['mean', 'min', 'max']})

Unnamed: 0_level_0,order_number,order_number,order_number
Unnamed: 0_level_1,mean,min,max
department_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
4,18.82578,1,99
7,17.472355,1,99
13,17.993423,1,99
14,19.246334,1,99
16,19.463012,1,99
17,11.294069,1,98
19,19.305237,1,99
20,17.599636,1,99


### Aggregating Data with transform()

#### ESTABLISHING FLAG CRITERIA
In most cases, these criteria will be provided by your client, so you won’t need to come up with them yourself.

#### Now, let’s map this task onto the three-step process introduced earlier:

Split the data into groups based on the “user_id” column.
Apply the transform() function on the “order_number” column to generate the maximum orders for each user.
Create a new column, “max_order,” into which you’ll place the results of your aggregation.
Once this process is complete, you can use the “max_order” column to create another new column that assigns a loyalty flag to each customer according to the criteria (via the loc() function). This will be explained in just a moment. First, let’s follow the steps above to create your “max_order” column. In this case, all three steps can be included in a single line of code:

In [13]:
df_ords_prods_merged['max_order'] = df_ords_prods_merged.groupby(['user_id'])['order_number'].transform(np.max)

##### Let’s run through each piece of the code in turn. First, a new column called “max_order” is created, which will be what stores the maximum order number for each user (step 3). Then, the ords_prods_merge dataframe is grouped by the “user_id” column (step 1). And finally, the transform() function is applied on the “order_number” column with the np.max argument (step 2).

But what is this np? This is actually the NumPy library! Remember the import code you include at the beginning of every notebook? In that code, you assigned the NumPy library to the np variable. Now, you’re finally going to use that variable. The max() function is a function included within NumPy that finds the max value within a column. Including this as an argument within the transform() function tells Python to “transform the ‘order_number’ column by applying the max() function from the NumPy library.”

Go ahead and execute your code, then check out the results using the head() function:



In [14]:
df_ords_prods_merged.head(25)

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,_merge,price_range_loc,price_range_loc.1,busiest_day,busiest_days,busiest_period_of_day,max_order
0,2539329,1,1,2,8,,196,1,0,Soda,77,7,9.0,both,,Mid-range product,Regularly busy,Regularly busy,Average orders,10
1,2398795,1,2,3,7,15.0,196,1,1,Soda,77,7,9.0,both,,Mid-range product,Regularly busy,Least busy days,Average orders,10
2,473747,1,3,3,12,21.0,196,1,1,Soda,77,7,9.0,both,,Mid-range product,Regularly busy,Least busy days,Most orders,10
3,2254736,1,4,4,7,29.0,196,1,1,Soda,77,7,9.0,both,,Mid-range product,Least busy,Least busy days,Average orders,10
4,431534,1,5,4,15,28.0,196,1,1,Soda,77,7,9.0,both,,Mid-range product,Least busy,Least busy days,Most orders,10
5,3367565,1,6,2,7,19.0,196,1,1,Soda,77,7,9.0,both,,Mid-range product,Regularly busy,Regularly busy,Average orders,10
6,550135,1,7,1,9,20.0,196,1,1,Soda,77,7,9.0,both,,Mid-range product,Regularly busy,Busiest days,Most orders,10
7,3108588,1,8,1,14,14.0,196,2,1,Soda,77,7,9.0,both,,Mid-range product,Regularly busy,Busiest days,Most orders,10
8,2295261,1,9,1,16,0.0,196,4,1,Soda,77,7,9.0,both,,Mid-range product,Regularly busy,Busiest days,Most orders,10
9,2550362,1,10,4,8,30.0,196,1,1,Soda,77,7,9.0,both,,Mid-range product,Least busy,Least busy days,Average orders,10


#### Another way to conduct this check is to print the head of the dataframe with an argument of, say 100—ords_prods_merge.head(100)—which would result in an output like this:

In [15]:
df_ords_prods_merged.head(100)

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,_merge,price_range_loc,price_range_loc.1,busiest_day,busiest_days,busiest_period_of_day,max_order
0,2539329,1,1,2,8,,196,1,0,Soda,77,7,9.0,both,,Mid-range product,Regularly busy,Regularly busy,Average orders,10
1,2398795,1,2,3,7,15.0,196,1,1,Soda,77,7,9.0,both,,Mid-range product,Regularly busy,Least busy days,Average orders,10
2,473747,1,3,3,12,21.0,196,1,1,Soda,77,7,9.0,both,,Mid-range product,Regularly busy,Least busy days,Most orders,10
3,2254736,1,4,4,7,29.0,196,1,1,Soda,77,7,9.0,both,,Mid-range product,Least busy,Least busy days,Average orders,10
4,431534,1,5,4,15,28.0,196,1,1,Soda,77,7,9.0,both,,Mid-range product,Least busy,Least busy days,Most orders,10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,3226575,360,1,5,12,,196,1,0,Soda,77,7,9.0,both,,Mid-range product,Regularly busy,Regularly busy,Most orders,3
96,1469869,377,3,5,17,3.0,196,9,0,Soda,77,7,9.0,both,,Mid-range product,Regularly busy,Regularly busy,Average orders,3
97,1927023,387,2,4,10,22.0,196,3,0,Soda,77,7,9.0,both,,Mid-range product,Least busy,Least busy days,Most orders,8
98,858092,420,4,1,19,30.0,196,2,0,Soda,77,7,9.0,both,,Mid-range product,Regularly busy,Busiest days,Average orders,22


### Deriving Columns with loc()

#### With your new column ready to go, all that’s left is to create a flag that assigns a “loyalty” label to a user ID based on its corresponding max order value. You should be familiar with the process already from the previous Exercise. Based on the criteria listed above, your code should look something like this:

In [16]:
df_ords_prods_merged.loc[df_ords_prods_merged['max_order'] > 40, 'loyalty_flag'] = 'Loyal customer'

In [17]:
df_ords_prods_merged.loc[(df_ords_prods_merged['max_order'] <= 40) & (df_ords_prods_merged['max_order'] > 10), 'loyalty_flag'] = 'Regular customer'

In [18]:
df_ords_prods_merged.loc[df_ords_prods_merged['max_order'] <= 10, 'loyalty_flag'] = 'New customer'

In [19]:
df_ords_prods_merged['loyalty_flag'].value_counts(dropna = False)

loyalty_flag
Regular customer    15876776
Loyal customer      10284093
New customer         6243990
Name: count, dtype: int64

In [20]:
df_ords_prods_merged.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,...,department_id,prices,_merge,price_range_loc,price_range_loc.1,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag
0,2539329,1,1,2,8,,196,1,0,Soda,...,7,9.0,both,,Mid-range product,Regularly busy,Regularly busy,Average orders,10,New customer
1,2398795,1,2,3,7,15.0,196,1,1,Soda,...,7,9.0,both,,Mid-range product,Regularly busy,Least busy days,Average orders,10,New customer
2,473747,1,3,3,12,21.0,196,1,1,Soda,...,7,9.0,both,,Mid-range product,Regularly busy,Least busy days,Most orders,10,New customer
3,2254736,1,4,4,7,29.0,196,1,1,Soda,...,7,9.0,both,,Mid-range product,Least busy,Least busy days,Average orders,10,New customer
4,431534,1,5,4,15,28.0,196,1,1,Soda,...,7,9.0,both,,Mid-range product,Least busy,Least busy days,Most orders,10,New customer


#### As always, after executing your code, you should check that everything was interpreted correctly and that the right flags were assigned. To do this without looking at the entire dataframe and constantly having to scroll left and right, which is both tedious and prone to error, you can try this method, which will limit the output of your head() function to just those columns you want to check. You learned already that to access a single column in a dataframe, you can index it like so: df['column']

### Well, a similar syntax can be used to access multiple columns at the same time. For instance, if you wanted to write a head() function that returned the first 60 rows of only the “user_id,” “loyalty_flag,” and “order_number” columns, you could format it as follows:

In [21]:
df_ords_prods_merged[['user_id', 'max_order', 'loyalty_flag']].sample(60)

Unnamed: 0,user_id,max_order,loyalty_flag
7376543,118929,48,Loyal customer
15541702,5588,91,Loyal customer
24969241,49822,7,New customer
11421844,105644,13,Regular customer
23970469,42920,99,Loyal customer
11664861,80175,6,New customer
4753569,60109,17,Regular customer
8996361,98971,53,Loyal customer
21588947,62373,21,Regular customer
30783946,84946,34,Regular customer


### Exercise 4.8 Tasks

### Task 02. In this Exercise, you learned how to find the aggregated mean of the “order_number” column grouped by “department_id” for a subset of your dataframe. Now, repeat this process for the entire dataframe.

In [22]:
df_ords_prods_merged.groupby('department_id').agg({'order_number': ['mean']})

Unnamed: 0_level_0,order_number
Unnamed: 0_level_1,mean
department_id,Unnamed: 1_level_2
1,15.457838
2,17.27792
3,17.170395
4,17.811403
5,15.215751
6,16.439806
7,17.225802
8,15.34065
9,15.895474
10,20.197148


### Task 03. Analyze the result. How do the results for the entire dataframe differ from those of the subset? Include your comments in a markdown cell below the executed code.

In [23]:
# I used this subset for easy view. this is from previous pre work steps

df.groupby('department_id').agg({'order_number': ['mean']})

Unnamed: 0_level_0,order_number
Unnamed: 0_level_1,mean
department_id,Unnamed: 1_level_2
4,18.82578
7,17.472355
13,17.993423
14,19.246334
16,19.463012
17,11.294069
19,19.305237
20,17.599636


#### Answer -  I have reviewed output and the average result of order_number from the subset (1000000) is greater than the average result of the entire dataframe. With the exception of the department_id: 17 which is less. 

### Task 04. Follow the instructions in the Exercise for creating a loyalty flag for existing customers using the transform() and loc() functions.

#### Answer - Already did this (output 27 above)

### Task 05. The marketing team at Instacart wants to know whether there’s a difference between the spending habits of the three types of customers you identified. Use the loyalty flag you created and check the basic statistics of the product prices for each loyalty category (Loyal Customer, Regular Customer, and New Customer). What you’re trying to determine is whether the prices of products purchased by loyal customers differ from those purchased by regular or new customers.

In [24]:
print('The mean, min and max of prices grouped by the loyalty_flag column in df_ords_prods_merged:')
df_ords_prods_merged.groupby('loyalty_flag').agg({'prices': ['mean', 'min', 'max']})

The mean, min and max of prices grouped by the loyalty_flag column in df_ords_prods_merged:


Unnamed: 0_level_0,prices,prices,prices
Unnamed: 0_level_1,mean,min,max
loyalty_flag,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Loyal customer,10.386336,1.0,99999.0
New customer,13.29467,1.0,99999.0
Regular customer,12.495717,1.0,99999.0


#### Review of the above - Newer customers have a higher avg purchase price than loyal and regular customers. The max value could use further investigation. It could be affecting other values and output. 

### Task 06. The team now wants to target different types of spenders in their marketing campaigns. This can be achieved by looking at the prices of the items people are buying. Create a spending flag for each user based on the average price across all their orders using the following criteria:
If the mean of the prices of products purchased by a user is lower than 10, then flag them as a “Low spender.”
If the mean of the prices of products purchased by a user is higher than or equal to 10, then flag them as a “High spender.”

In [25]:
# Create average_price column using transform()

df_ords_prods_merged['average_price'] = df_ords_prods_merged.groupby(['user_id'])['prices'].transform(np.mean)

In [26]:
df_ords_prods_merged.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,...,prices,_merge,price_range_loc,price_range_loc.1,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,average_price
0,2539329,1,1,2,8,,196,1,0,Soda,...,9.0,both,,Mid-range product,Regularly busy,Regularly busy,Average orders,10,New customer,6.367797
1,2398795,1,2,3,7,15.0,196,1,1,Soda,...,9.0,both,,Mid-range product,Regularly busy,Least busy days,Average orders,10,New customer,6.367797
2,473747,1,3,3,12,21.0,196,1,1,Soda,...,9.0,both,,Mid-range product,Regularly busy,Least busy days,Most orders,10,New customer,6.367797
3,2254736,1,4,4,7,29.0,196,1,1,Soda,...,9.0,both,,Mid-range product,Least busy,Least busy days,Average orders,10,New customer,6.367797
4,431534,1,5,4,15,28.0,196,1,1,Soda,...,9.0,both,,Mid-range product,Least busy,Least busy days,Most orders,10,New customer,6.367797


In [27]:
# Create a spending_flag for low spenders

df_ords_prods_merged.loc[df_ords_prods_merged['average_price'] < 10, 'spending_flag'] = 'Low spender'

In [28]:
# Create a spending_flag for high spenders

df_ords_prods_merged.loc[df_ords_prods_merged['average_price'] >= 10, 'spending_flag'] = 'High spender'

In [29]:
df_ords_prods_merged.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,...,_merge,price_range_loc,price_range_loc.1,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,average_price,spending_flag
0,2539329,1,1,2,8,,196,1,0,Soda,...,both,,Mid-range product,Regularly busy,Regularly busy,Average orders,10,New customer,6.367797,Low spender
1,2398795,1,2,3,7,15.0,196,1,1,Soda,...,both,,Mid-range product,Regularly busy,Least busy days,Average orders,10,New customer,6.367797,Low spender
2,473747,1,3,3,12,21.0,196,1,1,Soda,...,both,,Mid-range product,Regularly busy,Least busy days,Most orders,10,New customer,6.367797,Low spender
3,2254736,1,4,4,7,29.0,196,1,1,Soda,...,both,,Mid-range product,Least busy,Least busy days,Average orders,10,New customer,6.367797,Low spender
4,431534,1,5,4,15,28.0,196,1,1,Soda,...,both,,Mid-range product,Least busy,Least busy days,Most orders,10,New customer,6.367797,Low spender


In [30]:
# Checking frequency

df_ords_prods_merged['spending_flag'].value_counts(dropna = False)

spending_flag
Low spender     31770614
High spender      634245
Name: count, dtype: int64

### Task 07. In order to send relevant notifications to users within the app (for instance, asking users if they want to buy the same item again), the Instacart team wants you to determine frequent versus non-frequent customers. Create an order frequency flag that marks the regularity of a user’s ordering behavior according to the median in the “days_since_prior_order” column. The criteria for the flag should be as follows:
If the median of “days_since_prior_order” is higher than 20, then the customer should be labeled a “Non-frequent customer.”
If the median is higher than 10 and lower than or equal to 20, then the customer should be labeled a “Regular customer.”
If the median is lower than or equal to 10, then the customer should be labeled a “Frequent customer.”

In [31]:
# Create median_prior_orders column using transform()

df_ords_prods_merged['median_prior_orders'] = df_ords_prods_merged.groupby(['user_id'])['days_since_prior_order'].transform(np.median)

In [32]:
df_ords_prods_merged.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,...,price_range_loc,price_range_loc.1,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,average_price,spending_flag,median_prior_orders
0,2539329,1,1,2,8,,196,1,0,Soda,...,,Mid-range product,Regularly busy,Regularly busy,Average orders,10,New customer,6.367797,Low spender,20.5
1,2398795,1,2,3,7,15.0,196,1,1,Soda,...,,Mid-range product,Regularly busy,Least busy days,Average orders,10,New customer,6.367797,Low spender,20.5
2,473747,1,3,3,12,21.0,196,1,1,Soda,...,,Mid-range product,Regularly busy,Least busy days,Most orders,10,New customer,6.367797,Low spender,20.5
3,2254736,1,4,4,7,29.0,196,1,1,Soda,...,,Mid-range product,Least busy,Least busy days,Average orders,10,New customer,6.367797,Low spender,20.5
4,431534,1,5,4,15,28.0,196,1,1,Soda,...,,Mid-range product,Least busy,Least busy days,Most orders,10,New customer,6.367797,Low spender,20.5


In [33]:
# Create a order_frequency_flag for non frequent customers

df_ords_prods_merged.loc[df_ords_prods_merged['median_prior_orders'] > 20, 'order_frequency_flag'] = 'Non-frequent customer'

In [34]:
# Create a order_frequency_flag for regular customers

df_ords_prods_merged.loc[(df_ords_prods_merged['median_prior_orders'] > 10) & (df_ords_prods_merged['median_prior_orders'] <= 20), 'order_frequency_flag'] = 'Regular customer'

In [35]:
# Create a order_frequency_flag for frequent customers

df_ords_prods_merged.loc[df_ords_prods_merged['median_prior_orders'] <= 10, 'order_frequency_flag'] = 'Frequent customer'

In [36]:
df_ords_prods_merged.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,...,price_range_loc,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,average_price,spending_flag,median_prior_orders,order_frequency_flag
0,2539329,1,1,2,8,,196,1,0,Soda,...,Mid-range product,Regularly busy,Regularly busy,Average orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
1,2398795,1,2,3,7,15.0,196,1,1,Soda,...,Mid-range product,Regularly busy,Least busy days,Average orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
2,473747,1,3,3,12,21.0,196,1,1,Soda,...,Mid-range product,Regularly busy,Least busy days,Most orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
3,2254736,1,4,4,7,29.0,196,1,1,Soda,...,Mid-range product,Least busy,Least busy days,Average orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
4,431534,1,5,4,15,28.0,196,1,1,Soda,...,Mid-range product,Least busy,Least busy days,Most orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer


In [37]:
# Checking frequency

df_ords_prods_merged['order_frequency_flag'].value_counts(dropna = False)

order_frequency_flag
Frequent customer        21559853
Regular customer          7208564
Non-frequent customer     3636437
NaN                             5
Name: count, dtype: int64

In [38]:
# Checking null values

print('Records in df_ords_prods_merged with null values in column order_frequency_flag:')
df_ords_prods_merged[df_ords_prods_merged['order_frequency_flag'].isnull()][['user_id', 'days_since_prior_order', 'median_prior_orders', 'order_frequency_flag']]

Records in df_ords_prods_merged with null values in column order_frequency_flag:


Unnamed: 0,user_id,days_since_prior_order,median_prior_orders,order_frequency_flag
13645692,159838,,,
17251990,159838,,,
17622767,159838,,,
24138593,159838,,,
25880002,159838,,,


In [39]:
# Check user_id 159838

df_ords_prods_merged[df_ords_prods_merged['user_id'] == 159838]

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,...,price_range_loc,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,average_price,spending_flag,median_prior_orders,order_frequency_flag
13645692,895835,159838,1,0,17,,10749,3,0,Organic Red Bell Pepper,...,Mid-range product,Busiest day,Busiest days,Average orders,1,New customer,7.42,Low spender,,
17251990,895835,159838,1,0,17,,33401,6,0,Goat Cheese Crumbles,...,Mid-range product,Busiest day,Busiest days,Average orders,1,New customer,7.42,Low spender,,
17622767,895835,159838,1,0,17,,23695,2,0,California Veggie Burger,...,Low-range product,Busiest day,Busiest days,Average orders,1,New customer,7.42,Low spender,,
24138593,895835,159838,1,0,17,,21334,5,0,Organic Peeled Garlic,...,Mid-range product,Busiest day,Busiest days,Average orders,1,New customer,7.42,Low spender,,
25880002,895835,159838,1,0,17,,22198,1,0,4X Ultra Concentrated Natural Laundry Detergen...,...,Low-range product,Busiest day,Busiest days,Average orders,1,New customer,7.42,Low spender,,


#### User ID 159838 is a new customer so no need to change anything for NaN values

In [40]:
df_ords_prods_merged.shape

(32404859, 25)

### Task 08. Ensure your notebook is clean and structured and that your code is well commented. 

### Task 09. Export your dataframe as a pickle file and store it correctly in your “Prepared Data” folder.

In [41]:
df_ords_prods_merged.to_pickle(os.path.join(path, '02 Data','Prepared Data', 'orders_products_merged_updated_2.pkl'))