# Project: 12-2023 Instacart Basket Analysis
## Author: Nadia Ordonez
## Step 5 IC Final dataset after exclusion flag

# Table of Contents
* [1. Introduction](#1.-Introduction)
* [2. Importing data](#2.-Importing-data)
    * [2.1 Importing libraries](#2.1-Importing-libraries)
    * [2.2 Importing data](#2.2-Importing-data)
* [3. Exclusion flag](#3.-Exclusion-flag)
    * [3.1 Aggregate variable](#3.1-Aggregate-variable)
    * [3.2 Derived variable](#3.2-Derived-variable)
    * [3.3 Final dataframe](#3.3-Final-dataframe)
* [4. Exporting data](#4.-Exporting-data) 

# 1. Introduction

To answer Instacart research question, all dataframes were combined into a single dataframe. Since the Instacart CFO isn’t interested in customers who don’t generate much revenue for the app. Here, low-activity customers (customers with less than 5 orders) will be excluded from the orders_products_all to generate a final dataframe. 

Firstly, a new variable should be created where the maximum number of orders = ("max_order_number") that each user ordered is calculated. This step would be described in 3.1 Aggregate variable. Secondly, a derived variable will be created to flag user as low and no-low under the "activitiy_customer" variable. This step would be described in 3.2 Derived variable. Finally in 3.3 Final dataframe, the "activity_customer" variable would be used to exclude observations that are not of the CFO's interest, generating out final dataframe. 

# 2. Importing data

## 2.1 Importing libraries

In [1]:
#Import analytical libraries
import pandas as pd
import numpy as np
import os

## 2.2 Importing data

In [2]:
#Project folder path into a string to easily retrieve data
path = r'C:\Users\Ich\Documents\12-2023 Instacart Basket Analysis'

### Order products all

In [3]:
#Import “orders_products_all_step4.pkl”
#See "Step 4 IC Orders products all" to check for merging details
orders_products_all = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared data', 'orders_products_all_step4.pkl'))

In [4]:
#Check df size
orders_products_all.shape

(32434489, 20)

In [5]:
#Check headers
orders_products_all.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_sequence,reordered,product_name,aisle_id,department_id,prices,gender,state,age,date_joined,number_of_dependants,family_status,income
0,2539329,1,1,2,8,,196,1,0,Soda,77,7,9.0,Female,Alabama,31,2019-02-17,3,married,40423
1,2398795,1,2,3,7,15.0,196,1,1,Soda,77,7,9.0,Female,Alabama,31,2019-02-17,3,married,40423
2,473747,1,3,3,12,21.0,196,1,1,Soda,77,7,9.0,Female,Alabama,31,2019-02-17,3,married,40423
3,2254736,1,4,4,7,29.0,196,1,1,Soda,77,7,9.0,Female,Alabama,31,2019-02-17,3,married,40423
4,431534,1,5,4,15,28.0,196,1,1,Soda,77,7,9.0,Female,Alabama,31,2019-02-17,3,married,40423


# 3. Exclusion flag

## 3.1 Aggregate variable

In [6]:
#Calculating the maximum number of orders = ("max_order_number") that each user ordered
orders_products_all['max_order_number'] = orders_products_all.groupby(['user_id'])['order_number'].transform(np.max)

  orders_products_all['max_order_number'] = orders_products_all.groupby(['user_id'])['order_number'].transform(np.max)


In [7]:
#See results
orders_products_all.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_sequence,reordered,product_name,...,department_id,prices,gender,state,age,date_joined,number_of_dependants,family_status,income,max_order_number
0,2539329,1,1,2,8,,196,1,0,Soda,...,7,9.0,Female,Alabama,31,2019-02-17,3,married,40423,10
1,2398795,1,2,3,7,15.0,196,1,1,Soda,...,7,9.0,Female,Alabama,31,2019-02-17,3,married,40423,10
2,473747,1,3,3,12,21.0,196,1,1,Soda,...,7,9.0,Female,Alabama,31,2019-02-17,3,married,40423,10
3,2254736,1,4,4,7,29.0,196,1,1,Soda,...,7,9.0,Female,Alabama,31,2019-02-17,3,married,40423,10
4,431534,1,5,4,15,28.0,196,1,1,Soda,...,7,9.0,Female,Alabama,31,2019-02-17,3,married,40423,10


In [8]:
#Check aggregations
#Snap view of maximum order number per user
orders_products_all.groupby(['user_id'])['order_number'].max()

user_id
1         10
2         14
3         12
4          5
5          4
          ..
206205     3
206206    67
206207    16
206208    49
206209    13
Name: order_number, Length: 206209, dtype: int8

In [9]:
#Creating a temporary df
df = orders_products_all[['user_id', 'max_order_number']] 

In [10]:
#Subgroupping user_id to check flag
df.loc[df['user_id'].isin([5, 206208])]
#Aggregation was calculated correctly

Unnamed: 0,user_id,max_order_number
16344673,206208,49
16344674,206208,49
16344675,206208,49
16344676,206208,49
16344677,206208,49
...,...,...
24652894,5,4
24652895,5,4
24652896,5,4
24652897,5,4


## 3.2 Derived variable

In [11]:
#Create flag
#Flag condition
#1. low-activity customers (customers with less than 5 orders)
#2. no-low activity customers (customers with 5 or more orders)
orders_products_all.loc[orders_products_all['max_order_number'] < 5, 'activity_customer'] = 'low'

  orders_products_all.loc[orders_products_all['max_order_number'] < 5, 'activity_customer'] = 'low'


In [12]:
orders_products_all.loc[orders_products_all['max_order_number'] >= 5, 'activity_customer'] = 'no_low'

In [13]:
#Check flags
#Snap view of flags
orders_products_all.groupby(['user_id'])['activity_customer'].max()

user_id
1         no_low
2         no_low
3         no_low
4         no_low
5            low
           ...  
206205       low
206206    no_low
206207    no_low
206208    no_low
206209    no_low
Name: activity_customer, Length: 206209, dtype: object

In [14]:
#Creating a temporary df
df = orders_products_all[['user_id', 'max_order_number', 'activity_customer']] 

In [15]:
#Subgroupping user_id to check flag
df.loc[df['user_id'].isin([5, 206208])]
#Flag is set correctly

Unnamed: 0,user_id,max_order_number,activity_customer
16344673,206208,49,no_low
16344674,206208,49,no_low
16344675,206208,49,no_low
16344676,206208,49,no_low
16344677,206208,49,no_low
...,...,...,...
24652894,5,4,low
24652895,5,4,low
24652896,5,4,low
24652897,5,4,low


In [16]:
#Print the frequency of flag "activity_customer"
orders_products_all['activity_customer'].value_counts(dropna = False) 
#It is expected that 30992966 observation will remain on the final df

activity_customer
no_low    30992966
low        1441523
Name: count, dtype: int64

## 3.3 Final dataframe

In [17]:
#Excluding observations using the flag
#Flag "activity_customer", exclude all observations labelled low-activity
orders_products_final = orders_products_all.loc[orders_products_all['activity_customer'] == 'no_low' ]  

In [18]:
#See results
orders_products_final.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_sequence,reordered,product_name,...,prices,gender,state,age,date_joined,number_of_dependants,family_status,income,max_order_number,activity_customer
0,2539329,1,1,2,8,,196,1,0,Soda,...,9.0,Female,Alabama,31,2019-02-17,3,married,40423,10,no_low
1,2398795,1,2,3,7,15.0,196,1,1,Soda,...,9.0,Female,Alabama,31,2019-02-17,3,married,40423,10,no_low
2,473747,1,3,3,12,21.0,196,1,1,Soda,...,9.0,Female,Alabama,31,2019-02-17,3,married,40423,10,no_low
3,2254736,1,4,4,7,29.0,196,1,1,Soda,...,9.0,Female,Alabama,31,2019-02-17,3,married,40423,10,no_low
4,431534,1,5,4,15,28.0,196,1,1,Soda,...,9.0,Female,Alabama,31,2019-02-17,3,married,40423,10,no_low


In [19]:
#See results
#Print the frequency of flag "activity_customer"
orders_products_final['activity_customer'].value_counts(dropna = False) 
#Only no_low observations are included

activity_customer
no_low    30992966
Name: count, dtype: int64

In [20]:
#See results
orders_products_final.shape
#As it was expected, only 30992966 remained 

(30992966, 22)

In [24]:
#Drop "activity_customer" variable
orders_products_final = orders_products_final.drop(columns = ['activity_customer'])

In [25]:
#check df size
orders_products_final.shape

(30992966, 21)

In [26]:
#check df headers
orders_products_final.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_sequence,reordered,product_name,...,department_id,prices,gender,state,age,date_joined,number_of_dependants,family_status,income,max_order_number
0,2539329,1,1,2,8,,196,1,0,Soda,...,7,9.0,Female,Alabama,31,2019-02-17,3,married,40423,10
1,2398795,1,2,3,7,15.0,196,1,1,Soda,...,7,9.0,Female,Alabama,31,2019-02-17,3,married,40423,10
2,473747,1,3,3,12,21.0,196,1,1,Soda,...,7,9.0,Female,Alabama,31,2019-02-17,3,married,40423,10
3,2254736,1,4,4,7,29.0,196,1,1,Soda,...,7,9.0,Female,Alabama,31,2019-02-17,3,married,40423,10
4,431534,1,5,4,15,28.0,196,1,1,Soda,...,7,9.0,Female,Alabama,31,2019-02-17,3,married,40423,10


# 4. Exporting data

In [27]:
#Exporting to prepared data folder
#The pickle format is preferred for large df. This df contains 31M rows
orders_products_final.to_pickle(os.path.join(path, '02 Data','Prepared data', 'orders_products_final_step5.pkl'))