# Project: 12-2023 Instacart Basket Analysis
## Author: Nadia Ordonez
## Step 4 IC Orders products all

# Table of Contents
* [1. Introduction](#1.-Introduction)
* [2. Importing data](#2.-Importing-data)
    * [2.1 Importing libraries](#2.1-Importing-libraries)
    * [2.2 Importing data](#2.2-Importing-data)
* [3. Data combining](#3.-Data-combining)
    * [3.1 RAM memory space](#3.1-RAM-memory-space)
    * [3.2 Key variable](#3.2-Key-variable)
    * [3.3 Merge](#3.3-Merge)
* [4. Exporting data](#4.-Exporting-data) 

# 1. Introduction

To answer Instacart research question, all dataframes will be combined. Previously, the Order dataframe was combined with the Orders products prior into a single dataframe named as "orders_product_combined". Then, the products dataframe was additionally combined and saved as "orders_product_merged". Here, the Customer dataframe will be added. In this way, orders can be linked to customers, and thus customer shopping behaviors can be explored.

# 2. Importing data

## 2.1 Importing libraries

In [1]:
#Import analytical libraries
import pandas as pd
import numpy as np
import os

## 2.2 Importing data

In [2]:
#Project folder path into a string to easily retrieve data
path = r'C:\Users\Ich\Documents\12-2023 Instacart Basket Analysis'

### Order products merged

In [3]:
#Import “orders_products_merged_step3.pkl”
#See "Step 3 IC Orders products merged" to check for merging details
orders_products_merged = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared data', 'orders_products_merged_step3.pkl'))

In [4]:
#Check df size
orders_products_merged.shape

(32434489, 13)

In [5]:
#Check headers
orders_products_merged.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_sequence,reordered,product_name,aisle_id,department_id,prices
0,2539329,1,1,2,8,,196,1,0,Soda,77,7,9.0
1,2398795,1,2,3,7,15.0,196,1,1,Soda,77,7,9.0
2,473747,1,3,3,12,21.0,196,1,1,Soda,77,7,9.0
3,2254736,1,4,4,7,29.0,196,1,1,Soda,77,7,9.0
4,431534,1,5,4,15,28.0,196,1,1,Soda,77,7,9.0


### Customers

In [6]:
#Import “customers_step1.csv”
#See "Step 1 IC Data Import, Wrangling and Consistency checks" to check for clean up process
customers = pd.read_csv(os.path.join(path, '02 Data', 'Prepared data', 'customers_step1.csv'))

In [7]:
#Check df size
customers.shape

(206209, 8)

In [8]:
#Check headers
customers.head()

Unnamed: 0,user_id,gender,state,age,date_joined,number_of_dependants,family_status,income
0,26711,Female,Missouri,48,2017-01-01,3,married,165665
1,33890,Female,New Mexico,36,2017-01-01,0,single,59285
2,65803,Male,Idaho,35,2017-01-01,2,married,99568
3,125935,Female,Iowa,40,2017-01-01,0,single,42049
4,130797,Female,Maryland,26,2017-01-01,1,married,40374


# 3. Data combining

## 3.1 RAM memory space

### Orders products merged

In [9]:
#RAM memory issues can be avoided converting data types 
#Explore data types in the df
orders_products_merged.dtypes
#Data types were previously converted to save RAM memory (see "Step 3 IC Orders products merged")

order_id                    int32
user_id                     int32
order_number                 int8
orders_day_of_week           int8
order_hour_of_day            int8
days_since_prior_order    float64
product_id                  int32
add_to_cart_sequence        int32
reordered                    int8
product_name               object
aisle_id                     int8
department_id                int8
prices                    float64
dtype: object

### Customers

In [10]:
#RAM memory issues can be avoided converting data types 
#Explore data types in the df
customers.dtypes

user_id                  int64
gender                  object
state                   object
age                      int64
date_joined             object
number_of_dependants     int64
family_status           object
income                   int64
dtype: object

In [11]:
#Convert specific columns to more memory-efficient types
customers['user_id'] = customers['user_id'].astype('int32')

In [12]:
#Convert specific columns to more memory-efficient types
customers['gender'] = customers['gender'].astype('category')

In [13]:
#Convert specific columns to more memory-efficient types
customers['state'] = customers['state'].astype('category')

In [14]:
#Convert specific columns to more memory-efficient types
customers['age'] = customers['age'].astype('int32')

In [15]:
#Convert specific columns to more memory-efficient types
customers['number_of_dependants'] = customers['number_of_dependants'].astype('int32')

In [16]:
#Convert specific columns to more memory-efficient types
customers['family_status'] = customers['family_status'].astype('category')

## 3.2 Key variable

In [17]:
#The key variable is "user_id". This variable is shared among the two dataframes
#An inner join will join the orders that are present on both dataframes
#Explore "user_id" in Orders products merged
orders_products_merged['user_id'].describe()
#Users ids range from 1 to 206209 in our df

count    3.243449e+07
mean     1.029372e+05
std      5.946648e+04
min      1.000000e+00
25%      5.142100e+04
50%      1.026110e+05
75%      1.543910e+05
max      2.062090e+05
Name: user_id, dtype: float64

In [18]:
#Identify the number of unique values in 'user_id' from orders product merged
unique_orders_products_merged = orders_products_merged['user_id'].nunique()
print(f"The number of unique values in 'user_id': {unique_orders_products_merged}")
#There is a total of 206209 unique users ids, starting at 1

The number of unique values in 'user_id': 206209


In [19]:
#The key variable is "user_id". This variable is shared among the two dataframes
#An inner join will join the orders that are present on both dataframes
#Explore "user_id" in customers
customers['user_id'].describe()
#User ids range from 1 up to 206209 

count    206209.000000
mean     103105.000000
std       59527.555167
min           1.000000
25%       51553.000000
50%      103105.000000
75%      154657.000000
max      206209.000000
Name: user_id, dtype: float64

In [20]:
#Identify the number of unique values in 'user_id' from customers
unique_customers = customers['user_id'].nunique()
print(f"The number of unique values in 'user_id': {unique_customers}")
#It would be expected that we will have customers details for all our users in the orders_products_merged df
#It is also expected to have a similar number of rows are those observed for the orders_products_merged df 

The number of unique values in 'user_id': 206209


## 3.3 Merge

In [21]:
#function to merge dataframes
orders_products_all = orders_products_merged.merge(customers, on = 'user_id')

In [22]:
#See results
orders_products_all.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_sequence,reordered,product_name,aisle_id,department_id,prices,gender,state,age,date_joined,number_of_dependants,family_status,income
0,2539329,1,1,2,8,,196,1,0,Soda,77,7,9.0,Female,Alabama,31,2019-02-17,3,married,40423
1,2398795,1,2,3,7,15.0,196,1,1,Soda,77,7,9.0,Female,Alabama,31,2019-02-17,3,married,40423
2,473747,1,3,3,12,21.0,196,1,1,Soda,77,7,9.0,Female,Alabama,31,2019-02-17,3,married,40423
3,2254736,1,4,4,7,29.0,196,1,1,Soda,77,7,9.0,Female,Alabama,31,2019-02-17,3,married,40423
4,431534,1,5,4,15,28.0,196,1,1,Soda,77,7,9.0,Female,Alabama,31,2019-02-17,3,married,40423


In [23]:
#Identify the number of unique values in 'user_id' in the resulted dataframe
unique_orders_products_all = orders_products_all['user_id'].nunique()
print(f"The number of unique values in 'user_id': {unique_orders_products_all}")
#As expected from the previous analyses on the key variable, customer information from all users were added. 

The number of unique values in 'user_id': 206209


In [24]:
#Explore "user_id" in in the resulted dataframe 
orders_products_all['user_id'].describe()
#As expected from the previous analyses on the key variable, 
#the total count of the resulted df equals to the orders_products_merged

count    3.243449e+07
mean     1.029372e+05
std      5.946648e+04
min      1.000000e+00
25%      5.142100e+04
50%      1.026110e+05
75%      1.543910e+05
max      2.062090e+05
Name: user_id, dtype: float64

# 4. Exporting data

In [25]:
#check df size
orders_products_all.shape

(32434489, 20)

In [28]:
#check df headers
orders_products_all.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_sequence,reordered,product_name,aisle_id,department_id,prices,gender,state,age,date_joined,number_of_dependants,family_status,income
0,2539329,1,1,2,8,,196,1,0,Soda,77,7,9.0,Female,Alabama,31,2019-02-17,3,married,40423
1,2398795,1,2,3,7,15.0,196,1,1,Soda,77,7,9.0,Female,Alabama,31,2019-02-17,3,married,40423
2,473747,1,3,3,12,21.0,196,1,1,Soda,77,7,9.0,Female,Alabama,31,2019-02-17,3,married,40423
3,2254736,1,4,4,7,29.0,196,1,1,Soda,77,7,9.0,Female,Alabama,31,2019-02-17,3,married,40423
4,431534,1,5,4,15,28.0,196,1,1,Soda,77,7,9.0,Female,Alabama,31,2019-02-17,3,married,40423


In [27]:
#Exporting to prepared data folder
#The pickle format is preferred for large df. This df contains 32M rows
orders_products_all.to_pickle(os.path.join(path, '02 Data','Prepared data', 'orders_products_all_step4.pkl'))