# Project: 12-2023 Instacart Basket Analysis
## Author: Nadia Ordonez
## Step 2 IC Orders products combined

# Table of Contents
* [1. Introduction](#1.-Introduction)
* [2. Importing data](#2.-Importing-data)
    * [2.1 Importing libraries](#2.1-Importing-libraries)
    * [2.2 Importing data](#2.2-Importing-data)
* [3. Data combining](#3.-Data-combining)
    * [3.1 RAM memory space](#3.1-RAM-memory-space)
    * [3.2 Key variable](#3.2-Key-variable)
    * [3.3 Merge](#3.3-Merge)
* [4. Exporting data](#4.-Exporting-data) 

# 1. Introduction

To answer Instacart research question, all dataframes will be combined. First, the Order dataframe will be combined with the Orders products prior. In this way, each order will be enriched to display all shopping items related to each of the 3421083 total orders. 

# 2. Importing data

## 2.1 Importing libraries

In [2]:
#Import analytical libraries
import pandas as pd
import numpy as np
import os

## 2.2 Importing data

In [3]:
#Project folder path into a string to easily retrieve data
path = r'C:\Users\Ich\Documents\12-2023 Instacart Basket Analysis'

### Order

In [4]:
#Import “orders_step1.csv”
#See "Step 1 IC Data Import, Wrangling and Consistency checks" to check for clean up process
orders = pd.read_csv(os.path.join(path, '02 Data', 'Prepared data', 'orders_step1.csv'), index_col = False)

In [5]:
#Check df size
orders.shape

(3421083, 6)

In [6]:
#Check headers
orders.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
0,2539329,1,1,2,8,
1,2398795,1,2,3,7,15.0
2,473747,1,3,3,12,21.0
3,2254736,1,4,4,7,29.0
4,431534,1,5,4,15,28.0


### Orders products prior

In [7]:
#Import “orders_products_prior_step1.pkl”
#See "Step 1 IC Data Import, Wrangling and Consistency checks" to check for clean up process
orders_products_prior = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared data', 'orders_products_prior_step1.pkl'))

In [8]:
#Check df size
orders_products_prior.shape

(32434489, 4)

In [9]:
#Check headers
orders_products_prior.head()

Unnamed: 0,order_id,product_id,add_to_cart_sequence,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


# 3. Data combining

## 3.1 RAM memory space

### Orders

In [10]:
#RAM memory issues can be avoided converting data types 
#Explore data types in the df
orders.dtypes

order_id                    int64
user_id                     int64
order_number                int64
orders_day_of_week          int64
order_hour_of_day           int64
days_since_prior_order    float64
dtype: object

In [12]:
#Convert specific columns to more memory-efficient types
orders['order_id'] = orders['order_id'].astype('int32')

In [13]:
#Convert specific columns to more memory-efficient types
orders['user_id'] = orders['user_id'].astype('int32')

In [14]:
#Convert specific columns to more memory-efficient types
orders['order_number'] = orders['order_number'].astype('int8')

In [15]:
#Convert specific columns to more memory-efficient types
orders['orders_day_of_week'] = orders['orders_day_of_week'].astype('int8')

In [16]:
#Convert specific columns to more memory-efficient types
orders['order_hour_of_day'] = orders['order_hour_of_day'].astype('int8')

In [17]:
#See results
orders.dtypes

order_id                    int32
user_id                     int32
order_number                 int8
orders_day_of_week           int8
order_hour_of_day            int8
days_since_prior_order    float64
dtype: object

### Orders products prior

In [18]:
#RAM memory issues can be avoided converting data types 
#Explore data types in the df
orders_products_prior.dtypes

order_id                int64
product_id              int64
add_to_cart_sequence    int64
reordered               int64
dtype: object

In [19]:
#Convert specific columns to more memory-efficient types
orders_products_prior['order_id'] = orders_products_prior['order_id'].astype('int32')

In [20]:
#Convert specific columns to more memory-efficient types
orders_products_prior['product_id'] = orders_products_prior['product_id'].astype('int32')

In [21]:
#Convert specific columns to more memory-efficient types
orders_products_prior['add_to_cart_sequence'] = orders_products_prior['add_to_cart_sequence'].astype('int32')

In [22]:
#Convert specific columns to more memory-efficient types
orders_products_prior['reordered'] = orders_products_prior['reordered'].astype('int8')

In [23]:
#See results
orders_products_prior.dtypes

order_id                int32
product_id              int32
add_to_cart_sequence    int32
reordered                int8
dtype: object

## 3.2 Key variable

In [24]:
#The key variable is "order_id". This variable is shared among the two dataframes
#An inner join will join the orders that are present on both dataframes
#Explore "order_id" in Orders
orders['order_id'].describe()
#Orders range from 1 up to 3421083

count    3.421083e+06
mean     1.710542e+06
std      9.875817e+05
min      1.000000e+00
25%      8.552715e+05
50%      1.710542e+06
75%      2.565812e+06
max      3.421083e+06
Name: order_id, dtype: float64

In [25]:
#Identify the number of unique values in 'order_id' from orders
unique_orders = orders['order_id'].nunique()
print(f"The number of unique values in 'order_id': {unique_orders}")
#Each row corresponds to unique values from 1 to 3421083

The number of unique values in 'order_id': 3421083


In [26]:
#The key variable is "order_id". This variable is shared among the two dataframes
#An inner join will join the orders that are present on both dataframes
#Explore "order_id" in Orders products prior
orders_products_prior['order_id'].describe()
#Orders range from 2 up to 3421083, order_id = 1 is not present.
#Each order_id is not unique at each row within "order_id" variable. 

count    3.243449e+07
mean     1.710749e+06
std      9.873007e+05
min      2.000000e+00
25%      8.559430e+05
50%      1.711048e+06
75%      2.565514e+06
max      3.421083e+06
Name: order_id, dtype: float64

In [27]:
#Identify the number of unique values in 'order_id' from orders products prior
unique_orders_products_prior = orders_products_prior['order_id'].nunique()
print(f"The number of unique values in 'order_id': {unique_orders_products_prior}")
#It would be expected that there is ONLY information on shopping items for 3214874 orders. 

The number of unique values in 'order_id': 3214874


## 3.3 Merge

In [28]:
#function to merge dataframes
orders_products_combined = orders.merge(orders_products_prior, on = 'order_id')

In [29]:
#See results
orders_products_combined.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_sequence,reordered
0,2539329,1,1,2,8,,196,1,0
1,2539329,1,1,2,8,,14084,2,0
2,2539329,1,1,2,8,,12427,3,0
3,2539329,1,1,2,8,,26088,4,0
4,2539329,1,1,2,8,,26405,5,0


In [30]:
#Identify the number of unique values in 'order_id' from merged dataframe
unique_orders_products_combined = orders_products_combined['order_id'].nunique()
print(f"The number of unique values in 'order_id': {unique_orders_products_combined}")
#As expected from the previous analyses on the key variable, there is ONLY information on shopping items for 3214874 orders. 

The number of unique values in 'order_id': 3214874


# 4. Exporting data

In [31]:
#check df size
orders_products_combined.shape

(32434489, 9)

In [32]:
#check df headers
orders_products_combined.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_sequence,reordered
0,2539329,1,1,2,8,,196,1,0
1,2539329,1,1,2,8,,14084,2,0
2,2539329,1,1,2,8,,12427,3,0
3,2539329,1,1,2,8,,26088,4,0
4,2539329,1,1,2,8,,26405,5,0


In [33]:
#Exporting to prepared data folder
#The pickle format is preferred for large df. This df contains 32M rows
orders_products_combined.to_pickle(os.path.join(path, '02 Data','Prepared data', 'orders_products_combined_step2.pkl'))