# Project: 12-2023 Instacart Basket Analysis
## Author: Nadia Ordonez
## Step 3 IC Orders products merged

# Table of Contents
* [1. Introduction](#1.-Introduction)
* [2. Importing data](#2.-Importing-data)
    * [2.1 Importing libraries](#2.1-Importing-libraries)
    * [2.2 Importing data](#2.2-Importing-data)
* [3. Data combining](#3.-Data-combining)
    * [3.1 RAM memory space](#3.1-RAM-memory-space)
    * [3.2 Key variable](#3.2-Key-variable)
    * [3.3 Merge](#3.3-Merge)
* [4. Exporting data](#4.-Exporting-data) 

# 1. Introduction

To answer Instacart research question, all dataframes will be combined. Previously, the Order dataframe was combined with the Orders products prior into a single dataframe named as "orders_product_combined". Here, the products dataframe will be additionally combined. In this way, additional product details such as product names and prices will be added.  

# 2. Importing data

## 2.1 Importing libraries

In [1]:
#Import analytical libraries
import pandas as pd
import numpy as np
import os

## 2.2 Importing data

In [2]:
#Project folder path into a string to easily retrieve data
path = r'C:\Users\Ich\Documents\12-2023 Instacart Basket Analysis'

### Order products combined

In [3]:
#Import “orders_products_combined_step2.pkl”
#See "Step 2 IC Orders products combined" to check for merging details
orders_products_combined = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared data', 'orders_products_combined_step2.pkl'))

In [5]:
#Check df size
orders_products_combined.shape

(32434489, 9)

In [6]:
#Check headers
orders_products_combined.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_sequence,reordered
0,2539329,1,1,2,8,,196,1,0
1,2539329,1,1,2,8,,14084,2,0
2,2539329,1,1,2,8,,12427,3,0
3,2539329,1,1,2,8,,26088,4,0
4,2539329,1,1,2,8,,26405,5,0


### Products

In [3]:
#Import “products_step1.csv”
#See "Step 1 IC Data Import, Wrangling and Consistency checks" to check for clean up process
products = pd.read_csv(os.path.join(path, '02 Data', 'Prepared data', 'products_step1.csv'))

In [4]:
#Check df size
products.shape

(49688, 5)

In [5]:
#Check headers
products.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3


# 3. Data combining

## 3.1 RAM memory space

### Orders products combined

In [12]:
#RAM memory issues can be avoided converting data types 
#Explore data types in the df
orders_products_combined.dtypes
#Data types were previously converted to save RAM memory (see "Step 2 IC Orders products combined")

order_id                    int32
user_id                     int32
order_number                 int8
orders_day_of_week           int8
order_hour_of_day            int8
days_since_prior_order    float64
product_id                  int32
add_to_cart_sequence        int32
reordered                    int8
dtype: object

### Products

In [6]:
#RAM memory issues can be avoided converting data types 
#Explore data types in the df
products.dtypes

product_id         int64
product_name      object
aisle_id           int64
department_id      int64
prices           float64
dtype: object

In [7]:
#Convert specific columns to more memory-efficient types
products['product_id'] = products['product_id'].astype('int32')

In [8]:
#Convert specific columns to more memory-efficient types
products['aisle_id'] = products['aisle_id'].astype('int8')

In [9]:
#Convert specific columns to more memory-efficient types
products['department_id'] = products['department_id'].astype('int8')

In [10]:
#See results
products.dtypes

product_id         int32
product_name      object
aisle_id            int8
department_id       int8
prices           float64
dtype: object

## 3.2 Key variable

In [19]:
#The key variable is "product_id". This variable is shared among the two dataframes
#An inner join will join the orders that are present on both dataframes
#Explore "product_id" in Orders products combined
orders_products_combined['product_id'].describe()
#Products ids range from 1 up to 49688

count    3.243449e+07
mean     2.557634e+04
std      1.409669e+04
min      1.000000e+00
25%      1.353000e+04
50%      2.525600e+04
75%      3.793500e+04
max      4.968800e+04
Name: product_id, dtype: float64

In [20]:
#Identify the number of unique values in 'product_id' from orders product combined
unique_orders_products_combined = orders_products_combined['product_id'].nunique()
print(f"The number of unique values in 'product_id': {unique_orders_products_combined}")
#If 49688 products id exist, in this df only 49677 shopping items are listed

The number of unique values in 'product_id': 49677


In [21]:
#The key variable is "product_id". This variable is shared among the two dataframes
#An inner join will join the orders that are present on both dataframes
#Explore "product_id" in products
products['product_id'].describe()
#Product ids range from 1 up to 49688. 

count    49688.000000
mean     24844.500000
std      14343.834425
min          1.000000
25%      12422.750000
50%      24844.500000
75%      37266.250000
max      49688.000000
Name: product_id, dtype: float64

In [22]:
#Identify the number of unique values in 'product_id' from products
unique_products = products['product_id'].nunique()
print(f"The number of unique values in 'product_id': {unique_products}")
#It would be expected that ONLY 49677 unique product ids will be listed after merging both dfs

The number of unique values in 'product_id': 49688


## 3.3 Merge

In [23]:
#function to merge dataframes
orders_products_merged = orders_products_combined.merge(products, on = 'product_id')

In [24]:
#See results
orders_products_merged.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_sequence,reordered,product_name,aisle_id,department_id,prices
0,2539329,1,1,2,8,,196,1,0,Soda,77,7,9.0
1,2398795,1,2,3,7,15.0,196,1,1,Soda,77,7,9.0
2,473747,1,3,3,12,21.0,196,1,1,Soda,77,7,9.0
3,2254736,1,4,4,7,29.0,196,1,1,Soda,77,7,9.0
4,431534,1,5,4,15,28.0,196,1,1,Soda,77,7,9.0


In [25]:
#Identify the number of unique values in 'product_id' from merged dataframe
unique_orders_products_merged = orders_products_merged['product_id'].nunique()
print(f"The number of unique values in 'product_id': {unique_orders_products_merged}")
#As expected from the previous analyses on the key variable, there is ONLY 49677 unique products ids after the merge. 

The number of unique values in 'product_id': 49677


# 4. Exporting data

In [27]:
#check df size
orders_products_merged.shape

(32434489, 13)

In [28]:
#check df headers
orders_products_merged.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_sequence,reordered,product_name,aisle_id,department_id,prices
0,2539329,1,1,2,8,,196,1,0,Soda,77,7,9.0
1,2398795,1,2,3,7,15.0,196,1,1,Soda,77,7,9.0
2,473747,1,3,3,12,21.0,196,1,1,Soda,77,7,9.0
3,2254736,1,4,4,7,29.0,196,1,1,Soda,77,7,9.0
4,431534,1,5,4,15,28.0,196,1,1,Soda,77,7,9.0


In [29]:
#Exporting to prepared data folder
#The pickle format is preferred for large df. This df contains 32M rows
orders_products_merged.to_pickle(os.path.join(path, '02 Data','Prepared data', 'orders_products_merged_step3.pkl'))