### Contents

01 Import libraries and datasets

02 Check dataframes

03 Merge dataframes

04 Merge again with inner join

05 Export dataframe

01 Import libraries and datasets

In [1]:
# Import libraries

import pandas as pd
import numpy as np
import os

In [5]:
# Import datasets

path = r"C:\Users\cathe\OneDrive\Data Analysis\2 4 Instacart Basket Analysis\02 Data"
df_orders_products_combined = pd.read_pickle(os.path.join(path, 'Prepared Data', 'orders_products_combined.pkl'))
df_products = pd.read_csv(os.path.join(path, 'Prepared Data', 'products_checked_cleaned.csv'), index_col = False)

02 Check dataframes

In [5]:
df_orders_products_combined.head()

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,_merge
0,2539329,1,1,2,8,,196,1,0,both
1,2539329,1,1,2,8,,14084,2,0,both
2,2539329,1,1,2,8,,12427,3,0,both
3,2539329,1,1,2,8,,26088,4,0,both
4,2539329,1,1,2,8,,26405,5,0,both


In [15]:
# Remove _merge flag as no longer needed

df_combined = df_orders_products_combined.drop(columns = ['_merge'])

In [17]:
df_combined.shape

(32434489, 9)

Note: shape is as expected

In [7]:
df_products.head()

Unnamed: 0.1,Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,0,1,Chocolate Sandwich Cookies,61,19,5.8
1,1,2,All-Seasons Salt,104,13,9.3
2,2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,4,5,Green Chile Anytime Sauce,5,13,4.3


Note to Mohamed: Why do I keep getting the extra index column, despite putting 'index_col = False' into the import code?

In [9]:
# Remove extra index column

df_prods = df_products.drop(columns = ['Unnamed: 0'])

In [11]:
df_prods.shape

(49671, 5)

03 Merge dataframes

In [13]:
# Check that _merge flag has gone as I did it wrong initially!

df_combined.head()

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered
0,2539329,1,1,2,8,,196,1,0
1,2539329,1,1,2,8,,14084,2,0
2,2539329,1,1,2,8,,12427,3,0
3,2539329,1,1,2,8,,26088,4,0
4,2539329,1,1,2,8,,26405,5,0


In [14]:
# Merge using product_id as key; outer join to include products which haven't been ordered

df_merged = df_combined.merge(df_prods, how = 'outer', on = 'product_id', indicator = True)

In [25]:
df_merged['_merge'].value_counts()

_merge
both          32404859
left_only        30200
right_only          11
Name: count, dtype: int64

This seems to suggest that 30,200 orders have product ids which aren't listed in the products dataframe.  Also 11 products have never been ordered. This seems the wrong way round, though - I would expect fewer errors (product id on order doesn't exist) than products not ordered (which could be quite a few if the timeframe is fairly restricted). So to check my understanding of left and right, I'll do some investigating below.

In [29]:
# Locating null values

df_merged.isnull().sum()

order_id                       11
user_id                        11
order_number                   11
order_day_of_week              11
order_hour_of_day              11
days_since_prior_order    2078113
product_id                      0
add_to_cart_order              11
reordered                      11
product_name                30200
aisle_id                    30200
department_id               30200
prices                      30200
_merge                          0
dtype: int64

As I expected, 11 product_ids have no matching orders.  30200 orders have product_ids which don't match any in the product dataframe.  Possibly these are connected with the 16 rows which were removed during consistency checks because the product name was missing.

In [32]:
df_nan = df_merged[df_merged['product_name'].isnull() == True]

In [34]:
df_nan['product_id'].value_counts()

product_id
1511     13397
34        6536
116       4359
6799      1978
4790      1804
2240      1689
262        179
3230        55
26519       51
1780        39
2586        29
69          19
525         18
4283        17
40440       13
3736         8
3159         5
38183        4
Name: count, dtype: int64

Not quite, 18 product_ids which appear in orders but not in the product database.

04 Merge again with inner join

In [19]:
# Merge using product_id as key; using inner join to exclude rows with missing values

df_merged = df_combined.merge(df_prods, on = 'product_id', indicator = True)

In [21]:
df_merged.shape

(32404161, 14)

05 Export merged dataframe

In [14]:
df_merged.to_pickle(os.path.join(path, 'Prepared Data', 'ords_prods_merge.pkl'))