
3) In a new notebook, import the orders_products_combined dataframe from the pickle file you just saved.

4) Check the shape of the imported dataframe (it should be the same as the one you exported—always check!).

5) Determine a suitable way to combine the orders_products_combined dataframe with your products data set. Make sure you’re using your wrangled, cleaned, and deduped products data set stored in your “Prepared Data” folder from the previous Exercise’s task.

6) Confirm the results of the merge using the merge flag.

7) Export the newly created dataframe as orders_products_merged in a suitable format (taking into consideration the size).

8) Ensure your notebooks and Instacart project folder are organized and that comments and section headings have been used throughout your code. All your exported data files should be effectively labelled and stored in your “Data” folder.

9) Save the two notebooks and send them to your tutor along with a screenshot of the exported datasets in your Instacart project folder.

# Importing Libraries

In [3]:
# Import Libraries
import pandas as pd
import numpy as np
import os

## Importing the products_cleaned.csv

In [9]:
# Importing the products_cleaned.csv file 
df_prods = pd.read_csv(os.path.join(path, '02 Data', 'Prepared Data', 'products_cleaned.csv'), usecols = vars_list)

In [5]:
# Removing the unnamed column from the data frame
vars_list = ['product_id', 'product_name', 'aisle_id', 'department_id', 'prices']

## Creating a path to the .pkl file

In [8]:
# Path to the project folder
path = r'C:\Users\mmoss\20-12-2021 Instacart Basket Analysis'

## 3. Importing the .pkl file

In [13]:
# Importing the orders_products_combined.pkl file no restrictions
df_ords_products_combined = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'orders_products_combined.pkl'))

In [11]:
# Testing it
df_ords_products_combined.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,first_order,product_id,add_to_cart_order,reordered,_merge
0,2539329,1,1,2,8,,True,196,1,0,both
1,2539329,1,1,2,8,,True,14084,2,0,both
2,2539329,1,1,2,8,,True,12427,3,0,both
3,2539329,1,1,2,8,,True,26088,4,0,both
4,2539329,1,1,2,8,,True,26405,5,0,both


## 4. Checking the shape of the new dataframe

In [12]:
df_ords_products_combined.shape

(32434489, 11)

Same as the other two dataframes just combined. (32,434,489 rows at most and 7+4 columns = 11)

## 5. Checking the products dataset to combine with ords_products_combined

In [19]:
df_prods.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3


Product id is common between the two dataframes

Want to keep the data from ords_products_combined but adding the intersecting data from df_prods would be useful. Let's use left join.

In [10]:
#Checking for missing values
df_prods.isnull().sum()

product_id       0
product_name     0
aisle_id         0
department_id    0
prices           0
dtype: int64

In [22]:
# Merging the data with left join
df_merge_final = df_ords_products_combined.merge(df_prods, on = ['product_id'], how = 'left')

In [23]:
# Merging the data with inner join
df_merge_final_2 = df_ords_products_combined.merge(df_prods, on = 'product_id')

In [24]:
#Checking for missing values
df_merge_final.isnull().sum()

order_id                        0
user_id                         0
order_number                    0
orders_day_of_week              0
order_hour_of_day               0
days_since_prior_order    2078102
first_order                     0
product_id                      0
add_to_cart_order               0
reordered                       0
_merge                          0
product_name                30200
aisle_id                    30200
department_id               30200
prices                      30200
dtype: int64

In [25]:
#Checking the second merge for missing values
df_merge_final_2.isnull().sum()

order_id                        0
user_id                         0
order_number                    0
orders_day_of_week              0
order_hour_of_day               0
days_since_prior_order    2076096
first_order                     0
product_id                      0
add_to_cart_order               0
reordered                       0
_merge                          0
product_name                    0
aisle_id                        0
department_id                   0
prices                          0
dtype: int64

In [22]:
# Testing it
df_merge_final.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,first_order,product_id,add_to_cart_order,reordered,_merge,product_name,aisle_id,department_id,prices
0,2539329,1,1,2,8,,True,196,1,0,both,Soda,77.0,7.0,9.0
1,2539329,1,1,2,8,,True,14084,2,0,both,Organic Unsweetened Vanilla Almond Milk,91.0,16.0,12.5
2,2539329,1,1,2,8,,True,12427,3,0,both,Original Beef Jerky,23.0,19.0,4.4
3,2539329,1,1,2,8,,True,26088,4,0,both,Aged White Cheddar Popcorn,23.0,19.0,4.7
4,2539329,1,1,2,8,,True,26405,5,0,both,XL Pick-A-Size Paper Towel Rolls,54.0,17.0,1.0


Success!

## 6. Confirming the results using the merge flag

In [23]:
# Seeing the frequencies of the _merge column
df_merge_final['_merge'].value_counts()

both          32435059
left_only            0
right_only           0
Name: _merge, dtype: int64

There is a full match between the two dataframes. Product_id is in both datasets. A successful merge!

In [18]:
# Seeing the frequencies of the _merge column
df_merge_final_2['_merge'].value_counts()

both          32404859
left_only            0
right_only           0
Name: _merge, dtype: int64

## 7. Exporting the new dataframe in a suitable format

In [19]:
# Exporting in .pkl format
df_merge_final_2.to_pickle(os.path.join(path,'02 Data','Prepared Data','orders_products_merged.pkl'))

In [20]:
df_merge_final_2

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,first_order,product_id,add_to_cart_order,reordered,_merge,product_name,aisle_id,department_id,prices
0,2539329,1,1,2,8,,True,196,1,0,both,Soda,77,7,9.0
1,2398795,1,2,3,7,15.0,False,196,1,1,both,Soda,77,7,9.0
2,473747,1,3,3,12,21.0,False,196,1,1,both,Soda,77,7,9.0
3,2254736,1,4,4,7,29.0,False,196,1,1,both,Soda,77,7,9.0
4,431534,1,5,4,15,28.0,False,196,1,1,both,Soda,77,7,9.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32404854,1320836,202557,17,2,15,1.0,False,43553,2,1,both,Orange Energy Shots,64,7,3.7
32404855,31526,202557,18,5,11,3.0,False,43553,2,1,both,Orange Energy Shots,64,7,3.7
32404856,758936,203436,1,2,7,,True,42338,4,0,both,"Zucchini Chips, Pesto",50,19,6.9
32404857,2745165,203436,2,3,5,15.0,False,42338,16,1,both,"Zucchini Chips, Pesto",50,19,6.9
