# 4.2 Merging "orders products combined" and "products" dataframes
** **
## Table of contents:

1. Importing libraries <br>
2. Importing dataframes <br>
3. Merging the two dataframes
4. Exporting the new dataframe in a pickle format
** **

# 1. Importing libraries
** **

In [1]:
import pandas as pd
import numpy as np
import os

# 2. Importing dataframes
** **

In [2]:
# Creating a path variabile for the folder
path = r'C:\Users\Simone\Desktop\Career Foundry\Esercizi modulo 5\Instacart basket analysis'

In [3]:
# Importing the previously merged dataframe from Prepared Data
df_ords = pd.read_pickle(os.path.join(path, '02. Data', 'Prepared Data', 'orders_products_combined.pkl'))

In [4]:
# Checking the shape of the merged dataframe
df_ords.shape

(30356421, 10)

In [5]:
# Checking the head
df_ords.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_creation,days_since_prior_order,product_id,add_to_cart_order,reordered,_merge
0,2398795,1,2,3,7,15.0,196,1,1,both
1,2398795,1,2,3,7,15.0,10258,2,0,both
2,2398795,1,2,3,7,15.0,12427,3,1,both
3,2398795,1,2,3,7,15.0,13176,4,0,both
4,2398795,1,2,3,7,15.0,26088,5,1,both


In [6]:
# Removing the _merge column that is not necessary anymore.
df_ords = df_ords.drop(columns = ['_merge'])

In [7]:
# Testing the change
df_ords.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_creation,days_since_prior_order,product_id,add_to_cart_order,reordered
0,2398795,1,2,3,7,15.0,196,1,1
1,2398795,1,2,3,7,15.0,10258,2,0
2,2398795,1,2,3,7,15.0,12427,3,1
3,2398795,1,2,3,7,15.0,13176,4,0
4,2398795,1,2,3,7,15.0,26088,5,1


In [8]:
# Importing products dataframe
df_prods = pd.read_csv(os.path.join(path, '02. Data', 'Prepared Data', 'products_checked.csv'), index_col= False)

In [9]:
# Checking the head
df_prods.head()

Unnamed: 0.1,Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,0,1,Chocolate Sandwich Cookies,61,19,5.8
1,1,2,All-Seasons Salt,104,13,9.3
2,2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,4,5,Green Chile Anytime Sauce,5,13,4.3


In [10]:
# Removing the first column
df_prods = df_prods.drop(columns = ['Unnamed: 0'])

In [11]:
# Testing the change
df_prods.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3


# 3. Merging the two dataframes
** **

For this merge I want a full match, in order that my final dataframe contains only informations that are in both dataframes. <br>
For this reason I will use an inner join and I will create a merge flag. <br>
The key column (the column both dataframes share) is the product_id.

In [12]:
# Merging the two dataframes on the product_id column
df_merged = df_ords.merge(df_prods, on = ['product_id'], indicator = True)

In [13]:
# Checking the output
df_merged

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_creation,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,_merge
0,2398795,1,2,3,7,15.0,196,1,1,Soda,77,7,9.0,both
1,473747,1,3,3,12,21.0,196,1,1,Soda,77,7,9.0,both
2,2254736,1,4,4,7,29.0,196,1,1,Soda,77,7,9.0,both
3,431534,1,5,4,15,28.0,196,1,1,Soda,77,7,9.0,both
4,3367565,1,6,2,7,19.0,196,1,1,Soda,77,7,9.0,both
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30328758,31526,202557,18,5,11,3.0,43553,2,1,Orange Energy Shots,64,7,3.7,both
30328759,2745165,203436,2,3,5,15.0,42338,16,1,"Zucchini Chips, Pesto",50,19,6.9,both
30328760,850996,204229,12,2,3,25.0,37595,20,0,Dead Sea Minerals Eucalyptus Triple Milled Soap,25,11,13.5,both
30328761,2550789,204472,6,3,15,7.0,37595,9,0,Dead Sea Minerals Eucalyptus Triple Milled Soap,25,11,13.5,both


<b> Observations: </b>
Previous orders dataframe had 30.356.421 rows and 9 columns. <br>
This new dataframe has 27.687 less rows (probably due to inner join, there were some rows that got excluded) and five more columns (product_name, aisle_id, department_id, prices and  the merge flag).

In [14]:
# Cheking the results of the merge through the merge flag
df_merged['_merge'].value_counts()

both          30328763
left_only            0
right_only           0
Name: _merge, dtype: int64

Of course, being an inner join, there are only full matchs, so all values are "both".

In [15]:
# Before exporting the dataframe, I am going to drop the _merge column.
df_merged = df_merged.drop(columns = ['_merge'])

In [16]:
# Testing the drop
df_merged.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_creation,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices
0,2398795,1,2,3,7,15.0,196,1,1,Soda,77,7,9.0
1,473747,1,3,3,12,21.0,196,1,1,Soda,77,7,9.0
2,2254736,1,4,4,7,29.0,196,1,1,Soda,77,7,9.0
3,431534,1,5,4,15,28.0,196,1,1,Soda,77,7,9.0
4,3367565,1,6,2,7,19.0,196,1,1,Soda,77,7,9.0


# 4. Exporting the new dataframe in a pickle format
** **

Due to the size of the dataframe, is not possible to export it as a CVS. <br>
It will be exported in a PKL format.

In [17]:
# Exporting new dataframe in pkl format
df_merged.to_pickle(os.path.join(path, '02. Data','Prepared Data', 'orders_products_merged.pkl'))