# Task 4.6 - Combining and Exporting Data
# Notebook 2

## This script contains the following points:

### 1. Importing libraries
### 2. Importing datasets
### 3. Checking imported dataframes
### 4. Combining dataframes
### 5. Confirming the results of the merge using the merge flag
### 6. Exporting the dataframe

#### 1. Importing libraries

In [1]:
import pandas as pd
import numpy as np
import os

#### 2. Importing datasets

In [2]:
# Create path for dataframes

path = r'C:\Users\Lenad\Documents\Data Analytics Immersion\Achievement 4\Jupyter folder\Instacart basket analysis'

In [3]:
# Import orders_products_combined.pkl from Prepared Data folder

df_orders_products_combined = pd.read_pickle(os.path.join(path, '02. Data', 'Prepared Data', 'orders_products_combined.pkl'))

In [4]:
# Import products_checked.csv from Prepared Data folder

df_prods = pd.read_csv(os.path.join(path, '02. Data', 'Prepared Data', 'products_checked.csv'), index_col = False)

#### 3. Checking imported dataframes

In [5]:
# Check imported df_orders_products_combined

In [6]:
df_orders_products_combined.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,orders_day_of_week,order_hour_of_day,days_since_last_order,product_id,add_to_cart_order,reordered,_merge
0,2539329,1,prior,1,2,8,,196,1,0,both
1,2539329,1,prior,1,2,8,,14084,2,0,both
2,2539329,1,prior,1,2,8,,12427,3,0,both
3,2539329,1,prior,1,2,8,,26088,4,0,both
4,2539329,1,prior,1,2,8,,26405,5,0,both


In [7]:
# Dropping 'merge' column from previous merge

df_orders_products_combined.drop(df_orders_products_combined.columns[df_orders_products_combined.columns.str.contains('merge',case = False)],axis = 1, inplace = True)

In [8]:
df_orders_products_combined.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,orders_day_of_week,order_hour_of_day,days_since_last_order,product_id,add_to_cart_order,reordered
0,2539329,1,prior,1,2,8,,196,1,0
1,2539329,1,prior,1,2,8,,14084,2,0
2,2539329,1,prior,1,2,8,,12427,3,0
3,2539329,1,prior,1,2,8,,26088,4,0
4,2539329,1,prior,1,2,8,,26405,5,0


In [9]:
df_orders_products_combined.shape

(32434489, 10)

In [10]:
df_prods.head()

Unnamed: 0.1,Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,0,1,Chocolate Sandwich Cookies,61,19,5.8
1,1,2,All-Seasons Salt,104,13,9.3
2,2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,4,5,Green Chile Anytime Sauce,5,13,4.3


In [11]:
# Drop 'Unnamed:0' column

df_prods.drop(df_prods.columns[df_prods.columns.str.contains('unnamed',case = False)],axis = 1, inplace = True)

In [12]:
df_prods.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3


In [13]:
df_prods.shape

(49693, 5)

### 3. Combining dataframes

The most suitable way to combine these two datasets is to merge them, as they do not match in shape. Concatenation would not be suitable because the two datasets do not share the same columns (but with different values). The default option of an inner join will ensure that only information that is present in both datasets is kept.

In [14]:
df_orders_prods = df_orders_products_combined.merge(df_prods, on = 'product_id', indicator = True)

In [15]:
# Checking merged dataframe

df_orders_prods.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,orders_day_of_week,order_hour_of_day,days_since_last_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,_merge
0,2539329,1,prior,1,2,8,,196,1,0,Soda,77,7,9.0,both
1,2539329,1,prior,1,2,8,,14084,2,0,Organic Unsweetened Vanilla Almond Milk,91,16,12.5,both
2,2539329,1,prior,1,2,8,,12427,3,0,Original Beef Jerky,23,19,4.4,both
3,2539329,1,prior,1,2,8,,26088,4,0,Aged White Cheddar Popcorn,23,19,4.7,both
4,2539329,1,prior,1,2,8,,26405,5,0,XL Pick-A-Size Paper Towel Rolls,54,17,1.0,both


### 4. Confirming the results of the merge using the merge flag

In [16]:
df_orders_prods['_merge'].value_counts()

_merge
both          32434212
left_only            0
right_only           0
Name: count, dtype: int64

The merge flag confirms that all 32,434,212 rows of the newly merged dataframe have a value of both, meaning that there is a full match.

### 5. Exporting the dataframe

In [17]:
# Export data to pkl

df_orders_prods.to_pickle(os.path.join(path, '02. Data','Prepared Data', 'ords_prods_merge.pkl'))

Exporting the dataframe to pickle was the most suitable option, as it can be exported quickly, will save the dataframe exactly as it looks in Jupyter and will have a high compression rate when zipped.