# Task

### Step 3 - import pickle file (previous steps were done in another notebook)

In [None]:
#initial setup, imports, defs, dataframe loading

import pandas as pd
import os
prep_data_path = '../2_Data/2_Prepared_Data'
df_orders_products_combined = pd.read_pickle(os.path.join(prep_data_path, 'orders_products_combined.pkl'))

### Step 4 - check the shape of the imported pickle file

In [2]:
df_orders_products_combined.shape

(32434489, 7)

It kept the shape, as expected!

### Step 5 - Determine a suitable way to combine products_checked and orders_products_combined dataframes

In [3]:
#load products checked file
df_products_chk = pd.read_csv(os.path.join(prep_data_path, 'products_checked.csv'))

Considerations:

* It's clear to me that we should use a `df.merge()` using as common field `product_id`

* at this point, we should remember that we cleaned up 16 products that had no name. It's likely that the orders dataframe has records with `product_id` matching those that were cleaned (that don't exist anymore in the cleaned df)
  
* Now the question is: do we want to keep the records with orders of products that we don't even know the name? If yes, then we should do a left join, keeping all records from `df_orders_products_combined`. But for me doesn't make much sense, so i'll apply an inner join, keeping only orders where the `product_id` has a match in `df_products_checked`

In [4]:
#applying merge and storing it in a new dataframe (to be exported)
df_orders_products_merged = df_orders_products_combined.merge(df_products_chk, on = 'product_id', indicator = True)

### Step 6 - confirm the results with the merge flag

In [5]:
df_orders_products_merged['_merge'].value_counts()

_merge
both          32403719
left_only            0
right_only           0
Name: count, dtype: int64

In [6]:
#after verification, dropping unneeded column
df_orders_products_merged.drop('_merge',axis=1,inplace=True)

### Step 7 - export merged dataframe

In [7]:
#due to size, best is to export to a pickle file
df_orders_products_merged.to_pickle(os.path.join(prep_data_path, 'orders_products_merged.pkl'))