### **Getting orders ids values**
When order_items table was analyzed during the exploratory analysis phase, one of the insights arisen was that not all the tables related to “orders” have the same order_id values. Some of them are missing from some tables.
##### ***So, this notebook catches all the unique order_id values that are repeated in all the related tables.***
The outcome this notebook is a csv file contining the "common" order_id values to be applied as a filter when "clean" tables are created just before loading data into the database.

The related tables are:
   * olist_orders_dataset.csv   
   * olist_order_items_dataset.csv   
   * olist_order_payments_dataset.csv
   * olist_order_reviews_dataset.csv

In [1]:
import os
import random
import pandas as pd
import numpy as np

In [2]:
dataset_path = "../../data/raw/" 

### ***Database table relationships (from kaggle)***
<img src="https://i.imgur.com/HRhd2Y0.png" alt="Database table relationships" style="height: 500px; width:900px;"/>

In [3]:
files = os.listdir(dataset_path)
print(f'The dataset contains {len(files)} files:')
for file in files:
    print(f'    * {file}')

The dataset contains 9 files:
    * olist_customers_dataset.csv
    * olist_geolocation_dataset.csv
    * olist_orders_dataset.csv
    * olist_order_items_dataset.csv
    * olist_order_payments_dataset.csv
    * olist_order_reviews_dataset.csv
    * olist_products_dataset.csv
    * olist_sellers_dataset.csv
    * product_category_name_translation.csv


#### Loading order_dataset

In [4]:
csv_file_name = 'olist_orders_dataset.csv'
csv_file_path = os.path.join(dataset_path, csv_file_name)
df_order = pd.read_csv(csv_file_path)
order_id_list = df_order['order_id'].unique().tolist()
order_id_list.sort(key = str)
order_id_list[:5]

['00010242fe8c5a6d1ba2dd792cb16214',
 '00018f77f2f0320c557190d7a144bdd3',
 '000229ec398224ef6ca0657da4fc703e',
 '00024acbcdf0a6daa1e931b038114c75',
 '00042b26cf59d7ce69dfabb4e55b4fd9']

#### Loading order_items_dataset and getting the list of order_id values

In [5]:
csv_file_name = 'olist_order_items_dataset.csv'
csv_file_path = os.path.join(dataset_path, csv_file_name)
df_order_items = pd.read_csv(csv_file_path)
order_items_id_list = df_order_items['order_id'].unique().tolist()
order_items_id_list.sort(key = str)
order_items_id_list[:5]

['00010242fe8c5a6d1ba2dd792cb16214',
 '00018f77f2f0320c557190d7a144bdd3',
 '000229ec398224ef6ca0657da4fc703e',
 '00024acbcdf0a6daa1e931b038114c75',
 '00042b26cf59d7ce69dfabb4e55b4fd9']

#### Loading order_payments_dataset and getting the list of order_id values

In [6]:
csv_file_name = 'olist_order_payments_dataset.csv'
csv_file_path = os.path.join(dataset_path, csv_file_name)
df_order_payments = pd.read_csv(csv_file_path)
order_payments_id_list = df_order_payments['order_id'].unique().tolist()
order_payments_id_list.sort(key = str)
order_payments_id_list[:5]

['00010242fe8c5a6d1ba2dd792cb16214',
 '00018f77f2f0320c557190d7a144bdd3',
 '000229ec398224ef6ca0657da4fc703e',
 '00024acbcdf0a6daa1e931b038114c75',
 '00042b26cf59d7ce69dfabb4e55b4fd9']

#### Loading order_reviews_dataset and getting the list of order_id values

In [7]:
csv_file_name = 'olist_order_reviews_dataset.csv'
csv_file_path = os.path.join(dataset_path, csv_file_name)
df_order_reviews = pd.read_csv(csv_file_path)
order_reviews_id_list = df_order_reviews['order_id'].unique().tolist()
order_reviews_id_list.sort(key = str)
order_reviews_id_list[:5]

['00010242fe8c5a6d1ba2dd792cb16214',
 '00018f77f2f0320c557190d7a144bdd3',
 '000229ec398224ef6ca0657da4fc703e',
 '00024acbcdf0a6daa1e931b038114c75',
 '00042b26cf59d7ce69dfabb4e55b4fd9']

In [8]:
print("SUMMARY:")
print(f"order_id_list has {len(order_id_list)} unique values")
print(f"order_items_id_list has {len(order_items_id_list)} unique values")
print(f"order_payments_id_list has {len(order_payments_id_list)} unique values")
print(f"order_reviews_id_list has {len(order_reviews_id_list)} unique values")

SUMMARY:
order_id_list has 99441 unique values
order_items_id_list has 98666 unique values
order_payments_id_list has 99440 unique values
order_reviews_id_list has 98673 unique values


#### Once all the order_id lists are defined, the intersection of common values in tables `orders`, `order_items` and `order_payments` must be found. Since reviews are optional, `order_reviews` is not considered in this intersection.

In [9]:
order_ids_to_keep = list(set(order_id_list) & set(order_items_id_list) & set(order_payments_id_list))# & set(order_reviews_id_list))
len(order_ids_to_keep)

98665

##### The resulting list (named `order_ids_to_keep`) is applied to each concerned table by keeping all the rows that match with any of the order_id values in the list.

In [10]:
# Orders to keep in olist_orders_dataset
df_order_clean = df_order[df_order['order_id'].isin(order_ids_to_keep)]
# df_order_clean[['order_id']].to_csv('../../data/pre_interim/orders_dataset_unique_ids.csv', sep=',', index=False, encoding='utf-8-sig')
df_order_clean['order_id'].unique().shape[0]

98665

In [11]:
# Orders to keep in olist_order_items_dataset
df_order_items_clean = df_order_items[df_order_items['order_id'].isin(order_ids_to_keep)]
# df_order_items_clean[['order_id']].to_csv('../../data/pre_interim/orders_items_dataset_unique_ids.csv', sep=',', index=False, encoding='utf-8-sig')
df_order_items_clean['order_id'].unique().shape[0]

98665

In [12]:
# Orders to keep in olist_order_payments_dataset
df_order_payments_clean = df_order_payments[df_order_payments['order_id'].isin(order_ids_to_keep)]
# df_order_payments_clean[['order_id']].to_csv('../../data/pre_interim/orders_payments_dataset_unique_ids.csv', sep=',', index=False, encoding='utf-8-sig')
df_order_payments_clean['order_id'].unique().shape[0]

98665

In [13]:
# Orders to keep in olist_order_reviews_dataset
df_order_reviews_clean = df_order_reviews[df_order_reviews['order_id'].isin(order_ids_to_keep)]
#df_order_reviews_clean[['order_id']].to_csv('../../data/pre_interim/orders_reviews_dataset_unique_ids.csv', sep=',', index=False, encoding='utf-8-sig')
df_order_reviews_clean['order_id'].unique().shape[0]

97916

#### Checking the list `order_ids_to_keep` is "filtering" right.

In [14]:
len(list(set(df_order_clean['order_id'].unique().tolist()) - set(df_order_items_clean['order_id'].unique().tolist())))

0

In [15]:
len(list(set(df_order_clean['order_id'].unique().tolist()) - set(df_order_payments_clean['order_id'].unique().tolist())))

0

In [16]:
len(list(set(df_order_clean['order_id'].unique().tolist()) - set(df_order_reviews_clean['order_id'].unique().tolist())))

749

In [17]:
98665 - 97916

749

####  Saving the `order_ids_to_keep` list as a csv


In [18]:
save = True
if save:
    pd.Series(order_ids_to_keep, name='ids_to_keep').to_csv('../../data/pre_interim/order_ids_to_keep.csv', sep=',', index=False, encoding='utf-8-sig')