# Orders Datasets

Contains purchase order information, tracking by delivery dates: carrier, customer, and estimated delivery 

## Initial Column Description


|**Column Title**|**order_id-> str** |**customer_id -> str** |**order_status -> srt** |**order_purchase_timestamp -> timestamp**| **order_approved_at -> timestamp**|**order_delivered_carrier_date -> timestamp**|**order_delivered_customer_date -> timestamp**|**order_estimated_delivery_date -> timestamp**|
|--|--|--|--|--|--|--|--|--|
|Description |Primary key - unique identifier of the order|primary key to the customer dataset. Each order has a unique customer_id |Reference to the order status (delivered, shipped, etc) |Shows the purchase timestamp |Shows the payment approval timestamp |Shows the order posting timestamp. When it was handled to the logistic partner |Shows the actual order delivery date to the customer |Shows the estimated delivery date that was informed to customer at the purchase moment|
|Example |e481f51cbdc54678b7cc49136f2d6af7 |9ef432eb6251297304e76186b10a928d |delivered	 |2017-10-02 10:56:33 |2017-10-02 11:07:15 |2017-10-04 19:55:00 |2017-10-10 21:25:13 |2017-10-18 00:00:00 |

### Errors found
+ For this table the raw data contain dnull or empties values, are the following:
                             
    + order_approved_at               -> qty: 160
    + order_delivered_carrier_date    -> qty: 1783
    + order_delivered_customer_date   -> qty: 2965

## Required Libraries

In [1]:
#Allows to work with CSV easily
import pandas as pd

## Data Preprocessing

We decided to create 1 new table for unique_order_id. So it is necessary to replace information in one column for his id in the respective table:

|External Table | External Column with new id| column to replace|
|--|--|--|
|Unique_order_id |unique_id |order_id |


Example:

For first row the info order_id is:

|order_id |customer_id |order_status |
|--|--|--|
|b81ef226f3fe1789b1e8b2acac839d17 |0a8556ac6be836b46b3e89920d59291c |delivered |

Looking in the external table **unique_order_id** we find the id **1** corresponds to order_id **b81ef226f3fe1789b1e8b2acac839d17**. So we need to replace **b81ef226f3fe1789b1e8b2acac839d17** for **1**.

|unique_id |customer_id |order_status |
|--|--|--|
|1 |0a8556ac6be836b46b3e89920d59291c |delivered |

### Data Correction

In [2]:
orders = pd.read_csv('../../data/raw/olist_orders_dataset.csv')
unique_orders = pd.read_csv('../../data/interim/unique_order_id.csv')


In [3]:
orders

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18 00:00:00
1,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,2018-07-24 20:41:37,2018-07-26 03:24:27,2018-07-26 14:31:00,2018-08-07 15:27:45,2018-08-13 00:00:00
2,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,2018-08-08 08:38:49,2018-08-08 08:55:23,2018-08-08 13:50:00,2018-08-17 18:06:29,2018-09-04 00:00:00
3,949d5b44dbf5de918fe9c16f97b45f8a,f88197465ea7920adcdbec7375364d82,delivered,2017-11-18 19:28:06,2017-11-18 19:45:59,2017-11-22 13:39:59,2017-12-02 00:28:42,2017-12-15 00:00:00
4,ad21c59c0840e6cb83a9ceb5573f8159,8ab97904e6daea8866dbdbc4fb7aad2c,delivered,2018-02-13 21:18:39,2018-02-13 22:20:29,2018-02-14 19:46:34,2018-02-16 18:17:02,2018-02-26 00:00:00
...,...,...,...,...,...,...,...,...
99436,9c5dedf39a927c1b2549525ed64a053c,39bd1228ee8140590ac3aca26f2dfe00,delivered,2017-03-09 09:54:05,2017-03-09 09:54:05,2017-03-10 11:18:03,2017-03-17 15:08:01,2017-03-28 00:00:00
99437,63943bddc261676b46f01ca7ac2f7bd8,1fca14ff2861355f6e5f14306ff977a7,delivered,2018-02-06 12:58:58,2018-02-06 13:10:37,2018-02-07 23:22:42,2018-02-28 17:37:56,2018-03-02 00:00:00
99438,83c1379a015df1e13d02aae0204711ab,1aa71eb042121263aafbe80c1b562c9c,delivered,2017-08-27 14:46:43,2017-08-27 15:04:16,2017-08-28 20:52:26,2017-09-21 11:24:17,2017-09-27 00:00:00
99439,11c177c8e97725db2631073c19f07b62,b331b74b18dc79bcdf6532d51e1637c1,delivered,2018-01-08 21:28:27,2018-01-08 21:36:21,2018-01-12 15:35:03,2018-01-25 23:32:54,2018-02-15 00:00:00


In [4]:
unique_orders

Unnamed: 0,unique_id,order_id
0,1,b81ef226f3fe1789b1e8b2acac839d17
1,2,a9810da82917af2d9aefd1278f1dcfa0
2,3,25e8ea4e93396b6fa0d3dd708e76c1bd
3,4,ba78997921bbcdc1373bb41e913ab953
4,5,42fdf880ba16b47b59251dd489d4441a
...,...,...
99435,99436,0406037ad97740d563a178ecc7a2075c
99436,99437,7b905861d7c825891d6347454ea7863f
99437,99438,32609bbb3dd69b3c066a6860554a77bf
99438,99439,b8b61059626efa996a60be9bb9320e10


in the orders dataset there are null values with respect to dates: order_approved_at, order_delivered_carrier_date, order_delivered_customer_date 

In [5]:
orders.isnull().sum()

order_id                            0
customer_id                         0
order_status                        0
order_purchase_timestamp            0
order_approved_at                 160
order_delivered_carrier_date     1783
order_delivered_customer_date    2965
order_estimated_delivery_date       0
dtype: int64

Review of order status types

In [6]:
orders["order_status"].unique()

array(['delivered', 'invoiced', 'shipped', 'processing', 'unavailable',
       'canceled', 'created', 'approved'], dtype=object)

Review the quantity of order statuses

In [7]:
pd.crosstab(index = orders ['order_status'], columns = 'cantidad')

col_0,cantidad
order_status,Unnamed: 1_level_1
approved,2
canceled,625
created,5
delivered,96478
invoiced,314
processing,301
shipped,1107
unavailable,609


Only rows with NaN values are selected.

In [8]:
orders[pd.isnull(orders).any(axis=1)]

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
6,136cce7faa42fdb2cefd53fdc79a6098,ed0271e0b7da060a393796590e7b737a,invoiced,2017-04-11 12:22:08,2017-04-13 13:25:17,,,2017-05-09 00:00:00
44,ee64d42b8cf066f35eac1cf57de1aa85,caded193e8e47b8362864762a83db3c5,shipped,2018-06-04 16:44:48,2018-06-05 04:31:18,2018-06-05 14:32:00,,2018-06-28 00:00:00
103,0760a852e4e9d89eb77bf631eaaf1c84,d2a79636084590b7465af8ab374a8cf5,invoiced,2018-08-03 17:44:42,2018-08-07 06:15:14,,,2018-08-21 00:00:00
128,15bed8e2fec7fdbadb186b57c46c92f2,f3f0e613e0bdb9c7cee75504f0f90679,processing,2017-09-03 14:22:03,2017-09-03 14:30:09,,,2017-10-03 00:00:00
154,6942b8da583c2f9957e990d028607019,52006a9383bf149a4fb24226b173106f,shipped,2018-01-10 11:33:07,2018-01-11 02:32:30,2018-01-11 19:39:23,,2018-02-07 00:00:00
...,...,...,...,...,...,...,...,...
99283,3a3cddda5a7c27851bd96c3313412840,0b0d6095c5555fe083844281f6b093bb,canceled,2018-08-31 16:13:44,,,,2018-10-01 00:00:00
99313,e9e64a17afa9653aacf2616d94c005b8,b4cd0522e632e481f8eaf766a2646e86,processing,2018-01-05 23:07:24,2018-01-09 07:18:05,,,2018-02-06 00:00:00
99347,a89abace0dcc01eeb267a9660b5ac126,2f0524a7b1b3845a1a57fcf3910c4333,canceled,2018-09-06 18:45:47,,,,2018-09-27 00:00:00
99348,a69ba794cc7deb415c3e15a0a3877e69,726f0894b5becdf952ea537d5266e543,unavailable,2017-08-23 16:28:04,2017-08-28 15:44:47,,,2017-09-15 00:00:00


#### Replace the null data

A new DataFrame is created using copy to keep the data that has been manipulated separate from the raw data.

In [9]:
df_orders=orders.copy()

In order not to lose information and not simply delete them, we first analyze if they are related to the delivery status of the order but we find that they are not so related since there are orders in canceled status and have all the data of the dates complete, we complete the null data of the dates with "<-NULL->". 

In [10]:
df_orders["order_approved_at"].fillna("<null>", inplace = True)
df_orders["order_delivered_carrier_date"].fillna("<null>", inplace = True)
df_orders["order_delivered_customer_date"].fillna("<null>", inplace = True)

It is verified that there is no null data left.

In [11]:
df_orders.isnull().sum()

order_id                         0
customer_id                      0
order_status                     0
order_purchase_timestamp         0
order_approved_at                0
order_delivered_carrier_date     0
order_delivered_customer_date    0
order_estimated_delivery_date    0
dtype: int64

#### Merge between dataset orders and unique_orders_id

After having the dataset clean and without null data, we proceed to merge between the two tables through the column order_id.

In [12]:
orders_join = pd.merge(df_orders, unique_orders, on ='order_id', how ='right')
orders_join

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date,unique_id
0,b81ef226f3fe1789b1e8b2acac839d17,0a8556ac6be836b46b3e89920d59291c,delivered,2018-04-25 22:01:49,2018-04-25 22:15:09,2018-05-02 15:20:00,2018-05-09 17:36:51,2018-05-22 00:00:00,1
1,a9810da82917af2d9aefd1278f1dcfa0,f2c7fc58a9de810828715166c672f10a,delivered,2018-06-26 11:01:38,2018-06-26 11:18:58,2018-06-28 14:18:00,2018-06-29 20:32:09,2018-07-16 00:00:00,2
2,25e8ea4e93396b6fa0d3dd708e76c1bd,25b14b69de0b6e184ae6fe2755e478f9,delivered,2017-12-12 11:19:55,2017-12-14 09:52:34,2017-12-15 20:13:22,2017-12-18 17:24:41,2018-01-04 00:00:00,3
3,ba78997921bbcdc1373bb41e913ab953,7a5d8efaaa1081f800628c30d2b0728f,delivered,2017-12-06 12:04:06,2017-12-06 12:13:20,2017-12-07 20:28:28,2017-12-21 01:35:51,2018-01-04 00:00:00,4
4,42fdf880ba16b47b59251dd489d4441a,15fd6fb8f8312dbb4674e4518d6fa3b3,delivered,2018-05-21 13:59:17,2018-05-21 16:14:41,2018-05-22 11:46:00,2018-06-01 21:44:53,2018-06-13 00:00:00,5
...,...,...,...,...,...,...,...,...,...
99435,0406037ad97740d563a178ecc7a2075c,5d576cb2dfa3bc05612c392a1ee9c654,delivered,2018-03-08 16:57:23,2018-03-10 03:55:25,2018-03-12 18:19:36,2018-03-16 13:09:51,2018-04-04 00:00:00,99436
99436,7b905861d7c825891d6347454ea7863f,2079230c765a88530822a34a4cec2aa0,delivered,2017-08-18 09:45:35,2017-08-18 10:04:56,2017-08-18 18:04:24,2017-08-23 22:25:56,2017-09-12 00:00:00,99437
99437,32609bbb3dd69b3c066a6860554a77bf,e4abb5057ec8cfda9759c0dc415a8188,invoiced,2017-11-18 17:27:14,2017-11-18 17:46:05,<null>,<null>,2017-12-05 00:00:00,99438
99438,b8b61059626efa996a60be9bb9320e10,5d719b0d300663188169c6560e243f27,delivered,2018-08-07 23:26:13,2018-08-07 23:45:00,2018-08-09 11:46:00,2018-08-21 22:41:46,2018-08-24 00:00:00,99439


The order of the columns is arranged so that the unique_id column comes first 

In [13]:
new_cols = ["unique_id","order_id","customer_id","order_status", "order_purchase_timestamp", "order_approved_at", "order_delivered_carrier_date", "order_delivered_customer_date", "order_estimated_delivery_date"]
orders_join=orders_join[new_cols]
orders_join=orders_join.reindex(columns=new_cols)
orders_join

Unnamed: 0,unique_id,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
0,1,b81ef226f3fe1789b1e8b2acac839d17,0a8556ac6be836b46b3e89920d59291c,delivered,2018-04-25 22:01:49,2018-04-25 22:15:09,2018-05-02 15:20:00,2018-05-09 17:36:51,2018-05-22 00:00:00
1,2,a9810da82917af2d9aefd1278f1dcfa0,f2c7fc58a9de810828715166c672f10a,delivered,2018-06-26 11:01:38,2018-06-26 11:18:58,2018-06-28 14:18:00,2018-06-29 20:32:09,2018-07-16 00:00:00
2,3,25e8ea4e93396b6fa0d3dd708e76c1bd,25b14b69de0b6e184ae6fe2755e478f9,delivered,2017-12-12 11:19:55,2017-12-14 09:52:34,2017-12-15 20:13:22,2017-12-18 17:24:41,2018-01-04 00:00:00
3,4,ba78997921bbcdc1373bb41e913ab953,7a5d8efaaa1081f800628c30d2b0728f,delivered,2017-12-06 12:04:06,2017-12-06 12:13:20,2017-12-07 20:28:28,2017-12-21 01:35:51,2018-01-04 00:00:00
4,5,42fdf880ba16b47b59251dd489d4441a,15fd6fb8f8312dbb4674e4518d6fa3b3,delivered,2018-05-21 13:59:17,2018-05-21 16:14:41,2018-05-22 11:46:00,2018-06-01 21:44:53,2018-06-13 00:00:00
...,...,...,...,...,...,...,...,...,...
99435,99436,0406037ad97740d563a178ecc7a2075c,5d576cb2dfa3bc05612c392a1ee9c654,delivered,2018-03-08 16:57:23,2018-03-10 03:55:25,2018-03-12 18:19:36,2018-03-16 13:09:51,2018-04-04 00:00:00
99436,99437,7b905861d7c825891d6347454ea7863f,2079230c765a88530822a34a4cec2aa0,delivered,2017-08-18 09:45:35,2017-08-18 10:04:56,2017-08-18 18:04:24,2017-08-23 22:25:56,2017-09-12 00:00:00
99437,99438,32609bbb3dd69b3c066a6860554a77bf,e4abb5057ec8cfda9759c0dc415a8188,invoiced,2017-11-18 17:27:14,2017-11-18 17:46:05,<null>,<null>,2017-12-05 00:00:00
99438,99439,b8b61059626efa996a60be9bb9320e10,5d719b0d300663188169c6560e243f27,delivered,2018-08-07 23:26:13,2018-08-07 23:45:00,2018-08-09 11:46:00,2018-08-21 22:41:46,2018-08-24 00:00:00


The order_id column is removed

In [14]:
orders_join.drop('order_id', inplace=True, axis=1)
orders_join

Unnamed: 0,unique_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
0,1,0a8556ac6be836b46b3e89920d59291c,delivered,2018-04-25 22:01:49,2018-04-25 22:15:09,2018-05-02 15:20:00,2018-05-09 17:36:51,2018-05-22 00:00:00
1,2,f2c7fc58a9de810828715166c672f10a,delivered,2018-06-26 11:01:38,2018-06-26 11:18:58,2018-06-28 14:18:00,2018-06-29 20:32:09,2018-07-16 00:00:00
2,3,25b14b69de0b6e184ae6fe2755e478f9,delivered,2017-12-12 11:19:55,2017-12-14 09:52:34,2017-12-15 20:13:22,2017-12-18 17:24:41,2018-01-04 00:00:00
3,4,7a5d8efaaa1081f800628c30d2b0728f,delivered,2017-12-06 12:04:06,2017-12-06 12:13:20,2017-12-07 20:28:28,2017-12-21 01:35:51,2018-01-04 00:00:00
4,5,15fd6fb8f8312dbb4674e4518d6fa3b3,delivered,2018-05-21 13:59:17,2018-05-21 16:14:41,2018-05-22 11:46:00,2018-06-01 21:44:53,2018-06-13 00:00:00
...,...,...,...,...,...,...,...,...
99435,99436,5d576cb2dfa3bc05612c392a1ee9c654,delivered,2018-03-08 16:57:23,2018-03-10 03:55:25,2018-03-12 18:19:36,2018-03-16 13:09:51,2018-04-04 00:00:00
99436,99437,2079230c765a88530822a34a4cec2aa0,delivered,2017-08-18 09:45:35,2017-08-18 10:04:56,2017-08-18 18:04:24,2017-08-23 22:25:56,2017-09-12 00:00:00
99437,99438,e4abb5057ec8cfda9759c0dc415a8188,invoiced,2017-11-18 17:27:14,2017-11-18 17:46:05,<null>,<null>,2017-12-05 00:00:00
99438,99439,5d719b0d300663188169c6560e243f27,delivered,2018-08-07 23:26:13,2018-08-07 23:45:00,2018-08-09 11:46:00,2018-08-21 22:41:46,2018-08-24 00:00:00


### Creating final csv: orders_dataset.csv

finally the new csv file is created with the modified order dataset. When you saved the dataset always mark **"index = False"**. Or pandas will add a new column with a consequtive number. This small script is to remove this useless column.

In [16]:
orders_join.to_csv('../../data/pre_interim/orders_dataset.csv', index = False)

## Final Column Description

Before
|**Column Title**|**order_id-> str** |**customer_id -> str** |**order_status -> srt** |**order_purchase_timestamp -> timestamp**| **order_approved_at -> timestamp**|**order_delivered_carrier_date -> timestamp**|**order_delivered_customer_date -> timestamp**|**order_estimated_delivery_date -> timestamp**|
|--|--|--|--|--|--|--|--|--|
|Description |unique identifier of the order.|primary key to the customer dataset. Each order has a unique customer_id |Reference to the order status |purchase timestamp |payment approval timestamp |the order posting timestamp. When it was handled to the logistic partner |the actual order delivery date to the customer |the estimated delivery date that was informed to customer at the purchase moment|
|Before Preprocessing|b81ef226f3fe1789b1e8b2acac839d17 |0a8556ac6be836b46b3e89920d59291c |delivered	 |2017-10-02 10:56:33 |2017-10-02 11:07:15 |2017-10-04 19:55:00 |2017-10-10 21:25:13 |2017-10-18 00:00:00 |

After

|**Column Title**|**unique_id-> str** |**customer_id -> str** |**order_status -> srt** |**order_purchase_timestamp -> timestamp**| **order_approved_at -> timestamp**|**order_delivered_carrier_date -> timestamp**|**order_delivered_customer_date -> timestamp**|**order_estimated_delivery_date -> timestamp**|
|--|--|--|--|--|--|--|--|--|
|Description |unique_id|primary key to the customer dataset. Each order has a unique customer_id |Reference to the order status |purchase timestamp |payment approval timestamp |the order posting timestamp. When it was handled to the logistic partner |the actual order delivery date to the customer |the estimated delivery date that was informed to customer at the purchase moment|
|After Preprocessing|1 |0a8556ac6be836b46b3e89920d59291c |delivered	 |2017-10-02 10:56:33 |2017-10-02 11:07:15 |2017-10-04 19:55:00 |2017-10-10 21:25:13 |2017-10-18 00:00:00 |
