# Orders Datasets

Contains purchase order information, tracking by delivery dates: carrier, customer, and estimated delivery 

## Initial Column Description


|**Column Title**|**order_id-> str** |**customer_id -> str** |**order_status -> srt** |**order_purchase_timestamp -> timestamp**| **order_approved_at -> timestamp**|**order_delivered_carrier_date -> timestamp**|**order_delivered_customer_date -> timestamp**|**order_estimated_delivery_date -> timestamp**|
|--|--|--|--|--|--|--|--|--|
|Description |Primary key - unique identifier of the order|primary key to the customer dataset. Each order has a unique customer_id |Reference to the order status (delivered, shipped, etc) |Shows the purchase timestamp |Shows the payment approval timestamp |Shows the order posting timestamp. When it was handled to the logistic partner |Shows the actual order delivery date to the customer |Shows the estimated delivery date that was informed to customer at the purchase moment|
|Example |e481f51cbdc54678b7cc49136f2d6af7 |9ef432eb6251297304e76186b10a928d |delivered	 |2017-10-02 10:56:33 |2017-10-02 11:07:15 |2017-10-04 19:55:00 |2017-10-10 21:25:13 |2017-10-18 00:00:00 |

### Errors found
+ For this table the raw data contain null or empties values, are the following:
                             
    + order_approved_at               -> qty: 160
    + order_delivered_carrier_date    -> qty: 1783
    + order_delivered_customer_date   -> qty: 2965

This null values were corrected in the notebook orders_dataset that is in the folder pre-anlysis, also the merge between the orders dataset and unique_order_id was done to avoid repeated names and data in other tables.
 
In this notebook it will be solved the main problem of this table is the format of the columns with dates (order_purchase_timestamp, order_approved_at, order_delivered_carrier_date, order_delivered_customer_date, order_estimated_delivery_date), because the engine chosen for the database does not accept dates with the format dd/mm/yyyy hh:MM.

| **Column** | **type** |
| --- | --- |
| ID (**PK**)                   | int |
| order_id                      | int |
| customer_id                   | str |
| order_status                  | str |
| order_purchase_timestamp      | str |
| order_approved_at             | str |
| order_delivered_carrier_date  | str |
| order_delivered_customer_date | str |
| order_estimated_delivery_date | str |



## Data Preprocessing

the date format will be corrected, changing the fowardslash sign ( / ) by a hyphen ( - )

Example:

|**order_purchase_timestamp**| **order_approved_at**|**order_delivered_carrier_date -> timestamp**|**order_delivered_customer_date**|**order_estimated_delivery_date**|
|--|--|--|--|--|
|2017/10/02 10:56:33 |2017/10/02 11:07:15 |2017/10-04 19:55:00 |2017/10/10 21:25:13 |2017/10/18 00:00:00 |

what we need is transfrom datetime, like this:

|**order_purchase_timestamp**| **order_approved_at**|**order_delivered_carrier_date -> timestamp**|**order_delivered_customer_date**|**order_estimated_delivery_date**|
|--|--|--|--|--|
|2017-10-02 10:56:33 |2017-10-02 11:07:15 |2017-10-04 19:55:00 |2017-10-10 21:25:13 |2017-10-18 00:00:00 |

## Required Libraries

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv('../../data/pre_interim/orders_dataset_std_columns.csv')
df.head()

Unnamed: 0,ID,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
0,1,1,0a8556ac6be836b46b3e89920d59291c,delivered,25/04/2018 22:01,25/04/2018 22:15,2/05/2018 15:20,9/05/2018 17:36,22/05/2018 0:00
1,2,2,f2c7fc58a9de810828715166c672f10a,delivered,26/06/2018 11:01,26/06/2018 11:18,28/06/2018 14:18,29/06/2018 20:32,16/07/2018 0:00
2,3,3,25b14b69de0b6e184ae6fe2755e478f9,delivered,12/12/2017 11:19,14/12/2017 9:52,15/12/2017 20:13,18/12/2017 17:24,4/01/2018 0:00
3,4,4,7a5d8efaaa1081f800628c30d2b0728f,delivered,6/12/2017 12:04,6/12/2017 12:13,7/12/2017 20:28,21/12/2017 1:35,4/01/2018 0:00
4,5,5,15fd6fb8f8312dbb4674e4518d6fa3b3,delivered,21/05/2018 13:59,21/05/2018 16:14,22/05/2018 11:46,1/06/2018 21:44,13/06/2018 0:00


In [4]:
df.dtypes

ID                                int64
order_id                          int64
customer_id                      object
order_status                     object
order_purchase_timestamp         object
order_approved_at                object
order_delivered_carrier_date     object
order_delivered_customer_date    object
order_estimated_delivery_date    object
dtype: object

### Data Correction

We will transform the mentioned columns to datetime with the `to_datetime()` method of pandas, which allows us to express the date in a PostgreSQL compatible format: yyyy-mm-dd hh:MM:ss

In [5]:
df['order_purchase_timestamp'] = pd.to_datetime(df.order_purchase_timestamp)

In [6]:
df['order_approved_at'] = pd.to_datetime(df.order_approved_at, errors='coerce')

In [7]:
df['order_delivered_carrier_date'] = pd.to_datetime(df.order_delivered_carrier_date, errors='coerce')

In [8]:
df['order_delivered_customer_date'] = pd.to_datetime(df.order_delivered_customer_date, errors='coerce')

In [9]:
df['order_estimated_delivery_date'] = pd.to_datetime(df.order_estimated_delivery_date, errors='coerce')

In [10]:
df.dtypes

ID                                        int64
order_id                                  int64
customer_id                              object
order_status                             object
order_purchase_timestamp         datetime64[ns]
order_approved_at                datetime64[ns]
order_delivered_carrier_date     datetime64[ns]
order_delivered_customer_date    datetime64[ns]
order_estimated_delivery_date    datetime64[ns]
dtype: object

In [11]:
df.head()

Unnamed: 0,ID,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
0,1,1,0a8556ac6be836b46b3e89920d59291c,delivered,2018-04-25 22:01:00,2018-04-25 22:15:00,2018-02-05 15:20:00,2018-09-05 17:36:00,2018-05-22
1,2,2,f2c7fc58a9de810828715166c672f10a,delivered,2018-06-26 11:01:00,2018-06-26 11:18:00,2018-06-28 14:18:00,2018-06-29 20:32:00,2018-07-16
2,3,3,25b14b69de0b6e184ae6fe2755e478f9,delivered,2017-12-12 11:19:00,2017-12-14 09:52:00,2017-12-15 20:13:00,2017-12-18 17:24:00,2018-04-01
3,4,4,7a5d8efaaa1081f800628c30d2b0728f,delivered,2017-06-12 12:04:00,2017-06-12 12:13:00,2017-07-12 20:28:00,2017-12-21 01:35:00,2018-04-01
4,5,5,15fd6fb8f8312dbb4674e4518d6fa3b3,delivered,2018-05-21 13:59:00,2018-05-21 16:14:00,2018-05-22 11:46:00,2018-01-06 21:44:00,2018-06-13


Finally, the table is as follows:

| **Column** | **type** |
| --- | --- |
| ID (**PK**)                   | int |
| order_id                      | int |
| customer_id                   | str |
| order_status                  | str |
| order_purchase_timestamp      | datetime64 |
| order_approved_at             | datetime64 |
| order_delivered_carrier_date  | datetime64 |
| order_delivered_customer_date | datetime64 |
| order_estimated_delivery_date | datetime64 |

### Creating final csv: orders_dataset.csv

finally the new csv file is created with the modified order dataset. When you saved the dataset always mark **"index = False"**. Or pandas will add a new column with a consequtive number. This small script is to remove this useless column.

In [17]:
df['order_id'] = df['ID']

In [18]:
df.to_csv('../../data/interim/orders_dataset_clean.csv', index = False, header = True)

## Final Column Description


|**Column Title**|**ID** |**order_id** |**customer_id** |**order_status** |**order_purchase_timestamp**| **order_approved_at**|**order_delivered_carrier_date**|**order_delivered_customer_date**|**order_estimated_delivery_date**|
|--|--|--|--|--|--|--|--|--|--|
|Before Preprocessing|1 |1 |0a8556ac6be836b46b3e89920d59291c |delivered	 |2017/10/02 10:56:33 |2017/10/02 11:07:15 |2017/10/04 19:55:00 |2017/10/10 21:25:13 |2017/10/18 00:00:00 |
|After Preprocessing|1 |1 |0a8556ac6be836b46b3e89920d59291c |delivered	 |2017-10-02 10:56:33 |2017-10-02 11:07:15 |2017-10-04 19:55:00 |2017-10-10 21:25:13 |2017-10-18 00:00:00 |
