# Order items dataset clean
This dataset includes data about the items purchased within each order.

## Initial Column Description


|**Column Title**|**id -> int** |**order_id -> str** |**item_id -> int** |**product_id -> str** |**seller_id -> str**|**shipping_limit_date -> timestamp**| **price -> float**|**freight_value -> float** |
|--|--|--|--|--|--|--|--|--|
|Description |id |order unique identifier PK |sequential number identifying number of items included in the same order. PK |product unique identifier PK |seller unique identifier PK |Shows the seller shipping limit date for handling the order over to the logistic partner. |item price |item freight value item (if an order has more than one item the freight value is splitted between items) |
|Example |1 |82096 |1 |4244733e06e7ecb4970a6e2683c13e61 |48436dade18ac8b2bce089ec2a041202 |19/09/2017 9:45	|58.90 |13.29 |

### Errors found
+ the date format of the dataset order items should be corrected as it is causing problems when uploading the information to the database.


## Required Libraries

In [63]:
import pandas as pd
import os

## Data Preprocessing


We need to change the format of the column shipping_limit_date

|**Column Title**|**id -> int** |**order_id -> str** |**item_id -> int** |**product_id -> str** |**seller_id -> str**|**shipping_limit_date -> timestamp**| **price -> float**|**freight_value -> float** |
|--|--|--|--|--|--|--|--|--|
|Example |1 |82096 |1 |4244733e06e7ecb4970a6e2683c13e61 |48436dade18ac8b2bce089ec2a041202 |19/09/2017 9:45	|58.90 |13.29 |

Replace the slash for hyphen. the outcome: 

|**Column Title**|**id -> int** |**order_id -> str** |**item_id -> int** |**product_id -> str** |**seller_id -> str**|**shipping_limit_date -> timestamp**| **price -> float**|**freight_value -> float** |
|--|--|--|--|--|--|--|--|--|
|Example |1 |82096 |1 |4244733e06e7ecb4970a6e2683c13e61 |48436dade18ac8b2bce089ec2a041202 |19-09-2017 9:45	|58.90 |13.29 |


### Data Correction

In [64]:
dataset_path = "../../data/raw/" 

In [65]:
csv_file_name = 'olist_order_items_dataset.csv'
csv_file_path = os.path.join(dataset_path, csv_file_name)
df = pd.read_csv(csv_file_path)

In [66]:
order_ids = pd.read_csv('../../data/interim/unique_order_id.csv')
order_ids
order_ids_dict = dict(zip(order_ids['order'].to_list(), order_ids['order_id'].to_list()))
df['order_id'] = df['order_id'].map(order_ids_dict)
df

Unnamed: 0,order_id,order_item_id,product_id,seller_id,shipping_limit_date,price,freight_value
0,82096.0,1,4244733e06e7ecb4970a6e2683c13e61,48436dade18ac8b2bce089ec2a041202,2017-09-19 09:45:35,58.90,13.29
1,2495.0,1,e5f2d52b802189ee658865ca93d83a8f,dd7ddc04e1b6c2c614352b383efe2d36,2017-05-03 11:05:13,239.90,19.93
2,12285.0,1,c777355d18b72b67abbeef9df44fd0fd,5b51032eddd242adc84c38acab88f23d,2018-01-18 14:48:30,199.00,17.87
3,32372.0,1,7634da152a4610f1595efa32f14722fc,9d7a1d34a5052409006425275ba1c2b4,2018-08-15 10:10:18,12.99,12.79
4,94629.0,1,ac6c3623068f30de03045865e4e10089,df560393f3a51e74553ab94004ba5c87,2017-02-13 13:57:51,199.90,18.14
...,...,...,...,...,...,...,...
112645,66868.0,1,4aa6014eceb682077f9dc4bffebc05b0,b8bc237ba3788b23da09c0f1f3a3288c,2018-05-02 04:11:01,299.99,43.41
112646,2771.0,1,32e07fd915822b0765e448c4dd74c828,f3c38ab652836d21de61fb8314b69182,2018-07-20 04:31:48,350.00,36.53
112647,57154.0,1,72a30483855e2eafc67aee5dc2560482,c3cfdc648177fdbbbb35635a37472c53,2017-10-30 17:14:25,99.90,16.95
112648,36346.0,1,9c422a519119dcad7575db5af1ba540e,2b3e4a2a3ea8e01938cabda2a3e5cc79,2017-08-21 00:04:32,55.99,8.72


In [67]:
# Drop na
df.dropna(inplace=True)
df['order_id'] = df['order_id'].astype(int)
df['id'] = df.index + 1
df = df[['id', 'order_id', 'order_item_id', 'product_id', 'seller_id', 'shipping_limit_date', 'price', 'freight_value']]

In [68]:
# items_csv = '../../data/pre_interim/order_items_dataset_clean.csv'
# items_dataset = pd.read_csv(items_csv)
items_dataset = df.copy()
items_dataset

Unnamed: 0,id,order_id,order_item_id,product_id,seller_id,shipping_limit_date,price,freight_value
0,1,82096,1,4244733e06e7ecb4970a6e2683c13e61,48436dade18ac8b2bce089ec2a041202,2017-09-19 09:45:35,58.90,13.29
1,2,2495,1,e5f2d52b802189ee658865ca93d83a8f,dd7ddc04e1b6c2c614352b383efe2d36,2017-05-03 11:05:13,239.90,19.93
2,3,12285,1,c777355d18b72b67abbeef9df44fd0fd,5b51032eddd242adc84c38acab88f23d,2018-01-18 14:48:30,199.00,17.87
3,4,32372,1,7634da152a4610f1595efa32f14722fc,9d7a1d34a5052409006425275ba1c2b4,2018-08-15 10:10:18,12.99,12.79
4,5,94629,1,ac6c3623068f30de03045865e4e10089,df560393f3a51e74553ab94004ba5c87,2017-02-13 13:57:51,199.90,18.14
...,...,...,...,...,...,...,...,...
112645,112646,66868,1,4aa6014eceb682077f9dc4bffebc05b0,b8bc237ba3788b23da09c0f1f3a3288c,2018-05-02 04:11:01,299.99,43.41
112646,112647,2771,1,32e07fd915822b0765e448c4dd74c828,f3c38ab652836d21de61fb8314b69182,2018-07-20 04:31:48,350.00,36.53
112647,112648,57154,1,72a30483855e2eafc67aee5dc2560482,c3cfdc648177fdbbbb35635a37472c53,2017-10-30 17:14:25,99.90,16.95
112648,112649,36346,1,9c422a519119dcad7575db5af1ba540e,2b3e4a2a3ea8e01938cabda2a3e5cc79,2017-08-21 00:04:32,55.99,8.72


#### Correct the datatime

In [69]:
items_dataset['shipping_limit_date'] = pd.to_datetime(items_dataset['shipping_limit_date'])
items_dataset

Unnamed: 0,id,order_id,order_item_id,product_id,seller_id,shipping_limit_date,price,freight_value
0,1,82096,1,4244733e06e7ecb4970a6e2683c13e61,48436dade18ac8b2bce089ec2a041202,2017-09-19 09:45:35,58.90,13.29
1,2,2495,1,e5f2d52b802189ee658865ca93d83a8f,dd7ddc04e1b6c2c614352b383efe2d36,2017-05-03 11:05:13,239.90,19.93
2,3,12285,1,c777355d18b72b67abbeef9df44fd0fd,5b51032eddd242adc84c38acab88f23d,2018-01-18 14:48:30,199.00,17.87
3,4,32372,1,7634da152a4610f1595efa32f14722fc,9d7a1d34a5052409006425275ba1c2b4,2018-08-15 10:10:18,12.99,12.79
4,5,94629,1,ac6c3623068f30de03045865e4e10089,df560393f3a51e74553ab94004ba5c87,2017-02-13 13:57:51,199.90,18.14
...,...,...,...,...,...,...,...,...
112645,112646,66868,1,4aa6014eceb682077f9dc4bffebc05b0,b8bc237ba3788b23da09c0f1f3a3288c,2018-05-02 04:11:01,299.99,43.41
112646,112647,2771,1,32e07fd915822b0765e448c4dd74c828,f3c38ab652836d21de61fb8314b69182,2018-07-20 04:31:48,350.00,36.53
112647,112648,57154,1,72a30483855e2eafc67aee5dc2560482,c3cfdc648177fdbbbb35635a37472c53,2017-10-30 17:14:25,99.90,16.95
112648,112649,36346,1,9c422a519119dcad7575db5af1ba540e,2b3e4a2a3ea8e01938cabda2a3e5cc79,2017-08-21 00:04:32,55.99,8.72


In [70]:
items_dataset['order_item_id'].value_counts()

1     98665
2      9802
3      2286
4       965
5       460
6       256
7        58
8        36
9        28
10       25
11       17
12       13
13        8
14        7
15        5
16        3
17        3
18        3
19        3
20        3
21        1
Name: order_item_id, dtype: int64

#### Create the csv 

When you saved the dataset always mark **"index = False"**. Or pandas will add a new column with a consequtive number. This small script is to remove this useless column.

In [71]:
items_dataset.to_csv('../../data/interim/order_items_dataset.csv', index=False)

## Final Column Description


|**Column Title**|**id -> int** |**order_id -> str** |**item_id -> int** |**product_id -> str** |**seller_id -> str**|**shipping_limit_date -> timestamp**| **price -> float**|**freight_value -> float** |
|--|--|--|--|--|--|--|--|--|
|Before Preprocessing |1 |82096 |1 |4244733e06e7ecb4970a6e2683c13e61 |48436dade18ac8b2bce089ec2a041202 |19/09/2017 9:45	|58.90 |13.29 |
|After Preprocessing |1 |82096 |1 |4244733e06e7ecb4970a6e2683c13e61 |48436dade18ac8b2bce089ec2a041202 |19-09-2017 9:45	|58.90 |13.29 |

