## Dataset Itens

> Neste notebook ocorrerá as análises e alterações necessárias para tornar o dataframe Itens confiável para modelagem dimensional tais como comentários acerca de determinadas decisões.

In [1]:
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', None)

PATH_TRUSTED = '../output/pd/trusted/'

ds_items = pd.read_csv(PATH_TRUSTED + 'itens_trusted.csv')
ds_orders = pd.read_csv(PATH_TRUSTED + 'orders_trusted.csv')

#### Análise do dataset Order Itens

In [2]:
ds_items.shape

(112650, 7)

In [3]:
ds_items.head(2)

Unnamed: 0,order_id,order_item_id,product_id,seller_id,shipping_limit_date,price,freight_value
0,00010242fe8c5a6d1ba2dd792cb16214,1,25864.0,48436dade18ac8b2bce089ec2a041202,2017-09-19 09:45:35,58.9,13.29
1,00018f77f2f0320c557190d7a144bdd3,1,27229.0,dd7ddc04e1b6c2c614352b383efe2d36,2017-05-03 11:05:13,239.9,19.93


**Análise das Variáveis/Colunas**

- **shipping_limit_date**: Variável Qualitativa Ordinal
- **price**: Variável Quantitativa Contínua
- **freight_value**: Variável Quantitativa Contínua

> Como não trabalharemos com o dataset `sellers`, droparei a coluna referente ao id do vendedor

In [4]:
ds_items.drop(columns=["seller_id"], inplace=True)

In [5]:
ds_items['order_item_id'].value_counts()

order_item_id
1     98666
2      9803
3      2287
4       965
5       460
6       256
7        58
8        36
9        28
10       25
11       17
12       13
13        8
14        7
15        5
16        3
17        3
18        3
19        3
20        3
21        1
Name: count, dtype: int64

In [6]:
ds_items[ds_items['order_item_id'] == 20]

Unnamed: 0,order_id,order_item_id,product_id,shipping_limit_date,price,freight_value
11951,1b15974a0141d54e36626dca3fdc731a,20,27920.0,2018-03-01 02:50:48,100.0,10.12
57316,8272b63d03f5f79c56e9e4120aec44ef,20,2743.0,2017-07-21 18:25:23,1.2,7.89
75122,ab14fdcfbe524636d65ee38360e22ce8,20,25653.0,2017-08-30 14:30:23,98.7,14.44


> Analisando a coluna `order_item_id`, vejo que sua existência não faz muito sentido já que não altera em nada e também não apresenta nenhuma informação relevante para nossa modelagem ou futuras análises. Dito isso excluirei ela.

In [7]:
ds_items.drop(columns=['order_item_id'], inplace=True)

#### Análise de Valores Nulos/Inconsistências

In [8]:
ds_items[ds_items['product_id'].isnull()]

Unnamed: 0,order_id,product_id,shipping_limit_date,price,freight_value
7098,101157d4fae1c9fb74a00a5dee265c25,,2017-04-11 08:02:26,29.0,14.52
9233,1521c6bb7b1028154c8c67cf80fa809f,,2017-04-07 10:10:16,29.0,16.05
28715,415cfaaaa8cea49f934470548797fed1,,2017-04-07 10:35:19,29.0,14.52
28716,415cfaaaa8cea49f934470548797fed1,,2017-04-07 10:35:19,29.0,14.52
39299,595316a07cd3dea9db7adfcc7e247ae7,,2017-08-18 04:26:04,39.0,9.27
48424,6e150190fbe04c642a9cf0b80d83ee16,,2017-06-30 16:45:14,39.0,16.79
48980,6f497c40431d5fb0cfbd6c943dd29215,,2017-04-11 05:55:32,29.0,10.96
58833,85f8ad45e067abd694b627859fa57453,,2017-02-03 21:40:02,1934.0,27.0
71134,a2456e7f02197951664897a94c87242d,,2017-04-06 11:50:09,29.0,24.84
73556,a7a43f469c0d7bdb0a23a82db125aefa,,2017-08-28 13:15:11,39.0,15.1


> Esses pedidos possuem o seu `product_id` nulos devido a exclusão dos produtos que não possuem informações essenciais para o cálculo de frete, então apenas atualizarei o dataset removendo esses valores inconsistentes

In [9]:
ds_items.dropna(subset=['product_id'], inplace=True)

In [10]:
ds_items

Unnamed: 0,order_id,product_id,shipping_limit_date,price,freight_value
0,00010242fe8c5a6d1ba2dd792cb16214,25864.0,2017-09-19 09:45:35,58.90,13.29
1,00018f77f2f0320c557190d7a144bdd3,27229.0,2017-05-03 11:05:13,239.90,19.93
2,000229ec398224ef6ca0657da4fc703e,22623.0,2018-01-18 14:48:30,199.00,17.87
3,00024acbcdf0a6daa1e931b038114c75,15403.0,2018-08-15 10:10:18,12.99,12.79
4,00042b26cf59d7ce69dfabb4e55b4fd9,8862.0,2017-02-13 13:57:51,199.90,18.14
...,...,...,...,...,...
112645,fffc94f6ce00a00581880bf54a75a037,4738.0,2018-05-02 04:11:01,299.99,43.41
112646,fffcd46ef2263f404302a634eb57f7eb,9255.0,2018-07-20 04:31:48,350.00,36.53
112647,fffce4705a9662cd70adb13d4a31832d,24014.0,2017-10-30 17:14:25,99.90,16.95
112648,fffe18544ffabc95dfada21779c9644f,22354.0,2017-08-21 00:04:32,55.99,8.72


> Analisando esta tabela, podemos perceber que ela serve apenas como uma auxiliar de uma relação muitos para muitos, como os dados já foram tratados acima e não apresenta mais nenhuma inconsistência, nos resta apenas atualizar o `order_id` tanto nesta tabela quanto na tabela `orders` por ser a última etapa. 

In [11]:
order_id_map = ds_orders.set_index('order_id')['order_id_int']
order_id_map

order_id
e481f51cbdc54678b7cc49136f2d6af7        1
53cdb2fc8bc7dce0b6741e2150273451        2
47770eb9100c2d0c44946d9cf07ec65d        3
949d5b44dbf5de918fe9c16f97b45f8a        4
ad21c59c0840e6cb83a9ceb5573f8159        5
                                    ...  
9c5dedf39a927c1b2549525ed64a053c    99437
63943bddc261676b46f01ca7ac2f7bd8    99438
83c1379a015df1e13d02aae0204711ab    99439
11c177c8e97725db2631073c19f07b62    99440
66dea50a8b16d9b4dee7af250b4be1a5    99441
Name: order_id_int, Length: 99441, dtype: int64

In [12]:
ds_items['order_id'] = ds_items['order_id'].map(order_id_map)

In [13]:
ds_items

Unnamed: 0,order_id,product_id,shipping_limit_date,price,freight_value
0,85268,25864.0,2017-09-19 09:45:35,58.90,13.29
1,71854,27229.0,2017-05-03 11:05:13,239.90,19.93
2,6299,22623.0,2018-01-18 14:48:30,199.00,17.87
3,22551,15403.0,2018-08-15 10:10:18,12.99,12.79
4,5248,8862.0,2017-02-13 13:57:51,199.90,18.14
...,...,...,...,...,...
112645,79551,4738.0,2018-05-02 04:11:01,299.99,43.41
112646,70156,9255.0,2018-07-20 04:31:48,350.00,36.53
112647,52700,24014.0,2017-10-30 17:14:25,99.90,16.95
112648,59872,22354.0,2017-08-21 00:04:32,55.99,8.72


In [14]:
ds_orders.drop(columns=['order_id'], inplace=True)
ds_orders.rename(columns={'order_int_id': 'order_id'}, inplace=True)
ds_orders

Unnamed: 0,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date,order_id_int
0,68585,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18 00:00:00,1
1,74977,delivered,2018-07-24 20:41:37,2018-07-26 03:24:27,2018-07-26 14:31:00,2018-08-07 15:27:45,2018-08-13 00:00:00,2
2,555,delivered,2018-08-08 08:38:49,2018-08-08 08:55:23,2018-08-08 13:50:00,2018-08-17 18:06:29,2018-09-04 00:00:00,3
3,59790,delivered,2017-11-18 19:28:06,2017-11-18 19:45:59,2017-11-22 13:39:59,2017-12-02 00:28:42,2017-12-15 00:00:00,4
4,65715,delivered,2018-02-13 21:18:39,2018-02-13 22:20:29,2018-02-14 19:46:34,2018-02-16 18:17:02,2018-02-26 00:00:00,5
...,...,...,...,...,...,...,...,...
99436,59296,delivered,2017-03-09 09:54:05,2017-03-09 09:54:05,2017-03-10 11:18:03,2017-03-17 15:08:01,2017-03-28 00:00:00,99437
99437,76301,delivered,2018-02-06 12:58:58,2018-02-06 13:10:37,2018-02-07 23:22:42,2018-02-28 17:37:56,2018-03-02 00:00:00,99438
99438,19749,delivered,2017-08-27 14:46:43,2017-08-27 15:04:16,2017-08-28 20:52:26,2017-09-21 11:24:17,2017-09-27 00:00:00,99439
99439,16808,delivered,2018-01-08 21:28:27,2018-01-08 21:36:21,2018-01-12 15:35:03,2018-01-25 23:32:54,2018-02-15 00:00:00,99440


#

#### Carregamento do dataset Itens para a trusted

> Agora com nosso dataset `itens` atualizado com nosso novo índice da tabela e o nosso dataset `orders` atualizado com o seu novo `order_id` mais legível, podemos carregar esses valores na nossa camada `trusted`

In [15]:
ds_items.to_csv(PATH_TRUSTED + 'items_trusted.csv', index=False)
ds_orders.to_csv(PATH_TRUSTED + 'orders_trusted.csv', index=False)