## Dataset Payments

> Neste notebook ocorrerá as análises e alterações necessárias para tornar o dataframe Customer confiável para modelagem dimensional tais como comentários acerca de determinadas decisões.


In [1]:
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', None)

PATH_RAW = '../input/raw/'
PATH_TRUSTED = '../output/pd/trusted/'

ds_payments = pd.read_csv(PATH_RAW + 'olist_order_payments_dataset.csv')
ds_orders = pd.read_csv(PATH_TRUSTED + 'orders_trusted.csv')

### Análise do dataset Payments:

In [2]:
ds_payments.head(2)

Unnamed: 0,order_id,payment_sequential,payment_type,payment_installments,payment_value
0,b81ef226f3fe1789b1e8b2acac839d17,1,credit_card,8,99.33
1,a9810da82917af2d9aefd1278f1dcfa0,1,credit_card,1,24.39


**Análise das Variáveis/Colunas**:

- **payment_type**: Variável Qualitativa Ordinal
- **payment_installments**: Variável Quantitativa Discreta
- **payment_value**: Variável Quantitativa Contínua

#

#### Verificação de Duplicatas

In [3]:
ds_payments[ds_payments.duplicated(subset=("order_id"), keep="first")].value_counts('order_id')

order_id
fa65dad1b0e818e3ccc5cb0e39231352    28
ccf804e764ed5650cd8759557269dc13    25
285c2e15bebd4ac83635ccc563dc71f4    21
895ab968e7bb0d5659d16cd74cd1650c    20
fedcd9f7ccdc8cba3a18defedd1a5547    18
                                    ..
596c6d7a66d8869a75807519b27a437d     1
59a19c83ff825948739dd1601cc107b6     1
59a7ff272ffc5a0a705f006f9b2db1b9     1
59c134c32edc0046c31ad8bd6f521b61     1
ffc730a0615d28ec19f9cad02cb41442     1
Name: count, Length: 2961, dtype: int64

> Após essa breve análise, relacionando com os campos que nós temos, acho interessante criar um `payment_id` pois ela possui uma coluna qualitativa que não entrará na nossa futura `tabela fato` e precisará, caso solicitado ser analisada.

> Além disso também irei alterar o `order_id` com os mesmos fins do comentado no notebook de [_Customer_](./customer_to_trusted.ipyn).

In [4]:
ds_payments['payment_id'] = range(1, len(ds_payments) + 1)
ds_payments

Unnamed: 0,order_id,payment_sequential,payment_type,payment_installments,payment_value,payment_id
0,b81ef226f3fe1789b1e8b2acac839d17,1,credit_card,8,99.33,1
1,a9810da82917af2d9aefd1278f1dcfa0,1,credit_card,1,24.39,2
2,25e8ea4e93396b6fa0d3dd708e76c1bd,1,credit_card,1,65.71,3
3,ba78997921bbcdc1373bb41e913ab953,1,credit_card,8,107.78,4
4,42fdf880ba16b47b59251dd489d4441a,1,credit_card,2,128.45,5
...,...,...,...,...,...,...
103881,0406037ad97740d563a178ecc7a2075c,1,boleto,1,363.31,103882
103882,7b905861d7c825891d6347454ea7863f,1,credit_card,2,96.80,103883
103883,32609bbb3dd69b3c066a6860554a77bf,1,credit_card,1,47.77,103884
103884,b8b61059626efa996a60be9bb9320e10,1,credit_card,5,369.54,103885


In [5]:
ds_orders.order_id.duplicated().value_counts()

order_id
False    99441
Name: count, dtype: int64

In [6]:
ds_orders['order_id_int'] = range(1, len(ds_orders) + 1)
ds_orders

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date,order_id_int
0,e481f51cbdc54678b7cc49136f2d6af7,68585,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18 00:00:00,1
1,53cdb2fc8bc7dce0b6741e2150273451,74977,delivered,2018-07-24 20:41:37,2018-07-26 03:24:27,2018-07-26 14:31:00,2018-08-07 15:27:45,2018-08-13 00:00:00,2
2,47770eb9100c2d0c44946d9cf07ec65d,555,delivered,2018-08-08 08:38:49,2018-08-08 08:55:23,2018-08-08 13:50:00,2018-08-17 18:06:29,2018-09-04 00:00:00,3
3,949d5b44dbf5de918fe9c16f97b45f8a,59790,delivered,2017-11-18 19:28:06,2017-11-18 19:45:59,2017-11-22 13:39:59,2017-12-02 00:28:42,2017-12-15 00:00:00,4
4,ad21c59c0840e6cb83a9ceb5573f8159,65715,delivered,2018-02-13 21:18:39,2018-02-13 22:20:29,2018-02-14 19:46:34,2018-02-16 18:17:02,2018-02-26 00:00:00,5
...,...,...,...,...,...,...,...,...,...
99436,9c5dedf39a927c1b2549525ed64a053c,59296,delivered,2017-03-09 09:54:05,2017-03-09 09:54:05,2017-03-10 11:18:03,2017-03-17 15:08:01,2017-03-28 00:00:00,99437
99437,63943bddc261676b46f01ca7ac2f7bd8,76301,delivered,2018-02-06 12:58:58,2018-02-06 13:10:37,2018-02-07 23:22:42,2018-02-28 17:37:56,2018-03-02 00:00:00,99438
99438,83c1379a015df1e13d02aae0204711ab,19749,delivered,2017-08-27 14:46:43,2017-08-27 15:04:16,2017-08-28 20:52:26,2017-09-21 11:24:17,2017-09-27 00:00:00,99439
99439,11c177c8e97725db2631073c19f07b62,16808,delivered,2018-01-08 21:28:27,2018-01-08 21:36:21,2018-01-12 15:35:03,2018-01-25 23:32:54,2018-02-15 00:00:00,99440


In [7]:
ds_order_id_map = ds_orders.set_index('order_id')['order_id_int']
ds_order_id_map

order_id
e481f51cbdc54678b7cc49136f2d6af7        1
53cdb2fc8bc7dce0b6741e2150273451        2
47770eb9100c2d0c44946d9cf07ec65d        3
949d5b44dbf5de918fe9c16f97b45f8a        4
ad21c59c0840e6cb83a9ceb5573f8159        5
                                    ...  
9c5dedf39a927c1b2549525ed64a053c    99437
63943bddc261676b46f01ca7ac2f7bd8    99438
83c1379a015df1e13d02aae0204711ab    99439
11c177c8e97725db2631073c19f07b62    99440
66dea50a8b16d9b4dee7af250b4be1a5    99441
Name: order_id_int, Length: 99441, dtype: int64

In [8]:
ds_payments['order_id'] = ds_payments['order_id'].map(ds_order_id_map)
ds_payments

Unnamed: 0,order_id,payment_sequential,payment_type,payment_installments,payment_value,payment_id
0,75269,1,credit_card,8,99.33,1
1,98161,1,credit_card,1,24.39,2
2,43930,1,credit_card,1,65.71,3
3,64933,1,credit_card,8,107.78,4
4,12703,1,credit_card,2,128.45,5
...,...,...,...,...,...,...
103881,96837,1,boleto,1,363.31,103882
103882,39846,1,credit_card,2,96.80,103883
103883,27585,1,credit_card,1,47.77,103884
103884,35140,1,credit_card,5,369.54,103885


#

#### Carregamento do dataset Payment para a trusted

> Agora com nosso dataset payment atualizado com seu novo índice `payment_id` e o nosso dataset `orders` atualizado com o seu novo id `order_id_int` que futuramente será alterado, podemos carregar esses valores na nossa camada `trusted`

In [9]:
ds_payments = ds_payments.convert_dtypes()
ds_orders = ds_orders.convert_dtypes()

ds_payments.to_csv(PATH_TRUSTED + 'payments_trusted.csv', index=False)
ds_orders.to_csv(PATH_TRUSTED + 'orders_trusted.csv', index=False)