## Dataset Customer

> Neste notebook ocorrerá as análises e alterações necessárias para tornar o dataframe Customer confiável para modelagem dimensional tais como comentários acerca de determinadas decisões.


In [1]:
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', None)

PATH_RAW = '../input/raw/'
PATH_TRUSTED = '../output/pd/trusted/'

ds_customers = pd.read_csv(PATH_RAW + 'olist_customers_dataset.csv')
ds_orders = pd.read_csv(PATH_RAW + 'olist_orders_dataset.csv')

### Análise do dataset Customers:

In [2]:
ds_customers.head(2)

Unnamed: 0,customer_id,customer_unique_id,customer_zip_code_prefix,customer_city,customer_state
0,06b8999e2fba1a1fbc88172c00ba8bc7,861eff4711a542e4b93843c6dd7febb0,14409,franca,SP
1,18955e83d337fd6b2def6b18a428ac77,290c77bc529b7ac935b93aa66c333dc3,9790,sao bernardo do campo,SP


**Análise das Variáveis/Colunas**:

- **customer_zip_code_prefix**: Variável Qualitativa Nominal
- **customer_city**: Varivel Qualitativa Nominal
- **customer_state**: Variável Qualitativa Ordinal

#

#### Verificação de Duplicatas

In [3]:
ds_customers.duplicated(subset=["customer_id"],keep='first').value_counts()

False    99441
Name: count, dtype: int64

In [4]:
ds_customers.duplicated(subset=["customer_unique_id"],keep='first').value_counts()

False    96096
True      3345
Name: count, dtype: int64

> Nesta verificação podemos perceber valores duplicados mas na verdade como definido no kaggle, esses valores duplicados são os clientes que fizeram recompra. No nosso dataset, cada pedido é gerado um `customer_id` diferente. Essas observações serão importante na construção do nosso modelo dimensional na camada `refined`.

#

> Primeiramente, resolvi montar um dataset, mantendo os `customer_unique_id` sem duplicatas para conseguir mapea-los com um id mais legível aos analistas, consequentemente aumentando a performance do nosso datawarehouse em questões analíticas.

In [5]:
ds_customeres_without_id = ds_customers.drop_duplicates(subset=['customer_unique_id'], keep='first')
ds_customeres_without_id = ds_customeres_without_id.drop(columns="customer_id", axis=1)
ds_customeres_without_id

Unnamed: 0,customer_unique_id,customer_zip_code_prefix,customer_city,customer_state
0,861eff4711a542e4b93843c6dd7febb0,14409,franca,SP
1,290c77bc529b7ac935b93aa66c333dc3,9790,sao bernardo do campo,SP
2,060e732b5b29e8181a18229c7b0b2b5e,1151,sao paulo,SP
3,259dac757896d24d7702b9acbbff3f3c,8775,mogi das cruzes,SP
4,345ecd01c38d18a9036ed96c73b8d066,13056,campinas,SP
...,...,...,...,...
99436,1a29b476fee25c95fbafc67c5ac95cf8,3937,sao paulo,SP
99437,d52a67c98be1cf6a5c84435bd38d095d,6764,taboao da serra,SP
99438,e9f50caf99f032f0bf3c55141f019d99,60115,fortaleza,CE
99439,73c2643a0a458b49f58cea58833b192e,92120,canoas,RS


#

#### Padronização dos Indexes

> Primeiramente, montei a indexação dos valores da forma tradicional de 1 a n linhas

In [6]:
ds_customeres_without_id['client_id'] = range(1, len(ds_customeres_without_id) + 1)
ds_customeres_without_id

Unnamed: 0,customer_unique_id,customer_zip_code_prefix,customer_city,customer_state,client_id
0,861eff4711a542e4b93843c6dd7febb0,14409,franca,SP,1
1,290c77bc529b7ac935b93aa66c333dc3,9790,sao bernardo do campo,SP,2
2,060e732b5b29e8181a18229c7b0b2b5e,1151,sao paulo,SP,3
3,259dac757896d24d7702b9acbbff3f3c,8775,mogi das cruzes,SP,4
4,345ecd01c38d18a9036ed96c73b8d066,13056,campinas,SP,5
...,...,...,...,...,...
99436,1a29b476fee25c95fbafc67c5ac95cf8,3937,sao paulo,SP,96092
99437,d52a67c98be1cf6a5c84435bd38d095d,6764,taboao da serra,SP,96093
99438,e9f50caf99f032f0bf3c55141f019d99,60115,fortaleza,CE,96094
99439,73c2643a0a458b49f58cea58833b192e,92120,canoas,RS,96095


> Em seguida mapeei os ids referentes com a função `set_index`

In [7]:
client_id_map = ds_customeres_without_id.set_index('customer_unique_id')['client_id']

In [8]:
client_id_map

customer_unique_id
861eff4711a542e4b93843c6dd7febb0        1
290c77bc529b7ac935b93aa66c333dc3        2
060e732b5b29e8181a18229c7b0b2b5e        3
259dac757896d24d7702b9acbbff3f3c        4
345ecd01c38d18a9036ed96c73b8d066        5
                                    ...  
1a29b476fee25c95fbafc67c5ac95cf8    96092
d52a67c98be1cf6a5c84435bd38d095d    96093
e9f50caf99f032f0bf3c55141f019d99    96094
73c2643a0a458b49f58cea58833b192e    96095
84732c5050c01db9b23e19ba39899398    96096
Name: client_id, Length: 96096, dtype: int64

In [9]:
ds_customers['customer_unique_id'] = ds_customers['customer_unique_id'].map(client_id_map)
ds_customers

Unnamed: 0,customer_id,customer_unique_id,customer_zip_code_prefix,customer_city,customer_state
0,06b8999e2fba1a1fbc88172c00ba8bc7,1,14409,franca,SP
1,18955e83d337fd6b2def6b18a428ac77,2,9790,sao bernardo do campo,SP
2,4e7b3e00288586ebd08712fdd0374a03,3,1151,sao paulo,SP
3,b2b6027bc5c5109e529d4dc6358b12c3,4,8775,mogi das cruzes,SP
4,4f2d8ab171c80ec8364f7c12e35b23ad,5,13056,campinas,SP
...,...,...,...,...,...
99436,17ddf5dd5d51696bb3d7c6291687be6f,96092,3937,sao paulo,SP
99437,e7b71a9017aa05c9a7fd292d714858e8,96093,6764,taboao da serra,SP
99438,5e28dfe12db7fb50a4b2f691faecea5e,96094,60115,fortaleza,CE
99439,56b18e2166679b8a959d72dd06da27f9,96095,92120,canoas,RS


> E feita essas alterações, agora será necessário alterar o referente ao nosso dataset `orders`, pois lá está mapeado pelo `customer_id`, tudo que iremos fazer será somente alterar o `customer_id (passado)` pelo `customer_unique_id(novo)`.

> Para realizar isso iremos fazer o mesmo processo de mapeamento que utilizamos para substituir o `customer_unique_id`

In [10]:
customer_id_map = ds_customers.set_index('customer_id')['customer_unique_id']
customer_id_map

customer_id
06b8999e2fba1a1fbc88172c00ba8bc7        1
18955e83d337fd6b2def6b18a428ac77        2
4e7b3e00288586ebd08712fdd0374a03        3
b2b6027bc5c5109e529d4dc6358b12c3        4
4f2d8ab171c80ec8364f7c12e35b23ad        5
                                    ...  
17ddf5dd5d51696bb3d7c6291687be6f    96092
e7b71a9017aa05c9a7fd292d714858e8    96093
5e28dfe12db7fb50a4b2f691faecea5e    96094
56b18e2166679b8a959d72dd06da27f9    96095
274fa6071e5e17fe303b9748641082c8    96096
Name: customer_unique_id, Length: 99441, dtype: int64

In [11]:
ds_orders['customer_id'] = ds_orders['customer_id'].map(customer_id_map)
ds_orders

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
0,e481f51cbdc54678b7cc49136f2d6af7,68585,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18 00:00:00
1,53cdb2fc8bc7dce0b6741e2150273451,74977,delivered,2018-07-24 20:41:37,2018-07-26 03:24:27,2018-07-26 14:31:00,2018-08-07 15:27:45,2018-08-13 00:00:00
2,47770eb9100c2d0c44946d9cf07ec65d,555,delivered,2018-08-08 08:38:49,2018-08-08 08:55:23,2018-08-08 13:50:00,2018-08-17 18:06:29,2018-09-04 00:00:00
3,949d5b44dbf5de918fe9c16f97b45f8a,59790,delivered,2017-11-18 19:28:06,2017-11-18 19:45:59,2017-11-22 13:39:59,2017-12-02 00:28:42,2017-12-15 00:00:00
4,ad21c59c0840e6cb83a9ceb5573f8159,65715,delivered,2018-02-13 21:18:39,2018-02-13 22:20:29,2018-02-14 19:46:34,2018-02-16 18:17:02,2018-02-26 00:00:00
...,...,...,...,...,...,...,...,...
99436,9c5dedf39a927c1b2549525ed64a053c,59296,delivered,2017-03-09 09:54:05,2017-03-09 09:54:05,2017-03-10 11:18:03,2017-03-17 15:08:01,2017-03-28 00:00:00
99437,63943bddc261676b46f01ca7ac2f7bd8,76301,delivered,2018-02-06 12:58:58,2018-02-06 13:10:37,2018-02-07 23:22:42,2018-02-28 17:37:56,2018-03-02 00:00:00
99438,83c1379a015df1e13d02aae0204711ab,19749,delivered,2017-08-27 14:46:43,2017-08-27 15:04:16,2017-08-28 20:52:26,2017-09-21 11:24:17,2017-09-27 00:00:00
99439,11c177c8e97725db2631073c19f07b62,16808,delivered,2018-01-08 21:28:27,2018-01-08 21:36:21,2018-01-12 15:35:03,2018-01-25 23:32:54,2018-02-15 00:00:00


#

Para finalizar:

- Irei dropar a coluna `customer_id` do dataset `customer`
- Renomear a coluna `customer_unique_id` para `customer_id` transformando-a no nosso novo índice.
- Remover as duplicatas
- Converter os tipos dos dataframes `orders` e `customers` de acordo com o pandas com a função `convert_dtypes()`

In [12]:
ds_customers.drop(columns=['customer_id'], inplace=True)
ds_customers.rename(columns={'customer_unique_id':'customer_id'}, inplace=True)
ds_customers.drop_duplicates(subset=['customer_id'], keep='first', inplace=True)
ds_customers = ds_customers.convert_dtypes()
ds_orders = ds_orders.convert_dtypes()

In [13]:
ds_customers.info()

<class 'pandas.core.frame.DataFrame'>
Index: 96096 entries, 0 to 99440
Data columns (total 4 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   customer_id               96096 non-null  Int64 
 1   customer_zip_code_prefix  96096 non-null  Int64 
 2   customer_city             96096 non-null  string
 3   customer_state            96096 non-null  string
dtypes: Int64(2), string(2)
memory usage: 3.8 MB


In [14]:
ds_customers

Unnamed: 0,customer_id,customer_zip_code_prefix,customer_city,customer_state
0,1,14409,franca,SP
1,2,9790,sao bernardo do campo,SP
2,3,1151,sao paulo,SP
3,4,8775,mogi das cruzes,SP
4,5,13056,campinas,SP
...,...,...,...,...
99436,96092,3937,sao paulo,SP
99437,96093,6764,taboao da serra,SP
99438,96094,60115,fortaleza,CE
99439,96095,92120,canoas,RS


In [15]:
ds_orders.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99441 entries, 0 to 99440
Data columns (total 8 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   order_id                       99441 non-null  string
 1   customer_id                    99441 non-null  Int64 
 2   order_status                   99441 non-null  string
 3   order_purchase_timestamp       99441 non-null  string
 4   order_approved_at              99281 non-null  string
 5   order_delivered_carrier_date   97658 non-null  string
 6   order_delivered_customer_date  96476 non-null  string
 7   order_estimated_delivery_date  99441 non-null  string
dtypes: Int64(1), string(7)
memory usage: 6.2 MB


In [16]:
ds_orders

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
0,e481f51cbdc54678b7cc49136f2d6af7,68585,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18 00:00:00
1,53cdb2fc8bc7dce0b6741e2150273451,74977,delivered,2018-07-24 20:41:37,2018-07-26 03:24:27,2018-07-26 14:31:00,2018-08-07 15:27:45,2018-08-13 00:00:00
2,47770eb9100c2d0c44946d9cf07ec65d,555,delivered,2018-08-08 08:38:49,2018-08-08 08:55:23,2018-08-08 13:50:00,2018-08-17 18:06:29,2018-09-04 00:00:00
3,949d5b44dbf5de918fe9c16f97b45f8a,59790,delivered,2017-11-18 19:28:06,2017-11-18 19:45:59,2017-11-22 13:39:59,2017-12-02 00:28:42,2017-12-15 00:00:00
4,ad21c59c0840e6cb83a9ceb5573f8159,65715,delivered,2018-02-13 21:18:39,2018-02-13 22:20:29,2018-02-14 19:46:34,2018-02-16 18:17:02,2018-02-26 00:00:00
...,...,...,...,...,...,...,...,...
99436,9c5dedf39a927c1b2549525ed64a053c,59296,delivered,2017-03-09 09:54:05,2017-03-09 09:54:05,2017-03-10 11:18:03,2017-03-17 15:08:01,2017-03-28 00:00:00
99437,63943bddc261676b46f01ca7ac2f7bd8,76301,delivered,2018-02-06 12:58:58,2018-02-06 13:10:37,2018-02-07 23:22:42,2018-02-28 17:37:56,2018-03-02 00:00:00
99438,83c1379a015df1e13d02aae0204711ab,19749,delivered,2017-08-27 14:46:43,2017-08-27 15:04:16,2017-08-28 20:52:26,2017-09-21 11:24:17,2017-09-27 00:00:00
99439,11c177c8e97725db2631073c19f07b62,16808,delivered,2018-01-08 21:28:27,2018-01-08 21:36:21,2018-01-12 15:35:03,2018-01-25 23:32:54,2018-02-15 00:00:00


#

#### Carregamento do dataset Customer para a trusted

> Agora com nosso dataset customer atualizado com nosso novo índice da tabela e o nosso dataset `orders` atualizado com o `customer_id` que realmente agrega valor à nossa futura análise, podemos carregar esses valores na nossa camada `trusted`

In [17]:
ds_customers.to_csv(PATH_TRUSTED + 'customers_trusted.csv', index=False)
ds_orders.to_csv(PATH_TRUSTED + 'orders_trusted.csv', index=False)

#