## Dataset Products

> Neste notebook ocorrerá as análises e alterações necessárias para tornar o dataframe Products confiável para modelagem dimensional tais como comentários acerca de determinadas decisões.

In [1]:
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', None)

PATH_RAW = '../input/raw/'
PATH_TRUSTED = '../output/pd/trusted/'

ds_products = pd.read_csv(PATH_RAW + 'olist_products_dataset.csv')
ds_itens = pd.read_csv(PATH_RAW + 'olist_order_items_dataset.csv')

#### Análise do dataset Order Products

In [2]:
ds_products.tail(2)

Unnamed: 0,product_id,product_category_name,product_name_lenght,product_description_lenght,product_photos_qty,product_weight_g,product_length_cm,product_height_cm,product_width_cm
32949,83808703fc0706a22e264b9d75f04a2e,informatica_acessorios,60.0,156.0,2.0,700.0,31.0,13.0,20.0
32950,106392145fca363410d287a815be6de4,cama_mesa_banho,58.0,309.0,1.0,2083.0,12.0,2.0,7.0


**Análise das Variáveis/Colunas**

- **product_category_name**: Variável Qualitativa Nominal
- **product_name_lenght**: Variável Qualitativa Nominal
- **product_photos_qty**: Variável Qualitativa Ordinal
- **product_weight_g**: Variável Qualitativa Ordinal
- **product_length_cm**: Variável Qualitativa Ordinal
- **product_height_cm**: Variável Qualitativa Ordinal
- **product_width_cm**: Variável Qualitativa Ordinal

In [3]:
ds_products.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32951 entries, 0 to 32950
Data columns (total 9 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   product_id                  32951 non-null  object 
 1   product_category_name       32341 non-null  object 
 2   product_name_lenght         32341 non-null  float64
 3   product_description_lenght  32341 non-null  float64
 4   product_photos_qty          32341 non-null  float64
 5   product_weight_g            32949 non-null  float64
 6   product_length_cm           32949 non-null  float64
 7   product_height_cm           32949 non-null  float64
 8   product_width_cm            32949 non-null  float64
dtypes: float64(7), object(2)
memory usage: 2.3+ MB


#

#### Verificação de Valores Nulos

In [4]:
ds_products.isnull().sum()

product_id                      0
product_category_name         610
product_name_lenght           610
product_description_lenght    610
product_photos_qty            610
product_weight_g                2
product_length_cm               2
product_height_cm               2
product_width_cm                2
dtype: int64

In [5]:
ds_products.value_counts('product_category_name')

product_category_name
cama_mesa_banho                  3029
esporte_lazer                    2867
moveis_decoracao                 2657
beleza_saude                     2444
utilidades_domesticas            2335
                                 ... 
casa_conforto_2                     5
fashion_roupa_infanto_juvenil       5
pc_gamer                            3
seguros_e_servicos                  2
cds_dvds_musicais                   1
Name: count, Length: 73, dtype: int64

In [6]:
ds_products[ds_products['product_name_lenght'].isnull()].head(5)

Unnamed: 0,product_id,product_category_name,product_name_lenght,product_description_lenght,product_photos_qty,product_weight_g,product_length_cm,product_height_cm,product_width_cm
105,a41e356c76fab66334f36de622ecbd3a,,,,,650.0,17.0,14.0,12.0
128,d8dee61c2034d6d075997acef1870e9b,,,,,300.0,16.0,7.0,20.0
145,56139431d72cd51f19eb9f7dae4d1617,,,,,200.0,20.0,20.0,20.0
154,46b48281eb6d663ced748f324108c733,,,,,18500.0,41.0,30.0,41.0
197,5fb61f482620cb672f5e586bb132eae9,,,,,300.0,35.0,7.0,12.0


Aqui eu escolhi por alterar invés de excluir os seguintes campos nulos para os seguintes valores com as seguintes justificativas:

- **product_category_name** -> outros -> Produto com categoria nula não significa que ele não exista, apenas que ele não se encaixa em nenhuma outra categoria.
- **product_name_lenght** -> 0 -> Um produto sem nome ainda pode ser anunciado.
- **product_description_length** -> 0 -> Um produto sem descrição ainda pode ser anunciado.
- **product_photos_qty** -> 0 -> Um produto sem fotos ainda pode ser anunciado, o vendedor pode escolher se comunicar diretametne com o cliente para passar mais informações.

In [7]:
ds_products.fillna({'product_category_name': 'outros', 'product_name_lenght': 0, 'product_description_lenght': 0, 'product_photos_qty': 0}, inplace=True)

In [8]:
ds_products.isnull().sum()

product_id                    0
product_category_name         0
product_name_lenght           0
product_description_lenght    0
product_photos_qty            0
product_weight_g              2
product_length_cm             2
product_height_cm             2
product_width_cm              2
dtype: int64

> Agora invés de atualizar os valores nulos, escolhi por **remover esses dois valores restantes pois além de ser uma quantidade minúscula de dados comparando ao tamanho total do nosso dataset (32951), essas variáveis também são importantíssimas para por exemplo: Cálculo de Frete, Impostos sobre Circulação de Mercadoria, etc.**

In [9]:
ds_products = ds_products.dropna(subset=['product_weight_g', 'product_length_cm', 'product_height_cm', 'product_width_cm'])

In [10]:
ds_products.isnull().sum()

product_id                    0
product_category_name         0
product_name_lenght           0
product_description_lenght    0
product_photos_qty            0
product_weight_g              0
product_length_cm             0
product_height_cm             0
product_width_cm              0
dtype: int64

#

#### Análise de Duplicatas

In [11]:
ds_products.duplicated(subset=['product_id']).value_counts()

False    32949
Name: count, dtype: int64

#

#### Padronização dos Indexes

> Aqui precisarei utilizar a mesma lógica feita nos outros notebooks com `orders` sendo que aqui alteraremos apenas o `order_itens` pois é ele quem possui o nosso `product_id`

In [12]:
ds_products['product_id_int'] = range(1, len(ds_products) + 1)

In [13]:
ds_products

Unnamed: 0,product_id,product_category_name,product_name_lenght,product_description_lenght,product_photos_qty,product_weight_g,product_length_cm,product_height_cm,product_width_cm,product_id_int
0,1e9e8ef04dbcff4541ed26657ea517e5,perfumaria,40.0,287.0,1.0,225.0,16.0,10.0,14.0,1
1,3aa071139cb16b67ca9e5dea641aaa2f,artes,44.0,276.0,1.0,1000.0,30.0,18.0,20.0,2
2,96bd76ec8810374ed1b65e291975717f,esporte_lazer,46.0,250.0,1.0,154.0,18.0,9.0,15.0,3
3,cef67bcfe19066a932b7673e239eb23d,bebes,27.0,261.0,1.0,371.0,26.0,4.0,26.0,4
4,9dc1a7de274444849c219cff195d0b71,utilidades_domesticas,37.0,402.0,4.0,625.0,20.0,17.0,13.0,5
...,...,...,...,...,...,...,...,...,...,...
32946,a0b7d5a992ccda646f2d34e418fff5a0,moveis_decoracao,45.0,67.0,2.0,12300.0,40.0,40.0,40.0,32945
32947,bf4538d88321d0fd4412a93c974510e6,construcao_ferramentas_iluminacao,41.0,971.0,1.0,1700.0,16.0,19.0,16.0,32946
32948,9a7c6041fa9592d9d9ef6cfe62a71f8c,cama_mesa_banho,50.0,799.0,1.0,1400.0,27.0,7.0,27.0,32947
32949,83808703fc0706a22e264b9d75f04a2e,informatica_acessorios,60.0,156.0,2.0,700.0,31.0,13.0,20.0,32948


In [14]:
product_id_map = ds_products.set_index('product_id')['product_id_int']
product_id_map

product_id
1e9e8ef04dbcff4541ed26657ea517e5        1
3aa071139cb16b67ca9e5dea641aaa2f        2
96bd76ec8810374ed1b65e291975717f        3
cef67bcfe19066a932b7673e239eb23d        4
9dc1a7de274444849c219cff195d0b71        5
                                    ...  
a0b7d5a992ccda646f2d34e418fff5a0    32945
bf4538d88321d0fd4412a93c974510e6    32946
9a7c6041fa9592d9d9ef6cfe62a71f8c    32947
83808703fc0706a22e264b9d75f04a2e    32948
106392145fca363410d287a815be6de4    32949
Name: product_id_int, Length: 32949, dtype: int64

In [15]:
ds_itens['product_id'] = ds_itens['product_id'].map(product_id_map)

In [16]:
ds_itens.head(2)

Unnamed: 0,order_id,order_item_id,product_id,seller_id,shipping_limit_date,price,freight_value
0,00010242fe8c5a6d1ba2dd792cb16214,1,25864.0,48436dade18ac8b2bce089ec2a041202,2017-09-19 09:45:35,58.9,13.29
1,00018f77f2f0320c557190d7a144bdd3,1,27229.0,dd7ddc04e1b6c2c614352b383efe2d36,2017-05-03 11:05:13,239.9,19.93


In [17]:
ds_products.drop(columns=['product_id'], inplace=True)
ds_products.rename(columns={'product_id_int':'product_id'}, inplace=True)
ds_products.head(2)

Unnamed: 0,product_category_name,product_name_lenght,product_description_lenght,product_photos_qty,product_weight_g,product_length_cm,product_height_cm,product_width_cm,product_id
0,perfumaria,40.0,287.0,1.0,225.0,16.0,10.0,14.0,1
1,artes,44.0,276.0,1.0,1000.0,30.0,18.0,20.0,2


#

#### Carregamento do dataset Product para a trusted

> Agora com nosso dataset product atualizado com nosso novo índice da tabela e incosistências tratadas devidamente, podemos carregar esses valores na nossa camada `trusted`

In [18]:
ds_products = ds_products.convert_dtypes()
ds_itens = ds_itens.convert_dtypes()

ds_products.to_csv(PATH_TRUSTED + 'products_trusted.csv', index=False)
ds_itens.to_csv(PATH_TRUSTED + 'itens_trusted.csv', index=False)