# Introduction

This project aims to clean, analyze and visualize the ecommerce dataset. The project is divided into two parts:
1. An E-commerce Metrics analytics: to evaluate trends and KPI metrics of the business overtime
2. A KPI dashboard: A custom KPI dashboard web app comprising interactive visuals and paginations. This dashboard can be integrated with a database to serve as a real-time KPI dashboard.


## About the Dataset

Olist is an ecommerce marketplace in Brazil. This project is based on 100,000 sales orders made from multiple marketplaces in Brazil, from 2016 to 2018, with product, customer and review information, among other variables. The original data was provided in multiple datasets, and can be found [here](https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce?resource=download).

For convinience, I have combined the CSV files into one excel workbook. The combined workbook was used for this project, and can be found [here](https://drive.google.com/drive/u/1/folders/1mDL1BQMHqTRYyLMshGwaOkvjJ3nKX8rl). You will also find a detailed description of each dataset.

The tables schema can be found [here](https://i.imgur.com/HRhd2Y0.png)


### Attention

An order might have multiple items.

Each item might be fulfiled by a distinct seller

In [1]:
#Import required libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
import streamlit as st
import plotly

In [2]:
# The datasets were combined into one Excel file with multiple sheets
#Load dataset
xlsx = pd.ExcelFile('olist_store_dataset.xlsx', engine='openpyxl')

In [3]:
# list of sheets containing the datasets
xlsx.sheet_names

['customers_data',
 'geolocation_data',
 'order_items_data',
 'order_payments_data',
 'order_reviews_data',
 'orders_data',
 'products_data',
 'sellers_data',
 'product_categories_data']

<img src='schema.png' alt='Table schema' width='700px'>

## Let's combine the tables using the schema provided

In [4]:
order_reviews_df = pd.read_excel(xlsx, sheet_name='order_reviews_data')
order_reviews_df.head()

Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp
0,7bc2406110b926393aa56f80a40eba40,73fc7af87114b39712e6da79b0a377eb,4,,,2018-01-18,2018-01-18 21:46:59
1,80e641a11e56f04c1ad469d5645fdfde,a548910a1c6147796b98fdf73dbeba33,5,,,2018-03-10,2018-03-11 03:05:13
2,228ce5500dc1d8e020d8d1322874b6f0,f9e4b658b201a9f2ecdecbb34bed034b,5,,,2018-02-17,2018-02-18 14:36:24
3,e64fb393e7b32834bb789ff8bb30750e,658677c97b385a9be170737859d3511b,5,,Recebi bem antes do prazo estipulado.,2017-04-21,2017-04-21 22:02:06
4,f7c4243c7fe1938f181bec41a392bdeb,8e6bfb81e283fa7e4f11123a3fb894f1,5,,ParabÃ©ns lojas lannister adorei comprar pela ...,2018-03-01,2018-03-02 10:26:53


In [33]:
order_reviews_df.columns

Index(['review_id', 'order_id', 'review_score', 'review_comment_title',
       'review_comment_message', 'review_creation_date',
       'review_answer_timestamp'],
      dtype='object')

In [80]:
order_items_df.columns

Index(['order_id', 'order_item_id', 'product_id', 'seller_id',
       'shipping_limit_date', 'price', 'freight_value'],
      dtype='object')

In [37]:
orders_df.order_id.duplicated().any()

False

In [39]:
order_reviews_df.order_id.duplicated().any()

True

In [49]:
order_reviews_df.duplicated()

False

In [67]:
order_reviews_df[order_reviews_df.order_id.duplicated(keep=False)].nunique() #.value_counts() #.to_csv('duplicate_orders.csv', index=None,)

review_id                  904
order_id                   547
review_score                 5
review_comment_title        17
review_comment_message     302
review_creation_date       324
review_answer_timestamp    904
dtype: int64

In [69]:
order_reviews_df[order_reviews_df.review_id.duplicated(keep=False)]

Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp
200,28642ce6250b94cc72bc85960aec6c62,e239d280236cdd3c40cb2c033f681d1c,5,,,2018-03-25,2018-03-25 21:03:02
344,a0a641414ff718ca079b3967ef5c2495,169d7e0fd71d624d306f132acd791cbe,5,,,2018-03-04,2018-03-06 20:12:53
346,f4d74b17cd63ee35efa82cd2567de911,f269e83a82f64baa3de97c2ebf3358f6,3,,"A embalagem deixou a desejar, por pouco o prod...",2018-01-12,2018-01-13 18:46:10
360,ecbaf1fce7d2c09bfab46f89065afeaf,2451b9756f310d4cff5c7987b393870d,5,,,2017-07-27,2017-07-28 16:57:18
393,6b1de94de0f4bd84dfc4136818242faa,92acf87839903a94aeca0e5040d99acb,5,,,2018-02-16,2018-02-19 19:04:21
...,...,...,...,...,...,...,...
99108,2c6c08892b83ba4c1be33037c2842294,42ae1967f68c90bb325783ac55d761ce,4,,"Chegou um pouco amassada, mas nada de mais, e ...",2017-07-03,2017-07-05 19:06:59
99124,6ec93e77f444e0b1703740a69122e35d,e1fdc6e9d1ca132377e862593a7c0bd4,5,,Vendedor compromisso do vou o cliente,2017-10-07,2017-10-07 19:47:11
99164,2afe63a67dfd99b3038f568fb47ee761,c5334d330e36d2a810a7a13c72e135ee,5,,"Muito bom, produto conforme anunciado, entrega...",2018-03-03,2018-03-04 22:56:47
99167,017808d29fd1f942d97e50184dfb4c13,b1461c8882153b5fe68307c46a506e39,5,,,2018-03-02,2018-03-05 01:43:30


In [70]:
order_reviews_df2 = order_reviews_df.copy()

In [72]:
order_reviews_df2.drop_duplicates(subset=['review_id'])

Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp
0,7bc2406110b926393aa56f80a40eba40,73fc7af87114b39712e6da79b0a377eb,4,,,2018-01-18,2018-01-18 21:46:59
1,80e641a11e56f04c1ad469d5645fdfde,a548910a1c6147796b98fdf73dbeba33,5,,,2018-03-10,2018-03-11 03:05:13
2,228ce5500dc1d8e020d8d1322874b6f0,f9e4b658b201a9f2ecdecbb34bed034b,5,,,2018-02-17,2018-02-18 14:36:24
3,e64fb393e7b32834bb789ff8bb30750e,658677c97b385a9be170737859d3511b,5,,Recebi bem antes do prazo estipulado.,2017-04-21,2017-04-21 22:02:06
4,f7c4243c7fe1938f181bec41a392bdeb,8e6bfb81e283fa7e4f11123a3fb894f1,5,,ParabÃ©ns lojas lannister adorei comprar pela ...,2018-03-01,2018-03-02 10:26:53
...,...,...,...,...,...,...,...
99219,574ed12dd733e5fa530cfd4bbf39d7c9,2a8c23fee101d4d5662fa670396eb8da,5,,,2018-07-07,2018-07-14 17:18:30
99220,f3897127253a9592a73be9bdfdf4ed7a,22ec9f0669f784db00fa86d035cf8602,5,,,2017-12-09,2017-12-11 20:06:42
99221,b3de70c89b1510c4cd3d0649fd302472,55d4004744368f5571d1f590031933e4,5,,"Excelente mochila, entrega super rÃ¡pida. Supe...",2018-03-22,2018-03-23 09:10:43
99222,1adeb9d84d72fe4e337617733eb85149,7725825d039fc1f0ceb7635e3f7d9206,4,,,2018-07-01,2018-07-02 12:59:13


In [74]:
orders_df.order_id.duplicated().any()

False

In [9]:
print(orders_df.order_id.shape)
print(order_reviews_df.order_id.shape)

(99441,)
(99224,)


In [10]:
print(orders_df.order_id.nunique())
print(order_reviews_df.order_id.nunique())

99441
98673


In [7]:
order_reviews_df.order_id.duplicated().any()

True

In [8]:
orders_df = pd.read_excel(xlsx, sheet_name='orders_data')
orders_df.head()

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18
1,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,2018-07-24 20:41:37,2018-07-26 03:24:27,2018-07-26 14:31:00,2018-08-07 15:27:45,2018-08-13
2,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,2018-08-08 08:38:49,2018-08-08 08:55:23,2018-08-08 13:50:00,2018-08-17 18:06:29,2018-09-04
3,949d5b44dbf5de918fe9c16f97b45f8a,f88197465ea7920adcdbec7375364d82,delivered,2017-11-18 19:28:06,2017-11-18 19:45:59,2017-11-22 13:39:59,2017-12-02 00:28:42,2017-12-15
4,ad21c59c0840e6cb83a9ceb5573f8159,8ab97904e6daea8866dbdbc4fb7aad2c,delivered,2018-02-13 21:18:39,2018-02-13 22:20:29,2018-02-14 19:46:34,2018-02-16 18:17:02,2018-02-26


In [32]:
orders_df.columns

Index(['order_id', 'customer_id', 'order_status', 'order_purchase_timestamp',
       'order_approved_at', 'order_delivered_carrier_date',
       'order_delivered_customer_date', 'order_estimated_delivery_date'],
      dtype='object')

In [11]:
#  Let's merge the orders_df and order_reviews_df tables to an intermediate table 'orders', 
# using the common column 'order_id' provided in the schema 

orders = pd.merge(orders_df, order_reviews_df, on='order_id', how='left')
orders

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date,review_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18,a54f0611adc9ed256b57ede6b6eb5114,4.0,,"NÃ£o testei o produto ainda, mas ele veio corr...",2017-10-11,2017-10-12 03:43:48
1,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,2018-07-24 20:41:37,2018-07-26 03:24:27,2018-07-26 14:31:00,2018-08-07 15:27:45,2018-08-13,8d5266042046a06655c8db133d120ba5,4.0,Muito boa a loja,Muito bom o produto.,2018-08-08,2018-08-08 18:37:50
2,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,2018-08-08 08:38:49,2018-08-08 08:55:23,2018-08-08 13:50:00,2018-08-17 18:06:29,2018-09-04,e73b67b67587f7644d5bd1a52deb1b01,5.0,,,2018-08-18,2018-08-22 19:07:58
3,949d5b44dbf5de918fe9c16f97b45f8a,f88197465ea7920adcdbec7375364d82,delivered,2017-11-18 19:28:06,2017-11-18 19:45:59,2017-11-22 13:39:59,2017-12-02 00:28:42,2017-12-15,359d03e676b3c069f62cadba8dd3f6e8,5.0,,O produto foi exatamente o que eu esperava e e...,2017-12-03,2017-12-05 19:21:58
4,ad21c59c0840e6cb83a9ceb5573f8159,8ab97904e6daea8866dbdbc4fb7aad2c,delivered,2018-02-13 21:18:39,2018-02-13 22:20:29,2018-02-14 19:46:34,2018-02-16 18:17:02,2018-02-26,e50934924e227544ba8246aeb3770dd4,5.0,,,2018-02-17,2018-02-18 13:02:51
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99987,9c5dedf39a927c1b2549525ed64a053c,39bd1228ee8140590ac3aca26f2dfe00,delivered,2017-03-09 09:54:05,2017-03-09 09:54:05,2017-03-10 11:18:03,2017-03-17 15:08:01,2017-03-28,e262b3f92d1ce917aa412a9406cf61a6,5.0,,,2017-03-22,2017-03-23 11:02:08
99988,63943bddc261676b46f01ca7ac2f7bd8,1fca14ff2861355f6e5f14306ff977a7,delivered,2018-02-06 12:58:58,2018-02-06 13:10:37,2018-02-07 23:22:42,2018-02-28 17:37:56,2018-03-02,29bb71b2760d0f876dfa178a76bc4734,4.0,,So uma peÃ§a que veio rachado mas tudo bem rs,2018-03-01,2018-03-02 17:50:01
99989,83c1379a015df1e13d02aae0204711ab,1aa71eb042121263aafbe80c1b562c9c,delivered,2017-08-27 14:46:43,2017-08-27 15:04:16,2017-08-28 20:52:26,2017-09-21 11:24:17,2017-09-27,371579771219f6db2d830d50805977bb,5.0,,Foi entregue antes do prazo.,2017-09-22,2017-09-22 23:10:57
99990,11c177c8e97725db2631073c19f07b62,b331b74b18dc79bcdf6532d51e1637c1,delivered,2018-01-08 21:28:27,2018-01-08 21:36:21,2018-01-12 15:35:03,2018-01-25 23:32:54,2018-02-15,8ab6855b9fe9b812cd03a480a25058a1,2.0,,Foi entregue somente 1. Quero saber do outro p...,2018-01-26,2018-01-27 09:16:56


In [None]:

orders_1 = orders_df.join(order_reviews_df, on='order_id', how='left')
orders

In [12]:
order_payments_df = pd.read_excel(xlsx, sheet_name='order_payments_data')
order_payments_df.head()

Unnamed: 0,order_id,payment_sequential,payment_type,payment_installments,payment_value
0,b81ef226f3fe1789b1e8b2acac839d17,1,credit_card,8,99.33
1,a9810da82917af2d9aefd1278f1dcfa0,1,credit_card,1,24.39
2,25e8ea4e93396b6fa0d3dd708e76c1bd,1,credit_card,1,65.71
3,ba78997921bbcdc1373bb41e913ab953,1,credit_card,8,107.78
4,42fdf880ba16b47b59251dd489d4441a,1,credit_card,2,128.45


In [13]:
order_payments_df.duplicated().any()

False

In [14]:
order_payments_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103886 entries, 0 to 103885
Data columns (total 5 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   order_id              103886 non-null  object 
 1   payment_sequential    103886 non-null  int64  
 2   payment_type          103886 non-null  object 
 3   payment_installments  103886 non-null  int64  
 4   payment_value         103886 non-null  float64
dtypes: float64(1), int64(2), object(2)
memory usage: 4.0+ MB


In [15]:
# Let's retrieve the sheets one by one and load them to their dataframes. 
# For performance purposes, we shall be retrieving them one by one instead of returning the entire dictionary.
customers_df = pd.read_excel(xlsx, sheet_name='customers_data')
customers_df.head()

Unnamed: 0,customer_id,customer_unique_id,customer_zip_code_prefix,customer_city,customer_state
0,06b8999e2fba1a1fbc88172c00ba8bc7,861eff4711a542e4b93843c6dd7febb0,14409,franca,SP
1,18955e83d337fd6b2def6b18a428ac77,290c77bc529b7ac935b93aa66c333dc3,9790,sao bernardo do campo,SP
2,4e7b3e00288586ebd08712fdd0374a03,060e732b5b29e8181a18229c7b0b2b5e,1151,sao paulo,SP
3,b2b6027bc5c5109e529d4dc6358b12c3,259dac757896d24d7702b9acbbff3f3c,8775,mogi das cruzes,SP
4,4f2d8ab171c80ec8364f7c12e35b23ad,345ecd01c38d18a9036ed96c73b8d066,13056,campinas,SP


In [16]:
customers_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99441 entries, 0 to 99440
Data columns (total 5 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   customer_id               99441 non-null  object
 1   customer_unique_id        99441 non-null  object
 2   customer_zip_code_prefix  99441 non-null  int64 
 3   customer_city             99441 non-null  object
 4   customer_state            99441 non-null  object
dtypes: int64(1), object(4)
memory usage: 3.8+ MB


In [17]:
customers_df.duplicated().any()

False

In [75]:
order_items_df = pd.read_excel(xlsx, sheet_name='order_items_data')
order_items_df

Unnamed: 0,order_id,order_item_id,product_id,seller_id,shipping_limit_date,price,freight_value
0,00010242fe8c5a6d1ba2dd792cb16214,1,4244733e06e7ecb4970a6e2683c13e61,48436dade18ac8b2bce089ec2a041202,2017-09-19 09:45:35,58.90,13.29
1,00018f77f2f0320c557190d7a144bdd3,1,e5f2d52b802189ee658865ca93d83a8f,dd7ddc04e1b6c2c614352b383efe2d36,2017-05-03 11:05:13,239.90,19.93
2,000229ec398224ef6ca0657da4fc703e,1,c777355d18b72b67abbeef9df44fd0fd,5b51032eddd242adc84c38acab88f23d,2018-01-18 14:48:30,199.00,17.87
3,00024acbcdf0a6daa1e931b038114c75,1,7634da152a4610f1595efa32f14722fc,9d7a1d34a5052409006425275ba1c2b4,2018-08-15 10:10:18,12.99,12.79
4,00042b26cf59d7ce69dfabb4e55b4fd9,1,ac6c3623068f30de03045865e4e10089,df560393f3a51e74553ab94004ba5c87,2017-02-13 13:57:51,199.90,18.14
...,...,...,...,...,...,...,...
112645,fffc94f6ce00a00581880bf54a75a037,1,4aa6014eceb682077f9dc4bffebc05b0,b8bc237ba3788b23da09c0f1f3a3288c,2018-05-02 04:11:01,299.99,43.41
112646,fffcd46ef2263f404302a634eb57f7eb,1,32e07fd915822b0765e448c4dd74c828,f3c38ab652836d21de61fb8314b69182,2018-07-20 04:31:48,350.00,36.53
112647,fffce4705a9662cd70adb13d4a31832d,1,72a30483855e2eafc67aee5dc2560482,c3cfdc648177fdbbbb35635a37472c53,2017-10-30 17:14:25,99.90,16.95
112648,fffe18544ffabc95dfada21779c9644f,1,9c422a519119dcad7575db5af1ba540e,2b3e4a2a3ea8e01938cabda2a3e5cc79,2017-08-21 00:04:32,55.99,8.72


In [79]:
order_items_df.order_id.duplicated().sum()

13984

In [19]:
order_items_df.sample(15)

Unnamed: 0,order_id,order_item_id,product_id,seller_id,shipping_limit_date,price,freight_value
36324,525ec9a4d6f8d45b2d22977017fe7849,1,a64fc2f46aef1e68dc0bdc591c84832c,c864036feaab8c1659f65ea4faebe1da,2017-08-14 23:24:05,139.9,15.73
90860,ce65464cf5fd5392097c6eb4a9a7918c,1,fe6a9515d655fa7936b8a7c841039f34,dc317f341ab0e22f39acbd9dbf9b4a1f,2017-12-01 11:31:29,249.9,60.3
45848,6811f439ad8873c28f8e1138e02229a5,1,cd9895374edea6a67749d331b0b32070,8ae520247981aa06bc94abddf5f46d34,2017-12-14 11:31:27,649.0,20.98
84559,c0248cae5b63c3be878360d970ee5ace,1,b0961721fd839e9982420e807758a2a6,1f50f920176fa81dab994f9023523100,2017-11-30 20:33:53,49.0,35.93
39804,5a8cbc60b032f344459aae65f0c01d26,1,95bc2131ece03edd6623b82c80db02bb,29fe9f200d3fa0c668d2aa1ec7e08dfb,2017-07-26 15:15:19,80.0,34.36
97243,dc9b66f791e6bc0d3a44e2c513a1d117,1,68bf40b3abd5ffc25981c25df9ed9087,3d871de0142ce09b7081e2b9d1733cb1,2018-07-31 08:05:12,79.0,34.79
102881,e9a21226714589b1cb43d10c0a8cfb15,2,c10d842a54be2035918ad421a74d46c2,a17f621c590ea0fab3d5d883e1630ec6,2017-04-18 17:10:16,17.33,10.96
47707,6c682aa3032d2639e12a64b3c2a907b7,2,3606696d19bbcad6cb6c0e985749862f,d624126b9206f595fb3fbb6ba03b28a8,2018-02-16 12:05:31,36.9,16.6
109473,f8d19a0283152c2a14277f39a74da971,1,75d6b6963340c6063f7f4cfcccfe6a30,cc419e0650a3c5ba77189a1882b7556a,2017-10-13 12:49:21,56.99,15.15
43967,63f1ad378cef16f74bb7480a3a01e3b4,1,3cb0ece3f5f0b8121a53635c9f783aa5,e6a69c4a27dfdd98ffe5aa757ad744bc,2017-12-19 15:10:40,21.65,15.11


In [20]:
order_items_df.shape

(112650, 7)

In [21]:
order_items_df.product_id.shape

(112650,)

In [22]:
order_items_df.order_id.duplicated().any()

True

In [23]:
order_items_df.isna().any()

order_id               False
order_item_id          False
product_id             False
seller_id              False
shipping_limit_date    False
price                  False
freight_value          False
dtype: bool

In [24]:
order_items_df.duplicated().any()

False

In [25]:
order_items_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112650 entries, 0 to 112649
Data columns (total 7 columns):
 #   Column               Non-Null Count   Dtype         
---  ------               --------------   -----         
 0   order_id             112650 non-null  object        
 1   order_item_id        112650 non-null  int64         
 2   product_id           112650 non-null  object        
 3   seller_id            112650 non-null  object        
 4   shipping_limit_date  112650 non-null  datetime64[ns]
 5   price                112650 non-null  float64       
 6   freight_value        112650 non-null  float64       
dtypes: datetime64[ns](1), float64(2), int64(1), object(3)
memory usage: 6.0+ MB


In [26]:
products_df = pd.read_excel(xlsx, sheet_name='products_data')
products_df.head(10)

Unnamed: 0,product_id,product_category_name,product_name_lenght,product_description_lenght,product_photos_qty,product_weight_g,product_length_cm,product_height_cm,product_width_cm
0,1e9e8ef04dbcff4541ed26657ea517e5,perfumaria,40.0,287.0,1.0,225.0,16.0,10.0,14.0
1,3aa071139cb16b67ca9e5dea641aaa2f,artes,44.0,276.0,1.0,1000.0,30.0,18.0,20.0
2,96bd76ec8810374ed1b65e291975717f,esporte_lazer,46.0,250.0,1.0,154.0,18.0,9.0,15.0
3,cef67bcfe19066a932b7673e239eb23d,bebes,27.0,261.0,1.0,371.0,26.0,4.0,26.0
4,9dc1a7de274444849c219cff195d0b71,utilidades_domesticas,37.0,402.0,4.0,625.0,20.0,17.0,13.0
5,41d3672d4792049fa1779bb35283ed13,instrumentos_musicais,60.0,745.0,1.0,200.0,38.0,5.0,11.0
6,732bd381ad09e530fe0a5f457d81becb,cool_stuff,56.0,1272.0,4.0,18350.0,70.0,24.0,44.0
7,2548af3e6e77a690cf3eb6368e9ab61e,moveis_decoracao,56.0,184.0,2.0,900.0,40.0,8.0,40.0
8,37cc742be07708b53a98702e77a21a02,eletrodomesticos,57.0,163.0,1.0,400.0,27.0,13.0,17.0
9,8c92109888e8cdf9d66dc7e463025574,brinquedos,36.0,1156.0,1.0,600.0,17.0,10.0,12.0


In [27]:
sellers_df = pd.read_excel(xlsx, sheet_name='sellers_data')
sellers_df.head()

Unnamed: 0,seller_id,seller_zip_code_prefix,seller_city,seller_state
0,3442f8959a84dea7ee197c632cb2df15,13023,campinas,SP
1,d1b65fc7debc3361ea86b5f14c68d2e2,13844,mogi guacu,SP
2,ce3ad9de960102d0677a81f5d0bb7b2d,20031,rio de janeiro,RJ
3,c0f3eea2e14555b6faeea3dd58c1b1c3,4195,sao paulo,SP
4,51a04a8a6bdcb23deccc82b0b80742cf,12914,braganca paulista,SP


In [28]:
geolocation_df = pd.read_excel(xlsx, sheet_name='geolocation_data')
geolocation_df.head()

Unnamed: 0,geolocation_zip_code_prefix,geolocation_lat,geolocation_lng,geolocation_city,geolocation_state
0,1037,-23.545621,-46.639292,sao paulo,SP
1,1046,-23.546081,-46.64482,sao paulo,SP
2,1046,-23.546129,-46.642951,sao paulo,SP
3,1041,-23.544392,-46.639499,sao paulo,SP
4,1035,-23.541578,-46.641607,sao paulo,SP


In [29]:
geolocation_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000163 entries, 0 to 1000162
Data columns (total 5 columns):
 #   Column                       Non-Null Count    Dtype  
---  ------                       --------------    -----  
 0   geolocation_zip_code_prefix  1000163 non-null  int64  
 1   geolocation_lat              1000163 non-null  float64
 2   geolocation_lng              1000163 non-null  float64
 3   geolocation_city             1000163 non-null  object 
 4   geolocation_state            1000163 non-null  object 
dtypes: float64(2), int64(1), object(2)
memory usage: 38.2+ MB


In [30]:
geolocation_df.geolocation_state.unique()

array(['SP', 'RN', 'AC', 'RJ', 'ES', 'MG', 'BA', 'SE', 'PE', 'AL', 'PB',
       'CE', 'PI', 'MA', 'PA', 'AP', 'AM', 'RR', 'DF', 'GO', 'RO', 'TO',
       'MT', 'MS', 'RS', 'PR', 'SC'], dtype=object)

### Issues
- wrong datatype: geolocation_zip_code_prefix, geolocation_state, geolocation_city,
- Inconsistent customer zipcode prefix format. 4 digits sometimes, 5 digits other times
- non-ascii characters: geolocation city
- Inconsistent spelling formats. e.g 'sÃ£o paulo', 'sao paulo'; getÃºlio vargas, getulio vargas; etc.

### Issues
- wrong datatype: order_item_id

### Issues
- Wrong datatype: payment_type,  payment_sequential, payment_installments

In [31]:
product_categories_df = pd.read_excel(xlsx, sheet_name='product_categories_data')
product_categories_df.head()

Unnamed: 0,product_category_name,product_category_name_english
0,beleza_saude,health_beauty
1,informatica_acessorios,computers_accessories
2,automotivo,auto
3,cama_mesa_banho,bed_bath_table
4,moveis_decoracao,furniture_decor


### Issues
- wrong datatype: customer_city, customer_state, customer_zip_code_prefix
- Inconsistent customer zipcode prefix format. 4 digits sometimes, 5 digits other times
- City name in lower case

In [132]:
d1={'col1': ['a1', 'b1', 'c1', 'd1'],
    'col2': ['a2', 'b2', 'c2', 'd2', ]
    }
    
d2 = {
    'col3': ['x1', 'x2', 'x3', 'x4', 'x5', 'x6'],
    'col4': ['y1', 'y2', 'y3', 'y4', 'y5', 'y6']
}

d6 = {
    'col3': ['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8'],
    'col6': ['z1', 'z2', 'z3', 'z4', 'z5', 'z6', 'z7', np.nan]
}
    
df1 = pd.DataFrame(data=d1)
df2 = pd.DataFrame(data = d2)
df6 = pd.DataFrame(data=d6)

In [133]:
df1

Unnamed: 0,col1,col2
0,a1,a2
1,b1,b2
2,c1,c2
3,d1,d2


In [134]:
df2

Unnamed: 0,col3,col4
0,x1,y1
1,x2,y2
2,x3,y3
3,x4,y4
4,x5,y5
5,x6,y6


In [135]:
df6

Unnamed: 0,col3,col6
0,x1,z1
1,x2,z2
2,x3,z3
3,x4,z4
4,x5,z5
5,x6,z6
6,x7,z7
7,x8,


In [137]:
df3 = pd.merge(df2, df6, on='col3', how='inner')
df3

Unnamed: 0,col3,col4,col6
0,x1,y1,z1
1,x2,y2,z2
2,x3,y3,z3
3,x4,y4,z4
4,x5,y5,z5
5,x6,y6,z6


In [86]:
df3 = pd.merge(df2, df6, on='col3', how='outer')
df3

Unnamed: 0,col3,col4,col6
0,x1,y1,z1
1,x2,y2,z2
2,x3,y3,z3
3,x4,y4,z4
4,x5,y5,z5
5,x6,y6,z6
6,x7,,z7
7,x8,,


In [108]:
df3 = pd.merge(df2, df6, on='col3', how='left')
df3

Unnamed: 0,col3,col4,col6
0,x1,y1,z1
1,x2,y2,z2
2,x3,y3,z3
3,x4,y4,z4
4,x5,y5,z5
5,x6,y6,z6


In [104]:
df3 = pd.merge(df2, df6, on='col3', how='right')
df3

Unnamed: 0,col3,col4,col6
0,x1,y1,z1
1,x2,y2,z2
2,x3,y3,z3
3,x4,y4,z4
4,x5,y5,z5
5,x6,y6,z6
6,x7,,z7
7,x8,,


In [41]:
df4 = df1.join(df2)
df4

Unnamed: 0,col1,col2,col3,col4
0,a1,a2,x1,y1
1,b1,b2,x2,y2
2,c1,c2,x3,y3
3,d1,d2,x4,y4


In [46]:
d3={'col1': ['a1', 'b1', 'c1', 'd1'],
    'col2': ['a2', 'b2', 'c2', 'd2', ]
    }
    
d4 = {'col1': ['a1', 'a2', 'a3', 'a4', 'a5', 'a6'],
    'col3': ['x1', 'x2', 'x3', 'x4', 'x5', 'x6'],
    'col4': ['y1', 'y2', 'y3', 'y4', 'y5', 'y6']
}
    
df5 = pd.DataFrame(data=d3)

df6 = pd.DataFrame(data = d4)

In [47]:
print(pd.merge(left=df5, right=df6, ))

  col1 col2 col3 col4
0   a1   a2   x1   y1


In [131]:
df1 = pd.DataFrame([[1, 3], [2, 4]], columns=['A', 'B'])
df2 = pd.DataFrame([[1, 5], [7, 6]], columns=['A', 'C'])

pd.merge(df1, df2, how='left')

Unnamed: 0,A,B,C
0,1,3,5.0
1,2,4,


In [129]:
df1 = pd.DataFrame([[1, 3], [2, 4]], columns=['A', 'B'])
df2 = pd.DataFrame([[1, 5], [1, 6]], columns=['A', 'C'])

df1.merge(df2, how='left')

Unnamed: 0,A,B,C
0,1,3,5.0
1,1,3,6.0
2,2,4,
