# Introduction

Olist is an ecommerce marketplace in Brazil. This project is based on 100,000 sales orders made from multiple marketplaces in Brazil, from 2016 to 2018, with product, customer and review information, among many other columns. The original data was provided in multiple datasets, and can be found [here](). 

For convinience, I have combined the CSV files into one excel workbook. If you prefer to work with that, you can find it [here]()

The tables schema can be found [here]()

This project is divided into two parts:
1. An E-commerce analytics to evaluate trends and KPI metrics of the business
2. A custom dashboard comprising the key findings. This dashboard can be integrated to a real-time data source to serve as business KPI dashboard.

In [22]:
#Import required libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
import streamlit as st
import plotly

In [23]:
# The datasets were combined into one Excel file with multiple sheets
#Load dataset
xlsx = pd.ExcelFile('olist_store_dataset.xlsx', engine='openpyxl')

In [24]:
# list of sheets containing the datasets
xlsx.sheet_names

['customers_data',
 'geolocation_data',
 'order_items_data',
 'order_payments_data',
 'order_reviews_data',
 'orders_data',
 'products_data',
 'sellers_data',
 'product_categories_data']

<img src='HRhd2Y0.png' alt='Table schema' width='700px'>

## Let's combine the tables using the schema provided

In [32]:
order_reviews_df = pd.read_excel(xlsx, sheet_name='order_reviews_data')
order_reviews_df.head()

Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp
0,7bc2406110b926393aa56f80a40eba40,73fc7af87114b39712e6da79b0a377eb,4,,,2018-01-18,2018-01-18 21:46:59
1,80e641a11e56f04c1ad469d5645fdfde,a548910a1c6147796b98fdf73dbeba33,5,,,2018-03-10,2018-03-11 03:05:13
2,228ce5500dc1d8e020d8d1322874b6f0,f9e4b658b201a9f2ecdecbb34bed034b,5,,,2018-02-17,2018-02-18 14:36:24
3,e64fb393e7b32834bb789ff8bb30750e,658677c97b385a9be170737859d3511b,5,,Recebi bem antes do prazo estipulado.,2017-04-21,2017-04-21 22:02:06
4,f7c4243c7fe1938f181bec41a392bdeb,8e6bfb81e283fa7e4f11123a3fb894f1,5,,ParabÃ©ns lojas lannister adorei comprar pela ...,2018-03-01,2018-03-02 10:26:53


In [36]:
print(order_reviews_df.shape)
print(orders_df.shape)

(99224, 7)
(99441, 8)


In [33]:
orders_df = pd.read_excel(xlsx, sheet_name='orders_data')
orders_df.head()

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18
1,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,2018-07-24 20:41:37,2018-07-26 03:24:27,2018-07-26 14:31:00,2018-08-07 15:27:45,2018-08-13
2,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,2018-08-08 08:38:49,2018-08-08 08:55:23,2018-08-08 13:50:00,2018-08-17 18:06:29,2018-09-04
3,949d5b44dbf5de918fe9c16f97b45f8a,f88197465ea7920adcdbec7375364d82,delivered,2017-11-18 19:28:06,2017-11-18 19:45:59,2017-11-22 13:39:59,2017-12-02 00:28:42,2017-12-15
4,ad21c59c0840e6cb83a9ceb5573f8159,8ab97904e6daea8866dbdbc4fb7aad2c,delivered,2018-02-13 21:18:39,2018-02-13 22:20:29,2018-02-14 19:46:34,2018-02-16 18:17:02,2018-02-26


In [None]:
#  Let's merge the orders_df and order_reviews_df tables to an intermediate table 'orders', 
# using the common column 'order_id' provided in the schema 

orders = pd.merge(orders_df, order_reviews_df, on='order_id')

In [None]:
order_payments_df = pd.read_excel(xlsx, sheet_name='order_payments_data')
order_payments_df.head()

In [None]:
order_payments_df.duplicated().any()

In [None]:
order_payments_df.info()

In [None]:
# Let's retrieve the sheets one by one and load them to their dataframes. 
# For performance purposes, we shall be retrieving them one by one instead of returning the entire dictionary.
customers_df = pd.read_excel(xlsx, sheet_name='customers_data')
customers_df.head()

In [None]:
customers_df.info()

In [None]:
customers_df.duplicated().any()

In [None]:
order_items_df = pd.read_excel(xlsx, sheet_name='order_items_data')
order_items_df.head()

In [None]:
order_items_df.info()

In [None]:
order_items_df.duplicated().any()

In [28]:
products_df = pd.read_excel(xlsx, sheet_name='products_data')
products_df.head(10)

Unnamed: 0,product_id,product_category_name,product_name_lenght,product_description_lenght,product_photos_qty,product_weight_g,product_length_cm,product_height_cm,product_width_cm
0,1e9e8ef04dbcff4541ed26657ea517e5,perfumaria,40.0,287.0,1.0,225.0,16.0,10.0,14.0
1,3aa071139cb16b67ca9e5dea641aaa2f,artes,44.0,276.0,1.0,1000.0,30.0,18.0,20.0
2,96bd76ec8810374ed1b65e291975717f,esporte_lazer,46.0,250.0,1.0,154.0,18.0,9.0,15.0
3,cef67bcfe19066a932b7673e239eb23d,bebes,27.0,261.0,1.0,371.0,26.0,4.0,26.0
4,9dc1a7de274444849c219cff195d0b71,utilidades_domesticas,37.0,402.0,4.0,625.0,20.0,17.0,13.0
5,41d3672d4792049fa1779bb35283ed13,instrumentos_musicais,60.0,745.0,1.0,200.0,38.0,5.0,11.0
6,732bd381ad09e530fe0a5f457d81becb,cool_stuff,56.0,1272.0,4.0,18350.0,70.0,24.0,44.0
7,2548af3e6e77a690cf3eb6368e9ab61e,moveis_decoracao,56.0,184.0,2.0,900.0,40.0,8.0,40.0
8,37cc742be07708b53a98702e77a21a02,eletrodomesticos,57.0,163.0,1.0,400.0,27.0,13.0,17.0
9,8c92109888e8cdf9d66dc7e463025574,brinquedos,36.0,1156.0,1.0,600.0,17.0,10.0,12.0


In [None]:
sellers_df = pd.read_excel(xlsx, sheet_name='sellers_data')
sellers_df.head()

In [None]:
geolocation_df = pd.read_excel(xlsx, sheet_name='geolocation_data')
geolocation_df.head()

In [None]:
geolocation_df.info()

In [None]:
geolocation_df.geolocation_state.unique()

### Issues
- wrong datatype: geolocation_zip_code_prefix, geolocation_state, geolocation_city,
- Inconsistent customer zipcode prefix format. 4 digits sometimes, 5 digits other times
- non-ascii characters: geolocation city
- Inconsistent spelling formats. e.g 'sÃ£o paulo', 'sao paulo'; getÃºlio vargas, getulio vargas; etc.

### Issues
- wrong datatype: order_item_id

### Issues
- Wrong datatype: payment_type,  payment_sequential, payment_installments

In [31]:
product_categories_df = pd.read_excel(xlsx, sheet_name='product_categories_data')
product_categories_df.head()

Unnamed: 0,product_category_name,product_category_name_english
0,beleza_saude,health_beauty
1,informatica_acessorios,computers_accessories
2,automotivo,auto
3,cama_mesa_banho,bed_bath_table
4,moveis_decoracao,furniture_decor


### Issues
- wrong datatype: customer_city, customer_state, customer_zip_code_prefix
- Inconsistent customer zipcode prefix format. 4 digits sometimes, 5 digits other times
- City name in lower case

In [37]:
d1={'col1': ['a1', 'b1', 'c1', 'd1'],
    'col2': ['a2', 'b2', 'c2', 'd2', ]
    }
    
d2 = {
    'col3': ['x1', 'x2', 'x3', 'x4', 'x5', 'x6'],
    'col4': ['y1', 'y2', 'y3', 'y4', 'y5', 'y6']
}
    
df1 = pd.DataFrame(data=d1)

df2 = pd.DataFrame(data = d2)

In [38]:
df1

Unnamed: 0,col1,col2
0,a1,a2
1,b1,b2
2,c1,c2
3,d1,d2


In [39]:
df2

Unnamed: 0,col3,col4
0,x1,y1
1,x2,y2
2,x3,y3
3,x4,y4
4,x5,y5
5,x6,y6


In [40]:
df3 = pd.concat([df1, df2], axis=1)
df3

Unnamed: 0,col1,col2,col3,col4
0,a1,a2,x1,y1
1,b1,b2,x2,y2
2,c1,c2,x3,y3
3,d1,d2,x4,y4
4,,,x5,y5
5,,,x6,y6


In [41]:
df4 = df1.join(df2)
df4

Unnamed: 0,col1,col2,col3,col4
0,a1,a2,x1,y1
1,b1,b2,x2,y2
2,c1,c2,x3,y3
3,d1,d2,x4,y4


In [46]:
d3={'col1': ['a1', 'b1', 'c1', 'd1'],
    'col2': ['a2', 'b2', 'c2', 'd2', ]
    }
    
d4 = {'col1': ['a1', 'a2', 'a3', 'a4', 'a5', 'a6'],
    'col3': ['x1', 'x2', 'x3', 'x4', 'x5', 'x6'],
    'col4': ['y1', 'y2', 'y3', 'y4', 'y5', 'y6']
}
    
df5 = pd.DataFrame(data=d3)

df6 = pd.DataFrame(data = d4)

In [47]:
print(pd.merge(left=df5, right=df6, ))

  col1 col2 col3 col4
0   a1   a2   x1   y1
