#SQL para Engenharia de Dados

## O que é SQL ?

Structured Query Language, ou Linguagem de Consulta Estruturada ou SQL, é a linguagem de pesquisa declarativa padrão para banco de dados relacional (base de dados relacional). Muitas das características originais do SQL foram inspiradas na álgebra relacional.

## Entendimento de Negócio

**Food and Goods Deliveries in Brazil**

**O que é o Delivery Center**

Com seus diversos hubs operacionais espalhados pelo Brasil, o Delivery Center é uma plataforma integra lojistas e marketplaces, criando um ecossistema saudável para vendas de good (produtos) e food (comidas) no varejo brasileiro.

Atualmente temos um cadastro (catálogo + cardápio) com mais de 900 mil itens, milhares de pedidos e entregas são operacionalizados diariamente com uma rede de milhares lojistas e entregadores parceiros espalhados por todas as regiões do país.

Tudo isso gera dados e mais dados a todo momento!

Diante disso, nosso negócio está cada vez data driven, ou seja, utilizando dados para tomar decisões e numa visão de futuro sabemos que utilizar os dados de forma inteligente pode ser o nosso grande diferencial no mercado.

Este é o nosso contexto e com ele lhe propomos um desafio desafio em que você possa aplicar seus conhecimentos técnicos objetivando resolver problemas cotidianos de uma equipe de dados.

https://www.kaggle.com/datasets/nosbielcs/brazilian-delivery-center

<br/>

**Descrição dos datasets**

**channels**: Este dataset possui informações sobre os canais de venda (marketplaces) onde são vendidos os good e food de nossos lojistas.

**deliveries**: Este dataset possui informações sobre as entregas realizadas por nossos entregadores parceiros.

**drivers**: Este dataset possui informações sobre os entregadores parceiros. Eles ficam em nossos hubs e toda vez que um pedido é processado, são eles fazem as entregas na casa dos consumidores.

**hubs**: Este dataset possui informações sobre os hubs do Delivery Center. Entenda que os Hubs são os centros de distribuição dos pedidos e é dali que saem as entregas.

**orders**: Este dataset possui informações sobre as vendas processadas através da plataforma do Delivery Center.

**payments**: Este dataset possui informações sobre os pagamentos realizados ao Delivery Center.

**stores**: Este dataset possui informações sobre os lojistas. Eles utilizam a Plataforma do Delivery Center para vender seus itens (good e/ou food) nos marketplaces.

## Criação de Banco de Dados

In [0]:
%sql
CREATE DATABASE bronze

## Carregamento da Base

### Orders

In [0]:
df_orders = spark.read.format("csv").option("header", "true").load("dbfs:/FileStore/shared_uploads/contato@rodolfomoreira.com.br/orders-1.csv")
df_orders.display()

In [0]:
df_orders.write.format("delta").mode("append").saveAsTable("bronze.orders")

### Stores

In [0]:
df_stores = spark.read.format("csv").option("header", "true").load("dbfs:/FileStore/shared_uploads/contato@rodolfomoreira.com.br/stores-1.csv")
df_stores.display()

In [0]:
df_stores.write.format("delta").mode("append").saveAsTable("bronze.stores")

## SELECT, WHERE, JOIN e LIMIT

### SELECT

In [0]:
%sql
SELECT order_id, order_status FROM bronze.orders

### Where

In [0]:
%sql
SELECT * FROM bronze.orders WHERE order_status = 'FINISHED'

In [0]:
%sql
SELECT DISTINCT(order_status) FROM bronze.orders

### LIMIT

In [0]:
%sql
SELECT * FROM bronze.orders LIMIT 5

### JOIN

In [0]:
%sql
SELECT * FROM bronze.orders LIMIT 5

In [0]:
%sql
SELECT * FROM bronze.stores LIMIT 5

In [0]:
%sql
SELECT
  orders.order_id,
  orders.order_amount,
  stores.store_name
FROM
  bronze.orders
INNER JOIN
  bronze.stores
ON
  stores.store_id = orders.store_id
WHERE 
  stores.store_name = 'CUMIURI'
ORDER BY orders.order_amount ASC
LIMIT 10

## Cálculos Básicos e Agregações

In [0]:
%sql
SELECT
  order_id,
  store_id,
  order_delivery_fee,
  order_delivery_cost,
  ROUND((order_delivery_fee + order_delivery_cost), 2) as total_cost
FROM
  bronze.orders
WHERE
  order_delivery_fee > 0
AND
  order_delivery_cost IS NOT NULL
LIMIT 5

In [0]:
%sql
SELECT * FROM bronze.orders LIMIT 1

In [0]:
%sql
SELECT
  store_id,
  SUM(order_amount) as TOTAL
FROM
  bronze.orders
WHERE
  store_id = '3512'
GROUP BY
  store_id

In [0]:
%sql
SELECT COUNT(DISTINCT(store_id)) FROM bronze.orders

## Dados Duplicados

In [0]:
%sql
SELECT COUNT(*) FROM bronze.orders

In [0]:
orders2 = spark.sql("SELECT * FROM bronze.orders LIMIT 100")
display(orders2)

In [0]:
orders2.write.format("delta").mode("append").saveAsTable("bronze.orders")

In [0]:
%sql
SELECT COUNT(*) FROM bronze.orders

In [0]:
%sql
SELECT
  COUNT(*)
FROM (
  SELECT order_id, store_id, order_amount, COUNT(*) as records FROM bronze.orders GROUP BY order_id, store_id, order_amount
) a
WHERE a.records > 1


In [0]:
%sql
SELECT DISTINCT * FROM bronze.orders

In [0]:
%sql
CREATE DATABASE prata

In [0]:
%sql
CREATE TABLE prata.orders AS (SELECT DISTINCT * FROM bronze.orders)

In [0]:
%sql
SELECT COUNT(*) FROM prata.orders

## Detecção de Anomalias

In [0]:
%sql
SELECT * FROM bronze.orders LIMIT 5

order_id,store_id,channel_id,payment_order_id,delivery_order_id,order_status,order_amount,order_delivery_fee,order_delivery_cost,order_created_hour,order_created_minute,order_created_day,order_created_month,order_created_year,order_moment_created,order_moment_accepted,order_moment_ready,order_moment_collected,order_moment_in_expedition,order_moment_delivering,order_moment_delivered,order_moment_finished,order_metric_collected_time,order_metric_paused_time,order_metric_production_time,order_metric_walking_time,order_metric_expediton_speed_time,order_metric_transit_time,order_metric_cycle_time
68405119,3512,5,68405119,68405119,CANCELED,62.7,0,,0,1,1,1,2021,1/1/2021 12:01:36 AM,,,,,,,,,,,,,,
68405123,3512,5,68405123,68405123,CANCELED,62.7,0,,0,4,1,1,2021,1/1/2021 12:04:26 AM,,,,,,,,,,,,,,
68405206,3512,5,68405206,68405206,CANCELED,115.5,0,,0,13,1,1,2021,1/1/2021 12:13:07 AM,,,,,,,,,,,,,,
68405465,3401,5,68405465,68405465,CANCELED,55.9,0,,0,19,1,1,2021,1/1/2021 12:19:15 AM,,,,,,,,,,,,,,
68406064,3401,5,68406064,68406064,CANCELED,37.9,0,,0,26,1,1,2021,1/1/2021 12:26:25 AM,,,,,,,,,,,,,,


In [0]:
%sql
SELECT
  ntile,
  min(order_amount) as limite_inferior,
  max(order_amount) as limite_superior,
  avg(order_amount) as media,
  count(order_id) as orders
FROM
  (
    SELECT order_id, CAST(order_amount AS FLOAT),
    ntile(4) OVER (ORDER BY CAST(order_amount AS FLOAT)) AS ntile
    FROM bronze.orders
  ) a
GROUP BY 1

ntile,limite_inferior,limite_superior,media,orders
1,0.0,39.9,29.004368292712275,92275
2,39.9,71.6,54.82960870458065,92275
3,71.6,121.9,94.19318005425664,92275
4,121.9,1788306.1,242.56040653478703,92274


In [0]:
%sql
SELECT * FROM bronze.orders WHERE CAST(order_amount AS FLOAT) > 100000

order_id,store_id,channel_id,payment_order_id,delivery_order_id,order_status,order_amount,order_delivery_fee,order_delivery_cost,order_created_hour,order_created_minute,order_created_day,order_created_month,order_created_year,order_moment_created,order_moment_accepted,order_moment_ready,order_moment_collected,order_moment_in_expedition,order_moment_delivering,order_moment_delivered,order_moment_finished,order_metric_collected_time,order_metric_paused_time,order_metric_production_time,order_metric_walking_time,order_metric_expediton_speed_time,order_metric_transit_time,order_metric_cycle_time
84504844,603,10,84504844,84504844,CANCELED,1788306.11,9.9,,16,35,18,3,2021,3/18/2021 4:35:11 PM,3/18/2021 4:35:14 PM,3/18/2021 4:45:20 PM,3/18/2021 4:50:30 PM,3/18/2021 4:55:04 PM,,,3/18/2021 4:59:30 PM,5.17,,10.15,9.73,,,24.33
93127697,1300,1,93127697,93127697,FINISHED,100000.11,0.0,5.0,16,32,30,4,2021,4/30/2021 4:32:13 PM,4/30/2021 4:32:24 PM,4/30/2021 4:32:20 PM,4/30/2021 4:40:14 PM,4/30/2021 4:43:16 PM,4/30/2021 5:14:00 PM,,4/30/2021 5:36:18 PM,7.9,30.73,0.12,10.93,41.67,22.3,64.08


## Tratamento de dados com CASE

In [0]:
%sql
SELECT DISTINCT(channel_id) FROM bronze.orders

channel_id
7
15
11
29
3
30
34
8
28
35


In [0]:
%sql
SELECT
  order_id,
  store_id,
  order_amount,
  (
    CASE
      WHEN channel_id = "1" THEN "APP"
      WHEN channel_id = "10" THEN "SITE"
      ELSE "MARKET PLACE"
    END
  ) AS channel
FROM
  bronze.orders
WHERE
  channel_id IN ("1", "10", "11")

order_id,store_id,order_amount,channel
68434362,903,77.35,SITE
68435369,869,69.8,SITE
68453149,1018,85.32,SITE
68455353,1018,55.67,SITE
68475901,1018,80.42,SITE
68477798,1018,90.4,SITE
68512601,1018,107.77,SITE
68518953,1018,37.9,SITE
68527199,191,62.9,SITE
68549913,54,19.9,SITE


## Casting

In [0]:
%sql
SELECT * FROM bronze.orders LIMIT 5

order_id,store_id,channel_id,payment_order_id,delivery_order_id,order_status,order_amount,order_delivery_fee,order_delivery_cost,order_created_hour,order_created_minute,order_created_day,order_created_month,order_created_year,order_moment_created,order_moment_accepted,order_moment_ready,order_moment_collected,order_moment_in_expedition,order_moment_delivering,order_moment_delivered,order_moment_finished,order_metric_collected_time,order_metric_paused_time,order_metric_production_time,order_metric_walking_time,order_metric_expediton_speed_time,order_metric_transit_time,order_metric_cycle_time
68405119,3512,5,68405119,68405119,CANCELED,62.7,0,,0,1,1,1,2021,1/1/2021 12:01:36 AM,,,,,,,,,,,,,,
68405123,3512,5,68405123,68405123,CANCELED,62.7,0,,0,4,1,1,2021,1/1/2021 12:04:26 AM,,,,,,,,,,,,,,
68405206,3512,5,68405206,68405206,CANCELED,115.5,0,,0,13,1,1,2021,1/1/2021 12:13:07 AM,,,,,,,,,,,,,,
68405465,3401,5,68405465,68405465,CANCELED,55.9,0,,0,19,1,1,2021,1/1/2021 12:19:15 AM,,,,,,,,,,,,,,
68406064,3401,5,68406064,68406064,CANCELED,37.9,0,,0,26,1,1,2021,1/1/2021 12:26:25 AM,,,,,,,,,,,,,,


In [0]:
%sql
SELECT
  CAST(order_amount AS FLOAT)
FROM
  bronze.orders
LIMIT 5

order_amount
62.7
62.7
115.5
55.9
37.9


In [0]:
%sql
SELECT
  CAST(order_amount AS FLOAT) AS preco_total,
  "R$ 19,95" AS preco_base,
  CAST(REPLACE(REPLACE(REPLACE("R$ 19,95", "R$", ""), " ", ""), ",", ".") AS FLOAT) AS preco_base_formatado
FROM
  bronze.orders
LIMIT 5

preco_total,preco_base,preco_base_formatado
62.7,"R$ 19,95",19.95
62.7,"R$ 19,95",19.95
115.5,"R$ 19,95",19.95
55.9,"R$ 19,95",19.95
37.9,"R$ 19,95",19.95


In [0]:
%sql
SELECT
  order_created_day,
  order_created_month,
  order_moment_created
FROM
  bronze.orders
WHERE
  order_created_day = 1
LIMIT 5

order_created_day,order_created_month,order_moment_created
1,1,1/1/2021 12:01:36 AM
1,1,1/1/2021 12:04:26 AM
1,1,1/1/2021 12:13:07 AM
1,1,1/1/2021 12:19:15 AM
1,1,1/1/2021 12:26:25 AM


In [0]:
%sql
SELECT
  order_moment_created,
  TO_DATE(REPLACE(SUBSTRING(order_moment_created, 1, 9), " ", ""), "M/d/yyyy") AS order_moment_created_formated
FROM
  bronze.orders
WHERE
  order_created_day > 10
AND
  order_created_month > 2
LIMIT 5

order_moment_created,order_moment_created_formated
3/11/2021 12:00:03 AM,2021-03-11
3/11/2021 12:00:15 AM,2021-03-11
3/11/2021 12:00:43 AM,2021-03-11
3/11/2021 12:00:44 AM,2021-03-11
3/11/2021 12:00:50 AM,2021-03-11


## Dados Ausentes

In [0]:
%sql
SELECT * FROM bronze.orders LIMIT 5

order_id,store_id,channel_id,payment_order_id,delivery_order_id,order_status,order_amount,order_delivery_fee,order_delivery_cost,order_created_hour,order_created_minute,order_created_day,order_created_month,order_created_year,order_moment_created,order_moment_accepted,order_moment_ready,order_moment_collected,order_moment_in_expedition,order_moment_delivering,order_moment_delivered,order_moment_finished,order_metric_collected_time,order_metric_paused_time,order_metric_production_time,order_metric_walking_time,order_metric_expediton_speed_time,order_metric_transit_time,order_metric_cycle_time
68405119,3512,5,68405119,68405119,CANCELED,62.7,0,,0,1,1,1,2021,1/1/2021 12:01:36 AM,,,,,,,,,,,,,,
68405123,3512,5,68405123,68405123,CANCELED,62.7,0,,0,4,1,1,2021,1/1/2021 12:04:26 AM,,,,,,,,,,,,,,
68405206,3512,5,68405206,68405206,CANCELED,115.5,0,,0,13,1,1,2021,1/1/2021 12:13:07 AM,,,,,,,,,,,,,,
68405465,3401,5,68405465,68405465,CANCELED,55.9,0,,0,19,1,1,2021,1/1/2021 12:19:15 AM,,,,,,,,,,,,,,
68406064,3401,5,68406064,68406064,CANCELED,37.9,0,,0,26,1,1,2021,1/1/2021 12:26:25 AM,,,,,,,,,,,,,,


In [0]:
%sql
SELECT
  order_amount,
  (
    CASE
      WHEN 
        order_delivery_cost IS NULL AND order_amount > 0 THEN ROUND((SELECT MEDIAN(order_delivery_cost)
        FROM bronze.orders),2)
      ELSE order_delivery_cost
    END
  ) AS order_delivery_cost
FROM
  bronze.orders
LIMIT 5

order_amount,order_delivery_cost
62.7,7.19
62.7,7.19
115.5,7.19
55.9,7.19
37.9,7.19


## Criando Tabela Tratada

<b> Ordem de execução SQL </b>
<br/><br/>
1.FROM clause<br/>
2.WHERE clause<br/>
3.GROUP BY clause<br/>
4.HAVING clause<br/>
5.SELECT clause<br/>
6.ORDER BY clause<br/>

In [0]:
%sql
CREATE TABLE prata.orders_amount_store AS (SELECT
  order_moment_date,
  store_name,
  ROUND(SUM(order_amount), 2) AS total
FROM
  (SELECT 
    TO_DATE(REPLACE(SUBSTRING(orders.order_moment_created, 1, 9), " ", ""), "M/d/yyyy") AS order_moment_date,
    stores.store_name,
    CAST(order_amount AS FLOAT) AS order_amount
  FROM
    bronze.orders
  INNER JOIN
    bronze.stores
  ON
    stores.store_id = orders.store_id
  WHERE
    order_moment_created IS NOT NULL
  AND
    CAST(order_amount AS FLOAT) <= 10000)
GROUP BY
  order_moment_date, store_name)

num_affected_rows,num_inserted_rows


In [0]:
%sql
SELECT * FROM prata.orders_amount_store

order_moment_date,store_name,total
2021-01-02,CISI DA URPIOU,158.9
2021-01-02,PUMGUA,543.5
2021-01-05,MURPURI OUS GURAIS,1037.7
2021-01-05,SR SIGIRIMI,1760.5
2021-01-08,ISZUI,219.8
2021-01-09,EULARAI MRIPACIA,940.9
2021-01-09,CISI PIUEUEMI,105.79
2021-01-11,ECILUMI MI LISI FASACIO PASCIAMO,58.0
2021-01-12,CZALLA PUIMS,299.98
2021-01-16,SR SIGIRIMI,3839.5
