# Trabajo Integrador: DuckDB

## Instalación

In [1]:
%pip install duckdb --upgrade

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.3.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


### Creando conexión

Nos permite establecer una conexión a una base de datos, por defecto, si no especificamos su nombre, la base de datos no persistirá y operará en memoria, por lo tanto no se almacenarán las tablas creadas. Trabajaremos en memoria ya que consideramos que es el fuerte de DUCKDB y su análisis es el pertinente de este trabajo de investigación. 

In [2]:
import duckdb as db

database = db.connect(database=":memory:")


##### Prueba con el dataset. En PostgreSQL almacenaremos archivos de órdenes y sus pagos. En AWS almacenaremos los productos y su categoría. El resto de archivos será almacenado de manera local, algunos en CSV y otros en PARQUET. Esta información se encuentra representada mediante un esquema en la documentación adjunta

In [3]:
%pip install -q kagglehub        
import kagglehub, shutil, pathlib

path = kagglehub.dataset_download("olistbr/brazilian-ecommerce")

local_csv_files = [
    "olist_customers_dataset.csv",
    "olist_geolocation_dataset.csv",
    "olist_order_items_dataset.csv",
    "olist_order_reviews_dataset.csv",
    "olist_sellers_dataset.csv",
    ## Estos archivos se almacenaran en PSQL 
    "olist_orders_dataset.csv",
    "olist_order_payments_dataset.csv", 
]

for file_name in local_csv_files:
    shutil.copy(f"{path}/{file_name}", f"dataset/{file_name}")

for file_name in local_csv_files:
    table = pathlib.Path(f"dataset/{file_name}").stem
    database.execute(f"""
        CREATE OR REPLACE TABLE {table} AS
        SELECT * FROM read_csv_auto('dataset/{file_name}');
    """)



[notice] A new release of pip is available: 24.3.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.


#### Verificación básica de la creación de las tablas

In [4]:
for file_name in local_csv_files:
    table = pathlib.Path(f"dataset/{file_name}").stem
    schema = database.sql(f"DESCRIBE {table}")
    print(schema)

┌──────────────────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│       column_name        │ column_type │  null   │   key   │ default │  extra  │
│         varchar          │   varchar   │ varchar │ varchar │ varchar │ varchar │
├──────────────────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ customer_id              │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ customer_unique_id       │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ customer_zip_code_prefix │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ customer_city            │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ customer_state           │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
└──────────────────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘

┌─────────────────────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│         column_name         │ column_type │  null   │   key   │ default │  extra 

#### Obteniendo los datos de forma remota - Conexión con AWS S3

**NOTA:** Para utilizar AWS S3 es necesario setear las credenciales para acceder al bucket, para eso debemos crear un user en la IAM de AWS, asignarle permisos y finalmente crear las claves de acceso para este usuario. Luego, estas credenciales son obtenidas desde un .env.
Es importante que la región del bucket y del usuario sean la misma.

In [5]:
%pip install python-dotenv
from dotenv import load_dotenv
load_dotenv()  

import os

database.sql("INSTALL httpfs; LOAD httpfs;")

database.sql(f"""
SET s3_region='{os.getenv("AWS_REGION")}';
SET s3_access_key_id='{os.getenv("AWS_ACCESS_KEY_ID")}';
SET s3_secret_access_key='{os.getenv("AWS_SECRET_ACCESS_KEY")}';
""")

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.3.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


### Creando las tablas con los datos de forma remota

In [6]:
database.sql(f"""
 CREATE OR REPLACE TABLE olist_products_dataset AS
        SELECT * FROM read_csv_auto('s3://ti-quadrelli-ribarov/olist_products_dataset.csv');
""")

database.sql(f"""
 CREATE OR REPLACE TABLE product_category_name_translation AS
        SELECT * FROM read_csv_auto('s3://ti-quadrelli-ribarov/product_category_name_translation.csv');
""")


#### Validación básica de la creación de las tablas con los datos remotos


In [7]:
remote_csv_files  = [
    "olist_products_dataset.csv",
    "product_category_name_translation.csv",
]

for file_name in remote_csv_files:
    table = pathlib.Path(f"dataset/{file_name}").stem
    schema = database.sql(f"DESCRIBE {table}")
    print(schema)

┌────────────────────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│        column_name         │ column_type │  null   │   key   │ default │  extra  │
│          varchar           │   varchar   │ varchar │ varchar │ varchar │ varchar │
├────────────────────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ product_id                 │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ product_category_name      │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ product_name_lenght        │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ product_description_lenght │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ product_photos_qty         │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ product_weight_g           │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ product_length_cm          │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ product_height_cm          │ BIGINT      │ YES     │ NULL    │ 

#### Obteniendo los datos desde PostgreSQL

#### Nota: este paso se podría haber hecho sin DuckDB, mismo desde un manejador de base de datos como DBeaver o con la librería Pandas, sin embargo, nuevamente debido al objetivo de este proyecto, se eligió realizarlo utilizando DuckDB

In [8]:

import os

# Esta celda utiliza duckdb para conectarse a PostgreSQL, crea la base de datos trabajo integrador
# en caso de que no exista 

database.execute("INSTALL postgres;")
database.execute("LOAD postgres;")

try:
    database.execute("DETACH pgadmin;")
except Exception:
    pass

conninfo = f"host={os.getenv('PG_HOST')} port={os.getenv('PG_PORT')} user={os.getenv('PG_USER')} password={os.getenv('PG_PASSWORD')} dbname=postgres"
database.execute(f"ATTACH '{conninfo}' AS pgadmin (TYPE postgres);")

exists = database.execute("""
    SELECT COUNT(*) > 0
    FROM postgres_query(
        'pgadmin',
        $$SELECT 1 FROM pg_database WHERE datname = 'trabajo_integrador'$$
    );
""").fetchone()[0]

if not exists:
    database.execute("""
        CALL postgres_execute(
            'pgadmin',
            $$CREATE DATABASE trabajo_integrador$$,
            use_transaction => false
        );
    """)

database.execute("DETACH pgadmin;")

<duckdb.duckdb.DuckDBPyConnection at 0x190433d43b0>

In [9]:
import pathlib, os

orders_csv   = pathlib.Path("olist_orders_dataset.csv")
payments_dataset = pathlib.Path("olist_order_payments_dataset.csv")
   

conninfo = f"host={os.getenv("PG_HOST")} port={os.getenv("PG_PORT")} user={os.getenv("PG_USER")} password={os.getenv("PG_PASSWORD")} dbname={os.getenv("PG_DB")}"
database.execute(f"ATTACH '{conninfo}' AS pgdb (TYPE postgres);")

<duckdb.duckdb.DuckDBPyConnection at 0x190433d43b0>

#### Creación de tablas

In [10]:
import pathlib

orders_csv = pathlib.Path(path) / "olist_orders_dataset.csv"
payments_dataset = pathlib.Path(path) / "olist_order_payments_dataset.csv"

database.execute(f"""
    DROP TABLE IF EXISTS pgdb.olist_orders;
    CREATE TABLE pgdb.olist_orders AS
    SELECT *
    FROM read_csv_auto('{orders_csv.as_posix()}', HEADER=TRUE);
""")

database.execute(f"""
    DROP TABLE IF EXISTS pgdb.olist_orders_payments;
    CREATE TABLE pgdb.olist_orders_payments AS
    SELECT *
    FROM read_csv_auto('{payments_dataset.as_posix()}', HEADER=TRUE);
""")

<duckdb.duckdb.DuckDBPyConnection at 0x190433d43b0>

In [11]:
postgreee_csv_files  = [
   "olist_orders", 
   "olist_orders_payments"
]

for file_name in postgreee_csv_files:
    table = pathlib.Path(f"dataset/{file_name}").stem
    schema = database.sql(f"DESCRIBE pgdb.{table}")
    print(schema)

┌───────────────────────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│          column_name          │ column_type │  null   │   key   │ default │  extra  │
│            varchar            │   varchar   │ varchar │ varchar │ varchar │ varchar │
├───────────────────────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ order_id                      │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ customer_id                   │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ order_status                  │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ order_purchase_timestamp      │ TIMESTAMP   │ YES     │ NULL    │ NULL    │ NULL    │
│ order_approved_at             │ TIMESTAMP   │ YES     │ NULL    │ NULL    │ NULL    │
│ order_delivered_carrier_date  │ TIMESTAMP   │ YES     │ NULL    │ NULL    │ NULL    │
│ order_delivered_customer_date │ TIMESTAMP   │ YES     │ NULL    │ NULL    │ NULL    │
│ order_estimated_delivery_date 

### Exploracion del dataset

In [13]:
database.sql("SHOW TABLES").df()

Unnamed: 0,name
0,olist_customers_dataset
1,olist_geolocation_dataset
2,olist_order_items_dataset
3,olist_order_payments_dataset
4,olist_order_reviews_dataset
5,olist_orders_dataset
6,olist_products_dataset
7,olist_sellers_dataset
8,product_category_name_translation


In [14]:
database.sql("SELECT * FROM olist_customers_dataset LIMIT 5").df()

Unnamed: 0,customer_id,customer_unique_id,customer_zip_code_prefix,customer_city,customer_state
0,06b8999e2fba1a1fbc88172c00ba8bc7,861eff4711a542e4b93843c6dd7febb0,14409,franca,SP
1,18955e83d337fd6b2def6b18a428ac77,290c77bc529b7ac935b93aa66c333dc3,9790,sao bernardo do campo,SP
2,4e7b3e00288586ebd08712fdd0374a03,060e732b5b29e8181a18229c7b0b2b5e,1151,sao paulo,SP
3,b2b6027bc5c5109e529d4dc6358b12c3,259dac757896d24d7702b9acbbff3f3c,8775,mogi das cruzes,SP
4,4f2d8ab171c80ec8364f7c12e35b23ad,345ecd01c38d18a9036ed96c73b8d066,13056,campinas,SP


In [17]:
print("Filas con nulos:")
database.sql("""
SELECT *
FROM olist_customers_dataset
WHERE
    customer_id IS NULL OR
    customer_unique_id IS NULL OR
    customer_zip_code_prefix IS NULL OR
    customer_city IS NULL OR
    customer_state IS NULL
LIMIT 5
""").df()

Filas con nulos:


Unnamed: 0,customer_id,customer_unique_id,customer_zip_code_prefix,customer_city,customer_state


In [18]:
database.sql("SELECT * FROM olist_geolocation_dataset LIMIT 5").df()

Unnamed: 0,geolocation_zip_code_prefix,geolocation_lat,geolocation_lng,geolocation_city,geolocation_state
0,1037,-23.545621,-46.639292,sao paulo,SP
1,1046,-23.546081,-46.64482,sao paulo,SP
2,1046,-23.546129,-46.642951,sao paulo,SP
3,1041,-23.544392,-46.639499,sao paulo,SP
4,1035,-23.541578,-46.641607,sao paulo,SP


In [24]:
tables = database.sql("SHOW TABLES").df()['name'].tolist()

for table in tables:
    print(f"Para la tabla: {table}")
    columns = database.sql(f"DESCRIBE {table}").df()['column_name'].tolist()
    where_clause = " OR ".join([f"{col} IS NULL" for col in columns])
    query = f"SELECT 1 FROM {table} WHERE {where_clause} LIMIT 1"
    result = database.sql(query).df()
    if not result.empty:
        print("Tiene nulos")
    else:
        print("NO tiene nulos")


Para la tabla: olist_customers_dataset
NO tiene nulos
Para la tabla: olist_geolocation_dataset
NO tiene nulos
Para la tabla: olist_order_items_dataset
NO tiene nulos
Para la tabla: olist_order_payments_dataset
NO tiene nulos
Para la tabla: olist_order_reviews_dataset
Tiene nulos
Para la tabla: olist_orders_dataset
Tiene nulos
Para la tabla: olist_products_dataset
Tiene nulos
Para la tabla: olist_sellers_dataset
NO tiene nulos
Para la tabla: product_category_name_translation
NO tiene nulos


In [27]:
tablas_con_nulos = ["olist_order_reviews_dataset", "olist_orders_dataset", "olist_products_dataset", "olist_sellers_dataset", "olist_customers_dataset"]
for table in tablas_con_nulos:
    print(f"Tabla: {table}")
    columns = database.sql(f"DESCRIBE {table}").df()['column_name'].tolist()
    where_clause = " OR ".join([f"{col} IS NULL" for col in columns])
    query = f"SELECT * FROM {table} WHERE {where_clause} LIMIT 5"
    df_nulos = database.sql(query).df()
    print(df_nulos)


Tabla: olist_order_reviews_dataset
                          review_id                          order_id  \
0  7bc2406110b926393aa56f80a40eba40  73fc7af87114b39712e6da79b0a377eb   
1  80e641a11e56f04c1ad469d5645fdfde  a548910a1c6147796b98fdf73dbeba33   
2  228ce5500dc1d8e020d8d1322874b6f0  f9e4b658b201a9f2ecdecbb34bed034b   
3  e64fb393e7b32834bb789ff8bb30750e  658677c97b385a9be170737859d3511b   
4  f7c4243c7fe1938f181bec41a392bdeb  8e6bfb81e283fa7e4f11123a3fb894f1   

   review_score review_comment_title  \
0             4                 None   
1             5                 None   
2             5                 None   
3             5                 None   
4             5                 None   

                              review_comment_message review_creation_date  \
0                                               None           2018-01-18   
1                                               None           2018-03-10   
2                                               None  

In [32]:
total = database.sql("SELECT COUNT(*) AS total FROM olist_order_reviews_dataset").df().iloc[0]['total']
nulos = database.sql("SELECT COUNT(*) AS nulos FROM olist_order_reviews_dataset WHERE review_comment_title IS NULL").df().iloc[0]['nulos']
no_nulos = total - nulos
print(f"review_comment_title nulos: {nulos}")
print(f"review_comment_title no nulos: {no_nulos}")
print(f"Porcentaje de nulos: {nulos / total * 100:.2f}%")

review_comment_title nulos: 87656
review_comment_title no nulos: 11568
Porcentaje de nulos: 88.34%


In [33]:
database.sql("DESCRIBE olist_orders_dataset").df()

Unnamed: 0,column_name,column_type,null,key,default,extra
0,order_id,VARCHAR,YES,,,
1,customer_id,VARCHAR,YES,,,
2,order_status,VARCHAR,YES,,,
3,order_purchase_timestamp,TIMESTAMP,YES,,,
4,order_approved_at,TIMESTAMP,YES,,,
5,order_delivered_carrier_date,TIMESTAMP,YES,,,
6,order_delivered_customer_date,TIMESTAMP,YES,,,
7,order_estimated_delivery_date,TIMESTAMP,YES,,,


In [34]:
# Obtener la orden más vieja y la última según la columna order_purchase_timestamp
result = database.sql("""
    SELECT 
        MIN(order_purchase_timestamp) AS orden_mas_vieja,
        MAX(order_purchase_timestamp) AS orden_mas_reciente
    FROM olist_orders_dataset
""")
print(result)

┌─────────────────────┬─────────────────────┐
│   orden_mas_vieja   │ orden_mas_reciente  │
│      timestamp      │      timestamp      │
├─────────────────────┼─────────────────────┤
│ 2016-09-04 21:15:19 │ 2018-10-17 17:30:18 │
└─────────────────────┴─────────────────────┘



In [36]:
query = """
SELECT 
    EXTRACT(year FROM order_purchase_timestamp) AS anio,
    COUNT(*) AS cantidad_ordenes
FROM olist_orders_dataset
GROUP BY anio
ORDER BY cantidad_ordenes DESC
LIMIT 3
"""
database.sql(query).df()

Unnamed: 0,anio,cantidad_ordenes
0,2018,54011
1,2017,45101
2,2016,329


In [38]:
# Obtener la categoría de producto más solicitada
query = """
SELECT product_category_name, COUNT(*) AS cantidad
FROM olist_products_dataset
GROUP BY product_category_name
ORDER BY cantidad DESC
LIMIT 3
"""
database.sql(query).df()

Unnamed: 0,product_category_name,cantidad
0,cama_mesa_banho,3029
1,esporte_lazer,2867
2,moveis_decoracao,2657


In [39]:
database.sql("""
SELECT customer_city, COUNT(*) AS cantidad
FROM olist_customers_dataset
GROUP BY customer_city
ORDER BY cantidad DESC
LIMIT 1
""").df()

Unnamed: 0,customer_city,cantidad
0,sao paulo,15540


In [41]:
database.sql(
"""
SELECT seller_city, COUNT(*) AS cantidad_vendedores
FROM olist_sellers_dataset
GROUP BY seller_city
ORDER BY cantidad_vendedores DESC
LIMIT 1
"""
).df()

Unnamed: 0,seller_city,cantidad_vendedores
0,sao paulo,694


### Consultas

Esta sección consiste en realizar consultas interesantes que podrían ser de utilidad para un negocio que cuenta con el dataset en estudio

¿Cuántos días demora la entrega de las órdenes en los distintos años?

In [47]:
query = """
SELECT 
    EXTRACT(year FROM order_purchase_timestamp) AS anio,
    AVG(DATEDIFF('day', order_purchase_timestamp, order_delivered_customer_date)) AS promedio_dias
FROM olist_orders_dataset
WHERE 
    order_delivered_customer_date IS NOT NULL
GROUP BY anio
ORDER BY anio DESC
"""
database.sql(query).df()

Unnamed: 0,anio,promedio_dias
0,2018,12.063928
1,2017,12.979045
2,2016,19.6875


¿Cuántos clientes ordenaron en el sitio web?

In [48]:
query = """
SELECT 
    EXTRACT(year FROM olist_orders_dataset.order_purchase_timestamp) AS anio,
    COUNT(DISTINCT olist_customers_dataset.customer_unique_id) AS cantidad_clientes
FROM olist_orders_dataset
JOIN olist_customers_dataset
    ON olist_orders_dataset.customer_id = olist_customers_dataset.customer_id
GROUP BY anio
ORDER BY anio
"""
database.sql(query).df()

Unnamed: 0,anio,cantidad_clientes
0,2016,326
1,2017,43713
2,2018,52749


¿Cuál fue el cliente con más órdenes?

In [55]:
query = """
SELECT c.customer_unique_id, COUNT(*) AS n_ordenes
FROM olist_orders_dataset o
JOIN olist_customers_dataset c
  ON o.customer_id = c.customer_id
GROUP BY c.customer_unique_id
ORDER BY n_ordenes DESC
LIMIT 10;
"""
database.sql(query).df()

Unnamed: 0,customer_unique_id,n_ordenes
0,8d50f5eadf50201ccdcedfb9e2ac8455,17
1,3e43e6105506432c953e165fb2acf44c,9
2,ca77025e7201e3b30c44b472ff346268,7
3,1b6c7548a2a1f9037c1fd3ddfed95f33,7
4,6469f99c1f9dfae7733b25662e7f1782,7
5,de34b16117594161a6a89c50b289d35a,6
6,63cfc61cee11cbe306bff5857d00bfe4,6
7,47c1a3033b8b77b3ab6e109eb4d5fdf3,6
8,dc813062e0fc23409cd255f7f53c7074,6
9,f0e310a6839dce9de1638e0fe5ab282a,6


Como parte de un analisis, vamos a crear una tabla nueva con los clientes que realizaron más de 3 pedidos en un año y luego la consultaremos para saber si estos clientes siguieron realizando pedidos o abandonaron la plataforma

In [57]:
query ="""
CREATE TABLE clientes_pedidos_por_año AS
SELECT
    c.customer_unique_id,
    EXTRACT(YEAR FROM o.order_purchase_timestamp) AS año,
    COUNT(*) AS num_pedidos
FROM olist_orders_dataset o
JOIN olist_customers_dataset c
  ON o.customer_id = c.customer_id
GROUP BY c.customer_unique_id, año
ORDER BY c.customer_unique_id, año;
"""
database.sql(query)

In [96]:
query = """
SELECT * 
FROM clientes_pedidos_por_año
"""
database.sql(query).df()

Unnamed: 0,customer_unique_id,año,num_pedidos
0,0000366f3b9a7992bf8c76cfdf3221e2,2018,1
1,0000b849f77a49e4a4ce2b2a4ca5be3f,2018,1
2,0000f46a3911fa3c0805444483337064,2017,1
3,0000f6ccb0745a6a4b88665a16c9f078,2017,1
4,0004aac84e0df4da2b147fca70cf8255,2017,1
...,...,...,...
96783,fffcf5a5ff07b0908bd4e2dbc735a684,2017,1
96784,fffea47cd6d3cc0a88bd621562a9d061,2017,1
96785,ffff371b4d645b6ecea244b27531430a,2017,1
96786,ffff5962728ec6157033ef9805bacc48,2018,1


Una buena forma de saber por qué estos clientes dejaron la plataforma es observando sus reviews

In [91]:
query = """
CREATE OR REPLACE TABLE clientes_perdidos AS
WITH max_año AS (
    SELECT MAX(año) AS ultimo_año
    FROM clientes_pedidos_por_año
),
activos AS (
    SELECT
        cpa.customer_unique_id,
        cpa.año AS año_activo
    FROM clientes_pedidos_por_año cpa
    WHERE cpa.num_pedidos > 3
),
posteriores AS (
    SELECT
        a.customer_unique_id,
        MIN(cpa.año) AS primer_año_posterior
    FROM activos a
    JOIN clientes_pedidos_por_año cpa
      ON a.customer_unique_id = cpa.customer_unique_id
     AND cpa.año > a.año_activo
     AND cpa.num_pedidos > 0
    GROUP BY a.customer_unique_id, a.año_activo
)

SELECT
    a.customer_unique_id,
    a.año_activo
FROM activos a
JOIN max_año m
  ON 1=1
LEFT JOIN posteriores p
  ON a.customer_unique_id = p.customer_unique_id
 AND a.año_activo = p.primer_año_posterior - 1
WHERE
    a.año_activo < m.ultimo_año
  AND p.primer_año_posterior IS NULL;
"""
database.sql(query)

In [92]:
query = """
SELECT *
FROM clientes_perdidos
"""
database.sql(query).df()

Unnamed: 0,customer_unique_id,año_activo
0,12f5d6e1cbf93dafd9dcc19095df0b3d,2017
1,25a560b9a6006157838aab1bdbd68624,2017
2,83e7958a94bd7f74a9414d8782f87628,2017
3,a239b8e2fbce33780f1f1912e2ee5275,2017
4,a7657330b1c135f3acd420326e335b2c,2017
5,b08fab27d47a1eb6deda07bfd965ad43,2017
6,b8b3c435a58aebd788a477bed8342910,2017
7,ec7f1811826ab04a27a92197bc40c888,2017
8,f34cd7fd85a1f8baff886edf09567be3,2017
9,f64ec6d8dd29940264cd0bbb5ecade8a,2017


In [95]:
query = """
SELECT 
    cp.customer_unique_id,
    cp.año_activo AS año,
    r.review_score AS puntaje,
    r.review_comment_message AS comentario
FROM clientes_perdidos cp
JOIN olist_customers_dataset c
    ON cp.customer_unique_id = c.customer_unique_id
JOIN olist_orders_dataset o
    ON c.customer_id = o.customer_id
LEFT JOIN olist_order_reviews_dataset r
    ON o.order_id = r.order_id
    AND EXTRACT(YEAR FROM o.order_purchase_timestamp) = cp.año_activo
WHERE r.review_comment_message IS NOT NULL
GROUP BY cp.customer_unique_id, cp.año_activo, r.review_score, r.review_comment_message
ORDER BY cp.customer_unique_id, cp.año_activo
"""
database.sql(query).df()


Unnamed: 0,customer_unique_id,año,puntaje,comentario
0,12f5d6e1cbf93dafd9dcc19095df0b3d,2017,5,Recebi antes do prazo de entrega informado e o...
1,12f5d6e1cbf93dafd9dcc19095df0b3d,2017,5,Recebi bem antes do prazo informado e o produt...
2,12f5d6e1cbf93dafd9dcc19095df0b3d,2017,5,Recomendo a loja! Produto entregue dentro do p...
3,12f5d6e1cbf93dafd9dcc19095df0b3d,2017,5,Super recomendo essa loja! Recebi bem antes do...
4,83e7958a94bd7f74a9414d8782f87628,2017,5,Uahlll! Recebi minha compra 30 dias antes do p...
5,83e7958a94bd7f74a9414d8782f87628,2017,5,"Recebi muito bem embalado, o produto é mais bo..."
6,83e7958a94bd7f74a9414d8782f87628,2017,5,"Adorei o produto, condiz com as imagens, a ent..."
7,a239b8e2fbce33780f1f1912e2ee5275,2017,5,Recomendo para outros
8,b08fab27d47a1eb6deda07bfd965ad43,2017,5,Minha esposa amou.
9,b08fab27d47a1eb6deda07bfd965ad43,2017,5,Perfeito para iniciantes.\r\n
