# Explorando as Tabelas Delta

Este notebook permite explorar e manipular as tabelas Delta do projeto de diário de bordo.

## 1. Configuração do SparkSession

Configurando a sessão Spark com suporte ao Delta Lake.

In [1]:
from pyspark.sql import SparkSession
import os

# Configuração do SparkSession com Delta Lake
builder = (
    SparkSession.builder
    .appName("ExplorarTabelas")
    .config("spark.jars.packages", "io.delta:delta-core_2.12:2.3.0")
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
    .config("spark.databricks.delta.retentionDurationCheck.enabled", "false")
    .config("spark.sql.warehouse.dir", "/app/spark-warehouse")
    .config("spark.sql.catalogImplementation", "hive")
    .config("javax.jdo.option.ConnectionURL", "jdbc:derby:/app/derby/metastore_db;create=true")
    .config("javax.jdo.option.ConnectionDriverName", "org.apache.derby.jdbc.EmbeddedDriver")
    .config("javax.jdo.option.ConnectionUserName", "APP")
    .config("javax.jdo.option.ConnectionPassword", "mine")
    .enableHiveSupport()
)

spark = builder.getOrCreate()
print("SparkSession inicializada com sucesso!")

# Verificar se o Delta Lake está disponível
try:
    spark.sql("CREATE TABLE IF NOT EXISTS test_delta USING delta AS SELECT 1 as id")
    spark.sql("DROP TABLE test_delta")
    print("Delta Lake configurado com sucesso!")
except Exception as e:
    print(f"Erro ao verificar Delta Lake: {e}")



:: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
io.delta#delta-core_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-fa6a474d-3966-4440-a3ef-df689b3e062f;1.0
	confs: [default]
	found io.delta#delta-core_2.12;2.3.0 in central
	found io.delta#delta-storage;2.3.0 in central
	found org.antlr#antlr4-runtime;4.8 in central
downloading https://repo1.maven.org/maven2/io/delta/delta-core_2.12/2.3.0/delta-core_2.12-2.3.0.jar ...
	[SUCCESSFUL ] io.delta#delta-core_2.12;2.3.0!delta-core_2.12.jar (1124ms)
downloading https://repo1.maven.org/maven2/io/delta/delta-storage/2.3.0/delta-storage-2.3.0.jar ...
	[SUCCESSFUL ] io.delta#delta-storage;2.3.0!delta-storage.jar (257ms)
downloading https://repo1.maven.org/maven2/org/antlr/antlr4-runtime/4.8/antlr4-runtime-4.8.jar ...
	[SUCCESSFUL ] org.antlr#antlr4-runtime;4.8!antlr4-runtime.jar (282ms)
:: resolution report :: resolve 4057ms :: artifacts dl 1668ms
	:: mo

25/06/11 16:35:05 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


SparkSession inicializada com sucesso!
25/06/11 16:35:09 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
25/06/11 16:35:09 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
25/06/11 16:35:18 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0
25/06/11 16:35:18 WARN ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore UNKNOWN@172.19.0.2


[Stage 0:>                                                          (0 + 1) / 1]                                                                                

25/06/11 16:35:21 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


                                                                                

25/06/11 16:35:24 WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider delta. Persisting data source table `default`.`test_delta` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
25/06/11 16:35:24 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
25/06/11 16:35:24 WARN HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist
25/06/11 16:35:24 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
25/06/11 16:35:24 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
25/06/11 16:35:25 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Delta Lake configurado com sucesso!


## 2. Carregando a Tabela Bronze

Lendo os dados da tabela bronze diretamente do formato Delta.

In [2]:
# Carregando a tabela Bronze e registrando no catálogo
bronze_df = spark.read.format("delta").load("/app/data/bronze/b_info_transportes")
bronze_df.createOrReplaceTempView("b_info_transportes")
print("Schema da tabela Bronze:")
bronze_df.printSchema()

print("\nAmostra dos dados:")
bronze_df.show(5)

Schema da tabela Bronze:
root
 |-- data_inicio: string (nullable = true)
 |-- data_fim: string (nullable = true)
 |-- categoria: string (nullable = true)
 |-- local_inicio: string (nullable = true)
 |-- local_fim: string (nullable = true)
 |-- distancia: string (nullable = true)
 |-- proposito: string (nullable = true)


Amostra dos dados:
+----------------+----------------+---------+------------+---------------+---------+-----------------+
|     data_inicio|        data_fim|categoria|local_inicio|      local_fim|distancia|        proposito|
+----------------+----------------+---------+------------+---------------+---------+-----------------+
|01-01-2016 21:11|01-01-2016 21:17|  Negocio| Fort Pierce|    Fort Pierce|       51|      Alimentação|
|01-02-2016 01:25|01-02-2016 01:37|  Negocio| Fort Pierce|    Fort Pierce|        5|             null|
|01-02-2016 20:25|01-02-2016 20:38|  Negocio| Fort Pierce|    Fort Pierce|       48|         Entregas|
|01-05-2016 17:31|01-05-2016 17:45|  Neg

## 3. Carregando a Tabela Silver

Lendo os dados da tabela silver processada.

In [3]:
# Carregando a tabela Silver e registrando no catálogo
silver_df = spark.read.format("delta").load("/app/data/silver/s_info_transportes")
silver_df.createOrReplaceTempView("s_info_transportes")
print("Schema da tabela Silver:")
silver_df.printSchema()

print("\nAmostra dos dados:")
silver_df.show(5)

Schema da tabela Silver:
root
 |-- data_inicio: string (nullable = true)
 |-- data_fim: string (nullable = true)
 |-- categoria: string (nullable = true)
 |-- local_inicio: string (nullable = true)
 |-- local_fim: string (nullable = true)
 |-- distancia: string (nullable = true)
 |-- proposito: string (nullable = true)
 |-- dt_refe: date (nullable = true)


Amostra dos dados:
+----------------+----------------+---------+------------+---------+---------+-----------------+----------+
|     data_inicio|        data_fim|categoria|local_inicio|local_fim|distancia|        proposito|   dt_refe|
+----------------+----------------+---------+------------+---------+---------+-----------------+----------+
|03-04-2016 07:47|03-04-2016 08:06|  negocio|        Cary|   Durham|       99|          reunião|2016-04-03|
|03-04-2016 09:46|03-04-2016 10:03|  negocio|      Durham|     Cary|       99|visita ao cliente|2016-04-03|
|03-04-2016 11:46|03-04-2016 12:06|  negocio|        Cary|   Durham|      104|   

In [4]:
#selecione os valores distintos da coluna categoria da tabela Silver sem usar expressions
distinct_categories = silver_df.select("categoria").distinct().show()

[Stage 30:===>                                                    (1 + 14) / 15]

+---------+
|categoria|
+---------+
|  pessoal|
|  negocio|
+---------+



                                                                                

## 4. Carregando a Tabela Gold

Lendo os dados agregados da tabela gold.

In [5]:
# Carregando a tabela Gold e registrando no catálogo
gold_df = spark.read.format("delta").load("/app/data/gold/info_corridas_do_dia")
gold_df.createOrReplaceTempView("info_corridas_do_dia")
print("Schema da tabela Gold:")
gold_df.printSchema()

print("\nAmostra dos dados:")
gold_df.show(5)

Schema da tabela Gold:
root
 |-- dt_refe: date (nullable = true)
 |-- qt_corr: integer (nullable = true)
 |-- qt_corr_neg: integer (nullable = true)
 |-- qt_corr_pess: integer (nullable = true)
 |-- vl_max_dist: decimal(17,2) (nullable = true)
 |-- vl_min_dist: decimal(17,2) (nullable = true)
 |-- vl_avg_dist: decimal(17,2) (nullable = true)
 |-- qt_corr_reuni: integer (nullable = true)
 |-- qt_corr_nao_reuni: integer (nullable = true)


Amostra dos dados:
+----------+-------+-----------+------------+-----------+-----------+-----------+-------------+-----------------+
|   dt_refe|qt_corr|qt_corr_neg|qt_corr_pess|vl_max_dist|vl_min_dist|vl_avg_dist|qt_corr_reuni|qt_corr_nao_reuni|
+----------+-------+-----------+------------+-----------+-----------+-----------+-------------+-----------------+
|2016-12-03|      2|          1|           1|      22.00|      19.00|      20.50|            0|                0|
|2016-11-05|      2|          2|           0|      81.00|     256.00|     168.50|  

In [6]:
# Filtrar uma data na tabela gold_df usando PySpark DataFrame API
data_especifica = "2016-06-09"

filtered_gold_df = gold_df.filter(gold_df["QT_CORR_PESS"] > 0)

# Exibir os resultados
filtered_gold_df.show()

+----------+-------+-----------+------------+-----------+-----------+-----------+-------------+-----------------+
|   dt_refe|qt_corr|qt_corr_neg|qt_corr_pess|vl_max_dist|vl_min_dist|vl_avg_dist|qt_corr_reuni|qt_corr_nao_reuni|
+----------+-------+-----------+------------+-----------+-----------+-----------+-------------+-----------------+
|2016-12-03|      2|          1|           1|      22.00|      19.00|      20.50|            0|                0|
|2016-12-07|      3|          1|           2|      87.00|     123.00|      74.67|            0|                0|
|2016-02-04|      6|          4|           2|     805.00|     144.00|     595.00|            1|                3|
|2016-03-03|      5|          4|           1|      76.00|     173.00|      69.20|            1|                3|
|2016-05-03|      6|          4|           2|      78.00|      35.00|      56.17|            0|                4|
|2016-01-04|      4|          3|           1|       7.00|      11.00|      94.00|       

## 5. Manipulação dos Dados

Exemplos de operações que você pode fazer com os dados:

In [7]:
# Exemplo 1: Contagem de registros por tabela
print("Quantidade de registros:")
print(f"Bronze: {bronze_df.count():,}")
print(f"Silver: {silver_df.count():,}")
print(f"Gold: {gold_df.count():,}")

Quantidade de registros:
Bronze: 1,153
Silver: 420
Gold: 114


## Análise das Partições

Vamos explorar as partições da tabela Gold por data de referência:

In [8]:
table_df = spark.table("info_corridas_do_dia")

print("\nPartições da tabela:")
table_df.select('DT_REFE').distinct().orderBy('DT_REFE').show(20)


Partições da tabela:


[Stage 67:===>                                                    (1 + 14) / 15]

+----------+
|   DT_REFE|
+----------+
|2016-01-01|
|2016-01-02|
|2016-01-03|
|2016-01-04|
|2016-01-05|
|2016-01-06|
|2016-01-07|
|2016-01-08|
|2016-01-09|
|2016-01-11|
|2016-01-12|
|2016-02-01|
|2016-02-02|
|2016-02-04|
|2016-02-05|
|2016-02-07|
|2016-02-08|
|2016-02-09|
|2016-02-11|
|2016-02-12|
+----------+
only showing top 20 rows



                                                                                

In [9]:
# Listar todos os bancos de dados disponíveis
print("Databases disponíveis:")
spark.sql("SHOW DATABASES").show()

Databases disponíveis:
+---------+
|namespace|
+---------+
|  default|
+---------+



In [10]:
# Listar todas as tabelas no banco de dados atual
print("\nTabelas no banco de dados atual:")
spark.sql("SHOW TABLES").show()


Tabelas no banco de dados atual:
+---------+--------------------+-----------+
|namespace|           tableName|isTemporary|
+---------+--------------------+-----------+
|  default|  b_info_transportes|      false|
|  default|info_corridas_do_dia|      false|
|  default|  s_info_transportes|      false|
|         |  b_info_transportes|      false|
|         |info_corridas_do_dia|      false|
|         |  s_info_transportes|      false|
+---------+--------------------+-----------+



In [11]:
# Mostrar informações detalhadas sobre uma tabela específica (exemplo com a tabela gold)
print("\nDetalhes da tabela Gold:")
spark.sql("DESCRIBE EXTENDED info_corridas_do_dia").show(truncate=False)


Detalhes da tabela Gold:
+-----------------+-------------+-------+
|col_name         |data_type    |comment|
+-----------------+-------------+-------+
|dt_refe          |date         |null   |
|qt_corr          |int          |null   |
|qt_corr_neg      |int          |null   |
|qt_corr_pess     |int          |null   |
|vl_max_dist      |decimal(17,2)|null   |
|vl_min_dist      |decimal(17,2)|null   |
|vl_avg_dist      |decimal(17,2)|null   |
|qt_corr_reuni    |int          |null   |
|qt_corr_nao_reuni|int          |null   |
+-----------------+-------------+-------+



In [12]:
spark.sql("select * from info_corridas_do_dia where QT_CORR_NEG > 0").show(truncate=False)

+----------+-------+-----------+------------+-----------+-----------+-----------+-------------+-----------------+
|dt_refe   |qt_corr|qt_corr_neg|qt_corr_pess|vl_max_dist|vl_min_dist|vl_avg_dist|qt_corr_reuni|qt_corr_nao_reuni|
+----------+-------+-----------+------------+-----------+-----------+-----------+-------------+-----------------+
|2016-12-03|2      |1          |1           |22.00      |19.00      |20.50      |0            |0                |
|2016-02-05|2      |2          |0           |39.00      |22.00      |30.50      |0            |2                |
|2016-01-09|3      |3          |0           |22.00      |106.00     |47.00      |0            |0                |
|2016-05-07|5      |5          |0           |99.00      |12.00      |60.40      |1            |2                |
|2016-04-12|2      |2          |0           |34.00      |29.00      |31.50      |0            |2                |
|2016-01-12|4      |4          |0           |55.00      |29.00      |42.00      |1      