#1.0 Diagnosticando a saúde dos dados

##1.1 Instalando bibliotecas necessárias

In [0]:
%pip uninstall -y numpy pandas ydata-profiling

Found existing installation: numpy 1.26.4
Not uninstalling numpy at /databricks/python3/lib/python3.12/site-packages, outside environment /local_disk0/.ephemeral_nfs/envs/pythonEnv-7830ba31-acd4-475b-b71a-ddab187720e0
Can't uninstall 'numpy'. No files were found to uninstall.
Found existing installation: pandas 1.5.3
Not uninstalling pandas at /databricks/python3/lib/python3.12/site-packages, outside environment /local_disk0/.ephemeral_nfs/envs/pythonEnv-7830ba31-acd4-475b-b71a-ddab187720e0
Can't uninstall 'pandas'. No files were found to uninstall.
[0m[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m


In [0]:
dbutils.library.restartPython()

In [0]:
%pip install sweetviz

[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m


In [0]:
dbutils.library.restartPython()

#2.0 Importando bibliotecas necessárias para qualidade dos dados

In [0]:
import pandas as pd
import gc
import sweetviz as sv 

#4.0 Definindo caminho

In [0]:
catalog = "workspace_ecommerce"
schema = "bronze"
volume = "reports"

#5.0 Caminho para salvar

In [0]:
caminho_saida = f"/Volumes/{catalog}/{schema}/{volume}/"

#6.0 Gerar relatórios de qualidade para as tabelas delta na camada bronze

In [0]:
def gerar_relatorios_qualidade():
    #Listar tabelas do schema bronze
    tabelas = [t.name for t in spark.catalog.listTables(f"{catalog}.{schema}")]

    print(f"Tabelas encontradas no schema '{schema}': {tabelas}\n")

    for nome_tabela in tabelas:
        df_pandas = None
        try:
            print(f"Processando tabela delta: {nome_tabela}")

            #Lendo a tabela delta
            df_spark = spark.read.table(f"{catalog}.{schema}.{nome_tabela}")

            #Trava de segurança de memória
            qtd_linhas = df_spark.count()
            if qtd_linhas > 100000:
                print(f"Tabela grande({qtd_linhas} linhas). Usando amostragem")
                df_pandas = df_spark.sample(fraction=100000/qtd_linhas, seed=42).toPandas()
            else:
                df_pandas = df_spark.toPandas()

            #Gerar relatório em memória
            report = sv.analyze(df_pandas)

            #Salvar o arquivo físico no volume 'reports'
            caminho_arquivo = f"{caminho_saida}report_{nome_tabela}.html"

            report.show_html(filepath=caminho_arquivo, open_browser=False, layout='vertical')
            print(f"Relatório salvo: {caminho_arquivo}")

        except Exception as e:
            print(f"Erro em {nome_tabela}: {e}")

        finally:
            if df_pandas is not None:
                del df_pandas
            gc.collect() #Limpa memória RAM
            print("-" * 30)

#7.0 Executar função de relatório de qualidade de dados HTML

In [0]:
gerar_relatorios_qualidade()

Tabelas encontradas no schema 'bronze': ['customers', 'geolocation', 'order_items', 'order_payments', 'order_reviews', 'orders', 'product_category_name_translation', 'products', 'sellers']

Processando tabela delta: customers


                                             |          | [  0%]   00:00 -> (? left)

Report /Volumes/workspace_ecommerce/bronze/reports/report_customers.html was generated.
Relatório salvo: /Volumes/workspace_ecommerce/bronze/reports/report_customers.html
------------------------------
Processando tabela delta: geolocation
Tabela grande(1000163 linhas). Usando amostragem


                                             |          | [  0%]   00:00 -> (? left)

Report /Volumes/workspace_ecommerce/bronze/reports/report_geolocation.html was generated.
Relatório salvo: /Volumes/workspace_ecommerce/bronze/reports/report_geolocation.html
------------------------------
Processando tabela delta: order_items
Tabela grande(112650 linhas). Usando amostragem


                                             |          | [  0%]   00:00 -> (? left)

Report /Volumes/workspace_ecommerce/bronze/reports/report_order_items.html was generated.
Relatório salvo: /Volumes/workspace_ecommerce/bronze/reports/report_order_items.html
------------------------------
Processando tabela delta: order_payments
Tabela grande(103886 linhas). Usando amostragem


                                             |          | [  0%]   00:00 -> (? left)

Report /Volumes/workspace_ecommerce/bronze/reports/report_order_payments.html was generated.
Relatório salvo: /Volumes/workspace_ecommerce/bronze/reports/report_order_payments.html
------------------------------
Processando tabela delta: order_reviews
Tabela grande(104162 linhas). Usando amostragem


                                             |          | [  0%]   00:00 -> (? left)

Report /Volumes/workspace_ecommerce/bronze/reports/report_order_reviews.html was generated.
Relatório salvo: /Volumes/workspace_ecommerce/bronze/reports/report_order_reviews.html
------------------------------
Processando tabela delta: orders


                                             |          | [  0%]   00:00 -> (? left)

Report /Volumes/workspace_ecommerce/bronze/reports/report_orders.html was generated.
Relatório salvo: /Volumes/workspace_ecommerce/bronze/reports/report_orders.html
------------------------------
Processando tabela delta: product_category_name_translation


                                             |          | [  0%]   00:00 -> (? left)

Report /Volumes/workspace_ecommerce/bronze/reports/report_product_category_name_translation.html was generated.
Relatório salvo: /Volumes/workspace_ecommerce/bronze/reports/report_product_category_name_translation.html
------------------------------
Processando tabela delta: products


                                             |          | [  0%]   00:00 -> (? left)

Report /Volumes/workspace_ecommerce/bronze/reports/report_products.html was generated.
Relatório salvo: /Volumes/workspace_ecommerce/bronze/reports/report_products.html
------------------------------
Processando tabela delta: sellers


                                             |          | [  0%]   00:00 -> (? left)

Report /Volumes/workspace_ecommerce/bronze/reports/report_sellers.html was generated.
Relatório salvo: /Volumes/workspace_ecommerce/bronze/reports/report_sellers.html
------------------------------
