# üéß Spotify Data Monitor & Analytics
**Respons√°vel:** Matheus </br>
**Status:** üöß Em Desenvolvimento (Fase Bronze) </br>
**√öltima Atualiza√ß√£o:** 2026-02-03 </br>

---

## üéØ Objetivo 
Pipeline de engenharia de dados para processar o hist√≥rico estendido de streaming do Spotify (GDPR Data), separando √Åudio e V√≠deo, para monitoramento de h√°bitos de consumo.

## üìë √çndice
1. [Setup e Bibliotecas](#setup-e-bibliotecas)
2. [Defini√ß√£o do Database](#defini√ß√£o-do-database)
3. [Camada Bronze: Ingest√£o Unificada](#camada-bronze-ingest√£o-unificada)
4. [Valida√ß√£o da Carga](#valida√ß√£o-da-carga)

---

## ‚öôÔ∏è Configura√ß√µes iniciais

In [0]:
%pip install openpyxl

In [0]:
from pyspark.sql.functions import current_timestamp, input_file_name,col, sum as _sum
import pandas as pd
import json
import time

In [0]:
PATH_ORIGEM = "/Volumes/sandbox_prd/raw_layer/files/spotify/me/extended/Streaming_History_*.json"
NOME_TABELA_DESTINO = "sandbox_prd.bronze_layer.streaming_history_user_spotify"

## ü•â Bronze Layer
Leitura dos arquivos JSON e grava√ß√£o em tabela Delta.

In [0]:
# 1. Leitura (Inferindo Schema)
df_input = (spark.read
            .format("json")
            .option("multiline", "true") 
            .option("inferSchema", "true") 
            .load(PATH_ORIGEM))

# 2. Enriquecimento (Usando _metadata do Unity Catalog)
# AQUI EST√Å A CORRE√á√ÉO: Usamos col("_metadata.file_path") em vez da fun√ß√£o antiga
df_bronze = df_input.select(
    "*", 
    current_timestamp().alias("dt_ingestion"), 
    col("_metadata.file_path").alias("source_file") 
)

# 3. Grava√ß√£o (Schema Evolution)
(df_bronze.write
    .format("delta")
    .mode("overwrite")              
    .option("mergeSchema", "true")  
    .saveAsTable(NOME_TABELA_DESTINO)
)

print(f"‚úÖ Carga conclu√≠da com sucesso em: {NOME_TABELA_DESTINO}")
display(df_bronze.limit(5))

In [0]:
# 1. Leitura dos arquivos JSON (Inferindo Schema)
df_input = (spark.read
            .format("json")
            .option("multiline", "true") 
            .option("inferSchema", "true") 
            .load(PATH_ORIGEM))

In [0]:
# 2. Enriquecimento dos dados com metadados do Unity Catalog
df_bronze = df_input.select(
    "*", 
    current_timestamp().alias("dt_ingestion"), 
    col("_metadata.file_path").alias("source_file") 
)

In [0]:
# 3. Grava√ß√£o dos dados na tabela Delta (Schema Evolution)
(df_bronze.write
    .format("delta")
    .mode("overwrite")              
    .option("mergeSchema", "true")  
    .saveAsTable(NOME_TABELA_DESTINO)
)

print(f"‚úÖ Carga conclu√≠da com sucesso em: {NOME_TABELA_DESTINO}")

In [0]:
# Exibe uma amostra dos dados carregados
display(df_bronze.limit(5))

## üîç Quality Check

Query r√°pida para conferir os dados.

In [0]:
%sql
-- Valida√ß√£o 1: Volume de dados por arquivo de origem
SELECT 
    source_file,
    count(*) as total_linhas
FROM sandbox_prd.bronze_layer.streaming_history_user_spotify
GROUP BY source_file
ORDER BY source_file

In [0]:
%sql
-- Valida√ß√£o 2: Per√≠odo dos dados (Min e Max)
SELECT 
    min(ts) as primeira_reproducao,
    max(ts) as ultima_reproducao,
    count(*) as total_geral
FROM sandbox_prd.bronze_layer.streaming_history_user_spotify

In [0]:
# Valida√ß√£o 3: Verificando consist√™ncia dos campos principais
df_check = spark.read.table("sandbox_prd.bronze_layer.streaming_history_user_spotify")

# Conta quantos nulos existem nas colunas chave
df_check.select(
    _sum(col("master_metadata_track_name").isNull().cast("int")).alias("nulos_track_name"),
    _sum(col("master_metadata_album_artist_name").isNull().cast("int")).alias("nulos_artist_name"),
    _sum(col("ts").isNull().cast("int")).alias("nulos_timestamp")
).display()

## ü•à Silver Layer

Tratamento dos dados

In [0]:
%sql
--desc sandbox_prd.bronze_layer.streaming_history_user_spotify
select * from sandbox_prd.bronze_layer.streaming_history_user_spotify

In [0]:
# L√™ a base bronze
df_bronze = spark.read.table("sandbox_prd.bronze_layer.streaming_history_user_spotify")

# Listas colunas para remover 
drop_columns = [
    "audiobook_chapter_title",
    "audiobook_chapter_uri",
    "audiobook_title",
    "audiobook_uri",
    "conn_country",
    "incognito_mode",
    "ip_addr",
    "dt_ingestion",
    "source_file"
]

# Remove as colunas desnecess√°rias
df_silver = df_bronze.drop(*drop_columns)

In [0]:
['episode_name', 'episode_show_name', 'master_metadata_album_album_name', 'master_metadata_album_artist_name', 'master_metadata_track_name', 'ms_played', 'offline', 'offline_timestamp', 'platform', 'reason_end', 'reason_start', 'shuffle', 'skipped', 'spotify_episode_uri', 'spotify_track_uri', 'ts']
# Altera o nome das colunas
df_silver = df_silver.select(
    col("").alias("")
)
