# 02 - AnÃ¡lise de Reviews (Silver Layer) - VERSÃƒO OTIMIZADA

Este notebook processa os dados de reviews, aplicando filtros de qualidade sequenciais.

**OTIMIZAÃ‡Ã•ES APLICADAS:**
- âœ… Aumento de memÃ³ria (4GB driver, 4GB executor)
- âœ… Uso estratÃ©gico de cache/persist
- âœ… RemoÃ§Ã£o de counts intermediÃ¡rios desnecessÃ¡rios
- âœ… Particionamento otimizado
- âœ… Processamento incremental

**LÃ³gica de Filtragem:**
1. Filtro de Cidades (join com business)
2. Filtro de UsuÃ¡rios (>= 10 reviews)
3. Filtro de Restaurantes (>= 20 reviews)

**Output**: `data/silver/review`

In [1]:
import os
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, to_timestamp
import pyspark.sql.functions as F

In [2]:
# âš¡ CONFIGURAÃ‡ÃƒO OTIMIZADA DE MEMÃ“RIA
spark = SparkSession.builder \
    .appName("Reviews Analysis - Silver Layer (Optimized)") \
    .config("spark.driver.memory", "4g") \
    .config("spark.executor.memory", "4g") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
    .config("spark.sql.shuffle.partitions", "200") \
    .config("spark.default.parallelism", "100") \
    .getOrCreate()

print(f"âœ… Spark version: {spark.version}")
print(f"ðŸ“Š Driver memory: 4GB")
print(f"ðŸ“Š Executor memory: 4GB")

âœ… Spark version: 3.5.0
ðŸ“Š Driver memory: 4GB
ðŸ“Š Executor memory: 4GB


In [3]:
BASE_PATH = '/home/jovyan/work'
DATA_PATH = f'{BASE_PATH}/data'
BRONZE_PATH = f'{DATA_PATH}/bronze'
SILVER_PATH = f'{DATA_PATH}/silver'

print(f"ðŸ¥‰ Camada Bronze: {BRONZE_PATH}")
print(f"ðŸ¥ˆ Camada Silver: {SILVER_PATH}")

ðŸ¥‰ Camada Bronze: /home/jovyan/work/data/bronze
ðŸ¥ˆ Camada Silver: /home/jovyan/work/data/silver


In [4]:
print("ðŸ“¥ Lendo dados...\n")

# 1. Ler Business (Silver) - JÃ¡ filtrado
df_business = spark.read.parquet(f"{SILVER_PATH}/business")
print(f"âœ… Business carregados: ~{df_business.count():,}")

# 2. Ler Reviews (Bronze) - SEM COUNT! Evita processar tudo agora
df_reviews_raw = spark.read.parquet(f"{BRONZE_PATH}/review")
print(f"âœ… Reviews carregados (lazy): Bronze layer")

ðŸ“¥ Lendo dados...

âœ… Business carregados: ~64,645
âœ… Reviews carregados (lazy): Bronze layer


In [12]:
df_reviews_raw.show()

+--------------------+----+-------------------+-----+--------------------+-----+--------------------+------+--------------------+
|         business_id|cool|               date|funny|           review_id|stars|                text|useful|             user_id|
+--------------------+----+-------------------+-----+--------------------+-----+--------------------+------+--------------------+
|XQfwVwDr-v0ZS3_Cb...|   0|2018-07-07 22:09:11|    0|KU_O5udG6zpxOg-Vc...|  3.0|If you decide to ...|     0|mh_-eMZ6K5RLWhZyI...|
|7ATYjTIgM3jUlt4UM...|   1|2012-01-03 15:28:18|    0|BiTunyQ73aT9WBnpR...|  5.0|I've taken a lot ...|     1|OyoGAe7OKpv6SyGZT...|
|YjUWPpI6HXG530lwP...|   0|2014-02-05 20:30:30|    0|saUsX_uimxRlCVr67...|  3.0|Family diner. Had...|     0|8g_iMtfSiwikVnbP2...|
|kxX2SOes4o-D3ZQBk...|   1|2015-01-04 00:01:03|    0|AqPFMleE6RsU23_au...|  5.0|Wow!  Yummy, diff...|     1|_7bHUi9Uuf5__HHc_...|
|e4Vwtrqf-wpJfwesg...|   1|2017-01-14 20:54:15|    0|Sx8TMOWLNuJBWer-0...|  4.0|Cute inter

In [13]:
df_reviews_raw = df_reviews_raw.select("business_id", "date", "stars", "user_id")

In [14]:
print("ðŸ”§ Passo 1: Filtrar reviews por cidades...\n")

# Broadcast join - business IDs sÃ£o pequenos
valid_business_ids = df_business.select('business_id')

df_reviews_step1 = df_reviews_raw.join(
    F.broadcast(valid_business_ids),
    on='business_id',
    how='inner'
)

# âš¡ PERSIST aqui - vai ser usado mÃºltiplas vezes
df_reviews_step1.persist()

# Apenas 1 count para verificar
count_step1 = df_reviews_step1.count()
print(f"âœ… Reviews apÃ³s filtro de cidade: {count_step1:,}")

ðŸ”§ Passo 1: Filtrar reviews por cidades...

âœ… Reviews apÃ³s filtro de cidade: 3,821,238


In [15]:
print("ðŸ”§ Passo 2: Filtrar usuÃ¡rios (>= 10 reviews)...\n")

MIN_USER_REVIEWS = 10

# GroupBy otimizado
user_counts = df_reviews_step1 \
    .groupBy('user_id') \
    .count() \
    .withColumnRenamed('count', 'user_review_count')

# Broadcast join - lista de usuÃ¡rios vÃ¡lidos Ã© pequena
valid_users = user_counts \
    .filter(col('user_review_count') >= MIN_USER_REVIEWS) \
    .select('user_id')

df_reviews_step2 = df_reviews_step1.join(
    F.broadcast(valid_users),
    on='user_id',
    how='inner'
)

# âš¡ PERSIST - vai ser usado no prÃ³ximo passo
df_reviews_step2.persist()

# Libera memÃ³ria do step1
df_reviews_step1.unpersist()

count_step2 = df_reviews_step2.count()
print(f"âœ… Reviews apÃ³s filtro de usuÃ¡rios: {count_step2:,}")
print(f"   (Removidos: {count_step1 - count_step2:,})")

ðŸ”§ Passo 2: Filtrar usuÃ¡rios (>= 10 reviews)...

âœ… Reviews apÃ³s filtro de usuÃ¡rios: 1,474,353
   (Removidos: 2,346,885)


In [16]:
print("ðŸ”§ Passo 3: Filtrar restaurantes (>= 20 reviews)...\n")

MIN_BUSINESS_REVIEWS = 20

# GroupBy otimizado
business_counts = df_reviews_step2 \
    .groupBy('business_id') \
    .count() \
    .withColumnRenamed('count', 'business_review_count')

# Broadcast join
valid_businesses_final = business_counts \
    .filter(col('business_review_count') >= MIN_BUSINESS_REVIEWS) \
    .select('business_id')

df_reviews_final = df_reviews_step2.join(
    F.broadcast(valid_businesses_final),
    on='business_id',
    how='inner'
)

# âš¡ PERSIST final
df_reviews_final.persist()

# Libera memÃ³ria do step2
df_reviews_step2.unpersist()

count_final = df_reviews_final.count()
print(f"âœ… Reviews finais: {count_final:,}")
print(f"   (Removidos: {count_step2 - count_final:,})")

ðŸ”§ Passo 3: Filtrar restaurantes (>= 20 reviews)...

âœ… Reviews finais: 1,199,452
   (Removidos: 274,901)


In [17]:
print("ðŸ“Š EstatÃ­sticas Finais:\n")

# Usa o dataframe jÃ¡ persistido - nÃ£o reprocessa
n_reviews = count_final  # JÃ¡ temos esse valor
n_users = df_reviews_final.select('user_id').distinct().count()
n_items = df_reviews_final.select('business_id').distinct().count()

print(f"  - Total Reviews: {n_reviews:,}")
print(f"  - Total UsuÃ¡rios: {n_users:,}")
print(f"  - Total Restaurantes: {n_items:,}")
print(f"  - Sparsity: {100 * (1 - (n_reviews / (n_users * n_items))):.2f}%")

ðŸ“Š EstatÃ­sticas Finais:

  - Total Reviews: 1,199,452
  - Total UsuÃ¡rios: 56,555
  - Total Restaurantes: 14,674
  - Sparsity: 99.86%


In [18]:
print("ðŸ’¾ Salvando dados na camada Silver...\n")

output_path = f'{SILVER_PATH}/review'

# âš¡ REPARTITION ao invÃ©s de COALESCE(1)
# MantÃ©m paralelismo e evita explodir memÃ³ria
df_reviews_final \
    .repartition(10) \
    .write \
    .mode('overwrite') \
    .parquet(output_path)

print(f"âœ… Dados salvos em: {output_path}")
print(f"ðŸ“¦ PartiÃ§Ãµes: 10 (melhor performance de leitura)")

# Libera memÃ³ria final
df_reviews_final.unpersist()

ðŸ’¾ Salvando dados na camada Silver...

âœ… Dados salvos em: /home/jovyan/work/data/silver/review
ðŸ“¦ PartiÃ§Ãµes: 10 (melhor performance de leitura)


DataFrame[business_id: string, user_id: string, date: string, stars: double]

In [19]:
print("\nðŸ§¹ Limpeza final...")
spark.catalog.clearCache()
print("âœ… Cache limpo!")


ðŸ§¹ Limpeza final...
âœ… Cache limpo!


In [20]:
df_reviews_final.show()

+--------------------+--------------------+-------------------+-----+
|         business_id|             user_id|               date|stars|
+--------------------+--------------------+-------------------+-----+
|otQS34_MymijPTdNB...|4Uh27DgGzsp6PqrH9...|2011-10-27 17:12:05|  4.0|
|rBdG_23USc7DletfZ...|j2wlzrntrbKwyOcOi...|2014-08-10 19:41:43|  4.0|
|eFvzHawVJofxSnD7T...|IQsF3Rc6IgCzjVV9D...|2014-11-12 15:30:27|  5.0|
|rjuWz_AD3WfXJc03A...|vrKkXsozqqecF3CW4...|2012-12-04 16:46:20|  5.0|
|kq5Ghhh14r-eCxlVm...|aFa96pz67TwOFu4We...|2018-08-23 21:39:38|  5.0|
|j8JOZvfeHEfUWq3gE...|Z2cOL3n9V8NoguJ-u...|2014-06-11 14:55:14|  2.0|
|I6L0Zxi5Ww0zEWSAV...|S7bjj-L07JuRr-tpX...|2018-07-07 20:50:12|  4.0|
|EtKSTHV5Qx_Q7Aur9...|ZGjgfSvjQK886kiTz...|2009-10-14 01:15:04|  5.0|
|VJEzpfLs_Jnzgqh5A...|IKbjLnfBQtEyVzEu8...|2014-04-01 13:05:18|  4.0|
|oJ4ik-4PZe6gexxW-...|DBYhpb5hrAYgQjQaM...|2016-10-26 15:29:56|  4.0|
|2GYg3liJ9-m6Z67L_...|uAu772KpSkb-tPFgZ...|2008-12-03 04:13:43|  5.0|
|oQ5CPRt0R3AzFvcjN..