# üì¶ RentSight ‚Äî Pipeline em camadas (Databricks / Spark)

Esses notebooks s√£o **exemplos read-only** que reproduzem a l√≥gica do seu orquestrador `run_pipeline.py`, s√≥ que em **PySpark**.

Camadas:
- **Bronze**: ingest√£o do CSV (tudo como string) e escrita em Parquet.
- **Silver**: sele√ß√£o de colunas, casts seguros, normaliza√ß√£o/simula√ß√£o de `room_type`, e escrita em Parquet.
- **Gold**: agrega√ß√µes anal√≠ticas e escrita das tabelas finais em Parquet.

‚úÖ Dica: voc√™ pode rodar cada notebook isolado (ele l√™ da camada anterior pelo caminho padr√£o).


## ü•à SILVER ‚Äî Curadoria + casts seguros + simula√ß√£o de `room_type`

Reproduz a l√≥gica do seu `silver(df_raw)`:
- seleciona colunas
- cria `id` sequencial (row_number)
- casts seguros (invalid ‚Üí null)
- normaliza `room_type` e **simula** valores ausentes com rand(seed=42)
- remove `room_type` original e mant√©m `room_type_simulated`


In [0]:
from pyspark.sql import functions as F
from pyspark.sql.window import Window

bronze_in_path = '/Volumes/rentsight/bronze/listings_bronze_rj'
silver_out_path = '/Volumes/rentsight/silver/listings_silver_rj'

print('BRONZE IN:', bronze_in_path)
print('SILVER OUT:', silver_out_path)


In [0]:
df_raw = spark.read.parquet(bronze_in_path)
print('Rows:', df_raw.count())
display(df_raw.limit(5))


In [0]:
cols = [
    'id', 'neighbourhood', 'latitude', 'longitude', 'room_type', 'price',
    'minimum_nights', 'number_of_reviews', 'last_review',
    'reviews_per_month', 'availability_365'
]

missing = [c for c in cols if c not in df_raw.columns]
if missing:
    raise Exception(f'Colunas ausentes no RAW/BRONZE: {missing}')

df = df_raw.select(*cols)


In [0]:
w = Window.orderBy(F.monotonically_increasing_id())
df = df.withColumn('id', F.row_number().over(w).cast('long'))


In [0]:
df = (
    df
    .withColumn('price', F.col('price').cast('double'))
    .withColumn('minimum_nights', F.col('minimum_nights').cast('int'))
    .withColumn('number_of_reviews', F.col('number_of_reviews').cast('int'))
    .withColumn('last_review', F.to_date(F.col('last_review')))
    .withColumn('reviews_per_month', F.col('reviews_per_month').cast('double'))
    .withColumn('availability_365', F.col('availability_365').cast('int'))
    .withColumn('latitude', F.col('latitude').cast('double'))
    .withColumn('longitude', F.col('longitude').cast('double'))
    .withColumn('room_type', F.col('room_type').cast('string'))
)


In [0]:
norm = F.lower(F.trim(F.col('room_type')))

clean = (
    F.when(norm == F.lit('entire home/apt'), F.lit('Entire home/apt'))
     .when(norm == F.lit('private room'), F.lit('Private room'))
     .when(norm == F.lit('shared room'), F.lit('Shared room'))
     .when(norm == F.lit('hotel room'), F.lit('Hotel room'))
)

r = F.rand(seed=42)

simulated_when_missing = (
    F.when(r < 0.25, F.lit('Entire home/apt'))
     .when(r < 0.50, F.lit('Private room'))
     .when(r < 0.75, F.lit('Shared room'))
     .otherwise(F.lit('Hotel room'))
)

df = df.withColumn(
    'room_type_simulated',
    F.coalesce(clean, simulated_when_missing).cast('string')
)

df = df.drop('room_type')

display(df.limit(5))


In [0]:
(
    df
    .write
    .mode('overwrite')
    .parquet(silver_out_path)
)
print('‚úÖ Silver gravado em:', silver_out_path)
