# Fluxo ETL de dados da camada Bronze para camada Silver

## Introdução

Na camada Bronze, os dados brutos, ou seja, sem tratamento e limpeza de dados estão armazenados no arquivo `TMDB_movie_dataset_v11.csv`. Nessa pipeline, serão realizadas as seguintes operações:

- **Limpar e padronizar os dados**: tratar valores nulos, duplicados e NaN;
- **Normalizar formatos**: datas, categorias e colunas numéricas;
- **Enriquecer ou derivar novas colunas** quando necessário para análises futuras;
- **Garantir a qualidade dos dados** antes do carregamento na Silver, que terá dados mais estruturados e prontos para consumo analítico.

Na camada Silver ficam armazenados os **dados tratados e consistentes**, que poderão ser utilizados em análises exploratórias e os dashboards no Tableau e PowerBI.  

As etapas do fluxo ETL são:

1. Carregamento dos dados Bronze (CSV bruto);
2. Inspeção e validação inicial das colunas;
3. Explosão dos campos compostos em linhas
4. Remoção de colunas que não serão usadas;
5. Tratamento de valores nulos, vazios, duplicados e NaN;
6. Tratamentos e remoção de outliers;
7. Armazenamento dos dados tratados na camada Silver em csv;
8. Carregamento no banco de dados.


## Leitura do csv bruto da camada Bronze

### Import das bibliotecas, configurações iniciais e inicialização da sessão spark

In [34]:

from pyspark.sql import SparkSession
from pyspark.sql.functions import split, explode, arrays_zip, col, trim, ltrim, lit, count, isnan, to_timestamp
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, FloatType, BooleanType, DateType
import time, os, shutil

# Inicio sessão spark
jar_path = os.path.abspath("../postgresql-42.7.8.jar")
spark = SparkSession.builder.appName("tmdbEtlBronzeToSilver") \
                            .config("spark.jars", jar_path) \
                            .config("spark.driver.memory", "4g") \
                            .config("spark.executor.memory", "4g") \
                            .getOrCreate()
                            
spark.sparkContext.setLogLevel("ERROR")
print(spark.sparkContext._conf.get("spark.jars"))

/home/papercut/Documentos/projects/unb/2025-2/SBD2/film-data-analytics/postgresql-42.7.8.jar


### 1. Leitura do csv Bruto

Definição do schema do CSV para importação no pyspark dataframe e leitura do arquivo.

In [25]:
# Setando o schema
schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("title", StringType(), True),
    StructField("vote_average", FloatType(), True),
    StructField("vote_count", IntegerType(), True),
    StructField("status", StringType(), True),
    StructField("release_date", DateType(), True),
    StructField("revenue", IntegerType(), True),
    StructField("runtime", IntegerType(), True),
    StructField("adult", BooleanType(), True),
    StructField("backdrop_path", StringType(), True),
    StructField("budget", IntegerType(), True),
    StructField("homepage", StringType(), True),
    StructField("imdb_id", StringType(), True),
    StructField("original_language", StringType(), True),
    StructField("original_title", StringType(), True),
    StructField("overview", StringType(), True),
    StructField("popularity", FloatType(), True),
    StructField("poster_path", StringType(), True),
    StructField("tagline", StringType(), True),
    StructField("genres", StringType(), True),
    StructField("production_companies", StringType(), True),
    StructField("production_countries", StringType(), True),
    StructField("spoken_languages", StringType(), True)
])

# Ingestão do arquivo
csv_path = "../DataLayer/bronze/TMDB_movie_dataset_v11.csv" 
df = spark.read.csv(csv_path, header=True, schema=schema, sep=',', quote='"', escape='"')
df.printSchema()

root
 |-- id: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- vote_average: float (nullable = true)
 |-- vote_count: integer (nullable = true)
 |-- status: string (nullable = true)
 |-- release_date: date (nullable = true)
 |-- revenue: integer (nullable = true)
 |-- runtime: integer (nullable = true)
 |-- adult: boolean (nullable = true)
 |-- backdrop_path: string (nullable = true)
 |-- budget: integer (nullable = true)
 |-- homepage: string (nullable = true)
 |-- imdb_id: string (nullable = true)
 |-- original_language: string (nullable = true)
 |-- original_title: string (nullable = true)
 |-- overview: string (nullable = true)
 |-- popularity: float (nullable = true)
 |-- poster_path: string (nullable = true)
 |-- tagline: string (nullable = true)
 |-- genres: string (nullable = true)
 |-- production_companies: string (nullable = true)
 |-- production_countries: string (nullable = true)
 |-- spoken_languages: string (nullable = true)



### 2. Tratamento de campos compostos

No dataset de filmes existem 4 colunas com dados compostos, conforme indicado no [Dicionário de dados](https://github.com/Eric-chagas/film-data-analytics/blob/main/bronze/dicionario_dados.pdf). Essas colunas consistem em arrays com informações separadas por vírgula.

Cada linha nessas colunas, serão explodidas em uma ou várias linhas na dataframe principal interligadas pelo ID do filme `movie_id`, abaixo estão os nomes das colunas antes da explosão, e após:

1. `genres` -> `genre`
2. `production_companies` -> `production_company`
3. `production_countries` -> `production_country`
4. `spoken_languages` -> `spoken_language`

In [26]:
df_split = df.withColumn("genre", split(col("genres"), ",")) \
                .withColumn("production_company", split(col("production_companies"), ",")) \
                .withColumn("production_country", split(col("production_countries"), ",")) \
                .withColumn("spoken_language", split(col("spoken_languages"), ","))

# print('Colunas split:')
# df_split.select("genre").distinct().show(truncate=False)
# df_split.select("production_company").distinct().show(truncate=False)
# df_split.select("production_country").distinct().show(truncate=False)
# df_split.select("spoken_language").distinct().show(truncate=False)


df_exploded = df_split.withColumn("genre", explode(col('genre'))) \
                        .withColumn("genre", trim(col("genre"))) \
                        .withColumn("production_company", explode(col("production_company"))) \
                        .withColumn("production_company", trim(col("production_company"))) \
                        .withColumn("production_country", explode(col("production_country"))) \
                        .withColumn("production_country", trim(col("production_country"))) \
                        .withColumn("spoken_language", explode(col("spoken_language"))) \
                        .withColumn("spoken_language", trim(col("spoken_language"))) \
                            
print("Colunas exploded:")             
df_exploded.select("id", "genre").distinct().show(truncate=False)
df_exploded.select("id","production_company").distinct().show(truncate=False)
df_exploded.select("id","production_country").distinct().show(truncate=False)
df_exploded.select("id","spoken_language").distinct().show(truncate=False)

old_cols = ["genres", "production_companies", "production_countries", "spoken_languages"]
df_exploded = df_exploded.drop(*old_cols)

Colunas exploded:


                                                                                

+------+---------------+
|id    |genre          |
+------+---------------+
|77338 |Drama          |
|2062  |Family         |
|324857|Adventure      |
|273248|Mystery        |
|8844  |Family         |
|420817|Romance        |
|156022|Crime          |
|51497 |Crime          |
|312221|Action         |
|10192 |Fantasy        |
|331   |Science Fiction|
|44896 |Animation      |
|302946|Crime          |
|324849|Comedy         |
|479455|Comedy         |
|59440 |Drama          |
|1700  |Thriller       |
|941   |Adventure      |
|9732  |Animation      |
|438695|Family         |
+------+---------------+
only showing top 20 rows


                                                                                

+------+-----------------------------+
|id    |production_company           |
+------+-----------------------------+
|752   |DC Comics                    |
|64690 |OddLot Entertainment         |
|161   |Village Roadshow Pictures    |
|76203 |Film4 Productions            |
|110415|Opus Pictures                |
|454626|Original Film                |
|126889|Scott Free Productions       |
|339964|UFA                          |
|395992|Columbia Pictures            |
|43074 |Pascal Pictures              |
|399174|American Empirical Pictures  |
|431   |Viacom Canada                |
|152584|Alcatraz Films               |
|6477  |Regency Enterprises          |
|526896|Arad Productions             |
|4148  |BBC Film                     |
|72976 |Amblin Entertainment         |
|10330 |Gunn Films                   |
|8271  |The Montecito Picture Company|
|44048 |Millbrook Farm Productions   |
+------+-----------------------------+
only showing top 20 rows


                                                                                

+------+------------------------+
|id    |production_country      |
+------+------------------------+
|767   |United States of America|
|652   |Malta                   |
|70    |United States of America|
|198184|South Africa            |
|200727|Belgium                 |
|331482|United States of America|
|9836  |Australia               |
|9480  |United States of America|
|10830 |United States of America|
|301337|Norway                  |
|258216|Denmark                 |
|71859 |United States of America|
|549053|United States of America|
|29427 |United States of America|
|585083|United States of America|
|899112|United States of America|
|9531  |United States of America|
|592   |United States of America|
|746   |Italy                   |
|491472|France                  |
+------+------------------------+
only showing top 20 rows




+------+---------------+
|id    |spoken_language|
+------+---------------+
|271110|English        |
|127585|Russian        |
|393   |Cantonese      |
|218   |English        |
|85    |English        |
|62    |English        |
|181812|English        |
|59436 |German         |
|520763|English        |
|311   |English        |
|1581  |English        |
|428078|English        |
|346910|English        |
|134374|English        |
|72545 |English        |
|153518|English        |
|419479|English        |
|62214 |English        |
|1620  |Serbian        |
|72113 |English        |
+------+---------------+
only showing top 20 rows


                                                                                

### 3. Remoção das colunas que não serão usadas

Existem alguns campos no dataset que não trazem grande valor para a análise realizada nesse trabalho, por tanto, serão removidas. As colunas removidas, em sua maioria consistem em strings com URLs/Paths para imagens ou recursos externos relacionados ao filme, e são elas:

1. `backdrop_path`
2. `homepage`
3. `poster_path`

A descrição de cada um pode ser encontrada no [Dicionário de dados](https://github.com/Eric-chagas/film-data-analytics/blob/main/bronze/dicionario_dados.pdf).

In [27]:
# print(len(df_exploded.columns))

columns_to_drop = ["backdrop_path", "homepage", "poster_path", "imdb_id"]
df_refined_cols = df_exploded.drop(*columns_to_drop)

# print(len(df_refined_cols.columns))
# df_refined_cols.show(truncate=False)

In [28]:
print(f"df_overview_null_count = {df_refined_cols.select("overview").filter(col("overview").isNull()).count()}")
print(f"df_tagline_null_count = {df_refined_cols.select("tagline").filter(col("tagline").isNull()).count()}")

                                                                                

df_overview_null_count = 108504




df_tagline_null_count = 1114059


                                                                                

### 4. Tratamento de valores nulos, vazios, duplicados e NaN

As regras de negócio aplicadas são:

1. Remover linhas com `revenue` null/NaN/None
2. Remover linhas com `release_date` null/NaN/None/NaT
3. Remover colunas 100% nulas caso existam
4. Remover linhas em que o título original e o título do filme `original_title` e `title` são ambos vazios
5. Remover linhas idênticas caso existam

In [29]:
# print(df.select("Revenue").filter(col("Revenue").isNull()).count())
# print(df.select("Revenue").filter(isnan(col("Revenue"))).count())
# print(df.select("release_date").filter(col("release_date").isNull()).count())
print(f"df_cols original = {len(df_refined_cols.columns)}")

df_treated_nulls = df_refined_cols.dropna(how="all", subset=df_refined_cols.columns) 
print(f"df_cols nulls = {len(df_treated_nulls.columns)}")

df_treated_nulls = df_treated_nulls.filter(col("Revenue").isNotNull() & ~isnan(col("Revenue")))
print(f"df_rows revenue = {df_treated_nulls.count()}")

df_treated_nulls = df_treated_nulls.filter(col("release_date").isNotNull())
print(f"df_rows date = {df_treated_nulls.count()}")

df_treated_nulls = df_treated_nulls.filter((col("original_title") != "") | (col("title") != ""))
print(f"df_rows titles = {df_treated_nulls.count()}")

df_treated_nulls = df_treated_nulls.fillna({"overview": "Film has no overview."})
print(f"df_overview_null_count = {df_treated_nulls.select("overview").filter(col("overview").isNull()).count()}")

df_treated_nulls = df_treated_nulls.fillna({"tagline": "Film has no tagline."})
print(f"df_tagline_null_count = {df_treated_nulls.select("tagline").filter(col("tagline").isNull()).count()}")

df_treated_nulls = df_treated_nulls.distinct()
print(f"df_rows lines = {df_treated_nulls.count()}")

df_cols original = 19
df_cols nulls = 19


                                                                                

df_rows revenue = 1753313


                                                                                

df_rows date = 1701718


                                                                                

df_rows titles = 1701718
df_overview_null_count = 0
df_tagline_null_count = 0




df_rows lines = 1700013


                                                                                

### 4. Tratamentos e remoção de outliers

Os tratamentos realizados são:

1. Remoção de filmes classificados como "Adulto"
2. Remoção de linhas sem votos

In [30]:
print(f"DF original = {df_treated_nulls.count()}")

df_output = df_treated_nulls.filter(col("adult") == False) 
print(f"DF no adult = {df_output.count()}")

df_output = df_output.filter(col("vote_count") != 0)
print('DF only voted: ',df_output.count())

                                                                                

DF original = 1700013


                                                                                

DF no adult = 1689165




DF only voted:  1347484


                                                                                

### 5. Armazenamento dos dados tratados na camada Silver em csv

Os dados tratados são armazenados na camada silver no formato `SILVER_TMDB_movie_dataset_v11_{data_horario}.csv`.

In [31]:
time.tzset()
date_formatted = time.strftime("%Y%m%d")
time_formatted = time.strftime("%H%M%S")
timestamp_str = time.strftime("%Y-%m-%d %H:%M:%S")

print("Initiating csv save on silver...")

dest_filename = f"../DataLayer/silver/SILVER_TMDB_movie_dataset_v11_{date_formatted}_{time_formatted}"

df_output.selectExpr([f"CAST({col} AS STRING)" for col in df_output.columns]).coalesce(1).write.mode("overwrite").option("header", "true").option("delimiter", ",").csv(dest_filename)

current_dir = os.getcwd()
files = os.listdir(f"{current_dir}/{dest_filename}")

try:
    for f in files:
        curr_file = f"{current_dir}/{dest_filename}/{f}"
        if f.endswith(".csv"):
            print(f"achei {curr_file}")
            shutil.copyfile(f"{curr_file}", f"{current_dir}/{dest_filename}.csv")
            os.remove(f"{curr_file}")
        else:
            os.remove(f"{curr_file}")
            
    os.rmdir(f"{current_dir}/{dest_filename}")
    print("Finished.")

except:
    print("Failed to extract csv.")


Initiating csv save on silver...


                                                                                

achei /home/papercut/Documentos/projects/unb/2025-2/SBD2/film-data-analytics/Transformer/../DataLayer/silver/SILVER_TMDB_movie_dataset_v11_20251107_151831/part-00000-5733076d-0ac5-46ba-85df-cb05c77c8d20-c000.csv
Finished.


### 6. Inserção dos dados tratados no banco

O banco de dados é executado em um container Docker já deve estar rodando e disponível localmente na porta 5432. 

A conexão com o banco é feita utilizando uma string JDBC e o schema é criado dinâmicamente pelo spark, já de acordo com a estrutura do dataframe de saída `df_output`.

In [None]:
df_final = df_output.withColumn("id", col("id").cast(IntegerType())) \
                    .withColumn("title", col("title").cast(StringType())) \
                    .withColumn("vote_average", col("vote_average").cast(FloatType())) \
                    .withColumn("vote_count", col("vote_count").cast(IntegerType())) \
                    .withColumn("status", col("status").cast(StringType())) \
                    .withColumn("release_date", to_timestamp(col("release_date"), "yyyy-MM-dd HH:mm:ss")) \
                    .withColumn("revenue", col("revenue").cast(IntegerType())) \
                    .withColumn("runtime", col("runtime").cast(IntegerType())) \
                    .withColumn("adult", col("adult").cast(BooleanType())) \
                    .withColumn("budget", col("budget").cast(IntegerType())) \
                    .withColumn("original_language", col("original_language").cast(StringType())) \
                    .withColumn("original_title", col("original_title").cast(StringType())) \
                    .withColumn("overview", col("overview").cast(StringType())) \
                    .withColumn("popularity", col("popularity").cast(FloatType())) \
                    .withColumn("tagline", col("tagline").cast(StringType())) \
                    .withColumn("genre", col("genre").cast(StringType())) \
                    .withColumn("production_company", col("production_company").cast(StringType())) \
                    .withColumn("production_country", col("production_country").cast(StringType())) \
                    .withColumn("spoken_language", col("spoken_language").cast(StringType()))

DB_CONFIG = {
    "db_user": os.environ.get("DB_USER") or "postgres",
    "db_password": os.environ.get("DB_PASSWORD") or "secret",
    "db_name": os.environ.get("DB_NAME") or "postgres",
}
jdbc_string = f"jdbc:postgresql://localhost:5432/{DB_CONFIG["db_name"]}"
table_name = "lakehouse.film_lakehouse"
df_final.write \
  .format("jdbc") \
  .option("url", jdbc_string) \
  .option("dbtable", table_name) \
  .option("user", DB_CONFIG["db_user"]) \
  .option("password", DB_CONFIG["db_password"]) \
  .option("driver", "org.postgresql.Driver") \
  .mode("overwrite") \
  .save()


                                                                                