# Large-Scale Data Engineering for AI
En aquest projecte hem treballat amb tres datasets relacionats amb la música, concretament amb cançons disponibles a Spotify. Cada dataset aporta una visió complementària de la música en streaming, des del contingut de les cançons fins a les metadades dels tracks i la seva popularitat per països. A continuació, es descriu breument cadascun:

- **songs-lyrics** → Conté unes 25.000 cançons amb les seves respectives lletres.
- **spotify-tracks-dataset** → Inclou informació de cançons de Spotify de 125 gèneres diferents i altres atributs musicals com la popularitat, l’energia, la dansabilitat, etc.
- **top-spotify-songs-by-country** → Conté les cançons més escoltades diàriament a 72 països, i s’actualitza de manera contínua.

## The Data Engineering Pipeline
### The Landing Zone
Pel que fa a la landing zone, hem utilitzat l'API de Kaggle per obtenir els datasets necessaris. Abans de poder-hi accedir, cal generar un token d'autenticació. Per fer-ho, cal accedir al perfil d’usuari de Kaggle, anar a la configuració (Settings) i, a la part inferior, fer clic a "Create New Token", la qual cosa descarregarà un fitxer .json amb les credencials.

Ens hem basat en la [guia](https://www.kaggle.com/docs/api#getting-started-installation-&-authentication) oficial per instal·lar i configurar correctament l’API. Els passos bàsics són els següents: d'ús de l'API per aconseguir-ho.
```
pip install kaggle
mkdir ~/.kaggle
mv ~/Downloads/kaggle.json ~/.kaggle/kaggle.json
```

Un cop establerta la connexió amb l’API, hem creat una funció que automatitza la descàrrega dels tres datasets i els desa a la ubicació desitjada. Aquests datasets estan inicialment en format .csv, però hem optat per convertir-los a .parquet per optimitzar l'emmagatzematge i la lectura posterior.

En concret, el dataset principal que conté les lletres també inclou informació addicional sobre àlbums i cançons. No obstant això, nosaltres només requerim el conjunt de dades corresponent a les lletres, així que descartem la resta.

En resum, el procés consisteix en:

- Descarregar els datasets des de l’API de Kaggle
- Convertir-los a format .parquet
- Guardar-los al directori data/landing_zone.

Durant tot aquest procés es fan diverses comprovacions per garantir que, en cas d'error, es disposi de prou informació per detectar i solucionar el problema.

A part, la funció encarregada de executar-ho, té un paràmetre anomenat update. Si aquest es fixa com a TRUE, només es descarregarà el dataset **top-spotify-songs-by-country**, ja que aquest és l'únic que es va actualizant cada cert temps (concretament, cada día). D'aquesta menera, el nostre **data collector** permet fer execucions periòdiques, i mantenir les dades actualitzades de manera senzilla.

In [51]:
# Imports and requirements
!pip install kaggle
!pip install pyspark
!pip install unidecode




[notice] A new release of pip is available: 24.2 -> 25.0.1
[notice] To update, run: C:\Users\ralva\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 24.2 -> 25.0.1
[notice] To update, run: C:\Users\ralva\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


Collecting unidecode
  Downloading Unidecode-1.3.8-py3-none-any.whl.metadata (13 kB)
Downloading Unidecode-1.3.8-py3-none-any.whl (235 kB)
Installing collected packages: unidecode
Successfully installed unidecode-1.3.8



[notice] A new release of pip is available: 24.2 -> 25.0.1
[notice] To update, run: C:\Users\ralva\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [None]:
# Imports and requirements
import os
import kaggle
import pandas as pd
import logging
import shutil
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit, when, regexp_replace, trim, lower, upper, to_date, year, month, dayofmonth, explode, split, row_number, udf
from pyspark.sql.types import StringType, IntegerType, DateType, ArrayType, StructType, StructField
from pyspark.sql import Window
from unidecode import unidecode

In [3]:
# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

# Create Spark Session
def create_spark_session(app_name="Spotify_ETL"):
    """
    Creates and returns a Spark session.
    
    Args:
        app_name (str): The name of the Spark application.
        
    Returns:
        SparkSession: A configured Spark session.
    """
    logger.info(f"Creating Spark session with app name: {app_name}")
    
    try:
        # Create a SparkSession with appropriate settings
        spark = (SparkSession.builder
                .appName(app_name)
                .config("spark.sql.execution.arrow.pyspark.enabled", "true")
                .config("spark.sql.shuffle.partitions", "10")
                .config("spark.driver.memory", "2g")
                .config("spark.executor.memory", "4g")
                .config("spark.default.parallelism", "4")
                .config("spark.sql.adaptive.enabled", "true")
                .getOrCreate())
        
        # Set log level to ERROR to reduce verbosity
        spark.sparkContext.setLogLevel("ERROR")
        
        logger.info("Spark session created successfully")
        return spark
    
    except Exception as e:
        logger.error(f"Error creating Spark session: {e}")
        raise

spark = create_spark_session()

2025-04-13 01:39:13,865 - __main__ - INFO - Creating Spark session with app name: Spotify_ETL
2025-04-13 01:39:18,810 - __main__ - INFO - Spark session created successfully


In [4]:
# Base directory and Landing Zone directory structure
BASE_DIR = "./data"
LANDING_ZONE_DIR = os.path.join(BASE_DIR, "landing_zone")
os.makedirs(LANDING_ZONE_DIR, exist_ok=True)

kaggle.api.authenticate()

# List of datasets to ingest from Kaggle
datasets = [
        {
            "kaggle_id": "asaniczka/top-spotify-songs-in-73-countries-daily-updated",
            "dataset_name": "top-spotify-songs-by-country",
            "update": True
        },
        {
            "kaggle_id": "maharshipandya/-spotify-tracks-dataset",
            "dataset_name": "spotify-tracks-dataset",
            "update": False
        },
        {
            "kaggle_id": "terminate9298/songs-lyrics",
            "dataset_name": "songs-lyrics",
            "update": False
        }
        ]

In [5]:
def data_collector_kaggle(kaggle_dataset: dict) -> None:
    """
    Downloads a dataset from Kaggle and saves it to the landing zone.

    Parameters:
    kaggle_dataset (dict): A dictionary containing the Kaggle dataset information.
    """

    # Extract dataset information
    kaggle_id = kaggle_dataset["kaggle_id"]
    dataset_name = kaggle_dataset["dataset_name"]

    # Create a temporary directory for the dataset using the actual dataset name
    dataset_folder = os.path.join(LANDING_ZONE_DIR, f"temp_{dataset_name}")
    os.makedirs(dataset_folder, exist_ok=True)

    try:  
        logging.info(f"Downloading dataset: {kaggle_id}")

        kaggle.api.dataset_download_files(
            kaggle_id,
            path=dataset_folder,
            unzip=True
        )

        csv_found = False
        for filename in os.listdir(dataset_folder):
            if filename in ['songs_details.csv', 'album_details.csv']:
                continue
            if filename.endswith(".csv"):
                csv_found = True
                csv_path = os.path.join(dataset_folder, filename)

                # Read CSV with Pandas
                df = pd.read_csv(csv_path)

                # Write as a single Parquet file
                final_path = os.path.join(LANDING_ZONE_DIR, f"{dataset_name}.parquet")
                df.to_parquet(final_path, index=False)

                logging.info(f"CSV '{filename}' converted to single Parquet file and saved as '{final_path}'.")
                
        if not csv_found:
            logging.info(f"No CSV file found in the downloaded dataset. Check the contents of the download.")
            
        # Remove the temporary dataset folder
        shutil.rmtree(dataset_folder)

    except Exception as e:
         # Remove the dataset folder if it exists
        if os.path.exists(dataset_folder):
            shutil.rmtree(dataset_folder)

        # Log the error
        logging.error(f"Error downloading dataset '{kaggle_id}': {e}")
       
        return

    # Log the successful download
    logging.info(f"Dataset '{dataset_name}' downloaded successfully.")

def download_and_store_datasets(update: bool = False) -> None:
    """
    Downloads and stores datasets from Kaggle into the landing zone.
    """
    logging.info("Starting the creation of the Landing Zone using Kaggle API")

    for kaggle_dataset in datasets:
        if update and not kaggle_dataset["update"]:
            logging.info(f"Skipping dataset '{kaggle_dataset['dataset_name']}' as update is set to False.")
            continue
        try:
            dataset_name = kaggle_dataset["dataset_name"]
            data_collector_kaggle(kaggle_dataset)
            logging.info(f"Dataset '{dataset_name}' processed successfully.")
        except Exception as e:
            logging.error(f"Error processing dataset '{dataset_name}': {e}")

    logging.info("All datasets have been processed.")
    logging.info("Landing Zone creation completed.")

download_and_store_datasets(update=False)

2025-04-13 01:39:24,264 - root - INFO - Starting the creation of the Landing Zone using Kaggle API
2025-04-13 01:39:24,266 - root - INFO - Downloading dataset: asaniczka/top-spotify-songs-in-73-countries-daily-updated
2025-04-13 01:39:39,027 - root - INFO - CSV 'universal_top_spotify_songs.csv' converted to single Parquet file and saved as './data\landing_zone\top-spotify-songs-by-country.parquet'.
2025-04-13 01:39:39,075 - root - INFO - Dataset 'top-spotify-songs-by-country' downloaded successfully.
2025-04-13 01:39:39,122 - root - INFO - Dataset 'top-spotify-songs-by-country' processed successfully.
2025-04-13 01:39:39,124 - root - INFO - Downloading dataset: maharshipandya/-spotify-tracks-dataset
2025-04-13 01:39:40,883 - root - INFO - CSV 'dataset.csv' converted to single Parquet file and saved as './data\landing_zone\spotify-tracks-dataset.parquet'.
2025-04-13 01:39:40,887 - root - INFO - Dataset 'spotify-tracks-dataset' downloaded successfully.
2025-04-13 01:39:40,896 - root - IN

### La Formatted Zone

A la formatted zone, l'objectiu principal és transformar i netejar les dades descarregades a la landing zone per preparar-les per a anàlisis posteriors. Aquest procés inclou diverses operacions de preprocessament i validació per garantir la qualitat i consistència de les dades.

#### Processos principals:

1. **Conversió de formats**: Els datasets descarregats es guarden en format `.parquet` per optimitzar l'emmagatzematge i la lectura. Aquest format és més eficient en termes de compressió i velocitat de processament.

2. **Normalització de noms de columnes**: Els noms de les columnes es normalitzen per assegurar la coherència i facilitar l'accés a les dades.

3. **Filtrat i selecció de dades**: Es mantenen únicament les columnes rellevants.

4. **Normalització i estandardització**: Els noms de cançons i artistes es normalitzen per evitar inconsistències causades per espais innecessaris.

5. **Emmagatzematge**: Les dades processades es guarden al directori `data/formatted_zone` en format `.parquet`.

En resum, el procés de la formatted zone consisteix en:

- Convertir els datasets a `.parquet`.
- Normalitzar els noms de les columnes.
- Filtrar i seleccionar les dades rellevants.
- Normalitzar les dades.
- Guardar els resultats al directori `data/formatted_zone`.

Aquest procés assegura que les dades estiguin netes, consistents i llestes per a l'anàlisi.

In [70]:
# Definir la UDF para aplicar unidecode a cada valor de la columna
def unidecode_func(s):
    return unidecode.unidecode(s) if s is not None else None

unidecode_udf = udf(unidecode_func, StringType())

def process_spotify_tracks(spark, input_path, output_path):
    """
    Process the Spotify tracks dataset.
    
    Args:
        spark (SparkSession): The Spark session.
        input_path (str): The input file path.
        output_path (str): The output directory path.
    """
    logger.info(f"Processing Spotify tracks dataset from {input_path}")
    
    try:
        # Read the parquet file
        df = spark.read.parquet(input_path)
        
        # Print schema and count before processing
        logger.info("Original schema:")
        df.printSchema()
        count_before = df.count()
        logger.info(f"Count before processing: {count_before}")
        
        # Clean and transform the data
        processed_df = df.select(
            col("track_id").alias("track_id"),
            col("track_name").alias("track_name"),
            col("artists").alias("artist_name"),
            col("album_name"),
            col("popularity").cast("integer").alias("popularity"),
            col("duration_ms").cast("long").alias("duration_ms"),
            col("explicit").cast("boolean").alias("explicit"),
            col("danceability").cast("double").alias("danceability"),
            col("energy").cast("double").alias("energy"),
            col("key").cast("integer").alias("key"),
            col("loudness").cast("double").alias("loudness"),
            col("mode").cast("integer").alias("mode"),
            col("speechiness").cast("double").alias("speechiness"),
            col("acousticness").cast("double").alias("acousticness"),
            col("instrumentalness").cast("double").alias("instrumentalness"),
            col("liveness").cast("double").alias("liveness"),
            col("valence").cast("double").alias("valence"),
            col("tempo").cast("double").alias("tempo")
        )
        
        # Normalize track_name y artist_name usando la UDF de unidecode, trim y lower
        processed_df = processed_df.withColumn("track_name", lower(trim(col("track_name"))))
        processed_df = processed_df.withColumn("artist_name", lower(trim(col("artist_name"))))
        
        # Print schema and count after processing
        logger.info("Processed schema:")
        processed_df.printSchema()
        
        # Write the processed data as Parquet
        processed_df.write.mode("overwrite").parquet(output_path)
        logger.info(f"Processed Spotify tracks data saved to {output_path}")
        
        return output_path
        
    except Exception as e:
        logger.error(f"Error processing Spotify tracks dataset: {e}")
        raise

def process_top_songs(spark, input_path, output_path):
    """
    Procesa el dataset de Spotify amb el següent schema:
    
    root
     |-- spotify_id: string (nullable = true)
     |-- name: string (nullable = true)
     |-- artists: string (nullable = true)
     |-- daily_rank: string (nullable = true)
     |-- daily_movement: string (nullable = true)
     |-- weekly_movement: string (nullable = true)
     |-- country: string (nullable = true)
     |-- snapshot_date: string (nullable = true)
     |-- popularity: string (nullable = true)
     |-- is_explicit: string (nullable = true)
     |-- duration_ms: string (nullable = true)
     |-- album_name: string (nullable = true)
     |-- album_release_date: string (nullable = true)
     |-- danceability: string (nullable = true)
     |-- energy: string (nullable = true)
     |-- key: string (nullable = true)
     |-- loudness: string (nullable = true)
     |-- mode: string (nullable = true)
     |-- speechiness: string (nullable = true)
     |-- acousticness: double (nullable = true)
     |-- instrumentalness: double (nullable = true)
     |-- liveness: double (nullable = true)
     |-- valence: double (nullable = true)
     |-- tempo: double (nullable = true)
     |-- time_signature: double (nullable = true)
    
    Args:
        spark (SparkSession): La sessió de Spark.
        input_path (str): Ruta del fitxer CSV d'entrada.
        output_path (str): Ruta del directori de sortida en HDFS.
    """
    logger.info(f"Processant dades de Spotify des de {input_path}")
    
    try:
        # Llegir el fitxer parquet
        df = spark.read.parquet(input_path)
        
        # Imprimir schema i comptar files abans del processament
        logger.info("Schema original:")
        df.printSchema()
        count_before = df.count()
        logger.info(f"Files abans del processament: {count_before}")
        
        # Seleccionar i transformar les columnes segons el nou schema
        processed_df = df.select(
            col("spotify_id").alias("spotify_id"),
            col("name").alias("track_name"),        
            col("artists").alias("artist_name"),
            col("daily_rank").cast("integer").alias("daily_rank"),
            col("daily_movement").alias("daily_movement"),
            col("weekly_movement").alias("weekly_movement"),
            col("country").alias("country"),
            col("snapshot_date").alias("snapshot_date"),
            col("popularity").cast("integer").alias("popularity"),
            col("is_explicit").alias("is_explicit"),
            col("duration_ms").cast("long").alias("duration_ms"),
            col("album_name").alias("album_name"),
            col("album_release_date").alias("album_release_date"),
            col("danceability").alias("danceability"),
            col("energy").alias("energy"),
            col("key").alias("key"),
            col("loudness").alias("loudness"),
            col("mode").alias("mode"),
            col("speechiness").alias("speechiness"),
            col("acousticness").cast("double").alias("acousticness"),
            col("instrumentalness").cast("double").alias("instrumentalness"),
            col("liveness").cast("double").alias("liveness"),
            col("valence").cast("double").alias("valence"),
            col("tempo").cast("double").alias("tempo"),
            col("time_signature").cast("double").alias("time_signature")
        )
        
        # Convertir snapshot_date i album_release_date a format data (ajustar el patró si és necessari)
        processed_df = processed_df.withColumn("snapshot_date", to_date(col("snapshot_date"), "yyyy-MM-dd"))
        processed_df = processed_df.withColumn("album_release_date", to_date(col("album_release_date"), "yyyy-MM-dd"))
        
        # Extreure any, mes i dia a partir de snapshot_date per a possibles anàlisis addicionals
        processed_df = processed_df.withColumn("snapshot_year", year(col("snapshot_date"))) \
                                   .withColumn("snapshot_month", month(col("snapshot_date"))) \
                                   .withColumn("snapshot_day", dayofmonth(col("snapshot_date")))
        
        
        # Imprimir el schema i comptar les files després del processament
        logger.info("Schema processat:")
        processed_df.printSchema()
        
        # Escriure el DataFrame processat en format Parquet en HDFS
        processed_df.write.mode("overwrite").parquet(output_path)
        logger.info(f"Dades processades de Spotify guardades a {output_path}")
        
        return output_path
        
    except Exception as e:
        logger.error(f"Error al processar les dades de Spotify: {e}")
        raise

def process_song_lyrics(spark, input_path, output_path):
    """
    Process the song lyrics dataset.
    
    Args:
        spark (SparkSession): The Spark session.
        input_path (str): The input file path.
        output_path (str): The output directory path.
    """
    logger.info(f"Processing song lyrics from {input_path}")
    
    try:
        # Read the CSV file with proper encoding
        df = spark.read.parquet(input_path)
        
        # Print schema and count before processing
        logger.info("Original schema:")
        df.printSchema()
        count_before = df.count()
        
        # Clean and transform the data based on the actual columns in songs-lyrics.csv
        processed_df = df.select(
            col("Unnamed: 0").cast("integer").alias("song_id"),
            col("artist").alias("artist_name"),
            col("song_name").alias("track_name"),
            col("lyrics").alias("song_lyrics")
        )
        
        # Clean artist and track names
        processed_df = processed_df.withColumn("artist_name", trim(col("artist_name"))) \
                                 .withColumn("track_name", trim(col("track_name")))
        
        # Print schema and count after processing
        logger.info("Processed schema:")
        processed_df.printSchema()
        
        # Write the processed data as Parquet
        processed_df.write.mode("overwrite").parquet(output_path)
        logger.info(f"Processed song lyrics data saved to {output_path}")
        
        return output_path
        
    except Exception as e:
        logger.error(f"Error processing song lyrics dataset: {e}")
        raise

In [68]:
# Define input and output paths
landing_paths = {
    'spotify_tracks': './data/landing_zone/spotify-tracks-dataset.parquet',
    'top_songs': './data/landing_zone/top-spotify-songs-by-country.parquet',
    'song_lyrics': './data/landing_zone/songs-lyrics.parquet'
}
formatted_paths = {
    'spotify_tracks': './data/formatted_zone/spotify-tracks-dataset',
    'top_songs': './data/formatted_zone/top-spotify-songs-by-country',
    'song_lyrics': './data/formatted_zone/songs-lyrics'
}

# Process datasets
process_spotify_tracks(spark, landing_paths['spotify_tracks'], formatted_paths['spotify_tracks'])
process_top_songs(spark, landing_paths['top_songs'], formatted_paths['top_songs'])
process_song_lyrics(spark, landing_paths['song_lyrics'], formatted_paths['song_lyrics'])

2025-04-13 03:18:47,414 - __main__ - INFO - Processing Spotify tracks dataset from ./data/landing_zone/spotify-tracks-dataset.parquet
2025-04-13 03:18:47,510 - __main__ - INFO - Original schema:
2025-04-13 03:18:47,591 - __main__ - INFO - Count before processing: 114000
2025-04-13 03:18:47,626 - __main__ - INFO - Processed schema:


root
 |-- Unnamed: 0: long (nullable = true)
 |-- track_id: string (nullable = true)
 |-- artists: string (nullable = true)
 |-- album_name: string (nullable = true)
 |-- track_name: string (nullable = true)
 |-- popularity: long (nullable = true)
 |-- duration_ms: long (nullable = true)
 |-- explicit: boolean (nullable = true)
 |-- danceability: double (nullable = true)
 |-- energy: double (nullable = true)
 |-- key: long (nullable = true)
 |-- loudness: double (nullable = true)
 |-- mode: long (nullable = true)
 |-- speechiness: double (nullable = true)
 |-- acousticness: double (nullable = true)
 |-- instrumentalness: double (nullable = true)
 |-- liveness: double (nullable = true)
 |-- valence: double (nullable = true)
 |-- tempo: double (nullable = true)
 |-- time_signature: long (nullable = true)
 |-- track_genre: string (nullable = true)

root
 |-- track_id: string (nullable = true)
 |-- track_name: string (nullable = true)
 |-- artist_name: string (nullable = true)
 |-- album_n

2025-04-13 03:18:48,460 - __main__ - INFO - Processed Spotify tracks data saved to ./data/formatted_zone/spotify-tracks-dataset
2025-04-13 03:18:48,461 - __main__ - INFO - Processant dades de Spotify des de ./data/landing_zone/top-spotify-songs-by-country.parquet
2025-04-13 03:18:48,518 - __main__ - INFO - Schema original:
2025-04-13 03:18:48,591 - __main__ - INFO - Files abans del processament: 1923107
2025-04-13 03:18:48,653 - __main__ - INFO - Schema processat:


root
 |-- spotify_id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- artists: string (nullable = true)
 |-- daily_rank: long (nullable = true)
 |-- daily_movement: long (nullable = true)
 |-- weekly_movement: long (nullable = true)
 |-- country: string (nullable = true)
 |-- snapshot_date: string (nullable = true)
 |-- popularity: long (nullable = true)
 |-- is_explicit: boolean (nullable = true)
 |-- duration_ms: long (nullable = true)
 |-- album_name: string (nullable = true)
 |-- album_release_date: string (nullable = true)
 |-- danceability: double (nullable = true)
 |-- energy: double (nullable = true)
 |-- key: long (nullable = true)
 |-- loudness: double (nullable = true)
 |-- mode: long (nullable = true)
 |-- speechiness: double (nullable = true)
 |-- acousticness: double (nullable = true)
 |-- instrumentalness: double (nullable = true)
 |-- liveness: double (nullable = true)
 |-- valence: double (nullable = true)
 |-- tempo: double (nullable = true)
 |-- tim

2025-04-13 03:18:55,898 - __main__ - INFO - Dades processades de Spotify guardades a ./data/formatted_zone/top-spotify-songs-by-country
2025-04-13 03:18:55,899 - __main__ - INFO - Processing song lyrics from ./data/landing_zone/songs-lyrics.parquet
2025-04-13 03:18:55,959 - __main__ - INFO - Original schema:
2025-04-13 03:18:56,049 - __main__ - INFO - Count before processing: 25742
2025-04-13 03:18:56,063 - __main__ - INFO - Processed schema:


root
 |-- Unnamed: 0: long (nullable = true)
 |-- link: string (nullable = true)
 |-- artist: string (nullable = true)
 |-- song_name: string (nullable = true)
 |-- lyrics: string (nullable = true)

root
 |-- song_id: integer (nullable = true)
 |-- artist_name: string (nullable = true)
 |-- track_name: string (nullable = true)
 |-- song_lyrics: string (nullable = true)



2025-04-13 03:18:56,502 - __main__ - INFO - Processed song lyrics data saved to ./data/formatted_zone/songs-lyrics


'./data/formatted_zone/songs-lyrics'

### La Trusted Zone

La trusted zone és la capa del pipeline de dades on les dades ja processades es validaden i es preparen pel futur anàlisis i visualitzacions. Aquesta zona conté dades enriquides habitualment amb cheques de qualitat i validacions per assegurar la seva integritat. En el nostre cas, hem realitzat diverses operacions per garantir que les dades siguin fiables i útils per a l'anàlisi posterior.

#### Processos principals:

1. **Filtrat de dades**: Es filtra el dataset per eliminar les cançons que no contenen lletres. Això és important perquè les lletres són un dels aspectes més rellevants del nostre projecte. També es filtra eliminan les cançons que no tenen album o nom de canço associat, ja que representen un error en el dataset.

2. **Denial constraints**: S'apliquen restriccions per assegurar que les dades compleixin amb certes condicions. En el nostre cas, es verifiquen les següents condicions:
    - **Durada mínima de les cançons**
   
        Cada cançó ha de tenir una duració de com a mínim 30 segons:

        $$
        \forall r \in R,\; r.duration\_ms \ge 30000
        $$

    - **Unicitat del daily_rank per país i data**
   
   Dues cançons no poden compartir el mateix `daily_rank` en la mateixa data i país:

   $$
   \forall r, s \in R,\; \left[
       \left(
           r.snapshot\_date = s.snapshot\_date \land
           r.country = s.country \land
           r.daily\_rank = s.daily\_rank
       \right) \Rightarrow r = s
   \right]
   $$

    - **Unicitat del track_name per artista, país i data**
   
   Un artista no pot tenir dues cançons amb el mateix nom al mateix país i data:

   $$
   \forall r, s \in R,\; \left[
       \left(
           r.artist\_name = s.artist\_name \land
           r.country = s.country \land
           r.snapshot\_date = s.snapshot\_date \land
           r.track\_name = s.track\_name
       \right) \Rightarrow r = s
   \right]
   $$

    - **Unicitat del song_id**
   
   Cada `song_id` ha de ser únic dins el dataset de lletres:

   $$
   \forall r, s \in R,\; \left[
       r.song\_id = s.song\_id \Rightarrow r = s
   \right]
   $$

3. **Emmagatzematge**: Les dades validades es guarden al directori `data/trusted_zone` en format `.parquet`.

En resum, el procés de la trusted zone consisteix en:

- Filtrar les dades per eliminar cançons sense lletres o amb errors.
- Aplicar restriccions per assegurar la qualitat de les dades.
- Guardar els resultats al directori `data/trusted_zone`.

Aquest procés assegura que les dades siguin fiables i útils per a l'anàlisi posterior, i proporciona una base sòlida per a la presa de decisions i la generació d'informes.

#### Function definition

In [None]:
def preprocess_spotify_tracks(spark, input_path, output_path):
    """
    Preprocess the Spotify tracks dataset for the Trusted Zone.
    Args:
        spark (SparkSession): The Spark session.
        input_path (str): The input file path.
        output_path (str): The output directory path.
    """
    logger.info(f"Preprocessing Spotify tracks dataset from {input_path}")

    try:
        # Read the parquet file
        df = spark.read.parquet(input_path)

        # Clean and transform the data
        processed_df = df.select(
            col("track_id"),
            col("track_name"),
            col("artist_name"),
            col("album_name"),
            col("popularity"),
            col("duration_ms"),
            col("explicit"),
            col("danceability"),
            col("energy"),
            col("key"),
            col("loudness"),
            col("mode"),
            col("speechiness"),
            col("acousticness"),
            col("instrumentalness"),
            col("liveness"),
            col("valence"),
            col("tempo")
        ).filter(
            col("track_id").isNotNull() & 
            col("track_name").isNotNull() & 
            col("artist_name").isNotNull() & 
            col("explicit").isNotNull() &
            (col("duration_ms") >= 30000) # Canciones de menos de 30 segundos no son válidas para nuestro análisis
        )

        # Write the processed data as Parquet
        processed_df.write.mode("overwrite").parquet(output_path)
        logger.info(f"Processed Spotify tracks data saved to {output_path}")

    except Exception as e:
        logger.error(f"Error preprocessing Spotify tracks dataset: {e}")
        raise

def preprocess_top_songs(spark, input_path, output_path):
    """
    Preprocess the top songs dataset for the Trusted Zone.
    Args:
        spark (SparkSession): The Spark session.
        input_path (str): The input file path.
        output_path (str): The output directory path.
    """
    logger.info(f"Preprocessing top songs dataset from {input_path}")

    try:
        # Read the parquet file
        df = spark.read.parquet(input_path)

        # Clean and transform the data
        processed_df = df.select(
            col("spotify_id"),
            col("track_name"),
            col("artist_name"),
            col("daily_rank"),
            col("country"),
            col("snapshot_date"),
            col("popularity"),
            col("duration_ms"),
            col("danceability"),
            col("energy"),
            col("acousticness"),
            col("valence"),
            col("tempo")
        ).filter(
            col("track_name").isNotNull() & col("artist_name").isNotNull()
        )

        # No dos canciones pueden tener el mismo daily_rank en la misma fecha y país
        window_spec = Window.partitionBy("snapshot_date", "country", "daily_rank").orderBy("track_name")
        processed_df = processed_df.withColumn("row_number", row_number().over(window_spec))
        valid_df = processed_df.filter(col("row_number") == 1).drop("row_number")
        removed_rows = processed_df.count() - valid_df.count()
        logger.info(f"Rows removed due to 'not two songs same on same rank' denial constraint: {removed_rows}")

        # No dos canciones por el mismo artista pueden tener el mismo track_name en el mismo país y fecha
        window_spec_artist = Window.partitionBy("artist_name", "country", "track_name", "snapshot_date").orderBy("track_name")
        processed_df = valid_df.withColumn("row_number_artist", row_number().over(window_spec_artist))
        valid_df = processed_df.filter(col("row_number_artist") == 1).drop("row_number_artist")
        removed_rows_artist = processed_df.count() - valid_df.count()
        logger.info(f"Rows removed due to 'not two songs by same artist and same track name' constraint: {removed_rows_artist}")

        # Write the processed data as Parquet
        valid_df.write.mode("overwrite").parquet(output_path)
        logger.info(f"Processed top songs data saved to {output_path}")

    except Exception as e:
        logger.error(f"Error preprocessing top songs dataset: {e}")
        raise

def preprocess_song_lyrics(spark, input_path, output_path):
    """
    Preprocess the song lyrics dataset for the Trusted Zone.
    Args:
        spark (SparkSession): The Spark session.
        input_path (str): The input file path.
        output_path (str): The output directory path.
    """
    logger.info(f"Preprocessing song lyrics dataset from {input_path}")

    try:
        # Read the parquet file
        df = spark.read.parquet(input_path)

        # Clean and transform the data
        processed_df = df.select(
            col("song_id"),
            col("artist_name"),
            col("track_name"),
            col("song_lyrics")
        ).filter(
            col("song_id").isNotNull() & 
            col("track_name").isNotNull() & 
            col("artist_name").isNotNull() & 
            col("song_lyrics").isNotNull()
        )

        # No dos registros deben tener el mismo song_id 
        window_spec_song = Window.partitionBy("song_id").orderBy("artist_name")
        processed_df = processed_df.withColumn("row_number_song", row_number().over(window_spec_song))
        valid_df = processed_df.filter(col("row_number_song") == 1).drop("row_number_song")
        removed_rows = processed_df.count() - valid_df.count()
        logger.info(f"Rows removed due to 'not unique song_id' denial constraint: {removed_rows}")

        # Write the processed data as Parquet
        valid_df.write.mode("overwrite").parquet(output_path)
        logger.info(f"Processed song lyrics data saved to {output_path}")

    except Exception as e:
        logger.error(f"Error preprocessing song lyrics dataset: {e}")
        raise


#### Function execution

In [71]:
# Define input and output paths for Trusted Zone
trusted_zone_paths = {
    'spotify_tracks': './data/trusted_zone/spotify-tracks-dataset',
    'top_songs': './data/trusted_zone/top-spotify-songs-by-country',
    'song_lyrics': './data/trusted_zone/songs-lyrics'
}

# Execute preprocessing for Trusted Zone
preprocess_spotify_tracks(spark, formatted_paths['spotify_tracks'], trusted_zone_paths['spotify_tracks'])
preprocess_top_songs(spark, formatted_paths['top_songs'], trusted_zone_paths['top_songs'])
preprocess_song_lyrics(spark, formatted_paths['song_lyrics'], trusted_zone_paths['song_lyrics'])

2025-04-13 03:35:44,735 - __main__ - INFO - Preprocessing Spotify tracks dataset from ./data/formatted_zone/spotify-tracks-dataset
2025-04-13 03:35:45,469 - __main__ - INFO - Processed Spotify tracks data saved to ./data/trusted_zone/spotify-tracks-dataset
2025-04-13 03:35:45,470 - __main__ - INFO - Preprocessing top songs dataset from ./data/formatted_zone/top-spotify-songs-by-country
2025-04-13 03:35:47,222 - __main__ - INFO - Rows removed due to 'not two songs same on same rank' denial constraint: 3642
2025-04-13 03:35:51,476 - __main__ - INFO - Rows removed due to 'not two songs by same artist and same track name' constraint: 46
2025-04-13 03:35:56,393 - __main__ - INFO - Processed top songs data saved to ./data/trusted_zone/top-spotify-songs-by-country
2025-04-13 03:35:56,394 - __main__ - INFO - Preprocessing song lyrics dataset from ./data/formatted_zone/songs-lyrics
2025-04-13 03:35:56,796 - __main__ - INFO - Rows removed due to 'not unique song_id' denial constraint: 0
2025-04-

### The Exploitation Zone

En aquesta fase del procés, preparem i estructurem les dades finals per a ser utilitzades en aplicacions d'anàlisi o visualització. Treballem amb els datasets ja validats i netejats de la *trusted zone*, i generem dues sortides principals:

**1. explicit_prediction**

- Llegim les dades de lletres de cançons i de metadades dels tracks.
- Unim les dades mitjançant Spark, fent un join basat en les columnes `artist_name` i `song_name`/`track_name`.
- D'aquesta unió obtenim una estructura amb la lletra de la cançó i si aquesta és considerada explícita.
- El resultat es desa en format Parquet a `data/exploitation_zone/explicit_prediction`.

**2. data_visualization**

- Carreguem el dataset de les cançons més escoltades per país.
- Seleccionem només les columnes necessàries per a visualitzacions: `track_name`, `artist_name`, `daily_rank`, `snapshot_date`, i `country`.
- El desem en format Parquet a `data/exploitation_zone/data_visualization`.

Aquestes dues sortides constitueixen la base per a futurs models predictius i visualitzacions interactives.

In [11]:

# # Read all Parquet files in the folders
# df_lyrics = spark.read.parquet("data/trusted_zone/songs-lyrics.parquet")
# rdd_lyrics = df_lyrics.withColumn("artist_name", trim(regexp_replace("artist_name", " Lyrics", ""))).rdd
# rdd_tracks = spark.read.parquet("data/trusted_zone/spotify-tracks-dataset.parquet").rdd

# rdd_lyrics = rdd_lyrics.map(lambda f: ((f['artist_name'], f['song_name']), f['song_lyrics']))
# rdd_tracks = rdd_tracks.map(lambda f: ((f['artist_name'], f['track_name']), f['explicit']))
# rdd_joined = rdd_lyrics.join(rdd_tracks).map(lambda f: f[1])
# df_explicit_prediction = spark.createDataFrame(rdd_joined)
# df_explicit_prediction.write.parquet("data/exploitation_zone/explicit_prediction.parquet")

# rdd_top_songs = spark.read.parquet("data/trusted_zone/top-spotify-songs-by-country.parquet").rdd
# rdd_top_songs = rdd_top_songs.map(lambda f: (f['track_name'], f['artist_name'], f['daily_rank'], f['snapshot_date'], f['country']))
# df_data_visualization = spark.createDataFrame(rdd_top_songs)
# df_data_visualization.write.parquet("data/exploitation_zone/data_visualization.parquet")


# Leer los datos usando Spark
try:
    # Intentamos leer desde trusted_zone primero
    df_lyrics = spark.read.parquet("data/trusted_zone/songs-lyrics")
    df_tracks = spark.read.parquet("data/trusted_zone/spotify-tracks-dataset")
    df_top_songs = spark.read.parquet("data/trusted_zone/top-spotify-songs-by-country")
    print("Datos cargados desde trusted_zone")
except Exception as e:
    print(f"Error al leer desde trusted_zone: {e}")
    print("Intentando leer desde landing_zone...")
    try:
        # Si no podemos leer desde trusted_zone, leemos desde landing_zone
        df_lyrics = spark.read.parquet("data/landing_zone/songs-lyrics")
        df_tracks = spark.read.parquet("data/landing_zone/spotify-tracks-dataset")
        df_top_songs = spark.read.parquet("data/landing_zone/top-spotify-songs-by-country")
        print("Datos cargados desde landing_zone")
        
        # Adaptar nombres de columnas si es necesario
        if 'artist' in df_lyrics.columns:
            df_lyrics = df_lyrics.withColumnRenamed("artist", "artist_name")
            df_lyrics = df_lyrics.withColumnRenamed("lyrics", "song_lyrics")
        if 'name' in df_top_songs.columns:
            df_top_songs = df_top_songs.withColumnRenamed("name", "track_name")
    except Exception as e2:
        print(f"Error al leer desde landing_zone: {e2}")
        print("No se pudieron cargar los datos.")
        raise

# 1. Explicit prediction
# Limpiar nombres de artistas si es necesario
if 'artist_name' in df_lyrics.columns:
    df_lyrics = df_lyrics.withColumn(
        "artist_name", 
        regexp_replace(col("artist_name"), " Lyrics", "")
    )

# Realizar join basado en artist_name y song_name/track_name
try:
    # Seleccionar columnas necesarias para el join
    lyrics_for_join = df_lyrics.select("artist_name", "song_name", "song_lyrics")
    tracks_for_join = df_tracks.select("artist_name", "track_name", "explicit")
    
    # Realizar join
    explicit_prediction = lyrics_for_join.join(
        tracks_for_join,
        (lyrics_for_join.artist_name == tracks_for_join.artist_name) & 
        (lyrics_for_join.song_name == tracks_for_join.track_name),
        "inner"
    ).select(
        lyrics_for_join.song_lyrics, 
        tracks_for_join.explicit
    )
    
    # Mostrar datos resultantes
    print(f"Registros en explicit_prediction: {explicit_prediction.count()}")
    
    # Guardar resultados
    explicit_prediction.write.mode("overwrite").parquet("data/exploitation_zone/explicit_prediction")
    print("Dataset explicit_prediction guardado correctamente")
except Exception as e:
    print(f"Error al crear explicit_prediction: {e}")

# 2. Data visualization
try:
    # Seleccionar columnas necesarias para visualización
    data_visualization = df_top_songs.select(
        "track_name", "artist_name", "daily_rank", "snapshot_date", "country"
    )
    
    # Mostrar datos resultantes
    print(f"Registros en data_visualization: {data_visualization.count()}")
    
    # Guardar resultados
    data_visualization.write.mode("overwrite").parquet("data/exploitation_zone/data_visualization")
    print("Dataset data_visualization guardado correctamente")
except Exception as e:
    print(f"Error al crear data_visualization: {e}")

# Detener la sesión Spark para liberar recursos
spark.stop()

Datos cargados desde trusted_zone
Registros en explicit_prediction: 2258
Dataset explicit_prediction guardado correctamente
Registros en data_visualization: 1923107
Dataset data_visualization guardado correctamente


## The Data Analysis Pipelines