# Large-Scale Data Engineering for AI
En aquest projecte hem treballat amb tres datasets relacionats amb la música, concretament amb cançons disponibles a Spotify. Cada dataset aporta una visió complementària de la música en streaming, des del contingut de les cançons fins a les metadades dels tracks i la seva popularitat per països. A continuació, es descriu breument cadascun:

- **songs-lyrics** → Conté unes 25.000 cançons amb les seves respectives lletres.
- **spotify-tracks-dataset** → Inclou informació de cançons de Spotify de 125 gèneres diferents i altres atributs musicals com la popularitat, l’energia, la dansabilitat, etc.
- **top-spotify-songs-by-country** → Conté les cançons més escoltades diàriament a 72 països, i s’actualitza de manera contínua.

## The Data Engineering Pipeline
### The Landing Zone
Pel que fa a la landing zone, hem utilitzat l'API de Kaggle per obtenir els datasets necessaris. Abans de poder-hi accedir, cal generar un token d'autenticació. Per fer-ho, cal accedir al perfil d’usuari de Kaggle, anar a la configuració (Settings) i, a la part inferior, fer clic a "Create New Token", la qual cosa descarregarà un fitxer .json amb les credencials.

Ens hem basat en la [guia](https://www.kaggle.com/docs/api#getting-started-installation-&-authentication) oficial per instal·lar i configurar correctament l’API. Els passos bàsics són els següents: d'ús de l'API per aconseguir-ho.
```
pip install kaggle
mkdir ~/.kaggle
mv ~/Downloads/kaggle.json ~/.kaggle/kaggle.json
```

Un cop establerta la connexió amb l’API, hem creat una funció que automatitza la descàrrega dels tres datasets i els desa a la ubicació desitjada. Aquests datasets estan inicialment en format .csv, però hem optat per convertir-los a .parquet per optimitzar l'emmagatzematge i la lectura posterior.

En concret, el dataset principal que conté les lletres també inclou informació addicional sobre àlbums i cançons. No obstant això, nosaltres només requerim el conjunt de dades corresponent a les lletres, així que descartem la resta.

En resum, el procés consisteix en:

- Descarregar els datasets des de l’API de Kaggle
- Convertir-los a format .parquet
- Guardar-los al directori data/landing_zone.

Durant tot aquest procés es fan diverses comprovacions per garantir que, en cas d'error, es disposi de prou informació per detectar i solucionar el problema.

A part, la funció encarregada de executar-ho, té un paràmetre anomenat update. Si aquest es fixa com a TRUE, només es descarregarà el dataset **top-spotify-songs-by-country**, ja que aquest és l'únic que es va actualizant cada cert temps (concretament, cada día). D'aquesta menera, el nostre **data collector** permet fer execucions periòdiques, i mantenir les dades actualitzades de manera senzilla.

In [7]:
# Imports and requirements
!pip install kaggle
!pip install pyspark

IOStream.flush timed out

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
IOStream.flush timed out

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [6]:
# Imports and requirements
import os
import kaggle
import pandas as pd
import logging
import shutil
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit, when, regexp_replace, trim, lower, upper, to_date, year, month, dayofmonth, explode, split

In [7]:
# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

# Create Spark Session
def create_spark_session(app_name="Spotify_ETL"):
    """
    Creates and returns a Spark session.
    
    Args:
        app_name (str): The name of the Spark application.
        
    Returns:
        SparkSession: A configured Spark session.
    """
    logger.info(f"Creating Spark session with app name: {app_name}")
    
    try:
        # Create a SparkSession with appropriate settings
        spark = (SparkSession.builder
                .appName(app_name)
                .config("spark.sql.execution.arrow.pyspark.enabled", "true")
                .config("spark.sql.shuffle.partitions", "10")
                .config("spark.driver.memory", "2g")
                .config("spark.executor.memory", "4g")
                .config("spark.default.parallelism", "4")
                .config("spark.sql.adaptive.enabled", "true")
                .getOrCreate())
        
        # Set log level to ERROR to reduce verbosity
        spark.sparkContext.setLogLevel("ERROR")
        
        logger.info("Spark session created successfully")
        return spark
    
    except Exception as e:
        logger.error(f"Error creating Spark session: {e}")
        raise

spark = create_spark_session()

2025-04-13 00:53:51,241 - __main__ - INFO - Creating Spark session with app name: Spotify_ETL
2025-04-13 00:53:51,380 - __main__ - INFO - Spark session created successfully


In [12]:
# Base directory and Landing Zone directory structure
BASE_DIR = "./data"
LANDING_ZONE_DIR = os.path.join(BASE_DIR, "landing_zone")
os.makedirs(LANDING_ZONE_DIR, exist_ok=True)

kaggle.api.authenticate()

# List of datasets to ingest from Kaggle
datasets = [
        {
            "kaggle_id": "asaniczka/top-spotify-songs-in-73-countries-daily-updated",
            "dataset_name": "top-spotify-songs-by-country",
            "update": True
        },
        {
            "kaggle_id": "maharshipandya/-spotify-tracks-dataset",
            "dataset_name": "spotify-tracks-dataset",
            "update": False
        },
        {
            "kaggle_id": "terminate9298/songs-lyrics",
            "dataset_name": "songs-lyrics",
            "update": False
        }
        ]

In [13]:
def data_collector_kaggle(kaggle_dataset: dict) -> None:
    """
    Downloads a dataset from Kaggle and saves it to the landing zone.

    Parameters:
    kaggle_dataset (dict): A dictionary containing the Kaggle dataset information.
    """

    # Extract dataset information
    kaggle_id = kaggle_dataset["kaggle_id"]
    dataset_name = kaggle_dataset["dataset_name"]

    # Create a temporary directory for the dataset using the actual dataset name
    dataset_folder = os.path.join(LANDING_ZONE_DIR, f"temp_{dataset_name}")
    os.makedirs(dataset_folder, exist_ok=True)

    try:  
        logging.info(f"Downloading dataset: {kaggle_id}")

        kaggle.api.dataset_download_files(
            kaggle_id,
            path=dataset_folder,
            unzip=True
        )

        csv_found = False
        for filename in os.listdir(dataset_folder):
            if filename in ['songs_details.csv', 'album_details.csv']:
                continue
            if filename.endswith(".csv"):
                csv_found = True
                csv_path = os.path.join(dataset_folder, filename)

                # Read CSV with Pandas
                df = pd.read_csv(csv_path)

                # Write as a single Parquet file
                final_path = os.path.join(LANDING_ZONE_DIR, f"{dataset_name}.parquet")
                df.to_parquet(final_path, index=False)

                logging.info(f"CSV '{filename}' converted to single Parquet file and saved as '{final_path}'.")
                
        if not csv_found:
            logging.info(f"No CSV file found in the downloaded dataset. Check the contents of the download.")
            
        # Remove the temporary dataset folder
        shutil.rmtree(dataset_folder)

    except Exception as e:
         # Remove the dataset folder if it exists
        if os.path.exists(dataset_folder):
            shutil.rmtree(dataset_folder)

        # Log the error
        logging.error(f"Error downloading dataset '{kaggle_id}': {e}")
       
        return

    # Log the successful download
    logging.info(f"Dataset '{dataset_name}' downloaded successfully.")

def download_and_store_datasets(update: bool = False) -> None:
    """
    Downloads and stores datasets from Kaggle into the landing zone.
    """
    logging.info("Starting the creation of the Landing Zone using Kaggle API")

    for kaggle_dataset in datasets:
        if update and not kaggle_dataset["update"]:
            logging.info(f"Skipping dataset '{kaggle_dataset['dataset_name']}' as update is set to False.")
            continue
        try:
            dataset_name = kaggle_dataset["dataset_name"]
            data_collector_kaggle(kaggle_dataset)
            logging.info(f"Dataset '{dataset_name}' processed successfully.")
        except Exception as e:
            logging.error(f"Error processing dataset '{dataset_name}': {e}")

    logging.info("All datasets have been processed.")
    logging.info("Landing Zone creation completed.")

download_and_store_datasets(update=False)

2025-04-12 23:04:22,233 - root - INFO - Starting the creation of the Landing Zone using Kaggle API
2025-04-12 23:04:22,233 - root - INFO - Downloading dataset: asaniczka/top-spotify-songs-in-73-countries-daily-updated


Dataset URL: https://www.kaggle.com/datasets/asaniczka/top-spotify-songs-in-73-countries-daily-updated


2025-04-12 23:04:35,241 - root - INFO - CSV 'universal_top_spotify_songs.csv' converted to single Parquet file and saved as './data\landing_zone\top-spotify-songs-by-country.parquet'.
2025-04-12 23:04:35,270 - root - INFO - Dataset 'top-spotify-songs-by-country' downloaded successfully.
2025-04-12 23:04:35,319 - root - INFO - Dataset 'top-spotify-songs-by-country' processed successfully.
2025-04-12 23:04:35,320 - root - INFO - Downloading dataset: maharshipandya/-spotify-tracks-dataset


Dataset URL: https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset


2025-04-12 23:04:37,349 - root - INFO - CSV 'dataset.csv' converted to single Parquet file and saved as './data\landing_zone\spotify-tracks-dataset.parquet'.
2025-04-12 23:04:37,353 - root - INFO - Dataset 'spotify-tracks-dataset' downloaded successfully.
2025-04-12 23:04:37,370 - root - INFO - Dataset 'spotify-tracks-dataset' processed successfully.
2025-04-12 23:04:37,372 - root - INFO - Downloading dataset: terminate9298/songs-lyrics


Dataset URL: https://www.kaggle.com/datasets/terminate9298/songs-lyrics


2025-04-12 23:04:39,554 - root - INFO - CSV 'lyrics.csv' converted to single Parquet file and saved as './data\landing_zone\songs-lyrics.parquet'.
2025-04-12 23:04:39,560 - root - INFO - Dataset 'songs-lyrics' downloaded successfully.
2025-04-12 23:04:39,570 - root - INFO - Dataset 'songs-lyrics' processed successfully.
2025-04-12 23:04:39,571 - root - INFO - All datasets have been processed.
2025-04-12 23:04:39,572 - root - INFO - Landing Zone creation completed.


### The Formatted Zone

In [8]:
def process_spotify_tracks(spark, input_path, output_path):
    """
    Process the Spotify tracks dataset.
    
    Args:
        spark (SparkSession): The Spark session.
        input_path (str): The input file path.
        output_path (str): The output directory path.
    """
    logger.info(f"Processing Spotify tracks dataset from {input_path}")
    
    try:
        # Read the parquet file
        df = spark.read.parquet(input_path)
        
        # Print schema and count before processing
        logger.info("Original schema:")
        df.printSchema()
        count_before = df.count()
        logger.info(f"Count before processing: {count_before}")
        
        # Clean and transform the data
        processed_df = df.select(
            col("track_id").alias("track_id"),
            col("track_name").alias("track_name"),
            col("artists").alias("artist_name"),
            col("album_name"),
            col("popularity").cast("integer").alias("popularity"),
            col("duration_ms").cast("long").alias("duration_ms"),
            col("explicit").cast("boolean").alias("explicit"),
            col("danceability").cast("double").alias("danceability"),
            col("energy").cast("double").alias("energy"),
            col("key").cast("integer").alias("key"),
            col("loudness").cast("double").alias("loudness"),
            col("mode").cast("integer").alias("mode"),
            col("speechiness").cast("double").alias("speechiness"),
            col("acousticness").cast("double").alias("acousticness"),
            col("instrumentalness").cast("double").alias("instrumentalness"),
            col("liveness").cast("double").alias("liveness"),
            col("valence").cast("double").alias("valence"),
            col("tempo").cast("double").alias("tempo")
        )
        
        # Remove rows with null track_id or track_name
        # TRUSTED ZONE
        # processed_df = processed_df.filter(
        #     col("track_id").isNotNull() & 
        #     col("track_name").isNotNull()
        # )
        
        # Print schema and count after processing
        logger.info("Processed schema:")
        processed_df.printSchema()
        count_after = processed_df.count()
        logger.info(f"Count after processing: {count_after}")
        logger.info(f"Removed {count_before - count_after} rows during processing")
        
        # Write the processed data as Parquet
        processed_df.write.mode("overwrite").parquet(output_path)
        logger.info(f"Processed Spotify tracks data saved to {output_path}")
        
        return output_path
        
    except Exception as e:
        logger.error(f"Error processing Spotify tracks dataset: {e}")
        raise

def process_top_songs(spark, input_path, output_path):
    """
    Procesa el dataset de Spotify con el siguiente schema:
    
    root
     |-- spotify_id: string (nullable = true)
     |-- name: string (nullable = true)
     |-- artists: string (nullable = true)
     |-- daily_rank: string (nullable = true)
     |-- daily_movement: string (nullable = true)
     |-- weekly_movement: string (nullable = true)
     |-- country: string (nullable = true)
     |-- snapshot_date: string (nullable = true)
     |-- popularity: string (nullable = true)
     |-- is_explicit: string (nullable = true)
     |-- duration_ms: string (nullable = true)
     |-- album_name: string (nullable = true)
     |-- album_release_date: string (nullable = true)
     |-- danceability: string (nullable = true)
     |-- energy: string (nullable = true)
     |-- key: string (nullable = true)
     |-- loudness: string (nullable = true)
     |-- mode: string (nullable = true)
     |-- speechiness: string (nullable = true)
     |-- acousticness: double (nullable = true)
     |-- instrumentalness: double (nullable = true)
     |-- liveness: double (nullable = true)
     |-- valence: double (nullable = true)
     |-- tempo: double (nullable = true)
     |-- time_signature: double (nullable = true)
    
    Args:
        spark (SparkSession): La sesión de Spark.
        input_path (str): Ruta del archivo CSV de entrada.
        output_path (str): Ruta del directorio de salida en HDFS.
    """
    logger.info(f"Procesando datos de Spotify desde {input_path}")
    
    try:
        # Leer el archivo parquet
        df = spark.read.parquet(input_path)
        
        # Imprimir schema y contar filas antes del procesamiento
        logger.info("Schema original:")
        df.printSchema()
        count_before = df.count()
        logger.info(f"Filas antes del procesamiento: {count_before}")
        
        # Seleccionar y transformar las columnas según el nuevo schema
        processed_df = df.select(
            col("spotify_id").alias("spotify_id"),
            col("name").alias("track_name"),        
            col("artists").alias("artist_name"),
            col("daily_rank").cast("integer").alias("daily_rank"),
            col("daily_movement").alias("daily_movement"),
            col("weekly_movement").alias("weekly_movement"),
            col("country").alias("country"),
            col("snapshot_date").alias("snapshot_date"),
            col("popularity").cast("integer").alias("popularity"),
            col("is_explicit").alias("is_explicit"),
            col("duration_ms").cast("long").alias("duration_ms"),
            col("album_name").alias("album_name"),
            col("album_release_date").alias("album_release_date"),
            col("danceability").alias("danceability"),
            col("energy").alias("energy"),
            col("key").alias("key"),
            col("loudness").alias("loudness"),
            col("mode").alias("mode"),
            col("speechiness").alias("speechiness"),
            col("acousticness").cast("double").alias("acousticness"),
            col("instrumentalness").cast("double").alias("instrumentalness"),
            col("liveness").cast("double").alias("liveness"),
            col("valence").cast("double").alias("valence"),
            col("tempo").cast("double").alias("tempo"),
            col("time_signature").cast("double").alias("time_signature")
        )
        
        # Manejo de valores nulos para algunas columnas numéricas
        # TRUSTED ZONE
        # processed_df = processed_df.na.fill({
        #     "daily_rank": 0,
        #     "popularity": 0,
        #     "duration_ms": 0,
        #     "acousticness": 0.0,
        #     "instrumentalness": 0.0,
        #     "liveness": 0.0,
        #     "valence": 0.0,
        #     "tempo": 0.0,
        #     "time_signature": 0.0
        # })
        
        # Convertir snapshot_date y album_release_date a formato fecha (ajustar el patrón si es necesario)
        processed_df = processed_df.withColumn("snapshot_date", to_date(col("snapshot_date"), "yyyy-MM-dd"))
        processed_df = processed_df.withColumn("album_release_date", to_date(col("album_release_date"), "yyyy-MM-dd"))
        
        # Extraer año, mes y día a partir de snapshot_date para posibles análisis adicionales
        processed_df = processed_df.withColumn("snapshot_year", year(col("snapshot_date"))) \
                                   .withColumn("snapshot_month", month(col("snapshot_date"))) \
                                   .withColumn("snapshot_day", dayofmonth(col("snapshot_date")))
        
        # Limpiar el campo country: eliminar espacios extras y convertir a mayúsculas
        processed_df = processed_df.withColumn(
            "country", 
            upper(trim(regexp_replace(col("country"), "\\s+", " ")))
        )
        
        # Filtrar filas donde el nombre de la canción (track_name) no sea nulo
        # TRUSTED ZONE
        # processed_df = processed_df.filter(col("track_name").isNotNull())
        
        # Imprimir el schema y contar las filas después del procesamiento
        logger.info("Schema procesado:")
        processed_df.printSchema()
        count_after = processed_df.count()
        logger.info(f"Filas después del procesamiento: {count_after}")
        logger.info(f"Se removieron {count_before - count_after} filas durante el procesamiento")
        
        # Escribir el DataFrame procesado en formato Parquet en HDFS
        processed_df.write.mode("overwrite").parquet(output_path)
        logger.info(f"Datos procesados de Spotify guardados en {output_path}")
        
        return output_path
        
    except Exception as e:
        logger.error(f"Error al procesar los datos de Spotify: {e}")
        raise

def process_song_lyrics(spark, input_path, output_path):
    """
    Process the song lyrics dataset.
    
    Args:
        spark (SparkSession): The Spark session.
        input_path (str): The input file path.
        output_path (str): The output directory path.
    """
    logger.info(f"Processing song lyrics from {input_path}")
    
    try:
        # Read the CSV file with proper encoding
        df = spark.read.parquet(input_path)
        
        # Print schema and count before processing
        logger.info("Original schema:")
        df.printSchema()
        count_before = df.count()
        logger.info(f"Count before processing: {count_before}")
        
        # Clean and transform the data based on the actual columns in songs-lyrics.csv
        processed_df = df.select(
            col("Unnamed: 0").cast("integer").alias("song_id"),
            col("artist").alias("artist_name"),
            col("song_name").alias("song_name"),
            col("lyrics").alias("song_lyrics")
        )
        
        # Clean artist and track names
        processed_df = processed_df.withColumn("artist_name", trim(col("artist_name"))) \
                                 .withColumn("song_name", trim(col("song_name")))
        
        # Filter rows with valid song_id and track_name
        # TRUSTED ZONE
        # processed_df = processed_df.filter(col("song_id").isNotNull() & col("track_name").isNotNull())
        
        # Print schema and count after processing
        logger.info("Processed schema:")
        processed_df.printSchema()
        count_after = processed_df.count()
        logger.info(f"Count after processing: {count_after}")
        logger.info(f"Removed {count_before - count_after} rows during processing")
        
        # Write the processed data as Parquet
        processed_df.write.mode("overwrite").parquet(output_path)
        logger.info(f"Processed song lyrics data saved to {output_path}")
        
        return output_path
        
    except Exception as e:
        logger.error(f"Error processing song lyrics dataset: {e}")
        raise

In [9]:
# Define input and output paths
input_paths = {
    'spotify_tracks': './data/landing_zone/spotify-tracks-dataset.parquet',
    'top_songs': './data/landing_zone/top-spotify-songs-by-country.parquet',
    'song_lyrics': './data/landing_zone/songs-lyrics.parquet'
}
output_paths = {
    'spotify_tracks': './data/formatted_zone/spotify-tracks-dataset',
    'top_songs': './data/formatted_zone/top-spotify-songs-by-country',
    'song_lyrics': './data/formatted_zone/songs-lyrics'
}

# Process datasets
process_spotify_tracks(spark, input_paths['spotify_tracks'], output_paths['spotify_tracks'])
process_top_songs(spark, input_paths['top_songs'], output_paths['top_songs'])
process_song_lyrics(spark, input_paths['song_lyrics'], output_paths['song_lyrics'])

2025-04-13 00:54:11,583 - __main__ - INFO - Processing Spotify tracks dataset from ./data/landing_zone/spotify-tracks-dataset.parquet
2025-04-13 00:54:11,662 - __main__ - INFO - Original schema:
2025-04-13 00:54:11,753 - __main__ - INFO - Count before processing: 114000
2025-04-13 00:54:11,806 - __main__ - INFO - Processed schema:


root
 |-- Unnamed: 0: long (nullable = true)
 |-- track_id: string (nullable = true)
 |-- artists: string (nullable = true)
 |-- album_name: string (nullable = true)
 |-- track_name: string (nullable = true)
 |-- popularity: long (nullable = true)
 |-- duration_ms: long (nullable = true)
 |-- explicit: boolean (nullable = true)
 |-- danceability: double (nullable = true)
 |-- energy: double (nullable = true)
 |-- key: long (nullable = true)
 |-- loudness: double (nullable = true)
 |-- mode: long (nullable = true)
 |-- speechiness: double (nullable = true)
 |-- acousticness: double (nullable = true)
 |-- instrumentalness: double (nullable = true)
 |-- liveness: double (nullable = true)
 |-- valence: double (nullable = true)
 |-- tempo: double (nullable = true)
 |-- time_signature: long (nullable = true)
 |-- track_genre: string (nullable = true)

root
 |-- track_id: string (nullable = true)
 |-- track_name: string (nullable = true)
 |-- artist_name: string (nullable = true)
 |-- album_n

2025-04-13 00:54:11,881 - __main__ - INFO - Count after processing: 114000
2025-04-13 00:54:11,882 - __main__ - INFO - Removed 0 rows during processing
2025-04-13 00:54:12,561 - __main__ - INFO - Processed Spotify tracks data saved to ./data/formatted_zone/spotify-tracks-dataset
2025-04-13 00:54:12,562 - __main__ - INFO - Procesando datos de Spotify desde ./data/landing_zone/top-spotify-songs-by-country.parquet
2025-04-13 00:54:12,624 - __main__ - INFO - Schema original:
2025-04-13 00:54:12,715 - __main__ - INFO - Filas antes del procesamiento: 1923107
2025-04-13 00:54:12,798 - __main__ - INFO - Schema procesado:


root
 |-- spotify_id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- artists: string (nullable = true)
 |-- daily_rank: long (nullable = true)
 |-- daily_movement: long (nullable = true)
 |-- weekly_movement: long (nullable = true)
 |-- country: string (nullable = true)
 |-- snapshot_date: string (nullable = true)
 |-- popularity: long (nullable = true)
 |-- is_explicit: boolean (nullable = true)
 |-- duration_ms: long (nullable = true)
 |-- album_name: string (nullable = true)
 |-- album_release_date: string (nullable = true)
 |-- danceability: double (nullable = true)
 |-- energy: double (nullable = true)
 |-- key: long (nullable = true)
 |-- loudness: double (nullable = true)
 |-- mode: long (nullable = true)
 |-- speechiness: double (nullable = true)
 |-- acousticness: double (nullable = true)
 |-- instrumentalness: double (nullable = true)
 |-- liveness: double (nullable = true)
 |-- valence: double (nullable = true)
 |-- tempo: double (nullable = true)
 |-- tim

2025-04-13 00:54:12,878 - __main__ - INFO - Filas después del procesamiento: 1923107
2025-04-13 00:54:12,878 - __main__ - INFO - Se removieron 0 filas durante el procesamiento
2025-04-13 00:54:18,083 - __main__ - INFO - Datos procesados de Spotify guardados en ./data/formatted_zone/top-spotify-songs-by-country
2025-04-13 00:54:18,084 - __main__ - INFO - Processing song lyrics from ./data/landing_zone/songs-lyrics.parquet
2025-04-13 00:54:18,212 - __main__ - INFO - Original schema:
2025-04-13 00:54:18,346 - __main__ - INFO - Count before processing: 25742
2025-04-13 00:54:18,378 - __main__ - INFO - Processed schema:


root
 |-- Unnamed: 0: long (nullable = true)
 |-- link: string (nullable = true)
 |-- artist: string (nullable = true)
 |-- song_name: string (nullable = true)
 |-- lyrics: string (nullable = true)

root
 |-- song_id: integer (nullable = true)
 |-- artist_name: string (nullable = true)
 |-- song_name: string (nullable = true)
 |-- song_lyrics: string (nullable = true)



2025-04-13 00:54:18,497 - __main__ - INFO - Count after processing: 25742
2025-04-13 00:54:18,498 - __main__ - INFO - Removed 0 rows during processing
2025-04-13 00:54:18,988 - __main__ - INFO - Processed song lyrics data saved to ./data/formatted_zone/songs-lyrics


'./data/formatted_zone/songs-lyrics'

### The Trusted Zone

### The Exploitation Zone

En aquesta fase del procés, preparem i estructurem les dades finals per a ser utilitzades en aplicacions d'anàlisi o visualització. Treballem amb els datasets ja validats i netejats de la *trusted zone*, i generem dues sortides principals:

**1. explicit_prediction**

- Llegim les dades de lletres de cançons i de metadades dels tracks.
- Unim les dades mitjançant Spark, fent un join basat en les columnes `artist_name` i `song_name`/`track_name`.
- D'aquesta unió obtenim una estructura amb la lletra de la cançó i si aquesta és considerada explícita.
- El resultat es desa en format Parquet a `data/exploitation_zone/explicit_prediction`.

**2. data_visualization**

- Carreguem el dataset de les cançons més escoltades per país.
- Seleccionem només les columnes necessàries per a visualitzacions: `track_name`, `artist_name`, `daily_rank`, `snapshot_date`, i `country`.
- El desem en format Parquet a `data/exploitation_zone/data_visualization`.

Aquestes dues sortides constitueixen la base per a futurs models predictius i visualitzacions interactives.

In [11]:

# # Read all Parquet files in the folders
# df_lyrics = spark.read.parquet("data/trusted_zone/songs-lyrics.parquet")
# rdd_lyrics = df_lyrics.withColumn("artist_name", trim(regexp_replace("artist_name", " Lyrics", ""))).rdd
# rdd_tracks = spark.read.parquet("data/trusted_zone/spotify-tracks-dataset.parquet").rdd

# rdd_lyrics = rdd_lyrics.map(lambda f: ((f['artist_name'], f['song_name']), f['song_lyrics']))
# rdd_tracks = rdd_tracks.map(lambda f: ((f['artist_name'], f['track_name']), f['explicit']))
# rdd_joined = rdd_lyrics.join(rdd_tracks).map(lambda f: f[1])
# df_explicit_prediction = spark.createDataFrame(rdd_joined)
# df_explicit_prediction.write.parquet("data/exploitation_zone/explicit_prediction.parquet")

# rdd_top_songs = spark.read.parquet("data/trusted_zone/top-spotify-songs-by-country.parquet").rdd
# rdd_top_songs = rdd_top_songs.map(lambda f: (f['track_name'], f['artist_name'], f['daily_rank'], f['snapshot_date'], f['country']))
# df_data_visualization = spark.createDataFrame(rdd_top_songs)
# df_data_visualization.write.parquet("data/exploitation_zone/data_visualization.parquet")


# Leer los datos usando Spark
try:
    # Intentamos leer desde trusted_zone primero
    df_lyrics = spark.read.parquet("data/trusted_zone/songs-lyrics")
    df_tracks = spark.read.parquet("data/trusted_zone/spotify-tracks-dataset")
    df_top_songs = spark.read.parquet("data/trusted_zone/top-spotify-songs-by-country")
    print("Datos cargados desde trusted_zone")
except Exception as e:
    print(f"Error al leer desde trusted_zone: {e}")
    print("Intentando leer desde landing_zone...")
    try:
        # Si no podemos leer desde trusted_zone, leemos desde landing_zone
        df_lyrics = spark.read.parquet("data/landing_zone/songs-lyrics")
        df_tracks = spark.read.parquet("data/landing_zone/spotify-tracks-dataset")
        df_top_songs = spark.read.parquet("data/landing_zone/top-spotify-songs-by-country")
        print("Datos cargados desde landing_zone")
        
        # Adaptar nombres de columnas si es necesario
        if 'artist' in df_lyrics.columns:
            df_lyrics = df_lyrics.withColumnRenamed("artist", "artist_name")
            df_lyrics = df_lyrics.withColumnRenamed("lyrics", "song_lyrics")
        if 'name' in df_top_songs.columns:
            df_top_songs = df_top_songs.withColumnRenamed("name", "track_name")
    except Exception as e2:
        print(f"Error al leer desde landing_zone: {e2}")
        print("No se pudieron cargar los datos.")
        raise

# 1. Explicit prediction
# Limpiar nombres de artistas si es necesario
if 'artist_name' in df_lyrics.columns:
    df_lyrics = df_lyrics.withColumn(
        "artist_name", 
        regexp_replace(col("artist_name"), " Lyrics", "")
    )

# Realizar join basado en artist_name y song_name/track_name
try:
    # Seleccionar columnas necesarias para el join
    lyrics_for_join = df_lyrics.select("artist_name", "song_name", "song_lyrics")
    tracks_for_join = df_tracks.select("artist_name", "track_name", "explicit")
    
    # Realizar join
    explicit_prediction = lyrics_for_join.join(
        tracks_for_join,
        (lyrics_for_join.artist_name == tracks_for_join.artist_name) & 
        (lyrics_for_join.song_name == tracks_for_join.track_name),
        "inner"
    ).select(
        lyrics_for_join.song_lyrics, 
        tracks_for_join.explicit
    )
    
    # Mostrar datos resultantes
    print(f"Registros en explicit_prediction: {explicit_prediction.count()}")
    
    # Guardar resultados
    explicit_prediction.write.mode("overwrite").parquet("data/exploitation_zone/explicit_prediction")
    print("Dataset explicit_prediction guardado correctamente")
except Exception as e:
    print(f"Error al crear explicit_prediction: {e}")

# 2. Data visualization
try:
    # Seleccionar columnas necesarias para visualización
    data_visualization = df_top_songs.select(
        "track_name", "artist_name", "daily_rank", "snapshot_date", "country"
    )
    
    # Mostrar datos resultantes
    print(f"Registros en data_visualization: {data_visualization.count()}")
    
    # Guardar resultados
    data_visualization.write.mode("overwrite").parquet("data/exploitation_zone/data_visualization")
    print("Dataset data_visualization guardado correctamente")
except Exception as e:
    print(f"Error al crear data_visualization: {e}")

# Detener la sesión Spark para liberar recursos
spark.stop()

Datos cargados desde trusted_zone
Registros en explicit_prediction: 2258
Dataset explicit_prediction guardado correctamente
Registros en data_visualization: 1923107
Dataset data_visualization guardado correctamente


## The Data Analysis Pipelines