---
# INFO-H600 - Computing Foundations of Data Sciences

## Team 14 : 

Roman Lešický, Theo Abraham, Kevin Straatman, Lara Hansen, Grégoire Van den Eynde and Nicolas Roux

Version of python : 3.11.14 | packaged by conda-forge 

---

# Library:

###### The download environment.txt is present within the Github repository of the project https://github.com/RomanLesicky/Data_Science_Project_INFO_H600

### How path's are handled in this project:

In [None]:
from pathlib import Path # We are using the pathlib library for our paths 

# The way the code works is that we first locate the project's root
project_root = Path.cwd().resolve()

# Then we make a variable which hall be used as our data directory path which is sued for everyone in this project 
# For steps 2 till 5 included.

data_dir = project_root / "data" 

# Simple print for as a sanity check
print("Project root:", project_root)
print("Data dir:", data_dir)

Project root: C:\Users\roman\Desktop\Master - ULB\2nd year\Q1\Intro Data Sc\Data_Science_Project_INFO_H600
Data dir: C:\Users\roman\Desktop\Master - ULB\2nd year\Q1\Intro Data Sc\Data_Science_Project_INFO_H600\data


### Rest of the library:

In [None]:
from pyspark.sql import SparkSession, functions 

#! To be continued ofc 

---

# Overview of the project :

# `WIP`

---

# Step 1:

`TBD`

# DO NOT FORGET TO DO THE DOCUMENTATION 

# `Pas oublier de justifier avec le cours + documentaion official + re-write a part`

### 1.0 Set-up of the SparkSession

In [None]:
# Int this cell we initialise a SparkSession, which can be reused.
# An important part of this code is that indicating to Spark to run all available CPU cores, for each task utilizing Spark.
# Therefore, the use of the code has been warned that when they are running cell which are Spark related this will utilize their whole CPU.
# The reason for doing this is that it gives us parallelism without needing a proper cluster
spark = (SparkSession.builder.appName("MillionPlaylistProject") .master("local[*]").getOrCreate())

spark  # Just for postery we display the session 

### 1.1 Reading JSON slices into raw DataFrames using Spark

In [None]:


"""
This is an important part of the project which needs to be addressed. 

The question "How many sliced files do we want to read?" needs to be asked since this determined the trade-off between scalability and practical runtime.
Essentially, the answer to that question is having a sufficient amount of data that our metrics will be good whilst not calculating until forever. 

As a group we have decide to hardcode the value 5 for demonstration purposes, this means that we shall ony use mpd.slice.0 to 4999 so only about 5k playlists. 
The soul reason for this specific value is that it small enough to run very fast and yet demonstrate that the pipeline works. 
Additionally the use can adapt this number via the global variable `NUMBER_OF_SLICES`, but they shall keep in mind the that they are using all the cores of their CPU for this. 

That being said, for practical reasons which concern task's 3 and 4 (5 too) we shall use a dataset that contains 50 slices meaning 50 thousand playlists. 
This value does provides enough data to obtain stable aggregate statistics and similarity scores while keeping computation times manageable on a single machine.

Here we do not use a randomized method to chose the slices, since the data at hand is not ordered nor are we worried with a certain bias since we shall be using the 50k 
version for the actual metric determination. 
"""

# So this global variable is to be changed if one desires for a higher number of slices 
NUMBER_OF_SLICES = 5  

# This is the file path to the original Million Playlist Dataset to be used only in this Task.
# This dataset will never be published to github since it's under the .gitignore file. 
data_dir = project_root / "data" 

def slice_start_key(path: Path) -> int:
    """
    Extract the numeric start index from filenames like 'mpd.slice.1000-1999.json'
    so that we sort slices in the correct numeric order.
    """
    name = path.name                       # e.g. 'mpd.slice.1000-1999.json'
    middle = name.split('.')[2]            # '1000-1999'
    start_str = middle.split('-')[0]       # '1000'
    return int(start_str)

#! Need to justify this via documentation I GUESS 

# List and sort all slice files numerically
all_slices = sorted(data_dir.glob("mpd.slice.*.json"),key=slice_start_key)

# Effective number of slices we will use
num_slices = min(NUMBER_OF_SLICES, len(all_slices))

input_paths = [str(p) for p in all_slices[:num_slices]]

print(f"\nNUMBER_OF_SLICES = {NUMBER_OF_SLICES} → actually using {num_slices} slice file(s):")
for p in input_paths:
    print("  ", p)

# Read the selected slice files as a single DataFrame.
# Each file has the structure: {"info": {...}, "playlists": [ {...}, {...}, ... ]}
playlists_raw_df = (
    spark.read
    .option("multiLine", True)  # MPD JSON files are multi-line JSON, not line-delimited
    .json(input_paths)
)

print("\nSchema of the raw JSON DataFrame:")
playlists_raw_df.printSchema()

print("\nExample row (one JSON file):")
playlists_raw_df.show(1, truncate=False)


NameError: name 'Path' is not defined

### 1.2 Flattening pipeline 

# Add more infos here cuz this is a bit meager 

> Note: For MODE="dev"/"medium" (5 or 50 slices) we rely on Spark's built-in schema inference. If we processed all 35 GB or ran on a cluster, we would define an explicit StructType schema to avoid an extra pass over the data.

In [None]:

#! Change the whole F.col sine that is the default name of GPT 

### 1.2 Flattening pipeline (slices → playlists → playlist–track rows)

# 1) Flatten `playlists`: one row per playlist
playlists_df = playlists_raw_df.select(
    functions.explode("playlists").alias("playlist")  # explode the array of playlists per file
)

# Extract playlist-level fields we care about.
# We keep the `tracks` array for the next step.
playlists_flat_df = playlists_df.select(
    functions.col("playlist.pid").alias("pid"),
    functions.col("playlist.name").alias("name"),
    functions.col("playlist.collaborative").alias("collaborative"),
    functions.col("playlist.modified_at").alias("modified_at"),
    functions.col("playlist.num_tracks").alias("num_tracks"),
    functions.col("playlist.num_albums").alias("num_albums"),
    functions.col("playlist.num_followers").alias("num_followers"),
    functions.col("playlist.duration_ms").alias("duration_ms"),
    functions.col("playlist.tracks").alias("tracks")  # still an array ofunctions track structs
)

print("Schema of playlist-level table:")
playlists_flat_df.printSchema()
print("\nExample playlists:")
playlists_flat_df.show(3, truncate=False)

# 2) Flatten `tracks`: one row per (playlist, track)
playlist_track_df = playlists_flat_df.select(
    functions.col("pid"),
    functions.col("name").alias("playlist_name"),
    functions.col("num_tracks"),
    functions.col("num_albums"),
    functions.col("num_followers"),
    functions.col("modified_at"),
    functions.col("duration_ms").alias("playlist_duration_ms"),
    functions.explode("tracks").alias("track")   # explode the tracks array
)

# Flatten the `track` struct into individual columns.
playlist_track_df = playlist_track_df.select(
    functions.col("pid"),
    functions.col("playlist_name"),
    functions.col("num_tracks"),
    functions.col("num_albums"),
    functions.col("num_followers"),
    functions.col("modified_at"),
    functions.col("playlist_duration_ms"),

    functions.col("track.pos").alias("track_pos"),
    functions.col("track.track_uri").alias("track_uri"),
    functions.col("track.track_name").alias("track_name"),
    functions.col("track.artist_uri").alias("artist_uri"),
    functions.col("track.artist_name").alias("artist_name"),
    functions.col("track.album_uri").alias("album_uri"),
    functions.col("track.album_name").alias("album_name"),
    functions.col("track.duration_ms").alias("track_duration_ms")
)


print("\nSchema of playlist–track table:")
playlist_track_df.printSchema()
print("\nExample playlist–track rows:")
playlist_track_df.show(5, truncate=False)


### 1.3 Saving flattened DataFrames locally 

`To not run if one doesn't want to locally save the dataframes`

Additionally, to make this cell of code work the user needs to have winutils.exe and hadoop.dll installed locally. This can be found on this github page: 

- https://github.com/cdarlint/winutils

The version which was used for this project was hadoop-3.3.6.

In [None]:
### 1.3 Persist flattened DataFrames for later tasks (Task 2–5)

"""
We now persist the flattened tables to disk so that later tasks (2–5) do not
have to re-read and re-flatten the raw JSON.

The output folder name encodes the effective dataset size, based on the
`NUMBER_OF_SLICES` defined in Section 1.1.

Roughly, each slice contains ≈ 1000 playlists, so we name folders like:
- "5k_Playlists"   → NUMBER_OF_SLICES = 5   (≈ 5,000 playlists)
- "50k_Playlists"  → NUMBER_OF_SLICES = 50  (≈ 50,000 playlists)

If NUMBER_OF_SLICES is large enough to cover all slices (e.g. 1000),
we use the folder name "Full_Playlist".

This keeps the directory structure informative without hard-coding any mode.
"""

# NUMBER_OF_SLICES is defined in Section 1.1

# Decide folder name based on NUMBER_OF_SLICES
# (we approximate 1 slice ≈ 1000 playlists, hence "Nk_Playlists")
if NUMBER_OF_SLICES >= 1000:
    folder_name = "Full_Playlist"
else:
    folder_name = f"{NUMBER_OF_SLICES}k_Playlists"

post_task1_dir = project_root / "data_post_Task_1" / folder_name
post_task1_dir.mkdir(parents=True, exist_ok=True)

print("Post-Task-1 data dir:", post_task1_dir)

# ---- Separate playlist metadata and playlist–track interactions ----

# Playlist metadata only (no 'tracks' array)
playlists_meta_df = playlists_flat_df.drop("tracks")

playlists_out = post_task1_dir / "playlists_metadata"
playlist_track_out = post_task1_dir / "playlist_track"

# 1) Save playlist metadata (small, fast)
(
    playlists_meta_df
    .write
    .mode("overwrite")
    .parquet(str(playlists_out))
)

# 2) Save playlist–track table (large, but used in all later tasks)
(
    playlist_track_df
    .write
    .mode("overwrite")
    .parquet(str(playlist_track_out))
)

print("\nSaved playlist tables to:")
print("  ", playlists_out)
print("  ", playlist_track_out)


----

# Step 2:

---