# PMA3: Preprocessing

Notebook for doing all the calcuating and preprocessing for PMA3

Look over the config settings.
I personally change the settings to make data for different kinds of models. Mainly small, medium and long songs.  
There is some overlap with them, just so each model has an actual dataset to work with.

## Config

### Data Settings

#### Mininum Song Length
A song has to be this long to even be processed.  
If padding is **DISABLED**, this is the size songs will be sliced into! *Cannot be 0!*

In [None]:
min_song_length_seconds = 60

#### Maxinum Song Length
A song has to be shorter than this to even be processed

In [None]:
max_song_length_seconds = 540

#### Enable Padding / Whole Song Analysis
If a song is shorter than the max length, pad it with zeros.  
This is desirable because it forces the model to train on the whole song rather than just learning from small segments.  
Downside is the larger memory usage.  

In [None]:
padding = True

#### Overwrite
If the song has already been processed, should it overwrite old data.  
Might be useful for songs that have had their ratings updated since it last writen.  

In [None]:
overwrite = False

### Plex Settings

#### Plex URL
This should be the local IP address and port as if you would type it in your browser.  
Example: `http://192.168.X.X:32400`

In [None]:
plex_url = "http://192.168.X.X:32400"

#### Plex Token
Plex requires authentication, to get your token read more [here](https://support.plex.tv/articles/204059436-finding-an-authentication-token-x-plex-token/)

In [None]:
plex_token = "TOKEN_HERE"

#### Music Library Name
Library name as it would appear on the Plex website.  
Case sensitive.

In [None]:
plex_library_name = "Music"

### File Settings
Assuming you are running this on a different machine / enviroment,  
the file path that Plex gives could be different than what would be reachable  
for this machine / enviroment.  

This will just switches phrases.  

#### Plex Path Prefix
What Plex sees.  
Leave `None` if its just `/music/artist/...` for Plex.  
If it is like `/media/music/artist/...` change it to `/media`

In [None]:
path_prefix_plex = ""

#### Path Prefix Here
What this enviroment sees.  
Simmilar like Plex Path Prefix, just in the reverse.  

In [None]:
path_prefix_here = ""

## Imports

In [None]:
from plexapi.server import PlexServer
import ntpath
import random
from tqdm import tqdm
from IPython.display import clear_output

## Getting Songs from Plex

In [None]:
def get_plex_library(plex_client: PlexServer, plex_library_name: str = plex_library_name) -> PlexServer.library:
    """With a given client, return the desired library object"""
    return plex_client.library.section(plex_library_name)

In [None]:
def get_rated_songs_from_plex(plex_client: PlexServer, rating_out_of_ten: int) -> list:
    """With a given ratings (1-10) return a list of file dicts"""
    
    library = get_plex_library(plex_client)
    search = library.search(libtype="track", userRating=rating_out_of_ten)
    
    results = []
    
    for i in tqdm(search):
        file_path = i.media[0].parts[0].file
        file_path = file_path.replace(path_prefix_plex, path_prefix_here)
            
        file_path = ntpath.normpath(file_path)
                
        # IIRC I do this because Plex gives in a float, but I want it in a string form of an int
        # It is just easier to translate
        rating_out_of_ten = str(int(i.userRating))
        
        moods = []
        for mood in i.moods:
            try:
                moods.append(mood.tag)
            except:
                pass
        
        duration_seconds = int(i.duration / 1000)
        
        results.append(
            (file_path, rating_out_of_ten, duration_seconds ,moods)
        )
    
    return results

In [None]:
print("Connecting to Plex Server...", end=" ")
plex_client = PlexServer(baseurl=plex_url, token=plex_token)
print("CONNECTED!")

In [None]:
# Just to see how many songs are in each rating
stats = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
number_of_songs_too_short = 0
number_of_songs_too_long = 0

# List of all info for each rated song on Plex
parsed_songs = []

for i in range(10):
    print("Getting songs " + str(i+1) + "/10", end="\r")
    rated_songs = get_rated_songs_from_plex(plex_client=plex_client, rating_out_of_ten=i+1)
        
    for song in rated_songs:
        
        # Check the length of the song, if not acceptable, skip
        if song[2] < min_song_length_seconds:
            number_of_songs_too_short += 1
            continue
        elif song[2] > max_song_length_seconds:
            number_of_songs_too_long += 1
            continue
            
        stats[i] += 1
        parsed_songs.append(song)
    clear_output()
        
# Shuffle for good measure
random.shuffle(parsed_songs)
        
print("Songs that are too short:\t", number_of_songs_too_short)
print("Songs that are too long:\t", number_of_songs_too_long)

print("\nTracks retrieved from Plex (0 - 10):")
print(stats)

## Audio Processing

In [None]:
import librosa
from librosa import ParameterError

In [None]:
def make_frame_mfccs(path: str):
    """
    Loads audiofile, splits into slices (the min length)
    and then calculates all the mfccs for all the slices
    """
    
    audio, sr = librosa.load(path)
    
    if padding:
        audio = librosa.util.fix_length(audio, size = max_song_length_seconds * sr)
        
    frames = librosa.util.frame(audio, frame_length = sr * min_song_length_seconds, hop_length = sr * min_song_length_seconds)
    
    frame_mfccs = []
    
    for i in range(frames.shape[1]):
        frame_mfccs.append(
            librosa.feature.mfcc(y=frames[:,i], sr = sr, n_mfcc = 13, hop_length = 512, n_fft = 2048)
        )
    
    return frame_mfccs

In [None]:
import hashlib
import os
import pickle

In [None]:
# This is to make pkl files
# I need unique filenames that will never conflict
def hash_string(string: str) -> str:
    sha256_hash = hashlib.sha256()
    sha256_hash.update(string.encode('utf-8'))
    return sha256_hash.hexdigest()

In [None]:
# Checking if data folder exists
if not os.path.exists("./data/padding-" + str(padding) + '/min-' + str(min_song_length_seconds) + '_max-' + str(max_song_length_seconds)):
    # Does not exist
    os.makedirs("./data/padding-" + str(padding) + '/min-' + str(min_song_length_seconds) + '_max-' + str(max_song_length_seconds))

In [None]:
songs_loaded = 0
songs_calced = 0

In [None]:
for song in tqdm(parsed_songs):
    
    try:
        print("Songs Loaded:\t", songs_loaded, "\t(%", round(((songs_loaded/(songs_loaded + songs_calced)) * 100), 3), ")")
        print("Songs Calced:\t", songs_calced, "\t(%", round(((songs_calced/(songs_loaded + songs_calced)) * 100), 3), ")")
    except ZeroDivisionError:
        print("Songs Loaded:\t0\t(%0)")
        print("Songs Calced:\t0\t(%0)")
    
    filename = hash_string(song[0].replace(path_prefix_here, ""))
    
    rating = song[1]
    
    try:
        
        if overwrite:
            raise Exception
        
        with open("./data/padding-" + str(padding) + "/min-" + str(min_song_length_seconds) + "_max-" + str(max_song_length_seconds) + "/" + filename + ".pkl", "rb") as file:
            json_pkl = pickle.load(file)
            
        songs_loaded += 1
        
    except:
        
        fmccs = make_frame_mfccs(song[0])
        
        with open("./data/padding-" + str(padding) + "/min-" + str(min_song_length_seconds) + "_max-" + str(max_song_length_seconds) + "/" + filename + ".pkl", "wb") as file:
            json_pkl = pickle.dump({"fmccs": fmccs, "duration": song[2], "rating": song[1]}, file)
        
        songs_calced += 1
        
    clear_output()
