<a href="https://colab.research.google.com/github/GemmaGorey/Dissertation/blob/main/Dissertation_GG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Initial Colab setup below - Run  top two cells once only per session**


In [1]:
!pip install -q condacolab
import condacolab
condacolab.install()
# installs mamba to use instead of pip

⏬ Downloading https://github.com/jaimergp/miniforge/releases/download/24.11.2-1_colab/Miniforge3-colab-24.11.2-1_colab-Linux-x86_64.sh...
📦 Installing...
📌 Adjusting configuration...
🩹 Patching environment...
⏲ Done in 0:00:11
🔁 Restarting kernel...


In [1]:
# Create the config file and build the environment.
yaml_content = """
name: dissertation
channels:
  - pytorch
  - conda-forge
dependencies:
  - python=3.11
  - pytorch=2.2.2
  - torchvision=0.17.2
  - torchaudio
  - librosa
  - numpy<2
  - pandas
  - jupyter
  - wandb
"""

# Write the string content to a file -  'environment.yml'.
with open('environment.yml', 'w') as f:
    f.write(yaml_content)

print("environment.yml file created successfully.")

# create the environment using mamba from the yml file.
print("\n Creating environment")

!mamba env create -f environment.yml --quiet && echo -e "\n 'dissertation' environment is ready to use."

environment.yml file created successfully.

 Creating environment
Channels:
 - pytorch
 - conda-forge
Platform: linux-64
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done
Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... 

done

 'dissertation' environment is ready to use.


In [2]:
# imports and setting up of GitHub and W&B

# clone project repository from GitHub
print("⏳ Cloning GitHub repository...")
!git clone https://github.com/GemmaGorey/Dissertation.git
print("Repository cloned.")

#Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

#imports
import pandas as pd
import librosa
import os

⏳ Cloning GitHub repository...
Cloning into 'Dissertation'...
remote: Enumerating objects: 74, done.[K
remote: Counting objects: 100% (74/74), done.[K
remote: Compressing objects: 100% (65/65), done.[K
Receiving objects: 100% (74/74), 36.20 KiB | 5.17 MiB/s, done.
remote: Total 74 (delta 29), reused 5 (delta 1), pack-reused 0 (from 0)[K
Resolving deltas: 100% (29/29), done.
Repository cloned.
Mounted at /content/drive


In [3]:
#loading in the dataset -using complete which has the lyrics and music.



# create string for path to where dataset lives
data_path = '/content/drive/MyDrive/dissertation/MERGE_Bimodal_Complete/'

# load the files

print("Loading MERGE Metadata ")
merge_df = pd.read_csv(data_path + 'merge_bimodal_complete_metadata.csv')

print("\n Loading Valence-Arousal values")
av_df = pd.read_csv(data_path + 'merge_bimodal_complete_av_values.csv')


print("\n Datasets loaded successfully.")

# Inspect the files
print("\n First 5 rows of the data")
display(merge_df.head())

print("\n First 5 rows of the MERGE AV Values")
display(av_df.head())

Loading MERGE Metadata 

 Loading Valence-Arousal values

 Datasets loaded successfully.

 First 5 rows of the data


Unnamed: 0,Audio_Song,Lyric_Song,Quadrant,AllMusic Id,AllMusic Extraction Date,Artist,Title,Relevance,Year,LowestYear,...,ThemeWeights,Styles,StyleWeights,AppearancesTrackIDs,AppearancesAlbumIDs,Sample,SampleURL,ActualYear,num_Genres,num_MoodsAll
0,A001,L051,Q4,MT0000291374,New,Louis Armstrong,What a Wonderful World,,,,...,99.0,,,,,,,1968,,
1,A002,L052,Q4,MT0001577585,Old,Rod Stewart,Country Comfort,1.493585,1970-??-??,,...,55555555.0,"Adult Contemporary,Contemporary Pop/Rock",55.0,"MT0001577585,MT0002372349,MT0002706336,MT00029...","MW0000073575,MW0000100670,MW0000100670,MW00001...",1.0,http://rovimusic.rovicorp.com/playback.mp3?c=s...,1970,,
2,A003,L053,Q3,MT0008469560,New,Stevie Wonder,Lately,,,,...,78999.0,,,,,,,1980,,
3,A004,L054,Q3,MT0030326044,New,Johnny Cash,I'm So Lonesome I Could Cry,,,,...,,,,,,,,1960,,
4,A005,L055,Q1,MT0005204984,New,Prince,U Got the Look,,,,...,888999.0,,,,,,,1987,,



 First 5 rows of the MERGE AV Values


Unnamed: 0,Audio_Song,Lyric_Song,Arousal,Valence
0,A001,L051,0.29375,0.89375
1,A002,L052,0.3375,0.68125
2,A003,L053,0.25,0.225
3,A004,L054,0.2,0.18125
4,A005,L055,0.7875,0.6875


In [4]:
# merging the two dataframes using the common column

final_df = pd.merge(merge_df, av_df, left_on='Audio_Song', right_on='Audio_Song')

print(" DataFrames merged")

print("\n First 5 rows  MASTER DataFrame")
display(final_df.head())

 DataFrames merged

 First 5 rows of your new MASTER DataFrame


Unnamed: 0,Audio_Song,Lyric_Song_x,Quadrant,AllMusic Id,AllMusic Extraction Date,Artist,Title,Relevance,Year,LowestYear,...,AppearancesTrackIDs,AppearancesAlbumIDs,Sample,SampleURL,ActualYear,num_Genres,num_MoodsAll,Lyric_Song_y,Arousal,Valence
0,A001,L051,Q4,MT0000291374,New,Louis Armstrong,What a Wonderful World,,,,...,,,,,1968,,,L051,0.29375,0.89375
1,A002,L052,Q4,MT0001577585,Old,Rod Stewart,Country Comfort,1.493585,1970-??-??,,...,"MT0001577585,MT0002372349,MT0002706336,MT00029...","MW0000073575,MW0000100670,MW0000100670,MW00001...",1.0,http://rovimusic.rovicorp.com/playback.mp3?c=s...,1970,,,L052,0.3375,0.68125
2,A003,L053,Q3,MT0008469560,New,Stevie Wonder,Lately,,,,...,,,,,1980,,,L053,0.25,0.225
3,A004,L054,Q3,MT0030326044,New,Johnny Cash,I'm So Lonesome I Could Cry,,,,...,,,,,1960,,,L054,0.2,0.18125
4,A005,L055,Q1,MT0005204984,New,Prince,U Got the Look,,,,...,,,,,1987,,,L055,0.7875,0.6875


In [16]:
#checking data - in different quadrants
print(final_df['Quadrant'].value_counts())

Quadrant
Q2    673
Q1    525
Q4    518
Q3    500
Name: count, dtype: int64


In [17]:
#checking data
print(final_df[['Valence', 'Arousal']].describe())

           Valence      Arousal
count  2216.000000  2216.000000
mean      0.505027     0.482316
std       0.231149     0.139533
min       0.018750     0.062500
25%       0.293750     0.370625
50%       0.398750     0.506250
75%       0.738906     0.578750
max       0.987500     0.975000


In [14]:
#checking no blank entries
print(final_df[['Quadrant', 'Valence', 'Arousal']].info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2216 entries, 0 to 2215
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Quadrant  2216 non-null   object 
 1   Valence   2216 non-null   float64
 2   Arousal   2216 non-null   float64
dtypes: float64(2), object(1)
memory usage: 52.1+ KB
None


In [27]:
def load_song_data(song_id, lyric_id, quadrant):
    """
    Loads the audio and lyrics for a given song ID
    """
    print(f"Attempting to load song: {song_id}")
    try:
        # Construct the file path to google drive
        base_path = '/content/drive/MyDrive/dissertation/MERGE_Bimodal_Complete'

        # Audio files url addition as these are in subfolders for each emotion quadrant
        audio_path = os.path.join(base_path, 'audio', quadrant, f"{song_id}.mp3")

        # Lyric files url addition as these are in a separate main folder
        lyrics_path = os.path.join(base_path, 'lyrics', quadrant, f"{lyric_id}.txt")

        # load the audio file
        audio_waveform, sample_rate = librosa.load(audio_path, sr=None) #preserve original sample rate

        # load lyrics text
        with open(lyrics_path, 'r', encoding='utf-8') as f:
            lyrics_text = f.read()

        print(f" Successfully loaded {song_id}")
        return audio_waveform, lyrics_text

    except Exception as e:
        print(f" An error occurred for {song_id}: {e}")
        return None, None

In [29]:
# Get the ID and Quadrant from the first row of your dataframe
test_audio_id = final_df.iloc[0]['Audio_Song']
test_lyric_id = final_df.iloc[0]['Lyric_Song_x']
test_quadrant = final_df.iloc[0]['Quadrant']

print(test_audio_id)
print(test_lyric_id)
print(test_quadrant)

# Call your new function
audio, lyrics = load_song_data(test_audio_id, test_lyric_id, test_quadrant)

# Check the output to see if it worked
if audio is not None:
    print("\n--- Audio Data Sample ---")
    print(f"Shape: {audio.shape}") # This shows you the audio was loaded as an array
    print(audio[:5])

    print("\n--- Lyrics Sample ---")
    print(lyrics[:200]) # This prints the first 200 characters of the lyrics

A001
L051
Q4
Attempting to load song: A001
 Successfully loaded A001

--- Audio Data Sample ---
Shape: (961920,)
[0. 0. 0. 0. 0.]

--- Lyrics Sample ---
I see trees of green, red roses too
I see them bloom for me and you
And I think to myself what a wonderful world.

I see skies of blue and clouds of white
The bright blessed day, the dark sacred night
