## 1. Motivation:  Tidy Up OUR Beatles Metadata

- In this Notebook we will explore various ways to make our data 'Tidy', following Hadley Wickham's principles.
- We assume that you are already familiar with ways to 'clean' the data.
- Here we show 'split', 'explode' and 'melt'

#### NOTE:  Freedman already took care of `split()` for the Genre and Context Columns in Our Beatles Metadata.  

- After 'exploding' these, you will need to review data to 'clean' it (of white spaces or other inconsistencies you want to correct, just as we did in Clean Data!

### 1a.  Load Libraries

In [1]:
# import libraries

import pandas as pd

# supress warnings
import warnings
warnings.filterwarnings('ignore')

#### 1b. Load CSV Sheets


- NOTE THAT THESE LINKS HAVE BEEN CORRECTED AS OF 2/18!
- In the case of Our Clean Metadata, the file is now a PICKLE, which means it retains the correct datatypes for lists!

In [10]:
#  Beatles Spotify Data (just clean, not tidy!)
beatles_spotify_csv = 'https://raw.githubusercontent.com/RichardFreedman/Encoding_Music/refs/heads/main/02_Lab_Data/Beatles/Beatles_Spotify_2026.csv'
beatles_spotify = pd.read_csv(beatles_spotify_csv)


# Beatles Billboard Data (clean not tidy!)
beatles_billboard_clean_csv = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vTvOhZSzoa8kwkmYOzwuAGq3Piaujeabu41wVgAhSPfS0IONz2zi_nG5Km_5GO8B8P6tor0r8CacyNF/pub?output=csv'
bb_clean = pd.read_csv(beatles_billboard_clean_csv)


# and perhaps OUR Cleaned Metadata too!  A pickle!
our_cleaned_beatles_metadata_pkl = 'https://github.com/RichardFreedman/Encoding_Music/raw/refs/heads/main/02_Lab_Data/Beatles/our_clean_beatles_data.pkl'
our_clean_beatles = pd.read_pickle(our_cleaned_beatles_metadata_pkl)


# and for final comparison, you can even load a completely pickled exploded and melted version of the billboard and spotify data
beatles_bb_spotify_final = 'https://github.com/RichardFreedman/Encoding_Music/raw/refs/heads/main/02_Lab_Data/Beatles/beatles_data_combined_clean_tidy.pkl'
beatles_bb_spotify_tidy = pd.read_pickle(beatles_bb_spotify_final)

# 2.  Implementation

### Here is the plan . . ..

- a) Genre:  explode and clean (again!)
- c) Combine:  merge the dfs
- d) Think about OUR data in relation to these

In [7]:
# a quick view of our df

our_clean_beatles.head()


Unnamed: 0,Song Title,Your Name,Your Graduating Class,Team/Group Name,Personal Rank Clean,Genre Tags,Most Likely Context(s),Least Likely Contexts,Spotify URL
0,A Day in the Life,"Quasebarth, Grace",2024,Ohio Exists,5,"[soft rock, alternative rock]","[giant on south 23rd street, cedar point, ra...",[southwest airlines],https://open.spotify.com/track/0hKRSZhUGEhKU6a...
1,A Day in the Life,"Quasebarth, Grace",2024,Ohio Exists,5,"[soft rock, alternative rock]","[giant on south 23rd street, cedar point, ra...",[southwest airlines],https://open.spotify.com/track/0hKRSZhUGEhKU6a...
2,Love You To,"Quasebarth, Grace",2024,Ohio Exists,0,"[modern bollywood, exotica, indian electronic]","[radio, most inconvenient moment imaginable]","[cvs, cedar point, giant on south 23rd street]",https://open.spotify.com/track/69c15XPo8sYUqmn...
3,Hey Jude,"Quasebarth, Grace",2024,Ohio Exists,6,"[slowcore, piano rock, ]","[bar in spain, cvs, cedar point, most incon...","[southwest airlines, radio]",https://open.spotify.com/track/3m7V717IKZqZLW5...
4,"Ob-La-Di, Ob-La-Da","Quasebarth, Grace",2024,Ohio Exists,4,"[soft rock, indie poptimism, mellow gold]","[southwest airlines, cvs, giant on south 23r...","[bar in spain, radio, cedar point]",https://open.spotify.com/track/1gFNm7cXfG1vSMc...


In [5]:
# single row from our metadata looks like this:
our_clean_beatles.iloc[0]


Song Title                                                A Day in the Life
Your Name                                                 Quasebarth, Grace
Your Graduating Class                                                  2024
Team/Group Name                                                 Ohio Exists
Personal Rank Clean                                                       5
Genre Tags                                   [soft rock,  alternative rock]
Most Likely Context(s)    [giant on south 23rd street,  cedar point,  ra...
Least Likely Contexts                                  [southwest airlines]
Spotify URL               https://open.spotify.com/track/0hKRSZhUGEhKU6a...
Name: 0, dtype: object

In [9]:
# confirm the data type of Genre Tags
type(our_clean_beatles.iloc[0]['Genre Tags'])

list

### 2a. Tidy Genres with Explode

#### Why No Need to Split?

- As you can see above, our `Genre Tags` column has *already been `split()`*  
- It contains a `list` of `strings`, and not not a single long `list`!

This:

`['rock & roll', 'pop/rock']`

Not (as it was in Our ORIGINAL Messy Metadata):

`'rock & roll pop/rock'`

So we can go directly to `explode()`!

### Explode the lists!

- Humans can make sense of those lists.  But for various steps of grouping, filtering and making charts, it will be better to keep **one genre tag per row**.
- We can easily do this with a Pandas method called `explode().  Here is how:

```python
our_clean_beatles_exploded = our_clean_beatles.explode('Genre Tags').reset_index(drop=True)
```

In [15]:
# explode the data on the genre column, and save that as a new df
our_clean_beatles_exploded = our_clean_beatles.explode('Genre Tags').reset_index(drop=True)
our_clean_beatles_exploded.head()

Unnamed: 0,Song Title,Your Name,Your Graduating Class,Team/Group Name,Personal Rank Clean,Genre Tags,Most Likely Context(s),Least Likely Contexts,Spotify URL
0,A Day in the Life,"Quasebarth, Grace",2024,Ohio Exists,5,soft rock,"[giant on south 23rd street, cedar point, ra...",[southwest airlines],https://open.spotify.com/track/0hKRSZhUGEhKU6a...
1,A Day in the Life,"Quasebarth, Grace",2024,Ohio Exists,5,alternative rock,"[giant on south 23rd street, cedar point, ra...",[southwest airlines],https://open.spotify.com/track/0hKRSZhUGEhKU6a...
2,A Day in the Life,"Quasebarth, Grace",2024,Ohio Exists,5,soft rock,"[giant on south 23rd street, cedar point, ra...",[southwest airlines],https://open.spotify.com/track/0hKRSZhUGEhKU6a...
3,A Day in the Life,"Quasebarth, Grace",2024,Ohio Exists,5,alternative rock,"[giant on south 23rd street, cedar point, ra...",[southwest airlines],https://open.spotify.com/track/0hKRSZhUGEhKU6a...
4,Love You To,"Quasebarth, Grace",2024,Ohio Exists,0,modern bollywood,"[radio, most inconvenient moment imaginable]","[cvs, cedar point, giant on south 23rd street]",https://open.spotify.com/track/69c15XPo8sYUqmn...


### 2b. But now we need to Clean Again!

- Now that the genres have been split into separate rows, we need to check for all the usual problems:
- white space
- missing values (which might come from oddities in the original long strings)
- inconsistent values

In [16]:
# consider values in exploded genres for clean up:
sorted(our_clean_beatles_exploded['Genre Tags'].fillna('unspecified').unique().tolist())

[' ',
 '  art rock',
 '  art rock ',
 '  big trip hype ',
 '  cold war nostalgia soft rock',
 '  easy listening',
 '  folk rock ',
 '  groove-music',
 '  indie',
 '  lonesome rock',
 '  plagiarism music',
 '  pop',
 '  pop rock ',
 '  rock',
 '  rock-n-roll',
 '  rubbish',
 ' alternative rock',
 " children's music",
 ' doo-wop',
 ' exotica',
 ' folk',
 ' high vibe',
 ' indian electronic',
 ' indie poptimism',
 ' melancholia',
 ' mellow gold',
 ' piano rock',
 ' rock',
 ' soft rock',
 'comedy rock',
 'comrade pop',
 'folk rock',
 'garbage',
 'god-tier',
 'indie',
 'modern bollywood',
 'penitential pop',
 'plaintive pop',
 'pop',
 'pop rock',
 'psychedelic rock',
 'rock',
 'sea shanties',
 'shanty',
 'slowcore',
 'soft rock',
 'world music rock of the past century',
 'zoomba tunes']

- Here is a **function** you can adapt for the job of cleaning!

- In this case our function takes in a COMPLETE COLUMN, not individual values in cells
- Each line applies a different transformation to the data.  You can remove the ones you don't need, or adapt!
- Don't forget the closing `)` at the end!


In [19]:
# cleaning function


def clean_genre(genre_series):
    """
    Clean and standardize genre names.
    
    Parameters:
    genre_series: pandas Series containing genre data
    
    Returns:
    pandas Series with cleaned genre data
    """
    return (genre_series
            .str.strip('[')
            .str.strip()
            .str.replace('pop/rock', 'pop rock', regex=False)
            .str.replace('r&b', 'rhythm and blues', regex=False)
            .str.replace('rock and roll', 'rock', regex=False)
            .str.replace('rock & roll', 'rock', regex=False)
            .str.replace('jazz-pop', 'jazz pop', regex=False)
            .str.replace('experimental music', 'experimental', regex=False)
            .str.replace("children's music", "children's", regex=False)
            .str.replace("folkpop/rock", "folk pop rock", regex=False)
            .str.replace("stage&screen", "stage and screen", regex=False)
            .str.replace("electronic pop/rock", "electronic pop rock", regex=False)
            .str.replace("avant-pop", "avant pop", regex=False)
        
    )

In [20]:

# our column as a series
our_clean_beatles_exploded['Genre Tags']

0             soft rock
1      alternative rock
2             soft rock
3      alternative rock
4      modern bollywood
            ...        
84                 rock
85              garbage
86              rubbish
87        plaintive pop
88          rock-n-roll
Name: Genre Tags, Length: 89, dtype: object

In [21]:
# designate the series (a column)
selected_series = our_clean_beatles_exploded['Genre Tags']

# pass it in to the function
our_clean_beatles_exploded['Genre Tags'] = clean_genre(selected_series)

# check results
our_clean_beatles_exploded.head()

Unnamed: 0,Song Title,Your Name,Your Graduating Class,Team/Group Name,Personal Rank Clean,Genre Tags,Most Likely Context(s),Least Likely Contexts,Spotify URL
0,A Day in the Life,"Quasebarth, Grace",2024,Ohio Exists,5,soft rock,"[giant on south 23rd street, cedar point, ra...",[southwest airlines],https://open.spotify.com/track/0hKRSZhUGEhKU6a...
1,A Day in the Life,"Quasebarth, Grace",2024,Ohio Exists,5,alternative rock,"[giant on south 23rd street, cedar point, ra...",[southwest airlines],https://open.spotify.com/track/0hKRSZhUGEhKU6a...
2,A Day in the Life,"Quasebarth, Grace",2024,Ohio Exists,5,soft rock,"[giant on south 23rd street, cedar point, ra...",[southwest airlines],https://open.spotify.com/track/0hKRSZhUGEhKU6a...
3,A Day in the Life,"Quasebarth, Grace",2024,Ohio Exists,5,alternative rock,"[giant on south 23rd street, cedar point, ra...",[southwest airlines],https://open.spotify.com/track/0hKRSZhUGEhKU6a...
4,Love You To,"Quasebarth, Grace",2024,Ohio Exists,0,modern bollywood,"[radio, most inconvenient moment imaginable]","[cvs, cedar point, giant on south 23rd street]",https://open.spotify.com/track/69c15XPo8sYUqmn...


In [23]:
# check distribution of genres
our_clean_beatles_exploded['Genre Tags'].value_counts().tail(20)

Genre Tags
indie poptimism                         1
mellow gold                             1
shanty                                  1
children's                              1
comrade pop                             1
melancholia                             1
lonesome rock                           1
penitential pop                         1
folk                                    1
comedy rock                             1
sea shanties                            1
cold war nostalgia soft rock            1
zoomba tunes                            1
easy listening                          1
big trip hype                           1
plagiarism music                        1
world music rock of the past century    1
garbage                                 1
rubbish                                 1
plaintive pop                           1
Name: count, dtype: int64

In [25]:
# save our exploded data locally as csv
our_clean_beatles_exploded.to_pickle('beatles_billboard_clean_explode.pkl')

# and here is how to reload it
our_data_exploded_re = pd.read_pickle('beatles_billboard_clean_explode.pkl')

In [26]:
our_data_exploded_re

Unnamed: 0,Song Title,Your Name,Your Graduating Class,Team/Group Name,Personal Rank Clean,Genre Tags,Most Likely Context(s),Least Likely Contexts,Spotify URL
0,A Day in the Life,"Quasebarth, Grace",2024,Ohio Exists,5,soft rock,"[giant on south 23rd street, cedar point, ra...",[southwest airlines],https://open.spotify.com/track/0hKRSZhUGEhKU6a...
1,A Day in the Life,"Quasebarth, Grace",2024,Ohio Exists,5,alternative rock,"[giant on south 23rd street, cedar point, ra...",[southwest airlines],https://open.spotify.com/track/0hKRSZhUGEhKU6a...
2,A Day in the Life,"Quasebarth, Grace",2024,Ohio Exists,5,soft rock,"[giant on south 23rd street, cedar point, ra...",[southwest airlines],https://open.spotify.com/track/0hKRSZhUGEhKU6a...
3,A Day in the Life,"Quasebarth, Grace",2024,Ohio Exists,5,alternative rock,"[giant on south 23rd street, cedar point, ra...",[southwest airlines],https://open.spotify.com/track/0hKRSZhUGEhKU6a...
4,Love You To,"Quasebarth, Grace",2024,Ohio Exists,0,modern bollywood,"[radio, most inconvenient moment imaginable]","[cvs, cedar point, giant on south 23rd street]",https://open.spotify.com/track/69c15XPo8sYUqmn...
...,...,...,...,...,...,...,...,...,...
84,Love You To,"Garozzo, Elizabeth",2030,Team/Group Name,6,rock,[sitar convention],[when i dont want to hear sitar],https://open.spotify.com/track/69c15XPo8sYUqmn...
85,Run for Your Life,"Jahiel, Jacob",2020,The Skiffles,6,garbage,"[while dead, while asleep]",[under duress],https://open.spotify.com/track/4gUYV3ktbaOeAK5...
86,Run for Your Life,"Jahiel, Jacob",2020,The Skiffles,6,rubbish,"[while dead, while asleep]",[under duress],https://open.spotify.com/track/4gUYV3ktbaOeAK5...
87,Help!,"Jahiel, Jacob",2020,The Skiffles,4,plaintive pop,"[in need, in the car]","[with a turtle named gerald, at a silent ret...",https://open.spotify.com/track/7DD7eSuYSC5xk2A...


## 2c. Merge the our Data

- We've done this before, but note that in this case the song column is called `song` in one df and `Title` in the other.
- `pd.merge` can handle this!

In [28]:
combined_beatles_data = pd.merge(beatles_bb_spotify_tidy, 
                                 our_data_exploded_re, 
                                 left_on='song', 
                                 right_on='Song Title', 
                                 how='inner')
combined_beatles_data

Unnamed: 0,song,key,mode,tempo,time_signature,duration_ms,audio_feature,value,year,album.debut.uk,...,top.50.billboard,Song Title,Your Name,Your Graduating Class,Team/Group Name,Personal Rank Clean,Genre Tags,Most Likely Context(s),Least Likely Contexts,Spotify URL
0,A Day in the Life,4,0,163.219,4,337413,danceability,0.364,1967,Sgt. Pepper's Lonely Hearts Club Band,...,0,A Day in the Life,"Quasebarth, Grace",2024,Ohio Exists,5,soft rock,"[giant on south 23rd street, cedar point, ra...",[southwest airlines],https://open.spotify.com/track/0hKRSZhUGEhKU6a...
1,A Day in the Life,4,0,163.219,4,337413,danceability,0.364,1967,Sgt. Pepper's Lonely Hearts Club Band,...,0,A Day in the Life,"Quasebarth, Grace",2024,Ohio Exists,5,alternative rock,"[giant on south 23rd street, cedar point, ra...",[southwest airlines],https://open.spotify.com/track/0hKRSZhUGEhKU6a...
2,A Day in the Life,4,0,163.219,4,337413,danceability,0.364,1967,Sgt. Pepper's Lonely Hearts Club Band,...,0,A Day in the Life,"Quasebarth, Grace",2024,Ohio Exists,5,soft rock,"[giant on south 23rd street, cedar point, ra...",[southwest airlines],https://open.spotify.com/track/0hKRSZhUGEhKU6a...
3,A Day in the Life,4,0,163.219,4,337413,danceability,0.364,1967,Sgt. Pepper's Lonely Hearts Club Band,...,0,A Day in the Life,"Quasebarth, Grace",2024,Ohio Exists,5,alternative rock,"[giant on south 23rd street, cedar point, ra...",[southwest airlines],https://open.spotify.com/track/0hKRSZhUGEhKU6a...
4,A Day in the Life,4,0,163.219,4,337413,danceability,0.364,1967,Sgt. Pepper's Lonely Hearts Club Band,...,0,A Day in the Life,"Tang, Patricia",2026,Team/Group Name,1,psychedelic rock,[car],[baptism],https://open.spotify.com/track/0hKRSZhUGEhKU6a...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1267,Yellow Submarine,1,1,111.398,4,158880,valence,0.696,1966,Revolver,...,25,Yellow Submarine,"Quasebarth, Grace",2027,Romantic Travellers,2,sea shanties,"[kids scuba diving class, karaoke]",[lifeguard training course],https://open.spotify.com/track/50xwQXPtfNZFKFe...
1268,Yellow Submarine,1,1,111.398,4,158880,valence,0.696,1966,Revolver,...,25,Yellow Submarine,"Quasebarth, Grace",2024,Ohio Exists,1,shanty,"[cedar point, southwest airlines]","[bar in spain, radio, cvs]",https://open.spotify.com/track/50xwQXPtfNZFKFe...
1269,Yellow Submarine,1,1,111.398,4,158880,valence,0.696,1966,Revolver,...,25,Yellow Submarine,"Quasebarth, Grace",2024,Ohio Exists,1,children's,"[cedar point, southwest airlines]","[bar in spain, radio, cvs]",https://open.spotify.com/track/50xwQXPtfNZFKFe...
1270,Yellow Submarine,1,1,111.398,4,158880,valence,0.696,1966,Revolver,...,25,Yellow Submarine,"Quasebarth, Grace",2024,Ohio Exists,1,folk,"[cedar point, southwest airlines]","[bar in spain, radio, cvs]",https://open.spotify.com/track/50xwQXPtfNZFKFe...


Final Clean Up of Columns

In [29]:
# force cols to lower case for simplicity
combined_beatles_data.rename(columns=str.lower, inplace=True)
# drop redundant columns
cols_to_drop = ['song']

# this is called 'list comprehension' and will create a list of those NOT in the exclusion list!
cols_to_keep = [col for col in combined_beatles_data.columns if col not in cols_to_drop]

# now a new df with only the colums we want
combined_beatles_data_brief = combined_beatles_data[cols_to_keep]


In [30]:
combined_beatles_data_brief.head(5)

Unnamed: 0,key,mode,tempo,time_signature,duration_ms,audio_feature,value,year,album.debut.uk,album.debut.us,...,top.50.billboard,song title,your name,your graduating class,team/group name,personal rank clean,genre tags,most likely context(s),least likely contexts,spotify url
0,4,0,163.219,4,337413,danceability,0.364,1967,Sgt. Pepper's Lonely Hearts Club Band,,...,0,A Day in the Life,"Quasebarth, Grace",2024,Ohio Exists,5,soft rock,"[giant on south 23rd street, cedar point, ra...",[southwest airlines],https://open.spotify.com/track/0hKRSZhUGEhKU6a...
1,4,0,163.219,4,337413,danceability,0.364,1967,Sgt. Pepper's Lonely Hearts Club Band,,...,0,A Day in the Life,"Quasebarth, Grace",2024,Ohio Exists,5,alternative rock,"[giant on south 23rd street, cedar point, ra...",[southwest airlines],https://open.spotify.com/track/0hKRSZhUGEhKU6a...
2,4,0,163.219,4,337413,danceability,0.364,1967,Sgt. Pepper's Lonely Hearts Club Band,,...,0,A Day in the Life,"Quasebarth, Grace",2024,Ohio Exists,5,soft rock,"[giant on south 23rd street, cedar point, ra...",[southwest airlines],https://open.spotify.com/track/0hKRSZhUGEhKU6a...
3,4,0,163.219,4,337413,danceability,0.364,1967,Sgt. Pepper's Lonely Hearts Club Band,,...,0,A Day in the Life,"Quasebarth, Grace",2024,Ohio Exists,5,alternative rock,"[giant on south 23rd street, cedar point, ra...",[southwest airlines],https://open.spotify.com/track/0hKRSZhUGEhKU6a...
4,4,0,163.219,4,337413,danceability,0.364,1967,Sgt. Pepper's Lonely Hearts Club Band,,...,0,A Day in the Life,"Tang, Patricia",2026,Team/Group Name,1,psychedelic rock,[car],[baptism],https://open.spotify.com/track/0hKRSZhUGEhKU6a...


In [34]:
# save the result as 'pickled' file, to preserve all data types!

combined_beatles_data_brief.to_pickle('beatles_combined_clean_exploded_brief.pkl')

# 3.  Interpretation:

- What did you learn?