# Data From Online

**Here is where we began our data collection. We researched online for other projects that correlated music with data. We came across a paper conducted by Rosebud Anwuri in 2018, which helped us find some original data and taught us how we can process our own. `https://towardsdatascience.com/billboard-hot-100-analytics-using-data-to-understand-the-shift-in-popular-music-in-the-last-60-ac3919d39b49`**

**Below we read in some csv files provided by the open source resource project. The data corresponds to top 100 billboard songs during the years 1950 to 2015.**

(1) music_df: All artists and their genres.

(2) feature_analysis: Consits of quantitative data to describe songs (e.g. `danceability`).

(3) topics: decade and the general topic of a song represented as an integer. This data was not used in our study.

In [1]:
# Import libraries
import pickle
import pandas as pd
import numpy as np
import seaborn as sns
import random
%matplotlib inline

In [2]:
# Read in data
music_df = pd.read_csv("https://raw.githubusercontent.com/RosebudAnwuri/TheArtandScienceofData/master/The%20Making%20of%20Great%20Music/data/features_dataset.csv")
feature_analysis = pd.read_csv("https://raw.githubusercontent.com/RosebudAnwuri/TheArtandScienceofData/master/The%20Making%20of%20Great%20Music/data/music_df.csv")
topics = pd.read_csv("https://raw.githubusercontent.com/RosebudAnwuri/TheArtandScienceofData/master/The%20Making%20of%20Great%20Music/data/topic_dataset.csv")

In [4]:
music_df.head()

Unnamed: 0,artist,FeaturedArtists,main_genre,featured_genre
0,Nat King Cole,,soul,
1,Andrews Sisters,Gordon Jenkins,swing,
2,Patti Page,,jazz,
3,Tennessee Ernie Ford,Kay Starr,,swing
4,Patti Page,,jazz,


**Peeking at the `music_df` data frame, we can see that not all artists have a main genre.**

In [5]:
# Artists that do not have a genre
noGenreind = list(music_df[music_df.main_genre.isnull()].index)

**If an artist without a main genre has a featured genre, make this its main genre.**

In [6]:
for i in noGenreind:
    music_df.main_genre.loc[i] = music_df.featured_genre.loc[i]

In [7]:
music_df.head()

Unnamed: 0,artist,FeaturedArtists,main_genre,featured_genre
0,Nat King Cole,,soul,
1,Andrews Sisters,Gordon Jenkins,swing,
2,Patti Page,,jazz,
3,Tennessee Ernie Ford,Kay Starr,swing,swing
4,Patti Page,,jazz,


**Removing `FeaturedArtists` and `featured_genre`, because these are irrelevant to our study. This is because we do not use artist in our analysis, as they were only used to help merge data frames and gather lyrics. Also, we only want to pay attention to a main genre.**

In [8]:
music_df = (music_df.drop(["FeaturedArtists", "featured_genre"], axis=1)
            .reset_index()
            .drop(["index"], axis=1))

**Here we can see that multiple artists have multiple main genres.**

In [10]:
artists_genre = music_df.groupby("artist").main_genre.value_counts(dropna=False)
artists_genre.head(20)

artist               main_genre
10,000 Maniacs       rock           6
10cc                 rock          12
112                  hip-hop       35
2 Chainz             hip-hop       12
2Pac                 hip-hop       24
3 Doors Down         rock          48
3LW                  rnb            6
3t                   NaN            6
4 Non Blondes        rock           6
5 Seconds Of Summer  pop           12
50 Cent              hip-hop       56
69 Boyz              NaN           12
702                  rnb           12
                     soul          12
98 Degrees           NaN           24
A Flock Of Seagulls  pop            6
ABBA                 NaN           36
AWOLNATION           NaN            6
Aaliyah              hip-hop       66
                     pop           66
Name: main_genre, dtype: int64

**At first we wanted to replace the artist's genre with the one that occured most often for that specific artist. This is because it would make our analysis more simple and there was no way of knowing which specific song was tied to which genre, if an artist did have multiple genres. However, this did not work because all the genre counts per artist were the same. Therefore, we decided to randomly choose which genre corresponded to which artist. This was because artists with multiple genres tended to have similar genres. For example, Kanye West had both pop and hip-hop. Therefore, we chose the genre at random because if we chose the genre that occured first, it would always favor the genre that comes first alphabetically. We decided that limiting artists with multiple genres down to one would not significantly change our data in a negative way, due to the similarity property discussed above and only a small number of artists having multiple genres.**

In [11]:
artists_var = []
genre_var = []
artists = music_df.artist.unique()
for a in artists:
    genres = list(artists_genre.loc[a].index)
    artists_var.append(a)
    genre_var.append(random.choice(genres))

In [12]:
new_artist_genre = pd.DataFrame({"artist":artists_var, "genre":genre_var})

**Here we merge our genre to artist data frame with our quantitave data frame on songs. We merge the data frames by artists.**

In [13]:
music = new_artist_genre.merge(feature_analysis, on="artist")

In [14]:
music.head()

Unnamed: 0,artist,genre,lyrics,num_syllables,pos,year,fog_index,flesch_index,num_words,num_lines,...,tempo,duration_ms,time_signature,uri,analysis_url,artist_with_features,year_bin,image,cluster,Gender
0,Nat King Cole,jazz,"Mona Lisa, Mona Lisa, men have named you\nYou'...",189.0,0.199,1950,5.2,88.74,145,17,...,86.198,207573.0,3,spotify:track:3k5ycyXX5qsCjLd7R2vphp,https://api.spotify.com/v1/audio-analysis/3k5y...,,50s,https://i.scdn.co/image/a4c0918f13b67aa8d9f4ea...,String Lover,male
1,Nat King Cole,jazz,They try to tell us we're too young\nToo young...,100.8,0.133,1951,4.8,106.67,92,13,...,92.091,205707.0,1,spotify:track:44ZAlAAn0bAEzUJgZky04H,https://api.spotify.com/v1/audio-analysis/44ZA...,,50s,https://i.scdn.co/image/a4c0918f13b67aa8d9f4ea...,String Lover,male
2,Nat King Cole,jazz,Pretend you're happy when you're blue\nIt isn'...,119.7,0.222,1953,4.4,99.23,99,15,...,83.265,165240.0,4,spotify:track:0nQq0gmq7gcOWyLGrtqu8Z,https://api.spotify.com/v1/audio-analysis/0nQq...,,50s,https://i.scdn.co/image/a4c0918f13b67aa8d9f4ea...,String Lover,male
3,Nat King Cole,jazz,"Answer me, oh, my love\nJust what sin have I b...",190.8,0.292,1954,4.4,99.23,158,25,...,81.766,176960.0,1,spotify:track:3baoVFZ8zcG1uH8qnm59Hf,https://api.spotify.com/v1/audio-analysis/3bao...,,50s,https://i.scdn.co/image/a4c0918f13b67aa8d9f4ea...,String Lover,male
4,Nat King Cole,jazz,A blossom fell from off a tree\nIt settled sof...,157.5,0.28,1955,4.0,100.24,132,24,...,84.242,223440.0,3,spotify:track:2dIhMS83JMFJvFNTmAWucX,https://api.spotify.com/v1/audio-analysis/2dIh...,,50s,https://i.scdn.co/image/a4c0918f13b67aa8d9f4ea...,String Lover,male


In [15]:
feature_analysis.head()

Unnamed: 0,lyrics,num_syllables,pos,year,fog_index,flesch_index,num_words,num_lines,title,f_k_grade,...,tempo,duration_ms,time_signature,uri,analysis_url,artist_with_features,year_bin,image,cluster,Gender
0,"Mona Lisa, Mona Lisa, men have named you\nYou'...",189.0,0.199,1950,5.2,88.74,145,17,Mona Lisa,2.9,...,86.198,207573.0,3,spotify:track:3k5ycyXX5qsCjLd7R2vphp,https://api.spotify.com/v1/audio-analysis/3k5y...,,50s,https://i.scdn.co/image/a4c0918f13b67aa8d9f4ea...,String Lover,male
1,I wanna be Loved\nBy Andrews Sisters\n\nOooo-o...,270.9,0.224,1950,4.4,82.31,189,31,I Wanna Be Loved,3.3,...,170.869,198027.0,5,spotify:track:4UY81WrDU3jTROGaKuz4uZ,https://api.spotify.com/v1/audio-analysis/4UY8...,Gordon Jenkins,50s,https://i.scdn.co/image/42e4dc3ab9b190056a1ca1...,String Lover,Group
2,I was dancing with my darling to the Tennessee...,174.6,0.351,1950,5.2,88.74,138,16,Tennessee Waltz,2.9,...,86.335,182733.0,3,spotify:track:6DKt9vMnMN0HmlnK3EAHRQ,https://api.spotify.com/v1/audio-analysis/6DKt...,,50s,https://i.scdn.co/image/353b05113b1a140d64d83d...,String Lover,female
3,Each time I hold someone new\nMy arms grow col...,135.9,0.231,1950,4.4,99.23,117,18,I'll Never Be Free,0.9,...,82.184,158000.0,3,spotify:track:0KnD456yC5JuweN932Ems3,https://api.spotify.com/v1/audio-analysis/0KnD...,Kay Starr,50s,https://i.scdn.co/image/4bd427bb9181914d0fa448...,String Lover,male
4,"Unfortunately, we are not licensed to display ...",46.8,0.079,1950,6.0,69.79,32,3,All My Love,6.0,...,123.314,190933.0,4,spotify:track:05sXHTLqIpwywbpui1JT4o,https://api.spotify.com/v1/audio-analysis/05sX...,,50s,https://i.scdn.co/image/353b05113b1a140d64d83d...,String Lover,female


**Here is the other data frame that was read in but was not used.**

In [16]:
topics.head()

Unnamed: 0,document,topic,gamma
0,60s,1,0.999997
1,50s,1,0.999993
2,70s,1,0.999917
3,80s,1,0.999868
4,00s,2,0.998575


In [22]:
pickle.dump(music, open("cleanedData.pkl", "wb"))