# Correlating Tags in The Music Tags Dataset

So far, we have scraped music tags from www.bensound.com and created a boolean based dataframe with over 900 columns and over 250 rows. A first step to building a recommender system with this dataset is to find tags that are highly correlated. However, cateorical variables can't be correlated like continuous variables, they can only be 'associated'. In practice, this means that we can't make linear predictions with our 'correlation'. Also, we won't be able to find negative relationships (r will be in range(0,1), not in range(-1,1). Either a is associated with b or it isn't. It can't be associated with not-b, because that is the same as not-associated with b. For the purpose of a tagging recommender system, that's not an issue. Noone is going to want to know what would be the worst possible tag for a piece of music. If we tag 'happy', we know we shouldn't tag 'sad', too.

## 1. Load The Dataset

In [14]:
import pandas as pd

In [15]:
directory = "C:/Users/maxhi/OneDrive/Uni & Work/Programming/Data Science/Music Tagging/Data"
filename = "music_tags_bool.csv"

In [16]:
music_tags = pd.read_csv("{directory}/{filename}".format(directory = directory,
                                                        filename = filename), index_col = "track_name")

In [17]:
music_tags.head()

Unnamed: 0_level_0,ukulele,happy,funny,advertising,upbeat,kid,kids,positive,chidren,joy,...,shangai,koto,guzheng,erhu,dizi,voice,sfx,discover,geek,holiday
track_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ukulele,True,True,True,True,True,True,True,True,True,True,...,False,False,False,False,False,False,False,False,False,False
creative minds,False,False,False,True,False,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
a new beginning,False,False,False,False,False,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
little idea,False,True,True,False,True,True,True,True,False,True,...,False,False,False,False,False,False,False,False,False,False
jazzy frenchy,False,True,False,False,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


## 2. Correlation

### 2.1 Measurement

Having said that correlation and association are different things, I wills tick to using the term correlation for simplicity purposes.

The basic function we are going to use for correlating any two tags will look like this:

In [18]:
def correlation_dummy(a_and_b, 
                      a_not_b, 
                      b_not_a):
    
    # Formula for probability: positive_outcomes / possible_outcomes
    positive_outcomes = a_and_b
    possible_outcomes = a_and_b + a_not_b + b_not_a
    
    # r is the number of cases where a and b are tagged relative to the number of cases where only a or only b is tagged
    r = positive_outcomes / possible_outcomes
    
    return r
    

In [19]:
correlation_dummy(a_and_b = 4,
                  a_not_b = 7, 
                  b_not_a = 3)

0.2857142857142857

We now need to calculate this kind of correlation for every pair of tags in the dataset. Let's write a function for that.

In [20]:
def correlation(df, tag_a, tag_b):
    
    # Get all rows where a == True
    a = df[df[tag_a]]
    # Get all rows where b == True
    b = df[df[tag_b]]
    
    # Find all rows where a AND b == True
    a_and_b = df[ (df[tag_a]) & (df[tag_b]) ]
    
    # Find all rows where a == True AND b != True
    a_not_b = df[ (df[tag_a]) & ~ (df[tag_b]) ]
    # Find all rows where b == True AND a != True
    b_not_a = df[ (df[tag_b]) & ~ (df[tag_a]) ]
    
    # Calculate the number of possitive and possible outcomes using the shape attribute
    possible_outcomes = a_and_b.shape[0] + a_not_b.shape[0] + b_not_a.shape[0] # shape[0] returns the number of rows
    positive_outcomes = a_and_b.shape[0]
    
    # Calculate the final correlation coefficient
    r = positive_outcomes / possible_outcomes
    
    return r

Let's test this function. <br>
Correlating "upbeat" and "happy" should give us the same result as correlating "happy" and "upbeat".

In [21]:
correlation(music_tags, "upbeat", "happy")

0.391304347826087

In [22]:
correlation(music_tags, "happy", "upbeat")

0.391304347826087

Correlating "happy" with itself should give us a correlation of 1, because "happy" shares all its occurrences with itself.

In [23]:
correlation(music_tags, "happy", "happy")

1.0

Very nice! It seems like our correlation function is working!

# 2.2 Calculate Correlation Matrix

First, we need to write a function that correlates a tag with all other tags.

In [26]:
def correlate_with_every_tag(df, tag_a, dict_mode = True): 
    
    unique_tags = list(df.columns)
    
    # In dict_mode, the results are stored in a dict, which is good for analyzing one tag
    # However, in order to transform the data into a df later, we need a list output
    if dict_mode:
        # Loop through every tag and store the correlation in the dict
        correlation_dict = {}
        for tag_b in unique_tags:
            correlation_dict[tag_b] = correlation(df, tag_a, tag_b)
        return correlation_dict
    else:
        # Loop through every tag and store the correlation in a list
        correlation_list = []
        for tag_b in unique_tags:
            correlation_list.append(correlation(df, tag_a, tag_b))
        return correlation_list

In [28]:
correlate_with_every_tag(music_tags, "ukulele", dict_mode = False)[:5] # display only 5 rows

[1.0,
 0.1864406779661017,
 0.15384615384615385,
 0.08653846153846154,
 0.08695652173913043]

Next, we'll loop through all tags and perform the correlate_with_every_tag() function on it.

In [30]:
unique_tags = list(music_tags.columns)

correlation_matrix_dict = {}

for tag_a in unique_tags:
    correlation_matrix_dict[tag_a] = correlate_with_every_tag(music_tags, tag_a, dict_mode = False)

### 2.3 Store Correlations in a DataFrame

In [31]:
df_corr_matrix = pd.DataFrame(correlation_matrix_dict)

In [32]:
df_corr_matrix.shape

(917, 917)

In [33]:
df_corr_matrix.head()

Unnamed: 0,ukulele,happy,funny,advertising,upbeat,kid,kids,positive,chidren,joy,...,shangai,koto,guzheng,erhu,dizi,voice,sfx,discover,geek,holiday
0,1.0,0.186441,0.153846,0.086538,0.086957,0.2,0.321429,0.11236,0.083333,0.25,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333
1,0.186441,1.0,0.245902,0.37069,0.391304,0.103448,0.40678,0.464646,0.017241,0.266667,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.017241
2,0.153846,0.245902,1.0,0.101852,0.142857,0.2,0.387097,0.09375,0.055556,0.090909,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.086538,0.37069,0.101852,1.0,0.177966,0.038835,0.188679,0.46875,0.009901,0.133333,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.086957,0.391304,0.142857,0.177966,1.0,0.073171,0.26,0.315789,0.026316,0.272727,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.026316


Lastly, we need to change the index column to represent the tags in the same order as they are in the columns.

In [34]:
df_corr_matrix["index"] = unique_tags

In [35]:
df_corr_matrix = df_corr_matrix.set_index("index")

### 2.4 Export the DataFrame as .csv

In [36]:
df_corr_matrix.to_csv("music_tags_corr_matrix.csv")