# Analysis of Vectorized Genres

In the [genre_raw notebook](genre_raw.ipynb), I used the cardinal data in the `data_by_genre` dataset to act as a crude vector. For the `funk` genre, these were the closest genres according to the genre characteristics:

```
['funk',
 'folclore tucumano',
 'quiet storm',
 'lancaster pa indie',
 'freestyle',
 'second line',
 'danish metal',
 'disco',
 'hong kong indie',
 'hong kong rock']
```

You can see they don't have much to do with each other outside of their characteristic values. Hopefully, using the newly vectorized genres from the [genre2vec notebook](genre2vec.ipynb), I can achieve more related genres and increased functionality.


In [83]:
import pandas as pd
from scipy.spatial.distance import cosine
import numpy as np
from sklearn.decomposition import PCA
import plotly.express as px

In [84]:
weights = pd.read_csv('fit_vectors/vectors.tsv', sep='\t',names=[i for i in range(64)])
genres = pd.read_csv('fit_vectors/metadata.tsv', names=['genre'])
frequency = pd.read_csv('fit_vectors/frequency.csv', index_col=0)

In [85]:
df_genres = genres.merge(weights, left_index=True,right_index=True)
df_genres = df_genres.merge(frequency, left_on='genre', right_on='genre').set_index('genre')
df_genres.head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,55,56,57,58,59,60,61,62,63,occurances
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
rock,0.297401,-0.065244,0.161445,0.028481,-0.172059,0.051996,-0.230878,-0.15509,-0.189513,0.214556,...,0.165414,0.1263,0.24163,0.082258,-0.168814,0.215936,0.178141,-0.308114,-0.141744,611
pop,-0.047431,-0.075726,0.20123,-0.198277,-0.02523,0.188105,-0.145932,0.134658,0.108949,-0.10868,...,0.122657,-0.069843,-0.194077,0.149244,0.110303,-0.222669,0.066243,-0.086428,-0.218115,593
dance pop,-0.260151,0.14166,0.091752,-0.264015,0.131204,0.1786,-0.214659,0.033899,0.093616,-0.247598,...,0.06019,0.009739,-0.264901,0.173026,0.166727,-0.074361,-0.117554,0.312756,-0.128149,572
rap,-0.144567,0.315087,0.203594,-0.076298,-0.229012,0.251945,0.321193,-0.095935,-0.334801,-0.050728,...,0.004749,-0.004354,-0.084426,0.223548,0.268649,0.076912,-0.244264,0.321721,0.09305,516
hip hop,-0.114314,0.366533,0.33652,-0.049034,-0.267068,0.087755,0.370791,-0.130331,-0.380851,-0.080021,...,0.192115,0.166349,0.042726,0.098826,0.088184,0.075849,-0.30807,0.289119,-0.053495,507


In [118]:
# vector functions
def get_genre_vector(genre):
    return np.array(list(df_genres.loc[genre,0:63]))

def distance(vec1, vec2):
    return cosine(vec1, vec2)

def closest_genres(vec, qty = 10):
    distances = {
        genre: distance(vec, get_genre_vector(genre))
        for genre in df_genres.index
    }
    return sorted(distances, key=lambda genre: distances[genre])[:qty]

def furthest_genres(vec, qty = 10):
    distances = {
        genre: distance(vec, get_genre_vector(genre))
        for genre in df_genres.index
    }
    return sorted(distances, key=lambda genre: distances[genre])[-qty:]

def closest_genre(vec):
    return closest_genres(vec)[0] 

In [87]:
# test
get_genre_vector('funk')[:5]

array([-0.23584053, -0.3386798 ,  0.3369154 , -0.30271327,  0.343173  ])

In [88]:
# test 2
funk_vec = get_genre_vector('funk')
rock_vec = get_genre_vector('rock')
distance(funk_vec,rock_vec)


1.092089863858045

In [89]:
# test 3
closest_genres(funk_vec)

['philly soul',
 'post-disco',
 'memphis soul',
 'classic soul',
 'soul',
 'southern soul',
 'quiet storm',
 'rare groove',
 'disco',
 'new jack swing']

These genres appear much more relevant. So these vectors seem to work better for associating similar vectors!

## Exploring distances

I wanted to gain some intuitive for distances of similar genres and very disperate genres. Similar genres seem to be separated by a distance of ~0.2 while disimilar genres were separated by a distance of ~1.2.

In [90]:
furthest_genres(get_genre_vector('trance'))[-1]

'classic country pop'

In [91]:
distance(get_genre_vector('trance'), get_genre_vector('broadway'))

1.1878067320680745

In [92]:
closest_genre(get_genre_vector('trance'))

'progressive house'

In [93]:
distance(get_genre_vector('trance'), get_genre_vector('pop edm'))

0.16542969949260522

## Adding genres together to make new genres

I experimented with achieving new genres by simply adding two similar genres together. This appeared to work fairly well, with rock + reggae achieving ska and country + rap achieving country rap. I also attempted the process in reverse, subtracking rock from ska. This appeared to work fairly well as well.

In [94]:
# ska should loosely be reggae + rock

closest_genres(get_genre_vector('reggae') + get_genre_vector('rock'))

['rock steady',
 'reggae',
 'roots reggae',
 'ska revival',
 'lovers rock',
 'old school dancehall',
 'ska',
 'dancehall',
 'modern reggae',
 'reggae fusion']

In [95]:
# perhaps rap and country will produce rap country?

closest_genres(get_genre_vector('rap') + get_genre_vector('country'),10)

['country rap',
 'country',
 'southern hip hop',
 'rap',
 'memphis hip hop',
 'dirty south rap',
 'atl trap',
 'underground hip hop',
 'modern country rock',
 'oklahoma country']

In [96]:
# Does subtraction work as well? Should hopefully get 'reggae'

closest_genres((get_genre_vector('ska') - get_genre_vector('rock')))

['ska revival',
 'melodic hardcore',
 'ska punk',
 'modern reggae',
 'old school dancehall',
 'reggae',
 'lovers rock',
 'dancehall',
 'roots reggae',
 'oi']

## PCA Visualization of genres + Count

To guide some more complex usages of my vector functions, I wanted to have a rough visual of how genres are placed in reference to each other. Obviously, I cannot conceptualize 64 dimensions, so a dimensional reduction will have to suffice. Note: the resulting 3D scatter plot from plotly does not render in static environments. Also, the color and size of each marker are indicative of quantity of occurances of the genre the datapoint represents. An example of the 3D render is seen in the image below.

![pca screenshot](img\pca.png)

## Subtracting to find common differences

I attempted to achieve similar changes within different "genre clusters" (e.g. rap, rock, etc). However, this more complex functionality did not really seem to work. 

In [98]:
closest_genres(get_genre_vector('rap'))

['rock',
 'heartland rock',
 'country rock',
 'british blues',
 'british folk',
 'hard rock',
 'roots rock',
 'symphonic rock',
 'album rock',
 'folk rock']

In [99]:
# if looking in the "rap cluster", r&b is near by. I would describe the contextual difference of rap
# and r&b to be that r&b is a "softer" genre within this cluster. Apply this "softness" difference
# to a different large cluster, say "rock", and see what specific rock genres it recommends.
# hopefully they will be "softer" rock genres

closest_genres(get_genre_vector('gangster rap') + (get_genre_vector('soft rock') - get_genre_vector('rock')))

Unnamed: 0,0,1,2,genre,occurances
0,-1.146586,-0.453093,-0.8051,rock,611
1,0.681294,0.320412,0.013808,pop,593
2,1.166946,0.322607,-0.462125,dance pop,572


In [100]:
# can the difference in metal and rock lead to "gangster rap" when added to rap?

closest_genres(get_genre_vector('rap') + (get_genre_vector('metal') - get_genre_vector('rock')))

In [129]:
distance(get_genre_vector('ska'), get_genre_vector('rock'))

0.7684954255819321

In [116]:
from sklearn.decomposition import PCA
import plotly.express as px

['trap',
 'hip hop',
 'southern hip hop',
 'atl trap',
 'houston rap',
 'dirty south rap',
 'gangster rap',
 'chicago rap',
 'memphis hip hop',
 'underground hip hop']

In [109]:
pca_genres = pd.DataFrame(reduced_weights)
pca_genres = pca_genres.merge(genres, left_index=True,right_index=True)
pca_genres = pca_genres.merge(frequency, left_on='genre', right_on='genre') #.set_index('genre')
pca_genres.head(3)


['atl trap',
 'southern hip hop',
 'houston rap',
 'deep underground hip hop',
 'underground hip hop',
 'chicago rap',
 'trap',
 'dirty south rap',
 'hip hop',
 'melodic rap']