# PCA Analysis

In this document, we will try to get a dimensionality reduction using PCA and get a feature extraction from this.

## Load the data

First we load up data that is relevant to the songs:

In [1]:
from pyspark import RDD
from pyspark import SparkContext

rdd_analysis_songs = sc.pickleFile("../data/analysis-songs")
rdd_musicbrainz_songs = sc.pickleFile("../data/musicbrainz-songs")
rdd_metadata_songs = sc.pickleFile("../data/metadata-songs")

## Merge all relevant features into one RDD

We have features relevant to songs dispersed over 3 RDDs. We need to join them based on the key to get a vector of features.

In [2]:
import numpy as np
import numpy.lib.recfunctions as rfn

def merge_array_parameters(row):
    return (row[0], rfn.merge_arrays([row[1][0], row[1][1]], flatten=True))

big_rdd = rdd_analysis_songs.join(rdd_metadata_songs).map(merge_array_parameters)
big_rdd = big_rdd.join(rdd_musicbrainz_songs).map(merge_array_parameters)

Below you can see all the feature names that are available that are directly linked to songs. We will not be using the segments data for simplification. Some features in the `analysis-songs` dataset are compiled using these.

In [3]:
big_rdd.first()[1].dtype.names

('analysis_sample_rate',
 'audio_md5',
 'danceability',
 'duration',
 'end_of_fade_in',
 'energy',
 'idx_bars_confidence',
 'idx_bars_start',
 'idx_beats_confidence',
 'idx_beats_start',
 'idx_sections_confidence',
 'idx_sections_start',
 'idx_segments_confidence',
 'idx_segments_loudness_max',
 'idx_segments_loudness_max_time',
 'idx_segments_loudness_start',
 'idx_segments_pitches',
 'idx_segments_start',
 'idx_segments_timbre',
 'idx_tatums_confidence',
 'idx_tatums_start',
 'key',
 'key_confidence',
 'loudness',
 'mode',
 'mode_confidence',
 'start_of_fade_out',
 'tempo',
 'time_signature',
 'time_signature_confidence',
 'track_id',
 'analyzer_version',
 'artist_7digitalid',
 'artist_familiarity',
 'artist_hotttnesss',
 'artist_id',
 'artist_latitude',
 'artist_location',
 'artist_longitude',
 'artist_mbid',
 'artist_name',
 'artist_playmeid',
 'genre',
 'idx_artist_terms',
 'idx_similar_artists',
 'release',
 'release_7digitalid',
 'song_hotttnesss',
 'song_id',
 'title',
 'track_

## Vectorize features

The next step is to vectorize the data into vectors that can be used for primary component analysis. We are going to ignore any features that are not numeric, these are either invalid (`genre` is always an empty string) or not necessary (e.g. `track_id` is represented by `track_7digitalid`).

In [4]:
from pyspark.ml.feature import PCA
from pyspark.mllib.linalg import Vectors
from pyspark.sql.functions import isnan
import math

def map_features_to_vectors(input_row):
    song_id = input_row[0]
    feature_vec = input_row[1]
    
    # take only number values
    row_data = [feature_vec[x][0] for x in feature_vec.dtype.names if np.isreal(feature_vec[x][0]) and not x.endswith("hotttnesss")]
    # replace nan by 0
    row_data = map(lambda x: 0 if math.isnan(x) else x, row_data)
    
    label = feature_vec['song_hotttnesss'][0]
    
    return (Vectors.dense(row_data), float(label))

In [5]:
data = sqlContext.createDataFrame(big_rdd.map(map_features_to_vectors), ["features", "labels"])
data = data.filter(isnan(data.labels) == False).cache()

In [6]:
pca_ml = PCA(k=10, inputCol="features", outputCol="pcaFeatures")
model = pca_ml.fit(data)
transformed = model.transform(data)

In [7]:
from pyspark.mllib.regression import LinearRegressionWithSGD
from pyspark.mllib.regression import LabeledPoint

results = []

for k in range(2, data.first().features.size+1, 2):
    print "k=%d" % k
    
    dtrain, dtest = data.map(lambda x: LabeledPoint(x.labels, x.features)).randomSplit([0.8, 0.2])
    
    pca_ml = PCA(k=k, inputCol="features", outputCol="pcaFeatures")
    pca = pca_ml.fit(data)
    
    model_pca = LinearRegressionWithSGD.train(dtest)
    
    valuesAndPreds = dtest.map(lambda p: (p.label, model_pca.predict(p.features)))
    MSE = valuesAndPreds \
        .map(lambda vp: (vp[0] - vp[1])**2) \
        .mean()
    results.append(MSE)

k=2
k=4
k=6
k=8
k=10
k=12
k=14
k=16
k=18
k=20
k=22
k=24
k=26
k=28
k=30
k=32
k=34
k=36
k=38
k=40


In [9]:
print(results)

[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]
