## Music Recommendation
### University of Virginia
### DS 7200: Distributed Computing
### Last Updated: August 20, 2023


---

#### Instructions

In this assignment, you will prepare data and build an ALS recommendation algorithm based on user listening data from Autoscrobbler.

The data consists of: 
- user data (listeners)
- item data (songs)
- interaction data (user listens, which is implicit feedback).  

The code is outlined below. Make the requested modifications, run the code, and copy all answers to the **ANSWER SECTION** at the bottom of the notebook. Note the *None* variable is a placeholder for code.

**NOTE**: For a given userID, some/many recommendation might come back as $None$.  
This comes from artists not used in the training data.  
These should be filtered out using a list comprehension as follows:

`print([x for x in recommendationsForUser if x is not None])`

**TOTAL POINTS: 10**
***

In [1]:
# import modules
import os

from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.mllib import recommendation
from pyspark.mllib.recommendation import *
import pandas as pd

In [2]:
# set configurations
conf = SparkConf().setMaster("local").setAppName("autoscrobbler")

In [3]:
# set context
sc = SparkContext.getOrCreate(conf=conf)

In [4]:
# pathing and params
user_artist_data_file = 'user_artist_data.txt'
artist_data_file = 'artist_data.txt'
artist_alias_data_file  = 'artist_alias.txt'

numPartitions = 2
topk = 10

In [5]:
# read user_artist_data_file into RDD (417MB file, 24MM records of users’ plays of artists, along with count)
# specifically, each row holds: userID, artistID, count
rawDataRDD = sc.textFile(user_artist_data_file, numPartitions)
rawDataRDD.cache()

user_artist_data.txt MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0

In [6]:
# inspect some records
rawDataRDD.take(2)

['1000002 1 55', '1000002 1000006 33']

In [9]:
# read artist_data_file using *textFile*
rawArtistRDD = sc.textFile(artist_data_file, numPartitions)
rawArtistRDD.cache()

artist_data.txt MapPartitionsRDD[4] at textFile at NativeMethodAccessorImpl.java:0

In [14]:
# inspect some records


In [13]:
# read artist_alias_data_file using *textFile*


In [12]:
# inspect some records


In [None]:
# 1) (1 PT) Print the first 10 records from rawDataRDD


In [2]:
# 2) (1 PT) Apply parseArtistIdNamePair to rawArtistRDD, and print the first 10 records, showing only artist names


In [7]:
def parseArtistIdNamePair(singlePair):
   splitPair = singlePair.rsplit('\t')
   # we should have two items in the list - id and name of the artist.
   if len(splitPair) != 2:
       #print singlePair
       return []
   else:
       try:
           return [(int(splitPair[0]), splitPair[1])]
       except:
           return []


In [10]:
artistByID = dict(rawArtistRDD.flatMap(lambda x: parseArtistIdNamePair(x)).collect())
artist_vals = artistByID.values()
list(artist_vals)[:10]

['06Crazy Life',
 'Pang Nakarin',
 'Terfel, Bartoli- Mozart: Don',
 'The Flaming Sidebur',
 'Bodenstandig 3000',
 'Jota Quest e Ivete Sangalo',
 'Toto_XX (1977',
 'U.S Bombs -',
 'artist formaly know as Mat',
 'Kassierer - Musik für beide Ohren']

---

In [38]:
def parseArtistAlias(alias):
    splitPair = alias.rsplit('\t')
    # we should have two ids in the list.
    if len(splitPair) != 2:
        #print singlePair
        return []
    else:
        try:
            return [(int(splitPair[0]), int(splitPair[1]))]
        except:
            return []

In [39]:
artistAlias = rawAliasRDD.flatMap(lambda x: parseArtistAlias(x)).collectAsMap()

In [43]:
# turn the artistAlias into a broadcast variable.
# This will distribute it to worker nodes efficiently, so we save bandwidth.
artistAliasBroadcast = sc.broadcast( artistAlias )

In [49]:
artistAliasBroadcast.value.get(2097174)

In [50]:
# Print the number of records from the largest RDD, rawDataRDD
print( rawDataRDD.count() )

24296858


In [53]:
# Sample 10% of rawDataRDD (to reduce runtime) using seed 314. Call it sample.
seed = 314
weights = None
sample, _ = rawDataRDD.randomSplit(weights, seed)
sample.cache()

PythonRDD[15] at RDD at PythonRDD.scala:53

In [55]:
# take the first 5 records from the sample. each row represents userID, artistID, count.
sample.take(5)

In [58]:
# Based on sampled data, build the matrix for model training
def mapSingleObservation(x):
    # Returns Rating object represented as (user, product, rating) tuple.
    
    # [add line of code here to split each record into userID, artistID, count]
    
    # given possible aliasing, get finalArtistID
    finalArtistID = artistAliasBroadcast.value.get(artistID)
    if finalArtistID is None:
        finalArtistID = artistID
    return Rating(userID, finalArtistID, count)

In [60]:
trainData = sample.map(lambda x: mapSingleObservation(x))
trainData.cache()

In [None]:
# 3) (1 PT) Print the first 5 records from trainData


In [25]:
# Train the ALS implicit model (since the measurements are activity and not ratings)
# using seed 314, rank 10, iterations 5, alpha 0.01


In [27]:
# Model Evaluation

# fetch artists for a test user
testUserID = 1000002

# broadcast artistByID for speed
artistByIDBroadcast = sc.broadcast( artistByID )

# from trainData, collect the artists for the test user. Call the object artistsForUser.
# hint: you will need to apply .value.get(x.product) to the broadcast artistByID, where x is the Rating RDD.
# if you don't do this, you may see artistIDs. you want artist names.
artistsForUser = (trainData
                  .filter(lambda observation: observation.user == None)
                  .map(lambda observation: artistByIDBroadcast.value.get(observation.product))
                  .collect())

In [None]:
# 4) (1 PT) Print the artist listens for testUserID = 1000002


In [None]:
# 5) (2 PTS) Make 10 recommendations for testUserID = 1000002
num_recomm = 600 # this filters down to 10 after filtering Nones
recommendationsForUser = map(lambda observation: artistByID.get(observation.product), None.call("recommendProducts", None, None))

print(None)

In [30]:
# Train a second ALS model with seed 314, rank 20, iterations 5, lambda 0.01.


In [None]:
# 6) (2 PTS) Using the rank 20 model, make 10 recommendations for the same test user
recommendationsForUser_rank20 = map(lambda observation: artistByID.get(observation.product), model.call("recommendProducts", None, None))
print(None)

#### ANSWER SECTION (COPY ALL ANSWERS HERE)

In [None]:
# ANSWER 1 (1 PT)
# Print the first 10 records from rawDataRDD


In [4]:
# ANSWER 2 (1 PT)
# Apply parseArtistIdNamePair to rawArtistRDD and print the first 10 records, showing only artist names
    

In [None]:
# ANSWER 3 (1 PT)
# Print the first 5 records from trainData


In [None]:
# ANSWER 4 (1 PT)
# Print the artist listens for testUserID = 1000002


In [None]:
# ANSWER 5 (2 PTS)
# Make 10 recommendations for testUserID = 1000002


In [None]:
# ANSWER 6 (2 PTS)
# Using the rank 20 model, make 10 recommendations for testUserID = 1000002

In [None]:
# ANSWER 7 (2 PTS)
# How does the rank 10 model seem to perform versus the rank 20 model?
# The contents of artistsForUser may help answer the question.
