### Download dataset
<b>Dataset location: </b>http://files.grouplens.org/datasets/hetrec2011/hetrec2011-lastfm-2k.zip <br /><br />
Given the number of times users have listened to songs of an artist, make artist recommendations for the user

In [21]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName('Use Implicit Collaborative Filtering for \
             band recommendations') \
    .getOrCreate()

rawData = spark.read\
               .format('csv')\
               .option('delimiter', '\t')\
               .option('header', 'true')\
               .load('../datasets/lastfm/user_artists.dat')
                
rawData.toPandas().head()

Unnamed: 0,userID,artistID,weight
0,2,51,13883
1,2,52,11690
2,2,53,11351
3,2,54,10300
4,2,55,8983


#### Extract all the columns and cast the values as int

In [22]:
from pyspark.sql.functions import col

dataset = rawData.select(col('userID').cast('int'), 
                         col('artistID').cast('int'), 
                         col('weight').cast('int')
                        )

dataset

DataFrame[userID: int, artistID: int, weight: int]

#### Examine the weight field
This lists the number of times the user has listened to songs of that artist

In [23]:
dataset.select('weight').describe().toPandas()

Unnamed: 0,summary,weight
0,count,92834.0
1,mean,745.2439300256372
2,stddev,3751.32208038768
3,min,1.0
4,max,352698.0


#### Standardize the weight column
* Standardization is essential for more accurate prediction, rather than using a numeric column with huge range and magnitude (check min, max)
* In order to get recommendations using implicit feedback (such as number of times an artist has been listened to), we need to standardize the weight column
* Pyspark does not contain a built-in standardizer for scalar data (only for vectors) which is why we standardize the column values on our own

In [24]:
from pyspark.sql.functions import stddev, mean, col

df = dataset.select(mean('weight').alias('mean_weight'), 
                    stddev('weight').alias('stddev_weight'))\
            .crossJoin(dataset)\
            .withColumn('weight_scaled' , 
    (col('weight') - col('mean_weight')) / col('stddev_weight'))
        
df.toPandas().head()

Unnamed: 0,mean_weight,stddev_weight,userID,artistID,weight,weight_scaled
0,745.24393,3751.32208,2,51,13883,3.502167
1,745.24393,3751.32208,2,52,11690,2.917573
2,745.24393,3751.32208,2,53,11351,2.827205
3,745.24393,3751.32208,2,54,10300,2.547037
4,745.24393,3751.32208,2,55,8983,2.195961


**What the above piece of code is doing**

In [25]:
from pyspark.sql.functions import stddev, mean, col

df2 = dataset.select(mean('weight').alias('mean_weight'), 
                    stddev('weight').alias('stddev_weight'))

df2.toPandas().head()

Unnamed: 0,mean_weight,stddev_weight
0,745.24393,3751.32208


In [26]:
from pyspark.sql.functions import stddev, mean, col

df3 = dataset.select(mean('weight').alias('mean_weight'), 
                    stddev('weight').alias('stddev_weight'))\
                                            .crossJoin(dataset)

df3.toPandas().head()

Unnamed: 0,mean_weight,stddev_weight,userID,artistID,weight
0,745.24393,3751.32208,2,51,13883
1,745.24393,3751.32208,2,52,11690
2,745.24393,3751.32208,2,53,11351
3,745.24393,3751.32208,2,54,10300
4,745.24393,3751.32208,2,55,8983


In [27]:
###############################################################################

#### Split the dataset into training and test sets

In [28]:
(trainingData, testData) = df.randomSplit([0.8, 0.2])

## Define the ALS model
### **The metrics used to evaluate ALS models which use implicit feedback are:**
#### Mean Average Precision (MAP)
#### Normalized Discounted Cumulative Gain (NDCG)

These are not part of Pyspark yet so will need to be implemented by us (not covered in this course)

In [29]:
from pyspark.ml.recommendation import ALS

als = ALS(maxIter=10, # same as explicit model
          regParam=0.1, # same as explicit model
          userCol='userID', # same as explicit model
          itemCol='artistID', # same as explicit model
# `implicitPrefs` tells spark they are dealing with IMPLICIT feedback
# and not EXPLICIT ratings
          implicitPrefs=True,
          ratingCol='weight_scaled', # Standardized weight column
          coldStartStrategy='drop')

model = als.fit(trainingData)

#### Get the predictions from our model on the test data

In [30]:
predictions = model.transform(testData)
predictions.toPandas().head()

Unnamed: 0,mean_weight,stddev_weight,userID,artistID,weight,weight_scaled,prediction
0,745.24393,3751.32208,1137,463,77,-0.178136,0.0
1,745.24393,3751.32208,850,463,784,0.010331,0.008961
2,745.24393,3751.32208,859,463,243,-0.133885,0.001999
3,745.24393,3751.32208,11,463,1235,0.130556,0.003363
4,745.24393,3751.32208,174,463,93,-0.17387,0.000695


#### Examine the distribution of the original weights and the predictions

In [31]:
predictionsPandas = predictions.select('weight_scaled', 
                                    'prediction').toPandas()
predictionsPandas.describe()

Unnamed: 0,weight_scaled,prediction
count,16154.0,16154.0
mean,0.014553,0.042319
std,1.056234,0.099612
min,-0.198395,-0.510118
25%,-0.168006,0.0
50%,-0.125887,0.002242
75%,-0.026456,0.03657
max,60.534327,1.18815


#### Load the Artist information from the artists.dat file
This will be used to map the artistID listed in the recommendation with the actual artist name

In [32]:
artistData = spark.read\
                  .format('csv')\
                  .option('delimiter', '\t')\
                  .option('header', 'true')\
                  .load('../datasets/lastfm/artists.dat')
                
artistData.toPandas().head()

Unnamed: 0,id,name,url,pictureURL
0,1,MALICE MIZER,http://www.last.fm/music/MALICE+MIZER,http://userserve-ak.last.fm/serve/252/10808.jpg
1,2,Diary of Dreams,http://www.last.fm/music/Diary+of+Dreams,http://userserve-ak.last.fm/serve/252/3052066.jpg
2,3,Carpathian Forest,http://www.last.fm/music/Carpathian+Forest,http://userserve-ak.last.fm/serve/252/40222717...
3,4,Moi dix Mois,http://www.last.fm/music/Moi+dix+Mois,http://userserve-ak.last.fm/serve/252/54697835...
4,5,Bella Morte,http://www.last.fm/music/Bella+Morte,http://userserve-ak.last.fm/serve/252/14789013...


#### Define a function to get the artist recommendations
* Similar to what was done in the last exercise for movie recommendations
* Note how the joining of the artistData and artistsDF is a little different - the ids have different name in each table (artistID vs id)

In [33]:
# A function that will do all the steps like previous NB
from pyspark.sql.types import IntegerType

def getRecommendationsForUser(userId, numRecs):
    
    usersDF = spark.\
    createDataFrame([userId], IntegerType()).\
    toDF('userId')
    
    userRecs = model.recommendForUserSubset(usersDF, numRecs)
    
    artistsList = userRecs.collect()[0].recommendations
    artistsDF = spark.createDataFrame(artistsList)
    
    # Joining on artistID
    recommendedArtists = artistData\
                        .join(artistsDF, 
                        artistData.id == artistsDF.artistID)\
                        .orderBy('rating', ascending=False)\
                        .select('name', 'url', 'rating')
    
    return recommendedArtists

#### Get full recommendations for a particular userID
* Users 939, 2013 gets recommended a lot of rock and metal bands 
* User 2 gets recommended 80s/90s European pop bands
* Metal bands for user 107
* 1726 gets pop recommendations

In [34]:
# 1726 gets pop recommendation
getRecommendationsForUser(1726, 10).toPandas()

Unnamed: 0,name,url,rating
0,Britney Spears,http://www.last.fm/music/Britney+Spears,0.778817
1,Lady Gaga,http://www.last.fm/music/Lady+Gaga,0.508797
2,Christina Aguilera,http://www.last.fm/music/Christina+Aguilera,0.448109
3,Rihanna,http://www.last.fm/music/Rihanna,0.447334
4,Madonna,http://www.last.fm/music/Madonna,0.38932
5,Kylie Minogue,http://www.last.fm/music/Kylie+Minogue,0.342708
6,Beyoncé,http://www.last.fm/music/Beyonc%C3%A9,0.342571
7,Ke$ha,http://www.last.fm/music/Ke%24ha,0.312987
8,Katy Perry,http://www.last.fm/music/Katy+Perry,0.287431
9,Shakira,http://www.last.fm/music/Shakira,0.26204


##### Now to see if the recommendations to user 1726 makes sense, we can check what band this user has been listening

In [36]:
userArtistRaw = dataset.filter(dataset.userID == 1726)

userArtistsInfo = artistData.join(userArtistRaw, 
                        artistData.id==userArtistRaw.artistID)\
                        .orderBy('weight', ascending=False)\
                        .select('name', 'weight')

userArtistsInfo.toPandas()

Unnamed: 0,name,weight
0,Britney Spears,13804
1,Christina Aguilera,1396
2,Rihanna,1056
3,Shakira,1027
4,Katy Perry,651
5,Beyoncé,544
6,Lady Gaga,517
7,Cheryl Cole,478
8,David Guetta,466
9,Ke$ha,446


**User 1726 listens to pop music in real, so we can recommend that person all the bands that shows up in recommendation but the user hasn't listen to them yet in real life**