**Setting up the Data:**
- Dataset - published by Audioscrobbler.

In [1]:
from pyspark.sql import SparkSession
from pyspark import SparkConf
import os, sys

os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

In [2]:
conf = SparkConf()
conf.set("spark.app.name","recommender")
conf.set("spark.master","local[*]")
conf.set("spark.driver.memory","4g")

<pyspark.conf.SparkConf at 0x25889031870>

In [3]:
spark = SparkSession.builder\
                    .config(conf=conf)\
                    .getOrCreate()

In [5]:
raw_user_artist_path = r"C:\Users\blais\Documents\ML\data\audioscrobbler_data\user_artist_data.txt"
raw_user_artist_data = spark.read.text(raw_user_artist_path)

Artist plays:

In [6]:
raw_user_artist_data.show(5)

+-------------------+
|              value|
+-------------------+
|       1000002 1 55|
| 1000002 1000006 33|
|  1000002 1000007 8|
|1000002 1000009 144|
|1000002 1000010 314|
+-------------------+
only showing top 5 rows



Dataset also gives the names of each artist by ID 

In [7]:
raw_artist_data = spark.read.text(r"C:\Users\blais\Documents\ML\data\audioscrobbler_data\artist_data.txt")
raw_artist_data.show(5)

+--------------------+
|               value|
+--------------------+
|1134999\t06Crazy ...|
|6821360\tPang Nak...|
|10113088\tTerfel,...|
|10151459\tThe Fla...|
|6826647\tBodensta...|
+--------------------+
only showing top 5 rows



In [8]:
raw_artist_alias = spark.read.text(r"C:\Users\blais\Documents\ML\data\audioscrobbler_data\artist_alias.txt")

In [9]:
raw_artist_alias.show(5)

+-----------------+
|            value|
+-----------------+
| 1092764\t1000311|
| 1095122\t1000557|
| 6708070\t1007267|
|10088054\t1042317|
| 1195917\t1042317|
+-----------------+
only showing top 5 rows



Now that we have a basic understanding of the datasets:
- One that has users, artists and number of times a user has played songs from an artist
- Another that has artist IDs and names
- Another works to correct wrongly named artists - maps IDs to their real ID (Facts table sort of)

**Our Requirements for a Recommender System:**
- We need to choose a recommender algorithm that is suitable for our data. Here are our considerations:

1. Implicit feedback:
Data is comprised entirely of interactions between users and artists' songs. It contains no information about users or about artists. We need an algorithm that learns without access to user or artist attributes. These are called collaborarive filtering algorithms. 
2. Sparsity:
Our dataset looks large - 10s of millions of play counts but - small skimpy and sparse. 
3. Scalability and real-time predictions: real time predictions


A broad class of algorithms that may be suitable is - latent factor models. They try to explain observed interactions between large numbers of users and items through a relatively small number of unobserved underlying reasons. for example, consider a customer who has bought albums by metal bands Megadeth and Pantera but also classical composer Mozart. It may be difficult to explain why these albums were boight.However, its probably a small window on a mich larger set of tastes. Maybe the customer likes a coherent spectrum of music. 

**Using collaborative filters and Matrix Factorisation:**



**Alternating Least Squares Algorithm:**


In [10]:
import pyspark.sql.functions as f 
from pyspark.sql.types import IntegerType, StringType

In [11]:
raw_user_artist_data.show(5)

+-------------------+
|              value|
+-------------------+
|       1000002 1 55|
| 1000002 1000006 33|
|  1000002 1000007 8|
|1000002 1000009 144|
|1000002 1000010 314|
+-------------------+
only showing top 5 rows



In [15]:
user_artist_df = raw_user_artist_data\
                            .withColumn('user', f.split(raw_user_artist_data['value'],' ').getItem(0).cast(IntegerType()))\
                            .withColumn('artist',f.split(raw_user_artist_data['value'],' ').getItem(1).cast(IntegerType()))\
                            .withColumn('count',f.split(raw_user_artist_data['value'],' ').getItem(2).cast(IntegerType()))\
                            .drop("value")

In [17]:
user_artist_df.show(3)

+-------+-------+-----+
|   user| artist|count|
+-------+-------+-----+
|1000002|      1|   55|
|1000002|1000006|   33|
|1000002|1000007|    8|
+-------+-------+-----+
only showing top 3 rows



In [18]:
user_artist_df.select(f.min("user"), f.max("user"), f.min("artist"), f.max("artist")).show()

+---------+---------+-----------+-----------+
|min(user)|max(user)|min(artist)|max(artist)|
+---------+---------+-----------+-----------+
|       90|  2443548|          1|   10794401|
+---------+---------+-----------+-----------+



In [19]:
raw_artist_data.show(3)

+--------------------+
|               value|
+--------------------+
|1134999\t06Crazy ...|
|6821360\tPang Nak...|
|10113088\tTerfel,...|
+--------------------+
only showing top 3 rows



In [21]:
artist_by_id = raw_artist_data.withColumn('id', f.split(f.col('value'), '\s+', 2).getItem(0).cast(IntegerType()))\
                              .withColumn('name', f.split(f.col('value'),'\s+',2).getItem(1).cast(StringType()))\
                              .drop("value")

In [22]:
artist_by_id.show(3)

+--------+--------------------+
|      id|                name|
+--------+--------------------+
| 1134999|        06Crazy Life|
| 6821360|        Pang Nakarin|
|10113088|Terfel, Bartoli- ...|
+--------+--------------------+
only showing top 3 rows



Raw artist alias maps artist IDs that may be mispelled or nonstandard to the ID of the artist's canonical name. 

In [23]:
raw_artist_alias.show(3)

+----------------+
|           value|
+----------------+
|1092764\t1000311|
|1095122\t1000557|
|6708070\t1007267|
+----------------+
only showing top 3 rows



In [24]:
artist_alias = raw_artist_alias.withColumn('artist', f.split(f.col('value'),'\s+').getItem(0).cast(IntegerType()))\
                               .withColumn('alias',f.split(f.col('value'),'\s+').getItem(1).cast(StringType()))\
                               .drop("value")

In [25]:
artist_alias.show(4)

+--------+-------+
|  artist|  alias|
+--------+-------+
| 1092764|1000311|
| 1095122|1000557|
| 6708070|1007267|
|10088054|1042317|
+--------+-------+
only showing top 4 rows



first entry maps ID 1092764 to 1000311. We can look these up from artist_by_id:

In [28]:
artist_by_id.where("id IN (1092764, 1000311)").show()

+-------+--------------+
|     id|          name|
+-------+--------------+
|1000311| Steve Winwood|
|1092764|Winwood, Steve|
+-------+--------------+



**Building a First Model:**
- Although the dataset is nearly right form for use with spark mllib's als - it requires small, extra transformation. the aliases should be applied to convert alll artist IDs to a canonical ID, if a different canonical ID exists:

In [30]:
artist_alias.show(3)

+-------+-------+
| artist|  alias|
+-------+-------+
|1092764|1000311|
|1095122|1000557|
|6708070|1007267|
+-------+-------+
only showing top 3 rows



In [29]:
user_artist_df.show(3)

+-------+-------+-----+
|   user| artist|count|
+-------+-------+-----+
|1000002|      1|   55|
|1000002|1000006|   33|
|1000002|1000007|    8|
+-------+-------+-----+
only showing top 3 rows



In [31]:
train_data = user_artist_df.join(f.broadcast(artist_alias), 'artist', how='left')

In [32]:
train_data.show(3)

+-------+-------+-----+-----+
| artist|   user|count|alias|
+-------+-------+-----+-----+
|      1|1000002|   55| NULL|
|1000006|1000002|   33| NULL|
|1000007|1000002|    8| NULL|
+-------+-------+-----+-----+
only showing top 3 rows



In [34]:
train_data = train_data.withColumn('artist',
                                    f.when(f.col('alias').isNull(),f.col('artist')).otherwise(f.col('alias')))

In [35]:
train_data = train_data.withColumn('artist',f.col('artist').cast(IntegerType()))\
                       .drop('alias')

In [36]:
train_data.cache()

DataFrame[artist: int, user: int, count: int]

In [37]:
train_data.count()

24296858

with spark, when you use cache or persist - the dataframe is not fully cached until you trigger an action that goes through every record (e.g. count). Using an action like show(1) - only one partition will be cached. 

Finally - build a model:

In [38]:
from pyspark.ml.recommendation import ALS

In [40]:
train_data.show(3)

+-------+-------+-----+
| artist|   user|count|
+-------+-------+-----+
|      1|1000002|   55|
|1000006|1000002|   33|
|1000007|1000002|    8|
+-------+-------+-----+
only showing top 3 rows



In [41]:
model = ALS(rank=10, seed=0, maxIter=5, regParam=0.1,
            implicitPrefs=True, alpha=1.0, userCol='user',
            itemCol='artist', ratingCol='count').fit(train_data)

to see some feature vectors, try the following - which displays just 1 row and does not truncate the wide display of the feature vector.

In [42]:
model.userFactors.show(1, truncate=False)

+---+-----------------------------------------------------------------------------------------------------------------------------+
|id |features                                                                                                                     |
+---+-----------------------------------------------------------------------------------------------------------------------------+
|90 |[0.16020626, 0.20717518, -0.17194685, 0.060384676, 0.0627277, 0.54658705, -0.40481892, 0.43657345, -0.10396776, -0.042728294]|
+---+-----------------------------------------------------------------------------------------------------------------------------+
only showing top 1 row



**Spot Checking Recommendations:**

In [43]:
user_id = 2093760

In [44]:
existing_artist_ids = train_data.filter(train_data.user == user_id).select("artist").collect()

In [45]:
existing_artist_ids = [i[0] for i in existing_artist_ids]

In [46]:
existing_artist_ids

[1180, 1255340, 378, 813, 942]

In [48]:
artist_by_id.filter(f.col('id').isin(existing_artist_ids)).show()

+-------+---------------+
|     id|           name|
+-------+---------------+
|   1180|     David Gray|
|    378|  Blackalicious|
|    813|     Jurassic 5|
|1255340|The Saw Doctors|
|    942|         Xzibit|
+-------+---------------+



In [49]:
user_subset = train_data.select('user').where(f.col('user')== user_id).distinct()

In [50]:
user_subset.show()

+-------+
|   user|
+-------+
|2093760|
+-------+



In [51]:
top_predictions = model.recommendForUserSubset(user_subset, 5)

In [52]:
top_predictions.show()

+-------+--------------------+
|   user|     recommendations|
+-------+--------------------+
|2093760|[{2814, 0.0294106...|
+-------+--------------------+



In [53]:
top_predictions_pandas = top_predictions.toPandas()

In [54]:
print(top_predictions_pandas)

      user                                    recommendations
0  2093760  [(2814, 0.029410677030682564), (1300642, 0.028...


In [55]:
recommended_artist_ids = [i[0] for i in top_predictions_pandas.recommendations[0]]

In [57]:
artist_by_id.filter(f.col('id').isin(recommended_artist_ids)).show()

+-------+----------+
|     id|      name|
+-------+----------+
|   2814|   50 Cent|
|   4605|Snoop Dogg|
|1007614|     Jay-Z|
|1001819|      2Pac|
|1300642|  The Game|
+-------+----------+

