#  Music Recommendations at Scale with Spark
## Summary of Christopher Johnson's Talk (Spotify)

## In his talk *"Music Recommendations at Scale with Spark"*, Christopher Johnson describes how Spotify uses large-scale recommender systems to help users discover new music from over 40 million songs. 

## Music streaming services rely on large-scale recommender systems to improve user experience and engagement by helping users discover new songs.

# Key points:
## - Spotify combines **personalized recommendations**, **artist radios**, and **similar artist discovery**.
## - They use both **manual curation** (expert tagging of musical attributes) and **automated analysis** (e.g., via The Echo Nest's audio content analysis).
## - The core of Spotify’s large-scale recommendations relies on **collaborative filtering**, particularly **implicit matrix factorization**:
##     - Instead of explicit ratings (like Netflix stars), Spotify uses **implicit feedback** (binary: streamed or not, or weighted by play count).
##     - This feedback forms a large sparse matrix (users × songs).
## - They solve for latent factors using **alternating least squares (ALS)**:
##    - Fix item vectors → solve user vectors (ridge regression)
##     - Fix user vectors → solve item vectors
##    - Iterate until convergence
## - Three implementation strategies were discussed:
###     1️⃣ **Broadcast everything** (inefficient, lots of shuffling)  
###     2️⃣ **Full gridify** (better caching, still heavy network traffic)  
###     3️⃣ **Half gridify** (optimal balance of memory use and network traffic)
## - Switching from Hadoop (10 hours) to Spark reduced training time dramatically:
##     - Spark full gridify: 3.5 hours
##     - Spark half gridify: 1.5 hours
## - Techniques like **Kryo serialization** helped improve performance over default Java serialization.

# Why Implicit Feedback?

## In music streaming, users rarely provide explicit ratings (like stars). Instead, their listening behavior—such as how often they play a song—provides implicit feedback. Modeling implicit feedback allows the system to learn preferences from abundant real usage patterns, which are more reflective of user interest than sparse explicit ratings.


# Example: Implicit Matrix Factorization with PySpark ALS
## The code below illustrates implicit matrix factorization using Spark's built-in ALS module.

In [56]:
# Import Spark libraries
from pyspark.sql import SparkSession
from pyspark.ml.recommendation import ALS

# Start Spark session
# SparkSession is the entry point to programming Spark with the Dataset and DataFrame API.
spark = SparkSession.builder.appName("SpotifyALSExample").getOrCreate()

In [58]:
# Example dataset: userId, songId, implicit play count (e.g., number of streams)
data = [
    (0, 10, 3.0),
    (0, 20, 1.0),
    (0, 30, 2.0),
    (1, 10, 4.0),
    (1, 30, 1.0),
    (2, 20, 5.0),
    (2, 30, 3.0)
]

columns = ["userId", "songId", "playCount"]

# Create Spark DataFrame
ratings = spark.createDataFrame(data, columns)

ratings.show()

                                                                                

+------+------+---------+
|userId|songId|playCount|
+------+------+---------+
|     0|    10|      3.0|
|     0|    20|      1.0|
|     0|    30|      2.0|
|     1|    10|      4.0|
|     1|    30|      1.0|
|     2|    20|      5.0|
|     2|    30|      3.0|
+------+------+---------+



## This table shows the user-song interactions used as training data. Each row represents how many times a user has played a specific song. For example, user 0 played song 10 three times.

In [60]:
# Configure ALS for implicit feedback
# ALS (Alternating Least Squares) factorizes the user-item interaction matrix into latent factors.
als = ALS(
    userCol="userId",
    itemCol="songId",
    ratingCol="playCount",
    implicitPrefs=True,  # Enables implicit matrix factorization (binary/weighted events)
    rank=10,             # Number of latent factors
    maxIter=10,          # Number of ALS iterations
    regParam=0.1         # Regularization to prevent overfitting
)

# Fit the ALS model
model = als.fit(ratings)

In [62]:
# Generate top 3 song recommendations for each user
user_recommendations = model.recommendForAllUsers(3)

# Display recommendations
user_recommendations.show(truncate=False)

+------+-----------------------------------------------------+
|userId|recommendations                                      |
+------+-----------------------------------------------------+
|0     |[{30, 0.9973508}, {10, 0.95992213}, {20, 0.92250496}]|
|1     |[{10, 0.9859443}, {30, 0.9187319}, {20, 0.11632301}] |
|2     |[{20, 0.98867387}, {30, 0.9600925}, {10, 0.10747534}]|
+------+-----------------------------------------------------+



## About ALS Parameters
### rank: Number of latent factors used to represent users and items. Higher rank can capture more complex patterns but may lead to overfitting.

### maxIter: Number of iterations ALS will run to optimize the latent factors.

### regParam: Regularization parameter that controls overfitting by penalizing large factor values.

### implicitPrefs: When set to True, ALS models implicit feedback data rather than explicit ratings, interpreting the input as confidence levels rather than direct ratings.

## Interpreting Recommendations

### Each user receives a ranked list of songs with predicted preference scores indicating the strength of the recommendation. Higher scores imply a higher likelihood the user will enjoy the song.

### For example, user 0 is most strongly recommended song 30, followed by songs 10 and 20. These scores are not explicit ratings but relative measures of user affinity inferred from implicit feedback.

In [64]:
# Stop Spark session
spark.stop()

## References
### - Talk: Christopher Johnson (2014). *Music Recommendations at Scale with Spark (Spotify)*. [YouTube](http://www.youtube.com/watch?v=3LBgiFch4_g)
### - Code inspired by: Christopher Johnson’s `implicit-mf` repository: [https://github.com/MrChrisJohnson/implicit-mf](https://github.com/MrChrisJohnson/implicit-mf)
