<a href="https://colab.research.google.com/github/Imraj/3D-Machine-Learning/blob/master/music_recommender_big_data_als.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**Setup dev environment**

In [17]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


**Import Packages** 


In [19]:
from google.colab import drive
import os
import pandas as pd
import numpy as np
from pyspark.ml.recommendation import ALS
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.ml.feature import StandardScaler
from pyspark.ml.recommendation import ALS
from pyspark.sql.functions import max, col
from pyspark.ml.feature import StringIndexer, OneHotEncoder
from pyspark.ml import Pipeline
from pyspark.sql.functions import col, concat, lit

**mount google drive to access the dataset from it's directory**

In [20]:
drive.mount("/content/drive")
os.chdir('/content/drive/My Drive/collaborative filtering/dataset')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [21]:
!ls

ydata-ymusic-artist-names-v1_0.txt  ydata-ymusic-user-artist-ratings-v1_0.txt


## **Create Dataset**



1.   Load data from the text files
2.   Normalize the ratings(0-1) by diving each rating with the maximum rating 
3.   Split dataset into training and test dataset



In [23]:
# Create a Spark session
spark = SparkSession.builder.appName('MusicRecommender').getOrCreate()

# Load the artist names data
artist_names = spark.read.csv('ydata-ymusic-artist-names-v1_0.txt', header=None, sep='\t')

# Load the user ratings data
user_ratings = spark.read.csv('ydata-ymusic-user-artist-ratings-v1_0.txt', header=None, sep='\t')

In [24]:
# Rename the columns of the user ratings dataframe
user_ratings = user_ratings.selectExpr('_c0 as userId', '_c1 as artistId', '_c2 as rating')


In [25]:
# Rename the columns of the user ratings dataframe
artist_names = artist_names.selectExpr('_c0 as artistId', '_c1 as artistName')

In [26]:
# Join the artist names dataframe with the user ratings dataframe
joined_data = artist_names.join(user_ratings, 'artistId')


**Normalize Ratings**

In [None]:

# Compute the maximum rating
max_rating = joined_data.agg(max(col('rating'))).collect()[0][0]

# Create a new column with normalized ratings
joined_data_norm = joined_data.withColumn('normalized_rating', col('rating') / max_rating)

In [27]:
joined_data_norm.sample(0.1, seed=123).show()

+--------+------------------+------+------+------------------+
|artistId|        artistName|userId|rating| normalized_rating|
+--------+------------------+------+------+------------------+
| 1045525|      Tupac Shakur|     1|   100|1.0101010101010102|
| 1047584|      Shadows Fall|     1|   100|1.0101010101010102|
| 1018143|    Mint Condition|     2|     0|               0.0|
| 1053438|             Musiq|     2|    90|0.9090909090909091|
| 1019512|   Nine Inch Nails|     4|    90|0.9090909090909091|
| 1030811|     Jack Off Jill|     4|    90|0.9090909090909091|
| 1042272| The White Stripes|     4|    90|0.9090909090909091|
| 1049897|        Norma Jean|     4|     0|               0.0|
| 1099394|  Coheed & Cambria|     4|     0|               0.0|
| 1014120|          R. Kelly|     5|   100|1.0101010101010102|
| 1014252|          Kid Rock|     5|     0|               0.0|
| 1022980|         The Roots|     5|    90|0.9090909090909091|
| 1026260|       Keith Sweat|     5|   100|1.0101010101

**Encode the categorical field**

In [None]:

# Encode user_id column
user_id_indexer = StringIndexer(inputCol='userId', outputCol='userId_index')
user_id_encoder = OneHotEncoder(inputCol='userId_index', outputCol='userId_vec')

# Encode artist_id column
artist_id_indexer = StringIndexer(inputCol='artistId', outputCol='artistId_index')
artist_id_encoder = OneHotEncoder(inputCol='artistId_index', outputCol='artistId_vec')

# Encode artist_name column
artist_name_indexer = StringIndexer(inputCol='artistName', outputCol='artistName_index')
artist_name_encoder = OneHotEncoder(inputCol='artistName_index', outputCol='artistName_vec')

# Chain the encoders together into a single pipeline
encoder_pipeline = Pipeline(stages=[user_id_indexer, user_id_encoder,
                                    artist_id_indexer, artist_id_encoder,
                                    artist_name_indexer, artist_name_encoder])

# Fit the encoder pipeline to the training data
encoded_training_data = encoder_pipeline.fit(joined_data_norm).transform(joined_data_norm)


In [None]:
encoded_training_data.sample(0.1, seed=123).show()

+--------+------------------+------+------+------------------+------------+--------------------+--------------+--------------------+----------------+--------------------+
|artistId|        artistName|userId|rating| normalized_rating|userId_index|          userId_vec|artistId_index|        artistId_vec|artistName_index|      artistName_vec|
+--------+------------------+------+------+------------------+------------+--------------------+--------------+--------------------+----------------+--------------------+
| 1045525|      Tupac Shakur|     1|   100|1.0101010101010102|     26421.0|(72844,[26421],[1...|          35.0|  (23182,[35],[1.0])|            35.0|  (22959,[35],[1.0])|
| 1047584|      Shadows Fall|     1|   100|1.0101010101010102|     26421.0|(72844,[26421],[1...|        1009.0|(23182,[1009],[1.0])|          1006.0|(22959,[1006],[1.0])|
| 1018143|    Mint Condition|     2|     0|               0.0|     54599.0|(72844,[54599],[1...|         667.0| (23182,[667],[1.0])|           66

In [30]:
encoded_training_data.printSchema()


root
 |-- artistId: string (nullable = true)
 |-- artistName: string (nullable = true)
 |-- userId: string (nullable = true)
 |-- rating: string (nullable = true)
 |-- normalized_rating: double (nullable = true)
 |-- userId_index: double (nullable = false)
 |-- userId_vec: vector (nullable = true)
 |-- artistId_index: double (nullable = false)
 |-- artistId_vec: vector (nullable = true)
 |-- artistName_index: double (nullable = false)
 |-- artistName_vec: vector (nullable = true)



In [33]:
# Select only the required columns for training
data_prep = encoded_training_data.select('userId_index', 'artistId_index', 'rating')

# Drop any null values that may exist in the data
data_prep = data_prep.dropna()


In [34]:
data_prep.sample(0.1, seed=123).show()

+------------+--------------+------+
|userId_index|artistId_index|rating|
+------------+--------------+------+
|     26421.0|          35.0|   100|
|     26421.0|        1009.0|   100|
|     54599.0|         667.0|     0|
|     54599.0|         236.0|    90|
|     28233.0|         126.0|    90|
|     28233.0|        1754.0|    90|
|     28233.0|          47.0|    90|
|     28233.0|        2947.0|     0|
|     28233.0|         519.0|     0|
|     15619.0|          96.0|   100|
|     15619.0|          30.0|     0|
|     15619.0|         248.0|    90|
|     15619.0|          99.0|   100|
|     15619.0|          42.0|     0|
|     15619.0|         359.0|   100|
|     16121.0|          21.0|    70|
|     16121.0|         409.0|    50|
|     16121.0|           9.0|    50|
|     16121.0|          73.0|     0|
|     16121.0|          75.0|    70|
+------------+--------------+------+
only showing top 20 rows



**Split into training and test data**

In [35]:

# Split the data into training and test datasets
(train_data, test_data) = data_prep.randomSplit([0.8, 0.2], seed=123)

# Cache the training data in memory
train_data.cache()

DataFrame[userId_index: double, artistId_index: double, rating: string]

In [37]:
# Count the number of distinct users and artists in the training data
num_users = encoded_training_data.select('userId').distinct().count()
num_artists = encoded_training_data.select('artistId').distinct().count()

# Print some statistics about the data
print('Number of training samples:', train_data.count())
print('Number of test samples:', test_data.count())
print('Number of distinct users:', num_users)
print('Number of distinct artists:', num_artists)

Number of training samples: 3503544
Number of test samples: 877510
Number of distinct users: 72845
Number of distinct artists: 23183


**Sample training data**

In [None]:
# Sample 10 rows from the training data and display the result
train_data.sample(False, 0.1, seed=123).show(20)

+--------+------------+------+------+----------+
|artistId|  artistName|userId|rating|rating_vec|
+--------+------------+------+------+----------+
| 1000004|'Til Tuesday| 18741|  20.0|    [20.0]|
| 1000004|'Til Tuesday| 18933|  60.0|    [60.0]|
| 1000004|'Til Tuesday| 19567|  19.0|    [19.0]|
| 1000004|'Til Tuesday| 19898|  60.0|    [60.0]|
| 1000004|'Til Tuesday| 21920|   0.0|     [0.0]|
| 1000004|'Til Tuesday| 24155|   0.0|     [0.0]|
| 1000004|'Til Tuesday| 24591|  90.0|    [90.0]|
| 1000004|'Til Tuesday|  2519|   0.0|     [0.0]|
| 1000004|'Til Tuesday|  2663|  90.0|    [90.0]|
| 1000004|'Til Tuesday|  2974|   0.0|     [0.0]|
| 1000004|'Til Tuesday| 30203|  50.0|    [50.0]|
| 1000004|'Til Tuesday| 32857|   0.0|     [0.0]|
| 1000004|'Til Tuesday| 34088|  50.0|    [50.0]|
| 1000004|'Til Tuesday|  4584|  30.0|    [30.0]|
| 1000004|'Til Tuesday|  5007|  30.0|    [30.0]|
| 1000004|'Til Tuesday|  7845|  80.0|    [80.0]|
| 1000006| .38 Special| 10294|  66.0|    [66.0]|
| 1000006| .38 Speci

**Sample test data**

In [38]:
# Sample 10 rows from the test data and display the result
test_data.sample(False, 0.1, seed=123).show(20)

+------------+--------------+------+
|userId_index|artistId_index|rating|
+------------+--------------+------+
|         0.0|         193.0|     0|
|         0.0|         195.0|     0|
|         0.0|         200.0|    30|
|         0.0|         205.0|    40|
|         0.0|         267.0|     0|
|         0.0|         307.0|    22|
|         0.0|         318.0|     0|
|         0.0|         374.0|    70|
|         0.0|         421.0|     0|
|         0.0|         503.0|    85|
|         0.0|         511.0|     0|
|         0.0|         566.0|    70|
|         0.0|         579.0|    12|
|         0.0|         762.0|     0|
|         0.0|         768.0|     0|
|         0.0|         797.0|    50|
|         0.0|         819.0|    50|
|         0.0|         829.0|    40|
|         0.0|         846.0|     0|
|         0.0|         872.0|     0|
+------------+--------------+------+
only showing top 20 rows



## Train model

## Evaluate model

## New Recommendation