<a href="https://colab.research.google.com/github/dev0419/BDA_Lab/blob/main/Lab-4/Recommendation_Systems_L4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1. Demonstrate how to load a dataset suitable for recommendation systems into a PySpark DataFrame

In [3]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Recommendation_System').getOrCreate()
df = spark.read.json('movies.json')
df.show(5)

+-----------+----------+--------------------+--------------------+-----+--------------------+----------+--------------+
|helpfulness|product_id|        profile_name|              review|score|             summary|      time|       user_id|
+-----------+----------+--------------------+--------------------+-----+--------------------+----------+--------------+
|        7/7|B003AI2VGA|Brian E. Erland "...|Synopsis: On the ...|  3.0|"There Is So Much...|1182729600|A141HP4LYPWMSR|
|        4/4|B003AI2VGA|          Grady Harp|THE VIRGIN OF JUA...|  3.0|Worthwhile and Im...|1181952000|A328S9RN3U5M68|
|       8/10|B003AI2VGA|Chrissy K. McVay ...|The scenes in thi...|  5.0|This movie needed...|1164844800|A1I7QGUDP043DG|
|        1/1|B003AI2VGA|        golgotha.gov|THE VIRGIN OF JUA...|  3.0|distantly based o...|1197158400|A1M5405JH9THP9|
|        1/1|B003AI2VGA|KerrLines "&#34;M...|Informationally, ...|  3.0|"What's going on ...|1188345600| ATXL536YX71TR|
+-----------+----------+----------------

2. Implement a PySpark script that splits the data and trains a recommendation model.

In [13]:
from pyspark.sql.functions import col
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.ml.feature import StringIndexer

data = df.select(col('user_id'),col('product_id'),col('score'))
data.show(5)
user_indexer = StringIndexer(inputCol='user_id',outputCol='user_index').fit(data)
product_indexer = StringIndexer(inputCol='product_id',outputCol='product_index').fit(data)
print("Transformed dataframe")
data_indexed = user_indexer.transform(product_indexer.transform(data))
data_indexed.show(5)
data_indexed = data_indexed.withColumnRenamed('score','rating')
(train,test) = data_indexed.randomSplit([0.8,0.2])
print("train set")
train.show(5)
print("test set")
test.show(5)


+--------------+----------+-----+
|       user_id|product_id|score|
+--------------+----------+-----+
|A141HP4LYPWMSR|B003AI2VGA|  3.0|
|A328S9RN3U5M68|B003AI2VGA|  3.0|
|A1I7QGUDP043DG|B003AI2VGA|  5.0|
|A1M5405JH9THP9|B003AI2VGA|  3.0|
| ATXL536YX71TR|B003AI2VGA|  3.0|
+--------------+----------+-----+
only showing top 5 rows

Transformed dataframe
+--------------+----------+-----+-------------+----------+
|       user_id|product_id|score|product_index|user_index|
+--------------+----------+-----+-------------+----------+
|A141HP4LYPWMSR|B003AI2VGA|  3.0|        731.0|      32.0|
|A328S9RN3U5M68|B003AI2VGA|  3.0|        731.0|       3.0|
|A1I7QGUDP043DG|B003AI2VGA|  5.0|        731.0|     312.0|
|A1M5405JH9THP9|B003AI2VGA|  3.0|        731.0|   10917.0|
| ATXL536YX71TR|B003AI2VGA|  3.0|        731.0|     173.0|
+--------------+----------+-----+-------------+----------+
only showing top 5 rows

train set
+--------------+----------+------+-------------+----------+
|       user_id|produ

3. Implement a PySpark script using the ALS algorithm for collaborative filtering.

In [14]:
als = ALS(maxIter = 5,regParam = 0.01,userCol = 'user_index',itemCol = 'product_index',ratingCol='rating',coldStartStrategy='drop')
model = als.fit(train)
pred = model.transform(test)
print("Predictions")
pred.show(5)

Predictions
+--------------+----------+------+-------------+----------+-----------+
|       user_id|product_id|rating|product_index|user_index| prediction|
+--------------+----------+------+-------------+----------+-----------+
|A1C80B497LCYKA|6304239343|   5.0|        496.0|     586.0| -12.712129|
|A1WBXDI7LRPLXB|6304239343|   4.0|        496.0|    2868.0|-0.84937793|
|A2FS6OGMZMALTD|6304239343|   5.0|        496.0|    1430.0|  2.9538493|
|A1KF9NQLPWUF75|B003U6SJXQ|   5.0|        148.0|    1265.0|  0.8086522|
|A29XWXJ525HH7O|B003U6SJXQ|   2.0|        148.0|    3276.0| 0.59714407|
+--------------+----------+------+-------------+----------+-----------+
only showing top 5 rows



4. Implement code to evaluate the performance of the recommendation model using appropriate metrics

In [15]:
evaluator = RegressionEvaluator(metricName='rmse',labelCol='rating',predictionCol='prediction')
rmse = evaluator.evaluate(pred)
print(f"RMSE: {rmse}")
userRecs = model.recommendForAllUsers(5)
itemRecs = model.recommendForAllItems(5)

userRecs.show(5)
itemRecs.show(5)

RMSE: 5.156665028572138
+----------+--------------------+
|user_index|     recommendations|
+----------+--------------------+
|         1|[{775, 19.085175}...|
|         3|[{858, 31.643036}...|
|         5|[{755, 30.92457},...|
|         6|[{557, 29.00822},...|
|         9|[{653, 33.269337}...|
+----------+--------------------+
only showing top 5 rows

+-------------+--------------------+
|product_index|     recommendations|
+-------------+--------------------+
|            1|[{159, 7.9197154}...|
|            3|[{174, 7.697837},...|
|            5|[{246, 12.1970005...|
|            6|[{395, 9.147764},...|
|            9|[{269, 9.744622},...|
+-------------+--------------------+
only showing top 5 rows

