# Association Rules on MovieLens 1M
This notebook provides code to mine association rules on MovieLens 1M.  
I select only good ratings (>3.0) and apply fp-growth algorithms (implemented by [PySpark](https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html#fp-growth)).  
The dataset is downloaded from https://grouplens.org/datasets/movielens/, extracted and copied to the directory `/opt/spark/data`.

The association rules is shown bellow.  

## Import libraries and initialize Spark context

In [1]:
import pyspark
import os
import socket

In [2]:
os.environ['PYSPARK_PYTHON'] = 'python3'
driver_host = socket.gethostbyname(socket.gethostname())

In [3]:
conf = pyspark.SparkConf()

conf.setMaster("k8s://https://kubernetes.default.svc.cluster.local:443") 

conf.set("spark.kubernetes.container.image", "gcr.io/spark-operator/spark-py:v2.4.5") 
conf.set("spark.kubernetes.authenticate.caCertFile", "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt")
conf.set("spark.kubernetes.authenticate.oauthTokenFile", "/var/run/secrets/kubernetes.io/serviceaccount/token")
conf.set("spark.kubernetes.authenticate.driver.serviceAccountName", "spark") 
conf.set("spark.executor.instances", "2")
conf.set("spark.executor.memory", "1g")
conf.set("spark.kubernetes.pyspark.pythonVersion", "3")
conf.set("spark.driver.host", driver_host)
conf.set("spark.driver.port", "29413")

<pyspark.conf.SparkConf at 0x7fc6a5a19e90>

In [4]:
spark = pyspark.sql.SparkSession.builder.config(conf=conf).getOrCreate()

In [5]:
from pyspark.ml.fpm import FPGrowth
from pyspark.sql import functions as F
from pyspark.sql import Row

## Prepare dataset

In [6]:
# Load rating data.
lines = spark.read.text("/opt/spark/data/ratings.dat").rdd
parts = lines.map(lambda row: row.value.split("::"))
ratingsRDD = parts.map(lambda p: Row(userId=int(p[0]), movieId=int(p[1]),
                                     rating=float(p[2]), timestamp=int(p[3])))
ratings = spark.createDataFrame(ratingsRDD)

In [7]:
ratings.show(10)

+-------+------+---------+------+
|movieId|rating|timestamp|userId|
+-------+------+---------+------+
|   1193|   5.0|978300760|     1|
|    661|   3.0|978302109|     1|
|    914|   3.0|978301968|     1|
|   3408|   4.0|978300275|     1|
|   2355|   5.0|978824291|     1|
|   1197|   3.0|978302268|     1|
|   1287|   5.0|978302039|     1|
|   2804|   5.0|978300719|     1|
|    594|   4.0|978302268|     1|
|    919|   4.0|978301368|     1|
+-------+------+---------+------+
only showing top 10 rows



In [8]:
ratings.count()

1000209

In [9]:
# Select good rating and split the dataset.
ratings = ratings.filter(ratings.rating > 3.0)
(training, test) = ratings.randomSplit([0.8, 0.2])

In [10]:
ratings.count()

575281

In [11]:
# Group movies according to users
movies_rating = training.groupBy('userId').agg(F.collect_set('movieId').alias('movieIds'))

In [12]:
movies_rating.show(10)

+------+--------------------+
|userId|            movieIds|
+------+--------------------+
|    26|[356, 3152, 1678,...|
|    29|[610, 589, 1220, ...|
|   474|[466, 741, 785, 3...|
|   964|[1148, 356, 589, ...|
|  1677|[3005, 3615, 2882...|
|  1697|[2086, 1680, 3421...|
|  1806|[508, 356, 3450, ...|
|  1950|[2571, 589, 1199,...|
|  2040|[356, 2599, 1678,...|
|  2214|[356, 3101, 1191,...|
+------+--------------------+
only showing top 10 rows



In [13]:
movies_rating.count()

6038

## Fit a FP-Growth model

In [22]:
# Define FP-Growth model
fpGrowth = FPGrowth(itemsCol="movieIds", minSupport=0.05, minConfidence=0.5)

In [23]:
# Fit the model
model = fpGrowth.fit(movies_rating)

In [24]:
# Display frequent itemsets.
model.freqItemsets.show()

# Display generated association rules.
model.associationRules.show()

+-----------------+----+
|            items|freq|
+-----------------+----+
|           [1246]| 517|
|           [2951]| 332|
|           [2858]|2262|
|            [260]|2115|
|      [260, 2858]| 850|
|           [2081]| 517|
|           [3360]| 332|
|           [1196]|2001|
|      [1196, 260]|1271|
|[1196, 260, 2858]| 538|
|     [1196, 2858]| 842|
|           [2908]| 517|
|     [2908, 2858]| 318|
|           [3082]| 330|
|           [2140]| 330|
|            [593]|1840|
|      [593, 1196]| 802|
| [593, 1196, 260]| 514|
|[593, 1196, 2858]| 437|
|       [593, 260]| 793|
+-----------------+----+
only showing top 20 rows

+-----------------+----------+------------------+------------------+
|       antecedent|consequent|        confidence|              lift|
+-----------------+----------+------------------+------------------+
|       [110, 858]|    [1221]|0.6028119507908611|  3.07933888229714|
|       [110, 858]|    [1196]|0.6467486818980668|1.9515584914045614|
|       [110, 858]|    [1198]

In [25]:
model.associationRules.count()

4163

## Display association rules on the test set

In [26]:
# Group movies according to user
test_movies_rating = test.groupBy('userId').agg(F.collect_set('movieId').alias('movieIds'))

In [27]:
# Transform examines the input items against all the association rules and summarize the
# consequents as prediction
test_association_rules = model.transform(test_movies_rating)

In [28]:
test_association_rules.show()

+------+--------------------+--------------------+
|userId|            movieIds|          prediction|
+------+--------------------+--------------------+
|    26|[1917, 45, 168, 2...|[260, 318, 2028, ...|
|    29|[288, 1036, 1799,...|[1198, 1240, 1291...|
|   474|[661, 2391, 2, 48...|[589, 1196, 260, ...|
|   964|[2599, 223, 597, ...|   [2858, 1196, 260]|
|  1677|[2683, 2716, 2706...|   [1196, 260, 2858]|
|  1697|[110, 356, 2571, ...|[1196, 1198, 1240...|
|  1806|[590, 3247, 1396,...|[318, 2028, 527, ...|
|  1950|[661, 799, 2916, ...|[260, 1198, 2571,...|
|  2040|[3481, 2797, 2881...|[1196, 1270, 260,...|
|  2214|[509, 249, 785, 9...|[2858, 1196, 1198...|
|  2250|[3481, 589, 2959,...|[2858, 2997, 260,...|
|  2453|[1953, 3783, 866,...|[1580, 589, 1196,...|
|  2509|[527, 1200, 2332,...|[589, 1198, 1196,...|
|  2529|[3398, 3479, 1018...|[260, 2858, 1198,...|
|  2927|   [590, 1259, 2671]|[260, 1196, 1198,...|
|  3091|[3499, 541, 1035,...|[1196, 260, 2571,...|
|  3506|[1079, 589, 2599,...|[2

In [29]:
# Load movies data
lines = spark.read.text("/opt/spark/data/movies.dat").rdd
parts = lines.map(lambda row: row.value.split("::"))
moviesRDD = parts.map(lambda p: Row(MovieID=int(p[0]), Title=p[1], Genres=p[2]))
movies = spark.createDataFrame(moviesRDD)

In [32]:
movies.show(10, truncate=False)

+----------------------------+-------+----------------------------------+
|Genres                      |MovieID|Title                             |
+----------------------------+-------+----------------------------------+
|Animation|Children's|Comedy |1      |Toy Story (1995)                  |
|Adventure|Children's|Fantasy|2      |Jumanji (1995)                    |
|Comedy|Romance              |3      |Grumpier Old Men (1995)           |
|Comedy|Drama                |4      |Waiting to Exhale (1995)          |
|Comedy                      |5      |Father of the Bride Part II (1995)|
|Action|Crime|Thriller       |6      |Heat (1995)                       |
|Comedy|Romance              |7      |Sabrina (1995)                    |
|Adventure|Children's        |8      |Tom and Huck (1995)               |
|Action                      |9      |Sudden Death (1995)               |
|Action|Adventure|Thriller   |10     |GoldenEye (1995)                  |
+----------------------------+-------+

In [33]:
user_id = 474

In [34]:
recommendation = test_association_rules.filter(test_association_rules['userId'] == user_id).collect()
recommendation

[Row(userId=474, movieIds=[661, 2391, 2, 482, 2791, 2987, 2428, 3272, 2329, 3266, 3948, 551, 1517, 3543, 2499, 3255, 1049, 3256, 1884, 1732, 593, 492, 3555, 1836, 3535, 333, 1351, 2571], prediction=[589, 1196, 260, 2858, 296, 1198, 2028, 2762, 608, 318, 50, 1610, 110, 457, 480])]

In [38]:
print('User {}: Preferable movies:'.format(user_id))
movies.filter(movies.MovieID.isin(recommendation[0]['movieIds'])).show(30, truncate=False)

User 474: Preferable movies:
+-----------------------------+-------+-----------------------------------------------------+
|Genres                       |MovieID|Title                                                |
+-----------------------------+-------+-----------------------------------------------------+
|Adventure|Children's|Fantasy |2      |Jumanji (1995)                                       |
|Comedy                       |333    |Tommy Boy (1995)                                     |
|Thriller                     |482    |Killing Zoe (1994)                                   |
|Comedy|Mystery               |492    |Manhattan Murder Mystery (1993)                      |
|Children's|Comedy|Musical    |551    |Nightmare Before Christmas, The (1993)               |
|Drama|Thriller               |593    |Silence of the Lambs, The (1991)                     |
|Animation|Children's|Musical |661    |James and the Giant Peach (1996)                     |
|Action|Adventure             |

In [40]:
print('User {}: Recommendation movies:'.format(user_id))
movies.filter(movies.MovieID.isin(recommendation[0]['prediction'])).show(truncate=False)

User 474: Recommendation movies:
+---------------------------------+-------+-----------------------------------------------------+
|Genres                           |MovieID|Title                                                |
+---------------------------------+-------+-----------------------------------------------------+
|Crime|Thriller                   |50     |Usual Suspects, The (1995)                           |
|Action|Drama|War                 |110    |Braveheart (1995)                                    |
|Action|Adventure|Fantasy|Sci-Fi  |260    |Star Wars: Episode IV - A New Hope (1977)            |
|Crime|Drama                      |296    |Pulp Fiction (1994)                                  |
|Drama                            |318    |Shawshank Redemption, The (1994)                     |
|Action|Thriller                  |457    |Fugitive, The (1993)                                 |
|Action|Adventure|Sci-Fi          |480    |Jurassic Park (1993)                      

In [41]:
user_id = 1806

In [42]:
recommendation = test_association_rules.filter(test_association_rules['userId'] == user_id).collect()
recommendation

[Row(userId=1806, movieIds=[590, 3247, 1396, 2006, 3624, 329, 2384, 3452, 1042, 2762, 2987, 2683, 150, 252, 448, 986, 500, 3745, 2709, 553, 2273, 168, 648, 1411, 344, 802, 671, 163, 317, 368, 2393, 1610, 1784, 3301, 515, 587, 3354, 3528, 457], prediction=[318, 2028, 527, 593, 589, 1196, 1198, 260, 2571, 2858, 480, 110])]

In [44]:
print('User {}: Preferable movies:'.format(user_id))
movies.filter(movies.MovieID.isin(recommendation[0]['movieIds'])).show(40, truncate=False)

User 1806: Preferable movies:
+------------------------------+-------+----------------------------------------------+
|Genres                        |MovieID|Title                                         |
+------------------------------+-------+----------------------------------------------+
|Drama                         |150    |Apollo 13 (1995)                              |
|Action|Romance|Thriller       |163    |Desperado (1995)                              |
|Action|Adventure|Drama|Romance|168    |First Knight (1995)                           |
|Comedy|Romance                |252    |I.Q. (1994)                                   |
|Children's|Comedy|Fantasy     |317    |Santa Clause, The (1994)                      |
|Action|Adventure|Sci-Fi       |329    |Star Trek: Generations (1994)                 |
|Comedy                        |344    |Ace Ventura: Pet Detective (1994)             |
|Action|Comedy|Western         |368    |Maverick (1994)                               |
|D

In [45]:
print('User {}: Recommendation movies:'.format(user_id))
movies.filter(movies.MovieID.isin(recommendation[0]['prediction'])).show(truncate=False)

User 1806: Recommendation movies:
+---------------------------------+-------+-----------------------------------------------------+
|Genres                           |MovieID|Title                                                |
+---------------------------------+-------+-----------------------------------------------------+
|Action|Drama|War                 |110    |Braveheart (1995)                                    |
|Action|Adventure|Fantasy|Sci-Fi  |260    |Star Wars: Episode IV - A New Hope (1977)            |
|Drama                            |318    |Shawshank Redemption, The (1994)                     |
|Action|Adventure|Sci-Fi          |480    |Jurassic Park (1993)                                 |
|Drama|War                        |527    |Schindler's List (1993)                              |
|Action|Sci-Fi|Thriller           |589    |Terminator 2: Judgment Day (1991)                    |
|Drama|Thriller                   |593    |Silence of the Lambs, The (1991)         