<a href="https://colab.research.google.com/github/SandeepMLDLNPL/Machine_Learning_Models/blob/main/Recommender_System_using_ALS_pyspark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install pyspark

link of Dataset:
https://sites.google.com/site/limkwanhui/datacode

In [None]:
pip install findspark

In [112]:
import findspark
findspark.init()

In [113]:
import pyspark
from pyspark.sql import SparkSession

In [114]:
spark = SparkSession.builder.appName('Recommender_System').getOrCreate()

In [115]:
poi_df = spark.read.csv('/content/drive/MyDrive/data-sigir17/poiList-sigir17',header=True,inferSchema = True,sep=";")

In [116]:
poi_df.show(5)

+-----+--------------------+---------+-----------+------------+-------------+--------------+------+------+
|poiID|             poiName|      lat|       long|rideDuration|        theme|        theme2|theme3|theme4|
+-----+--------------------+---------+-----------+------------+-------------+--------------+------+------+
|    1| Gadget's Go Coaster|33.810259|-117.918438|         1.0|       Kiddie|Roller Coaster|  null|  null|
|    2|       Astro Orbitor|28.418532| -81.579153|         1.5|Spinning Ride|          null|  null|  null|
|    3|       Mad Tea Party|33.813458|-117.918289|         1.5|       Family| Spinning Ride|  null|  null|
|    4|Dumbo the Flying ...| 33.81368|-117.918928|        1.67|       Family| Spinning Ride|  null|  null|
|    5|Mr. Toad's Wild Ride|33.813311|-117.918697|         2.0|         Dark|          Ride|  null|Indoor|
+-----+--------------------+---------+-----------+------------+-------------+--------------+------+------+
only showing top 5 rows



In [117]:
visits_df = spark.read.csv('/content/drive/MyDrive/data-sigir17/userVisits-sigir17',header=True,inferSchema=True, sep=';')


In [118]:
visits_df.show(10)

+-----------+-------------+----------+-----+--------+-------+------------+-----+
|         id|         nsid| takenUnix|poiID|poiTheme|poiFreq|rideDuration|seqID|
+-----------+-------------+----------+-----+--------+-------+------------+-----+
| 5858403310| 10004778@N07|1308262550|    6|    Ride|   1665|       120.0|    1|
| 5857850631| 10004778@N07|1308270702|   26|  Family|  18710|       900.0|    1|
| 5858399220| 10004778@N07|1308631356|    6|    Ride|   1665|       120.0|    2|
| 8277294024| 10004778@N07|1355568624|   26|  Family|  18710|       900.0|    3|
| 9219062165| 10004778@N07|1373030964|   29|   Water|  10427|       900.0|    4|
| 5286317993| 10024109@N08|1283735402|   26|  Family|  18710|       900.0|    5|
| 5286320839| 10024109@N08|1283735452|   26|  Family|  18710|       900.0|    5|
| 5286923898| 10024109@N08|1283745187|   26|  Family|  18710|       900.0|    5|
| 5286326049| 10024109@N08|1283753756|   26|  Family|  18710|       900.0|    5|
|14979055621|100373287@N02|1

In [119]:
sample_df = visits_df.limit(1000).toPandas()
sample_df.describe()

Unnamed: 0,poiID,poiFreq,rideDuration,seqID
count,1000.0,1000.0,1000.0,1000.0
mean,20.785,7764.625,625.2606,50.349
std,8.138243,6233.964628,324.947216,20.027838
min,1.0,580.0,60.0,1.0
25%,15.0,2757.0,270.0,33.0
50%,23.0,4082.0,600.0,56.0
75%,28.0,16366.0,900.0,69.0
max,31.0,18710.0,1500.0,73.0


We need to have enough entries per user to ensure we have enough
information about users to make predictions. Furthermore, it's actually more relevant to
know whether users visit different attractions.

In [120]:
poi_df.createOrReplaceTempView('points')
visits_df.createOrReplaceTempView('visits')

we can do queries, such as finding the number of unique
attractions

In [121]:
spark.sql('select distinct poiID from visits').count()

32

In [122]:
spark.sql('select nsid,count(distinct poiID) as cnt from visits group by nsid ').describe().show()

+-------+--------------------+------------------+
|summary|                nsid|               cnt|
+-------+--------------------+------------------+
|  count|                8904|              8905|
|   mean|                null|4.8591802358225715|
| stddev|                null| 5.965359459316309|
|    min| (ii) then mapped...|                 0|
|    max|        99987318@N03|                31|
+-------+--------------------+------------------+



The preceding SQL command finds the number of distinct attractions each user visits. The
describe dataset operation finds statistics on these users, which tells us that, on average,
users visit about five different locations. 

In [123]:
spark.sql('select nsid,poiID,count(*) from visits group by nsid,poiID ').describe().show()

+-------+--------------------+------------------+-----------------+
|summary|                nsid|             poiID|         count(1)|
+-------+--------------------+------------------+-----------------+
|  count|               43272|             43271|            43273|
|   mean|                null|14.920061935245315| 7.67492431770388|
| stddev|                null| 8.437883931275127|52.92985818263355|
|    min| (ii) then mapped...|                 1|                1|
|    max|        99987318@N03|                31|             4128|
+-------+--------------------+------------------+-----------------+



The SQL command counts the number of entries for each user and attraction, and then we
find a statistical summary using the describe. We can conclude therefore that on average,
each user takes about eight pictures at every location they visit.


##Training the model


To train our model, we will construct a dataset that computes the number of photos taken
by each user at each location:

In [135]:
train_df = spark.sql('select hash(nsid) as user_hash_id, hash(poiID) as poi_hash_id, count(*) as \
pictures_taken from visits group by 1,2')


We hash the user because the ALS trainer just supports numerical values as features

we simply need to construct an instance of ALS and provide the user column, item column (in this case the attraction IDs), and the rating column (in this case, pictures_takes is used as a proxy for rating). coldStartStrategy is set to drop as we're not interested in making predictions for users or attractions not present in the dataset (that is, predictions for such entries will be dropped rather than returning NaN):

In [136]:
from pyspark.ml.recommendation import ALS
recommender = ALS(userCol="user_hash_id",
 itemCol="poi_hash_id",
 ratingCol="pictures_taken",
 coldStartStrategy="drop")
model = recommender.fit(train_df)


In [139]:
recommendations = model.recommendForAllUsers(10).show(10)

+------------+--------------------+
|user_hash_id|     recommendations|
+------------+--------------------+
| -2147481344|[{-132918897, 3.9...|
| -2146859726|[{-132918897, 9.4...|
| -2144420948|[{-1721654386, 13...|
| -2144286583|[{-554124381, 0.9...|
| -2143637621|[{972445202, 10.1...|
| -2142919823|[{-768484170, 2.9...|
| -2142858516|[{972445202, 46.7...|
| -2142523578|[{-1223696181, 39...|
| -2142192636|[{-554124381, 32....|
| -2141088717|[{-768484170, 2.5...|
+------------+--------------------+
only showing top 10 rows



In [145]:
row_list = spark.sql('select distinct p.poiName, p.poiID from visits v join \
points p on (p.poiID=v.poiID) ').collect()
id_to_poi_name = dict(map(lambda x: (x.poiID, x.poiName), row_list))
row_list[1:5]


[Row(poiName='O Canada!', poiID=10),
 Row(poiName='Big Thunder Mountain Railroad', poiID=19),
 Row(poiName='Turtle Talk with Crush', poiID=22),
 Row(poiName='The Many Adventures of Winnie the Pooh', poiID=9)]

we will construct a
dictionary of IDs to attraction names (point of interest names) by collecting the result of a
query that finds the name of each attraction in the points table:

In [149]:
row_list = spark.sql('select distinct p.poiName,p.poiID from visits v join \
                       points p on (p.poiID = v.poiID)').collect()
id_to_poiName = dict(map(lambda x: (x.poiID,x.poiName),row_list))
id_to_poiName

{1: 'Test Track',
 10: 'Golden Zephyr',
 19: "Tarzan's Treehouse",
 22: 'Country Bear Jamboree',
 9: "Pinocchio's Daring Journey",
 21: 'Red Car Trolley & News Boys',
 13: 'Haunted Mansion',
 26: 'Sleeping Beauty Castle Walkthrough',
 8: 'The Great Movie Ride',
 12: "It's A Small World",
 20: 'Splash Mountain',
 29: 'Pirates of the Caribbean',
 16: 'Buzz Lightyear Astro Blasters',
 25: "It's A Small World",
 14: 'The Many Adventures of Winnie the Pooh',
 11: "California Screamin'",
 15: 'The Twilight Zone Tower of Terror',
 3: "Soarin'",
 4: 'Journey Into Imagination With Figment',
 24: 'Jungle Cruise',
 2: 'Astro Orbiter',
 5: 'Silly Symphony Swings',
 6: "Snow White's Scary Adventures",
 30: 'Mark Twain Riverboat',
 28: 'Main Street Cinema',
 7: 'Voyage of The Little Mermaid',
 23: 'Redwood Creek Challenge Trail',
 18: 'Tom Sawyer Island',
 27: 'Walt Disney World Railroad',
 17: 'Rose & Crown Pub Musician',
 31: 'Fantasmic!'}