## Moive Recommendation

## Motivation
### In this notebook, we will use an Alternating Least Squares (ALS) algorithm with Spark APIs to predict the ratings for the movies in [MovieLens small dataset](https://grouplens.org/datasets/movielens/latest/)

## Step 1: Data ETL and Data Exploration Analysis
### Get to know the data sets. Acknowledge the general information.

## Step 2: Training the ALS recommendation Model
### Used CrossValidation methods to mind the best parameters for the model. Use RMSE to determine which one outperforms the others. Here are the metrics:
#### RMSE = 0.8775583955068894
#### Rank:8
#### MaxIter:5
#### RegParam:0.1

# Step 3: Use the best model to predict top 10 movies baed on users and item, seperately. 

# Step 4: Conclusion: 
### 1. The item based recommendation is more robust than user based. It is because users are more complicated than item. Users can have different tastes or even changeable. 
### 2. When recommendation similar movies, I used cosine similarity on item-features. It seems better than KNN, which tend to recommend top popular movies to every one. 


In [None]:
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
pd.set_option('display.float_format', lambda x: '%.3f' % x)
pd.set_option('display.max_columns',None)
pd.set_option('display.max_rows',None)
pd.set_option('max_colwidth',100)


  import pandas.util.testing as tm


In [None]:
from google.colab import drive
drive.mount("/content/drive")

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [None]:
links = pd.read_csv("/content/drive/My Drive/Data/links.csv")
movies = pd.read_csv('/content/drive/My Drive/Data/movies.csv')
ratings = pd.read_csv('/content/drive/My Drive/Data/ratings.csv')
tags = pd.read_csv('/content/drive/My Drive/Data/tags.csv')

In [None]:
links.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [None]:
links.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   movieId  9742 non-null   int64  
 1   imdbId   9742 non-null   int64  
 2   tmdbId   9734 non-null   float64
dtypes: float64(1), int64(2)
memory usage: 228.5 KB


In [None]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [None]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


In [None]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [None]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [None]:
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


In [None]:
tags.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3683 entries, 0 to 3682
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   userId     3683 non-null   int64 
 1   movieId    3683 non-null   int64 
 2   tag        3683 non-null   object
 3   timestamp  3683 non-null   int64 
dtypes: int64(3), object(1)
memory usage: 115.2+ KB


In [None]:
ratings['date'] = pd.to_datetime(ratings['timestamp'], unit = 's')
ratings['rating date'] = ratings['date'].dt.date

In [None]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp,date,rating date
0,1,1,4.0,964982703,2000-07-30 18:45:03,2000-07-30
1,1,3,4.0,964981247,2000-07-30 18:20:47,2000-07-30
2,1,6,4.0,964982224,2000-07-30 18:37:04,2000-07-30
3,1,47,5.0,964983815,2000-07-30 19:03:35,2000-07-30
4,1,50,5.0,964982931,2000-07-30 18:48:51,2000-07-30


In [None]:
tags['date'] = pd.to_datetime(tags['timestamp'], unit = 's')
tags['tag date'] = tags['date'].dt.date

In [None]:
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp,date,tag date
0,2,60756,funny,1445714994,2015-10-24 19:29:54,2015-10-24
1,2,60756,Highly quotable,1445714996,2015-10-24 19:29:56,2015-10-24
2,2,60756,will ferrell,1445714992,2015-10-24 19:29:52,2015-10-24
3,2,89774,Boxing story,1445715207,2015-10-24 19:33:27,2015-10-24
4,2,89774,MMA,1445715200,2015-10-24 19:33:20,2015-10-24


In [None]:
ratings_movies = pd.merge(movies, ratings, how = 'left', on = 'movieId')
combined = pd.merge(ratings_movies, tags, how = 'left', on = ['userId', 'movieId'])

In [None]:
combined.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp_x,date_x,rating date,tag,timestamp_y,date_y,tag date
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1.0,4.0,964982703.0,2000-07-30 18:45:03,2000-07-30,,,NaT,
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5.0,4.0,847434962.0,1996-11-08 06:36:02,1996-11-08,,,NaT,
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7.0,4.5,1106635946.0,2005-01-25 06:52:26,2005-01-25,,,NaT,
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15.0,2.5,1510577970.0,2017-11-13 12:59:30,2017-11-13,,,NaT,
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17.0,4.5,1305696483.0,2011-05-18 05:28:03,2011-05-18,,,NaT,


# Part1: Data ETL and Data Exploration

In [None]:
#combined = combined.sort_values(by = 'userId').reset_index()
data = combined.drop([ 'timestamp_x', 'date_x','timestamp_y', 'date_y'], axis = 1)

In [None]:
data.head()

Unnamed: 0,movieId,title,genres,userId,rating,rating date,tag,tag date
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1.0,4.0,2000-07-30,,
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5.0,4.0,1996-11-08,,
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7.0,4.5,2005-01-25,,
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15.0,2.5,2017-11-13,,
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17.0,4.5,2011-05-18,,


In [None]:
data.shape

(102695, 8)

## Q1: The number of Users
## Q2: The number of Movies
## Q3:  How many movies are rated by users? List movies not rated before

In [None]:
# Get distinct user #, movie # and # of movies that missing ratings.
distinct_user = data['userId'].nunique()
distinct_movie = data['movieId'].nunique()
rated_missing = data['rating'].isna().sum()
print(f"Dataset Shape: {data.shape}")
print(f"Dataset has unique users: {distinct_user}")
print(f"Dataset has unique movies: {distinct_movie}")
print("Dataset has movies with no ratings: {}".format(rated_missing))
print('\n')
print('List of Movie with no rating')
no_rating = data[data['rating'].isna() == True][['title','genres']]
print(no_rating.to_string(index = False))

Dataset Shape: (102695, 8)
Dataset has unique users: 610
Dataset has unique movies: 9742
Dataset has movies with no ratings: 18


List of Movie with no rating
                                        title                    genres
                        Innocents, The (1961)     Drama|Horror|Thriller
                               Niagara (1953)            Drama|Thriller
                       For All Mankind (1989)               Documentary
 Color of Paradise, The (Rang-e khoda) (1999)                     Drama
               I Know Where I'm Going! (1945)         Drama|Romance|War
                           Chosen, The (1981)                     Drama
  Road Home, The (Wo de fu qin mu qin) (1999)             Drama|Romance
                               Scrooge (1970)     Drama|Fantasy|Musical
                                 Proof (1991)      Comedy|Drama|Romance
                    Parallax View, The (1974)                  Thriller
                     This Gun for Hire (1942)  Cr

## Q4: List Movie Genres

In [None]:
def getGenresList(lst):
  genres_lst = []
  for i in range(len(lst)):
    for j in lst[i]:
      if j not in genres_lst:
        genres_lst.append(j)
      else:
        continue
  return genres_lst
genres = movies['genres'].str.split('|')
genres_lst = getGenresList(genres)

In [None]:
genres_lst

['Adventure',
 'Animation',
 'Children',
 'Comedy',
 'Fantasy',
 'Romance',
 'Drama',
 'Action',
 'Crime',
 'Thriller',
 'Horror',
 'Mystery',
 'Sci-Fi',
 'War',
 'Musical',
 'Documentary',
 'IMAX',
 'Western',
 'Film-Noir',
 '(no genres listed)']

## Q5: Movie for Each Category

In [None]:
genres_dict = {}
for i in range(len(genres)):
  for j in genres[i]:
    if j not in genres_dict:
      genres_dict[j] = 1
    else:
      genres_dict[j] += 1

genres_count = pd.DataFrame.from_dict(genres_dict, orient = 'index', columns = ['Number'])
genres_count

Unnamed: 0,Number
Adventure,1263
Animation,611
Children,664
Comedy,3756
Fantasy,779
Romance,1596
Drama,4361
Action,1828
Crime,1199
Thriller,1894


In [None]:
movie_lsts = []
for i in genres_count.index:
  movie_lst = []
  for j in range(len(movies)):
    if i in movies['genres'][j]:
      movie_lst.append(movies['title'][j])
  movie_lsts.append(', '.join(movie_lst))

movie_lsts_df = pd.DataFrame(movie_lsts, index = genres_count.index, columns = ['Movie List'])

    


In [None]:
genres_summary = pd.concat([genres_count, movie_lsts_df], axis = 1)
genres_summary

Unnamed: 0,Number,Movie List
Adventure,1263,"Toy Story (1995), Jumanji (1995), Tom and Huck (1995), GoldenEye (1995), Balto (1995), Cutthroat..."
Animation,611,"Toy Story (1995), Balto (1995), Pocahontas (1995), Goofy Movie, A (1995), Swan Princess, The (19..."
Children,664,"Toy Story (1995), Jumanji (1995), Tom and Huck (1995), Balto (1995), Now and Then (1995), Babe (..."
Comedy,3756,"Toy Story (1995), Grumpier Old Men (1995), Waiting to Exhale (1995), Father of the Bride Part II..."
Fantasy,779,"Toy Story (1995), Jumanji (1995), City of Lost Children, The (Cité des enfants perdus, La) (1995..."
Romance,1596,"Grumpier Old Men (1995), Waiting to Exhale (1995), Sabrina (1995), American President, The (1995..."
Drama,4361,"Waiting to Exhale (1995), American President, The (1995), Nixon (1995), Casino (1995), Sense and..."
Action,1828,"Heat (1995), Sudden Death (1995), GoldenEye (1995), Cutthroat Island (1995), Money Train (1995),..."
Crime,1199,"Heat (1995), Casino (1995), Money Train (1995), Get Shorty (1995), Copycat (1995), Assassins (19..."
Thriller,1894,"Heat (1995), GoldenEye (1995), Money Train (1995), Get Shorty (1995), Copycat (1995), Assassins ..."


# Part2: Spark ALS based approach for training model
## We will use an Spark ML to predict the ratings, so let's reload "ratings.csv" using ``sc.textFile`` and then convert it to the form of (user, item, rating) tuples.

In [None]:
! pip install pyspark

Collecting pyspark
[?25l  Downloading https://files.pythonhosted.org/packages/8e/b0/bf9020b56492281b9c9d8aae8f44ff51e1bc91b3ef5a884385cb4e389a40/pyspark-3.0.0.tar.gz (204.7MB)
[K     |████████████████████████████████| 204.7MB 67kB/s 
[?25hCollecting py4j==0.10.9
[?25l  Downloading https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl (198kB)
[K     |████████████████████████████████| 204kB 44.4MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.0.0-py2.py3-none-any.whl size=205044182 sha256=00683cbfbb34661496396c69d43dd50d437302b855ebf46e529e30b6b2bd8161
  Stored in directory: /root/.cache/pip/wheels/57/27/4d/ddacf7143f8d5b76c45c61ee2e43d9f8492fc5a8e78ebd7d37
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.0.0


In [None]:
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName("moive analysis") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

In [None]:
from pyspark.sql import SQLContext

In [None]:
movies_df = spark.read.load("/content/drive/My Drive/Data/links.csv", format='csv', header = True)
ratings_df = spark.read.load("/content/drive/My Drive/Data/ratings.csv", format='csv', header = True)
links_df = spark.read.load("/content/drive/My Drive/Data/links.csv", format='csv', header = True)
tags_df = spark.read.load("/content/drive/My Drive/Data/tags.csv", format='csv', header = True)

In [None]:
ratings_df.show()

+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
|     1|      1|   4.0|964982703|
|     1|      3|   4.0|964981247|
|     1|      6|   4.0|964982224|
|     1|     47|   5.0|964983815|
|     1|     50|   5.0|964982931|
|     1|     70|   3.0|964982400|
|     1|    101|   5.0|964980868|
|     1|    110|   4.0|964982176|
|     1|    151|   5.0|964984041|
|     1|    157|   5.0|964984100|
|     1|    163|   5.0|964983650|
|     1|    216|   5.0|964981208|
|     1|    223|   3.0|964980985|
|     1|    231|   5.0|964981179|
|     1|    235|   4.0|964980908|
|     1|    260|   5.0|964981680|
|     1|    296|   3.0|964982967|
|     1|    316|   3.0|964982310|
|     1|    333|   5.0|964981179|
|     1|    349|   4.0|964982563|
+------+-------+------+---------+
only showing top 20 rows



In [None]:
movie_ratings = ratings_df.drop('timestamp')

In [None]:
# Data type convert
from pyspark.sql.types import IntegerType, FloatType
movie_ratings = movie_ratings.withColumn("userId", movie_ratings["userId"].cast(IntegerType()))
movie_ratings = movie_ratings.withColumn("movieId", movie_ratings["movieId"].cast(IntegerType()))
movie_ratings = movie_ratings.withColumn("rating", movie_ratings["rating"].cast(FloatType()))

# ALS Model Selection and Evaluation
## With the ALS model, we can use a grid search to find the optimal hyperparameters.

In [None]:
# import package
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import CrossValidator,ParamGridBuilder

In [None]:
#Create test and train set
(training,test)=movie_ratings.randomSplit([0.8,0.2], seed = 33)

In [None]:
#Create ALS model
als = ALS(maxIter=5, regParam=0.01, rank = 8, userCol="userId", itemCol="movieId", ratingCol="rating",
          coldStartStrategy="drop", nonnegative = True)

In [None]:
#Tune model using ParamGridBuilder
paramGrid = ParamGridBuilder()\
            .addGrid(als.regParam, [0.001, 0.01, 0.05, 0.1])\
            .addGrid(als.maxIter, [5, 10, 15])\
            .addGrid(als.rank, [8,  12,  16,])\
            .build()

In [None]:
# Define evaluator as RMSE
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating",
                                predictionCol="prediction")

In [None]:
# Build Cross validation 
crossval = CrossValidator(estimator=als,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator,
                          numFolds=5)

In [None]:
#Fit ALS model to training data
model = crossval.fit(training)

In [None]:
#Extract best model from the tuning exercise using ParamGridBuilder
best_model = model.bestModel

# Model testing
##And finally, make a prediction and check the testing error.

In [None]:
#Generate predictions and evaluate using RMSE
predictions=best_model.transform(test)
rmse = evaluator.evaluate(predictions)

In [None]:
#Print evaluation metrics and model parameters
print ("RMSE = "+str(rmse))
print ("**Best Model**")
print (" Rank:" + str(best_model.rank))
print (" MaxIter:" + str(best_model._java_obj.parent().getMaxIter()))
print (" RegParam:" + str(best_model._java_obj.parent().getRegParam()))

RMSE = 0.8756757167936443
**Best Model**
 Rank:8
 MaxIter:15
 RegParam:0.1


In [None]:
predictions.show()

+------+-------+------+----------+
|userId|movieId|rating|prediction|
+------+-------+------+----------+
|    57|    471|   3.0| 3.4022155|
|   462|    471|   2.5| 2.7203116|
|   176|    471|   5.0| 3.3157814|
|   520|    471|   5.0| 3.0083802|
|   273|    471|   5.0|  3.664475|
|   216|    471|   3.0| 3.5301418|
|   287|    471|   4.5| 2.9268203|
|    32|    471|   3.0| 3.9161892|
|   541|    471|   3.0|  3.686296|
|   474|   1088|   3.5| 2.9751709|
|   169|   1088|   4.5| 3.9963353|
|    41|   1088|   1.5| 2.5871677|
|   583|   1088|   3.5| 3.3980017|
|   555|   1088|   4.0|  3.310546|
|   391|   1088|   1.0| 3.0393512|
|   509|   1088|   3.0| 2.9834957|
|   414|   1088|   3.0| 3.1559575|
|   525|   1088|   4.5| 3.3515925|
|   116|   1088|   4.5| 3.2281513|
|   600|   1088|   3.5| 2.5342603|
+------+-------+------+----------+
only showing top 20 rows



# Model apply and see the performance

In [None]:
alldata=best_model.transform(movie_ratings)
rmse = evaluator.evaluate(alldata)
print ("RMSE = "+str(rmse))

RMSE = 0.6601598879706201


In [None]:
alldata.registerTempTable("alldata")
movies_df.registerTempTable("movies_py")

In [None]:
output=spark.sql("Select * from alldata")
output.show()

+------+-------+------+----------+
|userId|movieId|rating|prediction|
+------+-------+------+----------+
|   191|    148|   5.0| 4.9142957|
|   133|    471|   4.0| 3.2198467|
|   597|    471|   2.0| 4.1746902|
|   385|    471|   4.0| 2.8706434|
|   436|    471|   3.0| 3.4148808|
|   602|    471|   4.0| 3.5163324|
|    91|    471|   1.0| 2.1001987|
|   409|    471|   3.0|  3.490967|
|   372|    471|   3.0|  2.846321|
|   599|    471|   2.5| 2.5546834|
|   603|    471|   4.0|  3.404143|
|   182|    471|   4.5| 4.1445665|
|   218|    471|   4.0| 3.5163882|
|   474|    471|   3.0|  3.274422|
|   500|    471|   1.0| 2.2210767|
|    57|    471|   3.0| 3.4022155|
|   462|    471|   2.5| 2.7203116|
|   387|    471|   3.0| 3.1333747|
|   610|    471|   4.0| 3.1766458|
|   217|    471|   2.0| 2.4428291|
+------+-------+------+----------+
only showing top 20 rows



In [None]:
output=spark.sql("Select * from movies_py join alldata on movies_py.movieId=alldata.movieId")
output.show()

+-------+-------+------+------+-------+------+----------+
|movieId| imdbId|tmdbId|userId|movieId|rating|prediction|
+-------+-------+------+------+-------+------+----------+
|    148|0112427| 22279|   191|    148|   5.0| 4.9142957|
|    471|0110074| 11934|   133|    471|   4.0| 3.2198467|
|    471|0110074| 11934|   597|    471|   2.0| 4.1746902|
|    471|0110074| 11934|   385|    471|   4.0| 2.8706434|
|    471|0110074| 11934|   436|    471|   3.0| 3.4148808|
|    471|0110074| 11934|   602|    471|   4.0| 3.5163324|
|    471|0110074| 11934|    91|    471|   1.0| 2.1001987|
|    471|0110074| 11934|   409|    471|   3.0|  3.490967|
|    471|0110074| 11934|   372|    471|   3.0|  2.846321|
|    471|0110074| 11934|   599|    471|   2.5| 2.5546834|
|    471|0110074| 11934|   603|    471|   4.0|  3.404143|
|    471|0110074| 11934|   182|    471|   4.5| 4.1445665|
|    471|0110074| 11934|   218|    471|   4.0| 3.5163882|
|    471|0110074| 11934|   474|    471|   3.0|  3.274422|
|    471|01100

# Recommend moive to users with id: 575, 232. 
## you can choose some users to recommend the moives 

In [None]:
#Generate top 10 movie recommendations for each user
user_Recs = best_model.recommendForAllUsers(10)

In [None]:
user575_rec = user_Recs.filter(user_Recs.userId==575)

## Recommendations for User 575

In [None]:
def get_rec_for_user(recs):
  recs = recs.select('recommendations.movieId', 'recommendations.rating')
  rec_user = recs.select('movieId').toPandas().iloc[0,0]
  ratings = recs.select('rating').toPandas().iloc[0,0]
  ratings_matrix = pd.DataFrame(rec_user, columns = ['movieId'])
  ratings_matrix['ratings'] = ratings
  ratings_matrix_ps = pd.merge(ratings_matrix, movies, how = 'left', on = 'movieId')
  return ratings_matrix_ps

In [None]:
print('Recommendation for user 575')
get_rec_for_user(user575_rec)


Recommendation for user 575


Unnamed: 0,movieId,ratings,title,genres
0,141718,5.369,Deathgasm (2015),Comedy|Horror
1,26326,5.297,"Holy Mountain, The (Montaña sagrada, La) (1973)",Drama
2,7025,5.239,"Midnight Clear, A (1992)",Drama|War
3,417,5.226,Barcelona (1994),Comedy|Romance
4,2131,5.131,Autumn Sonata (Höstsonaten) (1978),Drama
5,25947,5.127,Unfaithfully Yours (1948),Comedy
6,26810,5.102,Bad Boy Bubby (1993),Drama
7,116897,5.095,Wild Tales (2014),Comedy|Drama|Thriller
8,3379,5.089,On the Beach (1959),Drama
9,1217,5.073,Ran (1985),Drama|War


## Recommendations for User 232

In [None]:
print('List of Movies User 232 likes: Mostly Action/Drama Movies')
movie_232 = combined[(combined['userId'] == 232) & (combined['rating'] == 5)][['title', 'rating', 'genres']]
movie_232

List of Movies User 232 likes: Mostly Action/Drama Movies


Unnamed: 0,title,rating,genres
8091,Pulp Fiction (1994),5.0,Comedy|Crime|Drama|Thriller
9052,"Shawshank Redemption, The (1994)",5.0,Crime|Drama
26397,Star Wars: Episode VI - Return of the Jedi (1983),5.0,Action|Adventure|Sci-Fi
39235,Saving Private Ryan (1998),5.0,Action|Drama|War
42765,American History X (1998),5.0,Crime|Drama
48374,"Sixth Sense, The (1999)",5.0,Drama|Horror|Mystery
52626,"Green Mile, The (1999)",5.0,Crime|Drama
56360,Gladiator (2000),5.0,Action|Adventure|Drama
61550,Memento (2000),5.0,Mystery|Thriller
66785,We Were Soldiers (2002),5.0,Action|Drama|War


In [None]:
user232_rec = user_Recs.filter(user_Recs.userId==232)

In [None]:
get_rec_for_user(user232_rec)

Unnamed: 0,movieId,ratings,title,genres
0,3925,4.584,Stranger Than Paradise (1984),Comedy|Drama
1,86237,4.583,Connections (1978),Documentary
2,74226,4.583,"Dream of Light (a.k.a. Quince Tree Sun, The) (Sol del membrillo, El) (1992)",Documentary|Drama
3,138966,4.583,Nasu: Summer in Andalusia (2003),Animation
4,84273,4.583,Zeitgeist: Moving Forward (2011),Documentary
5,179135,4.583,Blue Planet II (2017),Documentary
6,134796,4.583,Bitter Lake (2015),Documentary
7,7071,4.583,"Woman Under the Influence, A (1974)",Drama
8,117531,4.583,Watermark (2014),Documentary
9,26928,4.583,"Summer's Tale, A (Conte d'été) (1996)",Comedy|Drama|Romance


# Find the similar moives for moive with id: 463, 471
## You can find the similar moives based on the ALS results

In [None]:
als = ALS(maxIter=5, rank=16, regParam=0.1, userCol="userId", itemCol="movieId", ratingCol="rating",
          coldStartStrategy="drop", nonnegative = True)
model_als = als.fit(training)

In [None]:
#get the item features and turn to 
item_features = model_als.itemFactors
item_features_df = item_features.toPandas()

In [None]:
item_features.show()

+---+--------------------+
| id|            features|
+---+--------------------+
| 10|[0.21949627, 0.62...|
| 20|[0.21646729, 0.79...|
| 30|[0.75904375, 0.0,...|
| 40|[0.40602213, 0.63...|
| 50|[0.49712536, 0.70...|
| 60|[0.24679609, 0.30...|
| 70|[0.90697926, 0.97...|
| 80|[0.51616156, 0.58...|
|100|[0.23431821, 0.51...|
|110|[0.4571069, 0.755...|
|140|[0.27278, 0.39272...|
|150|[0.5104057, 0.639...|
|160|[0.27194703, 0.67...|
|170|[0.13369885, 0.83...|
|180|[0.5379878, 0.524...|
|190|[0.23518513, 0.14...|
|210|[0.10778442, 0.39...|
|220|[0.47154355, 0.31...|
|230|[0.0, 0.4111778, ...|
|240|[0.42990988, 8.93...|
+---+--------------------+
only showing top 20 rows



In [None]:
def recommend_similar_movie(id, k):
  similarities = []
  movie_feature = item_features_df[item_features_df['id'] == id]['features'].to_list()
  for i in range(0, len(item_features_df['id'])):
    temp = []
    temp.append(i)
    item_feature = item_features_df['features'][i]
    # Calcualte the cos-similarity for each of the movie
    similarity = np.dot(movie_feature, item_feature)/(np.linalg.norm(movie_feature) * np.linalg.norm(item_feature))
    temp.append(similarity)
    similarities.append(tuple(temp))
  similarities.sort(key = lambda x: x[1], reverse=True)
  rec_ids = []
  j = 1
  # generate top k recommendations
  while j <= k:
    rec_id = similarities[j][0]
    rec_ids.append(rec_id)
    j += 1
  similar_movie = movies[movies.movieId.isin(rec_ids)]
  return similar_movie


In [None]:
# Recomend similar movies for movieId: 464
id_rol = movies[movies.movieId==464].index
movie = movies.loc[movies.movieId == 464, 'title'].values[0]
print('Similar Movies to {} is:'.format(movie))
recommend_similar_movie(464,10)

Similar Movies to Hard Target (1993) is:


Unnamed: 0,movieId,title,genres
772,1014,Pollyanna (1960),Children|Comedy|Drama
1103,1432,Metro (1997),Action|Comedy|Crime|Drama|Thriller
2885,3859,"Eyes of Tammy Faye, The (2000)",Documentary
3174,4275,Krull (1983),Action|Adventure|Fantasy|Sci-Fi
3390,4613,K-9 (1989),Action|Comedy|Crime
5137,8236,While the City Sleeps (1956),Drama|Film-Noir


In [None]:
# Recomend similar movies for movieId: 471
id_rol = movies[movies.movieId==471].index
movie = movies.loc[movies.movieId == 471, 'title'].values[0]
print('Similar Movies to {} is:'.format(movie))
recommend_similar_movie(471,10)

Similar Movies to Hudsucker Proxy, The (1994) is:


Unnamed: 0,movieId,title,genres
2882,3855,"Affair of Love, An (Liaison pornographique, Une) (1999)",Drama|Romance
3518,4809,Silkwood (1983),Drama
3859,5423,Gangster No. 1 (2000),Action|Crime|Thriller
3917,5504,Spy Kids 2: The Island of Lost Dreams (2002),Adventure|Children
4100,5876,"Quiet American, The (2002)",Drama|Thriller|War
4617,6880,"Texas Chainsaw Massacre, The (2003)",Horror
4770,7101,Doc Hollywood (1991),Comedy|Romance


# Write the report
## Motivation
### Using ALS algorithm to predict movies ratings and set up a recommendation system to push movies to each user


## Step 1: Data ETL and Data Exploration Analysis
### Get to know the data sets. Acknowledge the general information.

## Step 2: Training the ALS recommendation Model
### Used CrossValidation methods to mind the best parameters for the model. Use RMSE to determine which one outperforms the others. Here are the metrics:
#### RMSE = 0.8775583955068894
#### Rank:8
#### MaxIter:5
#### RegParam:0.1

# Step 3: Use the best model to predict top 10 movies baed on users and item, seperately. 

# Step 4: Conclusion: 
### 1. The item based recommendation is more robust than user based. It is because users are more complicated than item. Users can have different tastes or even changeable. 
### 2. When recommendation similar movies, I used cosine similarity on item-features. It seems better than KNN, which tend to recommend top popular movies to every one. 
