<a href="https://colab.research.google.com/github/MandyZhangxy/movie-recommendation/blob/master/Movie_Recommendation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Movie Recommendation Project**
In this project, I will use an Alternating Least Squares (ALS) algorithm with Spark APIs to predict the ratings for the movies in [MovieLens small dataset](https://grouplens.org/datasets/movielens/latest/)

## **Running Pyspark in Colab**

In [0]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
!tar xf spark-2.4.5-bin-hadoop2.7.tgz
!pip install -q findspark

In [0]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.5-bin-hadoop2.7"

In [0]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

## **Setting up environment**

In [6]:
%matplotlib inline
import os

import numpy as np
import pandas as pd
import seaborn as sns

# from nba_utils import draw_3pt_piechart,plot_shot_chart

from IPython.core.display import display, HTML
from IPython.core.magic import register_cell_magic, register_line_cell_magic, register_line_magic
from matplotlib import pyplot as plt
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.functions import array, col, count, mean, sum, udf, when
from pyspark.sql.types import DoubleType, IntegerType, StringType, Row
from pyspark.sql.functions import sum, col, udf

import warnings
warnings.filterwarnings("ignore")

sns.set_style("white")
sns.set_color_codes()

  import pandas.util.testing as tm


In [7]:
!ls

sample_data  spark-2.4.5-bin-hadoop2.7	spark-2.4.5-bin-hadoop2.7.tgz


In [0]:
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName("moive analysis") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

### **Data ETL and Data Exploration**

In [56]:
from google.colab import files
files.upload()

movies_df = spark.read.load("movies.csv", format="csv",header = True)
ratings_df = spark.read.load("ratings.csv", format = "csv", header = True)
links_df = spark.read.load("ratings.csv", format = "csv", header = True)
tags_df = spark.read.load("tags.csv", format = "csv", header = True)

Saving links.csv to links (3).csv
Saving movies.csv to movies (1).csv
Saving ratings.csv to ratings (1).csv


In [15]:
!ls

links.csv   ratings.csv  spark-2.4.5-bin-hadoop2.7	tags.csv
movies.csv  sample_data  spark-2.4.5-bin-hadoop2.7.tgz


In [17]:
movies_df.show(5)

+-------+--------------------+--------------------+
|movieId|               title|              genres|
+-------+--------------------+--------------------+
|      1|    Toy Story (1995)|Adventure|Animati...|
|      2|      Jumanji (1995)|Adventure|Childre...|
|      3|Grumpier Old Men ...|      Comedy|Romance|
|      4|Waiting to Exhale...|Comedy|Drama|Romance|
|      5|Father of the Bri...|              Comedy|
+-------+--------------------+--------------------+
only showing top 5 rows



In [18]:
ratings_df.show(5)

+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
|     1|      1|   4.0|964982703|
|     1|      3|   4.0|964981247|
|     1|      6|   4.0|964982224|
|     1|     47|   5.0|964983815|
|     1|     50|   5.0|964982931|
+------+-------+------+---------+
only showing top 5 rows



In [20]:
links_df.show(5)

+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
|     1|      1|   4.0|964982703|
|     1|      3|   4.0|964981247|
|     1|      6|   4.0|964982224|
|     1|     47|   5.0|964983815|
|     1|     50|   5.0|964982931|
+------+-------+------+---------+
only showing top 5 rows



In [21]:
tags_df.show(5)

+------+-------+---------------+----------+
|userId|movieId|            tag| timestamp|
+------+-------+---------------+----------+
|     2|  60756|          funny|1445714994|
|     2|  60756|Highly quotable|1445714996|
|     2|  60756|   will ferrell|1445714992|
|     2|  89774|   Boxing story|1445715207|
|     2|  89774|            MMA|1445715200|
+------+-------+---------------+----------+
only showing top 5 rows



In [24]:
ratings_df.groupBy("userID").count().show()

+------+-----+
|userID|count|
+------+-----+
|   296|   27|
|   467|   22|
|   125|  360|
|   451|   34|
|     7|  152|
|    51|  359|
|   124|   50|
|   447|   78|
|   591|   54|
|   307|  975|
|   475|  155|
|   574|   23|
|   169|  269|
|   205|   27|
|   334|  154|
|   544|   22|
|   577|  161|
|   581|   40|
|   272|   31|
|   442|   20|
+------+-----+
only showing top 20 rows



In [25]:
tmp1 = ratings_df.groupBy("userID").count().toPandas()['count'].min()
tmp2 = ratings_df.groupBy("movieId").count().toPandas()['count'].min()
print('For the users that rated movies and the movies that were rated:')
print('Minimum number of ratings per user is {}'.format(tmp1))
print('Minimum number of ratings per movie is {}'.format(tmp2))

For the users that rated movies and the movies that were rated:
Minimum number of ratings per user is 20
Minimum number of ratings per movie is 1


In [38]:
tmp1 = ratings_df.groupBy("movieId").count().toPandas()['count']==1
tmp2 = ratings_df.select('movieId').distinct().count()
print('{} out of {} movies are rated by only one user'.format(tmp1.sum(), tmp2))

3446 out of 9724 movies are rated by only one user


## **Spark SQL and OLAP**

### Register the DataFrame as a local temporary view

In [0]:
movies_df.registerTempTable("movies")
ratings_df.registerTempTable("ratings")
links_df.registerTempTable("links")
tags_df.registerTempTable("tags")

### The number of Users


In [61]:
spark.sql(
"select count(distinct userId) as total_users from ratings"
).show()

+-----------+
|total_users|
+-----------+
|        610|
+-----------+



### The number of Movies

In [62]:
spark.sql(
    "select count(distinct movieId) as total_movies from movies"

).show()

+------------+
|total_movies|
+------------+
|        9742|
+------------+



### numbers of movies that are rated by users:

In [65]:
spark.sql(
    '''
    with t as 
    (select m.movieId, title, genres,userId, rating from 
    movies m left join ratings r
    on m.movieId = r.movieId)

    select count(distinct movieId) 
    from t
    where rating is not null
    '''
).show()

+-----------------------+
|count(DISTINCT movieId)|
+-----------------------+
|                   9724|
+-----------------------+



### Movies not rated before:

In [70]:
spark.sql(
    '''
    with t as 
    (select m.movieId, title, genres,userId, rating from 
    movies m left join ratings r
    on m.movieId = r.movieId)

    select movieId, title,genres
    from t
    where rating is null
    group by 1,2,3
    '''
).show()

+-------+--------------------+--------------------+
|movieId|               title|              genres|
+-------+--------------------+--------------------+
|   4194|I Know Where I'm ...|   Drama|Romance|War|
|   1076|Innocents, The (1...|Drama|Horror|Thri...|
|  30892|In the Realms of ...|Animation|Documen...|
|  26085|Mutiny on the Bou...|Adventure|Drama|R...|
|   5721|  Chosen, The (1981)|               Drama|
|  32160|Twentieth Century...|              Comedy|
|   2939|      Niagara (1953)|      Drama|Thriller|
|  25855|Roaring Twenties,...|Crime|Drama|Thriller|
|  32371|Call Northside 77...|Crime|Drama|Film-...|
|   6849|      Scrooge (1970)|Drama|Fantasy|Mus...|
|   3338|For All Mankind (...|         Documentary|
|  85565|  Chalet Girl (2011)|      Comedy|Romance|
|   7792|Parallax View, Th...|            Thriller|
|   6668|Road Home, The (W...|       Drama|Romance|
|   3456|Color of Paradise...|               Drama|
|   7020|        Proof (1991)|Comedy|Drama|Romance|
|   8765|Thi

### List All Movie Genres

In [74]:
spark.sql(
    """
    select movie_genres from movies
    lateral view explode(split(genres, '[|]')) as movie_genres 
    where movie_genres <> "(no genres listed)"
    group by 1
    order by 1
    """
).show()

+------------+
|movie_genres|
+------------+
|      Action|
|   Adventure|
|   Animation|
|    Children|
|      Comedy|
|       Crime|
| Documentary|
|       Drama|
|     Fantasy|
|   Film-Noir|
|      Horror|
|        IMAX|
|     Musical|
|     Mystery|
|     Romance|
|      Sci-Fi|
|    Thriller|
|         War|
|     Western|
+------------+



### Count Movie for Each Category

In [76]:
spark.sql(
    """
select movie_genres,count(*) as total_movies from movies
lateral view explode(split(genres,'[|]')) as movie_genres
group by 1
""").show()

+------------------+------------+
|      movie_genres|total_movies|
+------------------+------------+
|             Crime|        1199|
|           Romance|        1596|
|          Thriller|        1894|
|         Adventure|        1263|
|             Drama|        4361|
|               War|         382|
|       Documentary|         440|
|           Fantasy|         779|
|           Mystery|         573|
|           Musical|         334|
|         Animation|         611|
|         Film-Noir|          87|
|(no genres listed)|          34|
|              IMAX|         158|
|            Horror|         978|
|           Western|         167|
|            Comedy|        3756|
|          Children|         664|
|            Action|        1828|
|            Sci-Fi|         980|
+------------------+------------+



### List all movie names in each category

In [77]:
spark.sql(
    """
    select t1.movie_genres, concat_ws("|",collect_set(t1.title)) as list_of_movies
    from
    (
    select title,movie_genres from movies
    lateral view explode(split(genres, '[|]')) as movie_genres
    group by 1,2
    ) t1
    group by 1    
    """
).show()

+------------------+--------------------+
|      movie_genres|      list_of_movies|
+------------------+--------------------+
|             Crime|Stealing Rembrand...|
|           Romance|Vampire in Brookl...|
|          Thriller|Element of Crime,...|
|         Adventure|Ice Age: Collisio...|
|             Drama|Airport '77 (1977...|
|               War|General, The (192...|
|       Documentary|The Barkley Marat...|
|           Fantasy|Masters of the Un...|
|           Mystery|Before and After ...|
|           Musical|U2: Rattle and Hu...|
|         Animation|Ice Age: Collisio...|
|         Film-Noir|Rififi (Du rififi...|
|(no genres listed)|T2 3-D: Battle Ac...|
|              IMAX|Harry Potter and ...|
|            Horror|Sweeney Todd (200...|
|           Western|Man Who Shot Libe...|
|            Comedy|Hysteria (2011)|H...|
|          Children|Ice Age: Collisio...|
|            Action|Stealing Rembrand...|
|            Sci-Fi|Push (2009)|SORI:...|
+------------------+--------------