<img width="200" style="float:left" 
     src="https://upload.wikimedia.org/wikipedia/commons/f/f3/Apache_Spark_logo.svg" />

# Sections
* [Description](#0)
* [1. Setup](#1)
  * [1.1 Start Hadoop](#1.1)  
  * [1.2 Set Global Config](#1.2)
  * [1.3 Create SparkSession](#1.3)
* [2. Lab](#2)
  * [2.1 Check Movielens Files](#2.1)
  * [2.2 List current databases](#2.3)
  * [2.3 Create Movie Pair Combinations](#2.3)
  * [2.4 Create Movie Recommendations](#2.4)
  * [2.5 Get Movie Recommendations](#2.5)
* [3. TearDown](#3)
  * [3.1 Stop Hadoop](#3.1)

<a id='0'></a>
## Description
<p>
<div>The goals for this lab are:</div>
<ul>    
    <li>Get familiar with Spark DataFrames API and SQL API</li>
    <li>Use Spark to apply some analytics</li>
</ul>    
</p>

<a id='1'></a>
## 1. Setup

Since we are going to process data stored from HDFS let's start the service

<a id='1.1'></a>
### 1.1 Start Hadoop

Start Hadoop <a href="http://localhost:2024/">here </a>

<p>
<img style="width:48px" src="https://cdn.iconscout.com/icon/free/png-256/free-hadoop-226007.png" /> 
</p>

<a id='1.2'></a>
### 1.2 Set Global Config

In [1]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

In [2]:
import numpy as np
np.bool = np.bool_

In [3]:
#current notebook name
notebook_name = __session__.replace('.ipynb','')[__session__.rfind('/')+1:] 

In [4]:
# HDFS base paths
hdfs_lakehouse_base_path = 'hdfs://localhost:9000/lakehouse/'
hdfs_warehouse_base_path = 'hdfs://localhost:9000/warehouse'

<a id='1.3'></a>
### 1.3 Create SparkSession

In [5]:
import os
dependencies = ["org.apache.spark:spark-avro_2.12:3.5.0",
                "io.delta:delta-iceberg_2.12:3.0.0"]
os.environ['PYSPARK_SUBMIT_ARGS']= f"--packages {','.join(dependencies)} pyspark-shell"
os.environ['PYARROW_IGNORE_TIMEZONE'] = 'true'

In [6]:
from pyspark.sql.session import SparkSession

spark = (SparkSession.builder
    .appName(notebook_name)
    .config("spark.log.level","ERROR")
    .config("spark.sql.warehouse.dir",hdfs_warehouse_base_path)
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
    .enableHiveSupport()
    .getOrCreate()
)

24/12/07 12:37:09 WARN Utils: Your hostname, osbdet resolves to a loopback address: 127.0.0.1; using 10.0.2.15 instead (on interface enp0s1)
24/12/07 12:37:09 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Ivy Default Cache set to: /home/osbdet/.ivy2/cache
The jars for the packages stored in: /home/osbdet/.ivy2/jars
org.apache.spark#spark-avro_2.12 added as a dependency
io.delta#delta-iceberg_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-5788ee28-0e42-411e-8a0e-b0dd1a8897dd;1.0
	confs: [default]
	found org.apache.spark#spark-avro_2.12;3.5.0 in central
	found org.tukaani#xz;1.9 in central
	found io.delta#delta-iceberg_2.12;3.0.0 in central


:: loading settings :: url = jar:file:/home/osbdet/.jupyter_venv/lib/python3.11/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


	found io.delta#delta-spark_2.12;3.0.0 in central
	found io.delta#delta-storage;3.0.0 in central
	found org.antlr#antlr4-runtime;4.9.3 in central
	found org.scala-lang.modules#scala-collection-compat_2.12;2.1.1 in central
	found com.github.ben-manes.caffeine#caffeine;2.9.3 in central
	found org.checkerframework#checker-qual;3.19.0 in central
	found com.google.errorprone#error_prone_annotations;2.10.0 in central
:: resolution report :: resolve 210ms :: artifacts dl 10ms
	:: modules in use:
	com.github.ben-manes.caffeine#caffeine;2.9.3 from central in [default]
	com.google.errorprone#error_prone_annotations;2.10.0 from central in [default]
	io.delta#delta-iceberg_2.12;3.0.0 from central in [default]
	io.delta#delta-spark_2.12;3.0.0 from central in [default]
	io.delta#delta-storage;3.0.0 from central in [default]
	org.antlr#antlr4-runtime;4.9.3 from central in [default]
	org.apache.spark#spark-avro_2.12;3.5.0 from central in [default]
	org.checkerframework#checker-qual;3.19.0 from central

<a id='2'></a>
## 2. Lab

###  Install python dependencies
To visualize the data I'm going to need a python library called beautifulsoup4.
Open a terminal and execute the following command:


In [7]:
!/home/osbdet/.jupyter_venv/bin/python3 -m pip install beautifulsoup4



<a id='2.1'></a>
### 2.1 Check Movielens Tables

In order to complete this lab you need to previosly complete **'Movielens - Step 1 - BRZ to SLVR'**.


<a id='2.2'></a>
### 2.2 List current databases

In [8]:
spark.sql("show databases").show()

+---------+
|namespace|
+---------+
|  default|
|movielens|
|  pokemon|
+---------+



In [9]:
spark.catalog.listDatabases()

[Database(name='default', catalog='spark_catalog', description='Default Hive database', locationUri='hdfs://localhost:9000/warehouse'),
 Database(name='movielens', catalog='spark_catalog', description='', locationUri='hdfs://localhost:9000/warehouse/movielens.db'),
 Database(name='pokemon', catalog='spark_catalog', description='', locationUri='hdfs://localhost:9000/warehouse/pokemon.db')]

<a id='2.3'></a>
### 2.3 Create Movie Pair Combinations

We are going to create a temporary table/table to create movie pairs combinations. It is a temporary table just because we use it as intermediary step to get the final result

In [10]:
spark.sql(
"""
create or replace temporary view movie_pair_by_user as
select distinct
    l.userId,
    least(l.movieId,r.movieId) as movieA,
    greatest(l.movieId,r.movieId) as movieB,
    case when l.movieId=least(l.movieId,r.movieId) then l.rating else r.rating end as ratingA,
    case when l.movieId=greatest(l.movieId,r.movieId) then l.rating else r.rating end as ratingB
from movielens.ratings l join movielens.ratings r
on l.userId = r.userId 
where  l.movieId <> r.movieId
""")

DataFrame[]

In [11]:
spark.sql("select * from movie_pair_by_user limit 10").toPandas()

                                                                                

Unnamed: 0,userId,movieA,movieB,ratingA,ratingB
0,1,3,3639,4.0,4.0
1,1,6,47,4.0,5.0
2,1,47,3273,5.0,5.0
3,1,47,2048,5.0,5.0
4,1,47,1282,5.0,5.0
5,1,50,2916,5.0,4.0
6,1,50,1954,5.0,5.0
7,1,50,1049,5.0,5.0
8,1,70,1258,3.0,3.0
9,1,101,231,5.0,5.0


Notice that if we get the info from the metastore, is indeed, a temporary table
Temporary means that will only exists during the execution of this notebook and will not be saved/persisted.

<a id='2.4'></a>
### 2.4 Create Movie Recommendations

Now it is moment to calculate the actual recomendations.
Recommendations will be all the previous movie pairs with positive correlation and enough statistical significance, this is, this combination happened at least a minimum number of times (in this case 35 times at least).
I calculated this number based on the previous combinations ocurrences.

Because these recommendations are our final results, we are going to save them in HDFS. Since we processed data from std layer (silver) we are going to persist our final results in the gold layer (notice the path used)

We are creating here an external table

In [12]:
spark.sql("drop database if exists movie_recommender cascade")
spark.sql("create database movie_recommender comment 'Movie Recommender' ")
recs = spark.sql(
"""
select 
    movieA,
    movieB,
    corr(ratingA,ratingB) as correlation, 
    count(*) as occurrences
from movie_pair_by_user
group by movieA,movieB
having correlation > 0 and occurrences >=35
""")

(recs.write
          .format("delta")
          .mode("overwrite")
          .option("path",f"{hdfs_lakehouse_base_path}/gold/movie_recommender/recommendations/")\
          .saveAsTable("movie_recommender.recommendations"))

DataFrame[]

DataFrame[]

                                                                                

From now on, we can query our recommendations table

In [13]:
spark.sql("select * from movie_recommender.recommendations limit 10").toPandas()

Unnamed: 0,movieA,movieB,correlation,occurrences
0,2268,3578,0.378771,35
1,2115,44191,0.114139,42
2,2683,8961,0.243803,53
3,6934,7143,0.187996,39
4,2858,2959,0.378292,134
5,1208,1247,0.423838,48
6,344,3793,0.045676,54
7,4306,5299,0.438145,50
8,1917,6377,0.481003,45
9,111,1193,0.333478,56


##### <a id='2.5'></a>
### 2.5 Get Movie Recommendations

We are going to create an utility function to get recommendations for a movie.

This function is going to query the table to get a number of movies more closely correlated with the one we are passing as an argument

In [14]:
import pyspark.sql.functions as F

spark.catalog.cacheTable("movie_recommender.recommendations")

def get_recs(movie,recs_number=5) :
    query = f"""
            select r.correlation,m.*,l.imdbUrl,l.tmdbUrl,t.youtubeUrl
            from
            (select 
                case when movieA={movie} then movieB else movieA end as movieId,
                correlation
            from movie_recommender.recommendations 
            where movieA={movie} or movieB={movie}
            order by correlation desc
            limit {recs_number}) r
            left join movielens.movies m on r.movieId = m.movieId            
            left join movielens.links l on r.movieId = l.movieId            
            left join movielens.trailers t on r.movieId = t.movieId            
            """
    return spark.sql(query).collect()

The following functions are to pretty print our recommendations ;)

In [15]:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup

def fetch_movie_poster(url):
    hdr = {'User-Agent': 'Mozilla/5.0'}
    req = Request(url,headers=hdr)
    page = urlopen(req)
    soup = BeautifulSoup(page)
    for meta in soup.findAll("meta"):
        if 'property' in meta.attrs and meta.attrs['property'] == "og:image":
            return meta.attrs['content']
    return None

In [16]:
from IPython.display import HTML,IFrame

def display_recs(recs):
    for rec in recs:
        display(IFrame(src=f'{rec.youtubeUrl}',width='560', height='315')) 
        display(HTML('<a href="%s" target="_blank">%s (IMDB)</a>'% (rec.imdbUrl,rec.title) ))
        
def display_posters(recs):
    html = "<table><tr>"
    for rec in recs:
        html+=f'<td><img src="{fetch_movie_poster(rec.imdbUrl)}" width="100"/></td>'
    html+= "</tr></table>"    
    display(HTML(html)) 

Let's get some recommendations for "Lord of the Rings: The Fellowship of the Ring" (movie id 4993)

In [17]:
#4993,"Lord of the Rings: The Fellowship of the Ring
recs = get_recs(4993)
display_posters(recs)
display_recs(recs)

                                                                                

Let's get some recommendations for "Harry Potter and the Philosopher's Stone" (movie id 4896)

In [18]:
#4896,Harry Potter and the Philosopher's Stone
recs = get_recs(4896)
display_posters(recs)
display_recs(recs)

<a id='3'></a>
## 3. Tear Down

Once we complete the the lab we can stop all the services

<a id='3.1'></a>
### 3.1 Stop Hadoop
Stop Hadoop <a href="http://localhost:2024/">here </a>
<p>
<img style="width:48px" src="https://cdn.iconscout.com/icon/free/png-256/free-hadoop-226007.png" /> 
</p>