### Notebook API Python Spark DataFrame - CNAM 2026


The aim of these exercises is to use Spark's Dataframe API in Python.
To consult the documentation, follow this link:
* https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html

See also the following files:
*   [beginners_python_cheat_sheet_pcc_all.pdf](https://drive.google.com/file/d/10Hvkf94dYT0Q3ZqlFUE6hqwU_4tOlclP/view?usp=sharing)
*   [PySpark_Cheat_Sheet_Python.pdf](https://drive.google.com/file/d/1DsTqOla0bmmwnpgMDOuSF-1QTNo0fjVR/view?usp=sharing)
*  [ PySpark_SQL_Cheat_Sheet_Python.pdf](https://drive.google.com/file/d/15Q_EVC3yaDW1oZ-QFbcQORpOFB7AZq0o/view?usp=sharing)

## Reminder of some functions

|Expression |Action|
|:-------------:|:-------------:|
|val ds = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("/path/file.csv") |loads the content of file.csv into a dataset ds by indicting that it contains a header  and by requesting  Spark to infer the schema |
|ds.printSchema | show the schema of ds |
|ds.show(truncate=false)|shows the first 20 rows without truncating the values |
|ds.describe().show()|collects and shows descriptive statistics (mean, max, count, ..) of numeric values|
|ds.select("c1", "c2", ..., "cn")|projects ds on the columns c1, …, cn|
|ds.withColumnRenamed("c1","c2")|renames the column c1 with c2|
|ds.where(cond)|selects the rows respecting cond|
|ds.groupBy("c1").agg(collect_list($"c2") as "values")|groups the rows by column c1 and creates an new column of values associated to those of c1|
|ds.groupBy("c1").agg(avg("c2"))|computes the sum of c2 for each c1 |
|ds.withColumn("new", Exp)|creates a new column whose values are computed by Exp|
|ds1.crossJoin(ds2)|computes the cross product of ds1 and ds2|
|ds1.join(ds2, "c") |joins ds1 and ds2 on the column c|
|ds1.join(ds2, Seq("c1",...,"cn")) |generalizes the previous one to a sequence of columns c1,…, cn|

## Preparation
*   ***Check that computing resources*** are allocated to your notebook if it is
connected (see disk RAM indicated at top right). If not, click on the connect button to obtain resources.

*   ***Create the directory*** to store the necessary files on your google
drive (give the notebook permission to access your drive when requested). *Adjust the name of your folder* : **MyDrive/ens/cnam/data/**

In [None]:
import os
from google.colab import drive
drive.mount("/content/drive", force_remount=True)

drive_dir = "/content/drive/MyDrive/ens/cnam/data/"
os.makedirs(drive_dir, exist_ok=True)
os.listdir(drive_dir)

***Install pyspark and findspark:***

In [None]:
!pip install -q pyspark
!pip install -q findspark

***Start the spark session:***

In [None]:
import os
os.environ["SPARK_HOME"] = "/usr/local/lib/python3.12/dist-packages/pyspark"
os.environ["JAVA_HOME"] = "/usr"

In [None]:
# Main imports
import findspark
from pyspark.sql import SparkSession
from pyspark import SparkConf

# for dataframe and udf
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark.sql.types import *
from datetime import *

# initialize environment variables for spark
findspark.init()

# Start spark session
# --------------------------
def start_spark():
  local = "local[*]"
  appName = "TP"
  configLocale = SparkConf().setAppName(appName).setMaster(local).\
  set("spark.executor.memory", "6G").\
  set("spark.driver.memory","6G").\
  set("spark.sql.catalogImplementation","in-memory")

  spark = SparkSession.builder.config(conf = configLocale).getOrCreate()
  sc = spark.sparkContext
  sc.setLogLevel("ERROR")

  spark.conf.set("spark.sql.autoBroadcastJoinThreshold","-1")

  # Adjust the query execution environment to the size of the cluster (4 cores)
  spark.conf.set("spark.sql.shuffle.partitions","4")
  print("session started, its id is ", sc.applicationId)
  return spark
spark = start_spark()

## Read a file and transform it into a DataFrame
  - read the movies.csv file
  - display the schema
  - display columns (attributes)
  - display content (3 films)
  - display the number of films
  - describe the nF column (describe() function)
  - display statistics on the films table (summary function)

### Copy movies.csv and ratings.csv to the folder "/content/drive/MyDrive/ens/cnam/data"

In [None]:
#The folder containing the imported csv files:
DATASET_DIR="/content/drive/MyDrive/ens/cnam/data"

In [None]:
#See an excerpt of movies.csv
!head $DATASET_DIR/movies.csv

In [None]:
#See an excerpt of ratings.csv
!head $DATASET_DIR/ratings.csv

## Read the movies.csv file and create the movies DataFrame

In [None]:
# Create a movies DataFrame to store the movies in the movies.csv file
# its schema is as follows: idM INT, title STRING, g STRING
schema = """
          idM INT,
          title STRING,
          g STRING
        """
print("Reading the file: ", DATASET_DIR+"/movies.csv")
movies = spark.read.format("csv").option("header", "true").schema(schema) \
            .load(DATASET_DIR+"/movies.csv")
movies=movies.persist()

In [None]:
#Display the resulting schema
movies.printSchema()

Result:
```
root
 |-- idM: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- g: string (nullable = true)

```

In [None]:
#Display column names


Result:
```
['idM', 'title', 'g']

```

In [None]:
#Display 3 lines of the movies structure (use the show function)


Result:
```
+---+-----------------------+-------------------------------------------+
|idM|title                  |g                                          |
+---+-----------------------+-------------------------------------------+
|1  |Toy Story (1995)       |Adventure|Animation|Children|Comedy|Fantasy|
|2  |Jumanji (1995)         |Adventure|Children|Fantasy                 |
|3  |Grumpier Old Men (1995)|Comedy|Romance                             |
+---+-----------------------+-------------------------------------------+
only showing top 3 rows

```

In [None]:
#Display the number of movies (use the count function)

#result: 9125

In [None]:
#Describe (give statistics) the idM column of movies (use the describe function)


Result:
```
+-------+------------------+
|summary|               idM|
+-------+------------------+
|  count|              9125|
|   mean|31123.291835616437|
| stddev| 40782.63360397416|
|    min|                 1|
|    max|            164979|
+-------+------------------+
only showing top 3 rows

```

In [None]:
# attribute statistics (use the summary function)


Result:
```
+-------+------------------+--------------------+------------------+
|summary|               idM|               title|                 g|
+-------+------------------+--------------------+------------------+
|  count|              9125|                9125|              9125|
|   mean|31123.291835616437|                NULL|              NULL|
| stddev| 40782.63360397416|                NULL|              NULL|
|    min|                 1|"""Great Performa...|(no genres listed)|
|    25%|              2849|                NULL|              NULL|
|    50%|              6287|                NULL|              NULL|
|    75%|             56251|                NULL|              NULL|
|    max|            164979| İtirazım Var (2014)|           Western|
+-------+------------------+--------------------+------------------+
only showing top 3 rows

```

## Queries on movies
   - display 10 movie titles
   - display movie titles, movie ids (idM) incremented by 1 and genres
   - display movies with titles starting with 'Police', ordered by idM (startswith function)
   - create a new movies2 DataFrame with a single genre per movie (for a movie with n genres, there are n lines); use the explode function
   - display two rows of movies2
   - display the number of distinct genres
   - display the number of movies per genre (groupBy)   

In [None]:
#Display 10 film titles


Result:
```
+--------------------+
|               title|
+--------------------+
|    Toy Story (1995)|
|      Jumanji (1995)|
|Grumpier Old Men ...|
|Waiting to Exhale...|
|Father of the Bri...|
|         Heat (1995)|
|      Sabrina (1995)|
| Tom and Huck (1995)|
| Sudden Death (1995)|
|    GoldenEye (1995)|
+--------------------+
only showing top 10 rows
```

In [None]:
#Display 3 movie titles, movie ids (idM) incremented by 1 and genres


Result:
```
+--------------------+---------+--------------------+
|               title|(idM + 1)|                   g|
+--------------------+---------+--------------------+
|    Toy Story (1995)|        2|Adventure|Animati...|
|      Jumanji (1995)|        3|Adventure|Childre...|
|Grumpier Old Men ...|        4|      Comedy|Romance|
+--------------------+---------+--------------------+
only showing top 3 rows
```

In [None]:
#Display movies with titles beginning with 'Police', ordered by idM (filter with startswith, orderBy, show)


Result:
```
+----+--------------------+------------+
| idM|               title|           g|
+----+--------------------+------------+
|2378|Police Academy (1...|Comedy|Crime|
|2379|Police Academy 2:...|Comedy|Crime|
|2380|Police Academy 3:...|Comedy|Crime|
+----+--------------------+------------+
only showing top 3 rows
```

In [None]:
#As the genres of each film are currently stored in a single string, we will replace this string with an array of strings
#(for example, for a film with a column g containing 'Comedy, Romance' we'll get a genres column ['Comedy', 'Romance']).
#- use the split function (split(col("g"),"\|"))

from pyspark.sql.functions import split, col


#Print the schema of the new movies DataFrame


Result:
```
root
 |-- idM: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- g: array (nullable = true)
 |    |-- element: string (containsNull = false)
```

In [None]:
#Display three lines of the new DataFrame movies


Result:
```
+---+--------------------+--------------------+
|idM|               title|                   g|
+---+--------------------+--------------------+
|  1|    Toy Story (1995)|[Adventure, Anima...|
|  2|      Jumanji (1995)|[Adventure, Child...|
|  3|Grumpier Old Men ...|   [Comedy, Romance]|
+---+--------------------+--------------------+
only showing top 3 rows
```

In [None]:
# For movies with no genre the column g contains the string '(no genres listed)'
# Create a Dataframe tmp which contains only the movies that DO NOT contain '(no genres listed)' (use the array_contains function)


# display the number of movies in the Dataframe tmp (Result: 9107)



In [None]:
#Create a new DataFrame movies_g from tmp with a single genre per movie (for a movie with n genres, there will be n rows); use the explode function

#Display 3 lines of the new DataFrame


Result:
```
+----------------+---------+---+
|           title|    genre|idM|
+----------------+---------+---+
|Toy Story (1995)|Adventure|  1|
|Toy Story (1995)|Animation|  1|
|Toy Story (1995)| Children|  1|
+----------------+---------+---+
only showing top 3 rows
```

In [None]:
# Compute the number of distinct genres (distinct and count)

#Result: 19

In [None]:
# Display the number of films by genre (groupBy and count)


Result:
```
+-----------+-----+
|      genre|count|
+-----------+-----+
|   Children|  583|
|    Fantasy|  654|
|      Crime| 1100|
|     Horror|  877|
|  Adventure| 1117|
|      Drama| 4365|
|     Sci-Fi|  792|
|       IMAX|  153|
|    Musical|  394|
|    Western|  168|
|  Animation|  447|
|     Comedy| 3315|
|    Romance| 1545|
|   Thriller| 1729|
|    Mystery|  543|
|        War|  367|
|     Action| 1545|
|Documentary|  495|
|  Film-Noir|  133|
+-----------+-----+
```

## Read the ratings.csv file and create the ratings DataFrame

In [None]:
schema = """
          idU INT,
          idM INT,
          rating FLOAT,
          date INT
        """

print("Reading the file: ", DATASET_DIR+"/ratings.csv")
ratings =spark.read.format('csv').option('header', 'true').schema(schema).load(DATASET_DIR+"/ratings.csv")
ratings=ratings.persist()

# Display the schema of ratings
...
# Display 3 lines of ratings
...
# Count the number of lines in ratings
...

Result:
```
root
 |-- idU: integer (nullable = true)
 |-- idM: integer (nullable = true)
 |-- rating: float (nullable = true)
 |-- date: integer (nullable = true)

+---+----+------+----------+
|idU| idM|rating|      date|
+---+----+------+----------+
|  1|  31|   2.5|1260759144|
|  1|1029|   3.0|1260759179|
|  1|1061|   3.0|1260759182|
+---+----+------+----------+
only showing top 3 rows

100004
```

### *Extraction of the day, month and year from the date :*
In the ratings.csv file, the date on which a user rated a film is in Unix epoch format (timestamp). We will extract the year, month and day information from this date. This conversion will be carried out in two stages:
- create 3 user functions, each taking as parameter an integer representing the date to be converted (annotated @udf('integer')) and returning respectively the day, the month (between 1 and 12) and the year
- invoke these functions using the withColumn() method.

In [None]:
from datetime import *
from pyspark.sql.functions import udf

In [None]:
#define the function that extracts the day (between 1 and 31) from the date
@udf('integer')
def getDay(v):
    return datetime.utcfromtimestamp(v).day

In [None]:
#test the previous function by applying it to the date column in ratings
...

Result:
```
+---+----+------+----------+---+
|idU| idM|rating|      date|day|
+---+----+------+----------+---+
|  1|  31|   2.5|1260759144| 14|
|  1|1029|   3.0|1260759179| 14|
|  1|1061|   3.0|1260759182| 14|
+---+----+------+----------+---+
only showing top 3 rows
```

In [None]:
#define the function that extracts the month (between 1 and 12) from the date
...

In [None]:
#define the function that extracts the year
...

In [None]:
# apply the previous 3 functions to the date column in ratings to build a new notes DataFrame ratings
# with columns idU, idM, rating, day, month, year
...

ratings = ratings.persist() #keep ratings in memory

#Display 3 lines of the new ratings DataFrame
...

#Count the number of lines from ratings
...

Result:
```
+---+----+------+----+-----+---+
|idU| idM|rating|year|month|day|
+---+----+------+----+-----+---+
|  1|  31|   2.5|2009|   12| 14|
|  1|1029|   3.0|2009|   12| 14|
|  1|1061|   3.0|2009|   12| 14|
+---+----+------+----+-----+---+
only showing top 3 rows

100004
```

## Querying ratings
  - display the number of distinct years
  - display the number of distinct dates (including year, month, day)
  - display the maximum, average and minimum score
  - group ratings by film number
  - display the average rating per film
  - for each user
     - display the total number of different ratings, and the maximum, minimum and average ratings
     - sort the results of the previous query by descending number of ratings and user number

In [None]:
#Display the number of distinct years (countDistinct)
...

Result:
```
+--------------------+
|count(DISTINCT year)|
+--------------------+
|                  22|
+--------------------+
```

In [None]:
#Display the number of distinct dates (year, month, day)
...

Result:
```
+--------------------------------+
|count(DISTINCT year, month, day)|
+--------------------------------+
|                            3840|
+--------------------------------+
```

In [None]:
#Display the maximum, average and minimum rating (min, max, avg)
...

Result:
```
+-----------+-----------+-----------------+
|min(rating)|max(rating)|      avg(rating)|
+-----------+-----------+-----------------+
|        0.5|        5.0|3.543608255669773|
+-----------+-----------+-----------------+
```

In [None]:
#Group ratings by film number (groupBy) and store the result in a grouped_ratings dataframe.
grouped_ratings=...
#no results to display

In [None]:
#Display the average score per movie using the grouped_ratings (avg) dataframe.
...

Result:
```
+----+------------------+
| idM|       avg(rating)|
+----+------------------+
|1029|3.7023809523809526|
|1129|            3.3125|
|1263|3.8645833333333335|
+----+------------------+
only showing top 3 rows
```

In [None]:
#Display average scores by film, sorted in descending order of score (orderBy with desc)
...

Result:
```
+---+-----------+
|idM|avg(rating)|
+---+-----------+
| 53|        5.0|
|183|        5.0|
|301|        5.0|
+---+-----------+
only showing top 3 rows
```

In [None]:
#Create a rating_user dataframe which groups ratings by user
#no results to display


In [None]:
# Create a Dataframe tmp which contains for each user the total number of different ratings, the maximum, minimum and average rating


In [None]:
#Sort the tmp dataframe by descending number of ratings and user number and display the result


Result:
```
+---+-----+---+---+------------------+
|idU|total|max|min|           moyenne|
+---+-----+---+---+------------------+
| 15|   10|5.0|0.5|2.6217647058823528|
| 17|   10|5.0|0.5| 3.743801652892562|
| 20|   10|5.0|0.5|3.2908163265306123|
+---+-----+---+---+------------------+
only showing top 3 rows

```

### **Films and ratings** joins
  - create a movies_ratings DataFrame containing the movies with their ratings (one line per rating)
  - display the number of ratings for the movie whose title contains the string 'Pocahontas'
  - display the title, number of ratings, average rating, maximum rating and minimum rating for each movie
  - the titles of movies that are not rated
  - for each genre, the users who have not rated any movies in that genre

In [None]:
#Print the schemas of movies and ratings DataFrames
...

Result:
```
root
 |-- idM: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- g: array (nullable = true)
 |    |-- element: string (containsNull = false)

root
 |-- idU: integer (nullable = true)
 |-- idM: integer (nullable = true)
 |-- rating: float (nullable = true)
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- day: integer (nullable = true)
```

In [None]:
# Create a movies_ratings DataFrame containing the movies and their ratings (one line per rating) (join)
movies_ratings=...
#Display two lines
...

Result:
```
+---+----------------+--------------------+---+------+----+-----+---+
|idM|           title|                   g|idU|rating|year|month|day|
+---+----------------+--------------------+---+------+----+-----+---+
|  1|Toy Story (1995)|[Adventure, Anima...|  7|   3.0|1996|   12| 29|
|  1|Toy Story (1995)|[Adventure, Anima...|  9|   4.0|1999|    9| 29|
+---+----------------+--------------------+---+------+----+-----+---+
only showing top 2 rows
```

In [None]:
#Display the number of ratings for the movie whose title contains the string 'Pocahontas' (contains function)
#result: 61
...

In [None]:
#For each movie, display its title, number of ratings, average rating, maximum rating, minimum rating (groupBy and agg)
fn=...
#Display three lines
...

Result:
```
+--------------------+---------+---+---+------------------+
|               title|nbRatings|max|min|               avg|
+--------------------+---------+---+---+------------------+
|    Toy Story (1995)|        9|5.0|1.0|3.8724696356275303|
|      Jumanji (1995)|        8|5.0|1.5|3.4018691588785046|
|Grumpier Old Men ...|       10|5.0|0.5|3.1610169491525424|
+--------------------+---------+---+---+------------------+
only showing top 3 rows
```

### Outer joins and cross join

In [None]:
# Create a Dataframe movies1 which renames the idM attribute of films to idM1 (function withColumnRenamed)
movies1 = ...
#no result to display

In [None]:
# Join movies1 with ratings using a left outer join, which also keeps films without ratings; store the result in a movies2 dataframe.
movies2 = ...
#no result to display

In [None]:
# From movies2, display three movies without notes (use isNull())
...

In [None]:
movies_g.show(3)

Result:
```
+--------------------+
|               title|
+--------------------+
|Wild Child, The (...|
|Iron Ladies, The ...|
|Scarlet Street (1...|
+--------------------+
only showing top 3 rows
```

In [None]:
# For each genre, users who have not rated any films in this genre
# Indications:

# - create a first Dataframe g_u which contains pairs (genre, idU) where idU has seen movies of the genre "genre".
# - create a second gu-all dataframe containing all possible (genre, idU) pairs (crossjoin)
# - use the two Dataframes to compute the (genre, idU) pairs where idU has not seen any movies of the "genre" genre (subtract)


#g_u.show()


#g_u_tous.show()


#Show the user that did not rate movies in the 'Comedy' category


Result:
```
+------+---+
| genre|idU|
+------+---+
|Comedy|446|
+------+---+
```

# **Recommend films to users**

Apply collaborative filtering (the user-centred approach) to recommend movies not yet viewed to each user (it is assumed that a movie not rated by a user has not been viewed by that user). See a description of the approach here (https://en.wikipedia.org/wiki/Collaborative_filtering, Memory-based section).



### 1.  Compute the similarity between users (Jaccard similarity)

First, we'll calculate a similarity value for each pair of users, based on the films they've rated in common. For a user u, we need to know the set v of all the film numbers he has rated. The similarity between users u1 and u2 will be calculated from the corresponding movie sets v1 and v2.

Jaccard similarity (see the description here: https://en.wikipedia.org/wiki/Jaccard_index):
 - the similarity between u1 and u2 is equal to the number of films rated in common by u1 and u2 divided by the total number of films rated by u1 or u2. For example, if u1 has rated the films f1, f3 and f4 (v1=[f1, f3, f4]) and u2 has rated the films f3, f4, f5 and f6 (v2=[f3, f4, f5, f6]) their similarity will be 2/5=0.4 which corresponds to the cardinality of the intersection between v1 and v2 divided by the cardinality of their union).
   
The similarity calculation is performed in several steps:

- *Step 1*: build for each user a list of the films they have rated and store them in the list_films DataFrame, which will have 2 columns: idU and l_films, which will contain an array of film numbers.

In [None]:


#print the schema of list_films


#show two lines


Result:
```
root
 |-- idU: integer (nullable = true)
 |-- l_films: array (nullable = false)
 |    |-- element: integer (containsNull = false)

+---+--------------------+
|idU|             l_films|
+---+--------------------+
|  1|[31, 1029, 1061, ...|
|  2|[10, 17, 39, 47, ...|
+---+--------------------+
only showing top 2 rows
```

- *Step 2*: Build all possible pairs of users with their respective movie lists and store them in the couples_u DataFrame, which will have columns idU1, idU2, l_films1, l_films2.

In [None]:
# Build an intermediate DataFrame t1(idU1, l_films1) from film list by renaming idU->idU1 and l_films -> l_films1


# Build an intermediate DataFrame t2(idU2, l_films2) in the same way as t1


# Build pairs_u(idU1, idU2, l_films1, l_films2) from t1 and t2 (remove pairs where idU1=idU2)


Result:
```
+----+--------------------+
|idU1|            l_films1|
+----+--------------------+
|   1|[31, 1029, 1061, ...|
|   2|[10, 17, 39, 47, ...|
|   3|[60, 110, 247, 26...|
+----+--------------------+
only showing top 3 rows

+----+--------------------+
|idU2|            l_films2|
+----+--------------------+
|   1|[31, 1029, 1061, ...|
|   2|[10, 17, 39, 47, ...|
|   3|[60, 110, 247, 26...|
+----+--------------------+
only showing top 3 rows

+----+--------------------+----+--------------------+
|idU1|            l_films1|idU2|            l_films2|
+----+--------------------+----+--------------------+
|   1|[31, 1029, 1061, ...|   2|[10, 17, 39, 47, ...|
|   1|[31, 1029, 1061, ...|   3|[60, 110, 247, 26...|
|   1|[31, 1029, 1061, ...|   4|[10, 34, 112, 141...|
+----+--------------------+----+--------------------+
only showing top 3 rows
```

- *Step 3*: Define a sim_jaccard user function that computes a Jaccard similarity value from two lists specified as parameters

In [None]:
@udf('float')
def sim_jaccard(l1, l2):
    set1=set(l1)
    set2=set(l2)
    l = len(set1.union(set2))
    if (l == 0): return 0
    return float(len(set1.intersection(set2)))/len(set1.union(set2))

- *Step 4*: Compute the similarity between each pair of users built in step 2 by applying the similarity function defined in step 3 to their respective movie lists. The similarity will be stored in the DataFrame sim_j(idU1, idU2, sim)

In [None]:
# Build a DataFrame sim_j(idU1, idU2, sim) by applying the withColumn method to the couples_u DataFrame. Keep only entries where sim != 0


sim_j.persist() #keep the dataframe in memory

#Display 3 lines
sim_j.show(3)

#Count the number of its lines


Result:
```
+----+----+-----------+
|idU1|idU2|        sim|
+----+----+-----------+
|   1|   4| 0.02283105|
|   1|   5|0.008403362|
|   1|   7|0.048543688|
+----+----+-----------+
only showing top 3 rows

395560
```

### 2. **Computation of recommendation scores for unrated films**

- Preparing the computation: remove the date information

In [None]:


#Display 3 lines


#Count the number of its lines


Result:
```
+---+----+------+
|idU| idM|rating|
+---+----+------+
|  1|  31|   2.5|
|  1|1029|   3.0|
|  1|1061|   3.0|
+---+----+------+
only showing top 3 rows

100004
```

- *Step 1*: Build all possible pairs of users and movies and remove the pairs in u_seen_ratings. Store the result in um_all(idU, idM)

In [None]:

#Count its number of lines (result: 6122875)


In [None]:
#Store the result in the u_not_seen(idU, idM) DataFrame, which will be kept in memory.


#Display 3 lines


#Count the number of its lines (result: 6022871)


Result:
```
+---+---+
|idU|idM|
+---+---+
|  1| 13|
|  1| 28|
|  1| 42|
+---+---+
only showing top 3 rows

6022871
```

- *Step 2*: compute a DF u_sim_ratings containing quintuples (idU1,idU2,idM,rating,sim)

In [None]:


#Display 3 lines


#Count the number of its lines
 #64389904

Result:
```
+----+----+---+------+-----------+
|idU1|idU2|idM|rating|        sim|
+----+----+---+------+-----------+
|   2|  12|253|   3.0|0.007352941|
|   3|  12|253|   3.0|0.037037037|
|   4|  12|253|   3.0|0.027131783|
+----+----+---+------+-----------+
only showing top 3 rows

64389904
```

- *Step 3*: create a u_recom DF that extends u_sim_ratings with a recom column containing the sim*rating product

In [None]:


#Display 3 lines


Result:
```
+----+----+---+------+-----------+-----------+
|idU1|idU2|idM|rating|        sim|      recom|
+----+----+---+------+-----------+-----------+
|   2|  12|253|   3.0|0.007352941|0.022058824|
|   3|  12|253|   3.0|0.037037037| 0.11111111|
|   4|  12|253|   3.0|0.027131783| 0.08139535|
+----+----+---+------+-----------+-----------+
only showing top 3 rows
```

- *Step 4*: Create a DF u_recom2 containing the columns idU1, idM and avg_rec. avg_rec is the average of the recommendation scores for idU1 and idM. Display the result.

In [None]:


#Count the number of its lines
#6019322

Result:
```
+----+---+-----------+
|idU1|idM|      recom|
+----+---+-----------+
|   2|253|0.022058824|
|   3|253| 0.11111111|
+----+---+-----------+
only showing top 2 rows

6019322
```

In [None]:
#Display the top 5 recommendations for user 100
#Display the top 5 recommendations for user 100


Result:
```
+----+----+------------------+
|idU1| idM|           avg_rec|
+----+----+------------------+
| 100|1472|0.9180327653884888|
| 100| 667|0.5885736619432768|
| 100|  86|0.5710706852842122|
| 100| 694|0.5690423326566816|
| 100| 100|0.5613212545535394|
+----+----+------------------+
only showing top 5 rows
```

- *Step 5*: create a u_not_seen_rec DF containing only recommendations for unseen movies

Result:
```
5922381
```

In [None]:
#Display the top 3 recommendations for user 100


Result:
```
+---+----+------------------+
|idU| idM|           avg_rec|
+---+----+------------------+
|100|1472|0.9180327653884888|
|100| 667|0.5885736619432768|
|100| 694|0.5690423326566816|
|100| 100|0.5613212545535394|
|100| 631| 0.531534632253978|
+---+----+------------------+
only showing top 5 rows
```