# Introduction To PySpark

Today we will see examples on how to use PySpark (the Spark API for Python) to handle data. In particular, we will learn how to create RDD objects and use data frames to query from data.

## Setup

Run the cell below to setup Spark on your Colab environment.

In [53]:
!pip install pyspark
!apt-get update
!apt install openjdk-17-jdk-headless -qq
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-17-openjdk-amd64"

Hit:1 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:2 https://cli.github.com/packages stable InRelease
Hit:3 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:4 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:5 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Hit:6 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:7 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:8 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Get:9 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Get:10 https://r2u.stat.illinois.edu/ubuntu jammy/main all Packages [9,728 kB]
Get:11 https://r2u.stat.illinois.edu/ubuntu jammy/main amd64 Packages [2,895 kB]
Fetched 12.6 MB in 3s (4,082 kB/s)
Reading package lists... Done
W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list ent

## Resilient Distributed Dataset (RDD)

Creating a resilient distributed dataset (RDD) is a common entry point for Spark users. In this section, we will create RDDs using `SparkContext` class and learn its important functions: `map()`, `reduce()`, `filter()`, `collect()`.

In [72]:
# sc.stop()

In [74]:
from pyspark import SparkContext, SparkConf
import numpy as np
import os
import sys

# Ensure PySpark uses the correct Python version
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

# Use getOrCreate to safely get the context
conf = SparkConf().setMaster("local[4]").setAppName("RDD App")
sc = SparkContext.getOrCreate(conf=conf)

In [75]:
# Create a RDD with 200 random integers.
lst = np.random.randint(0, 10, 500000)
print(lst)
# type(lst) # lst is a numpy array
A = sc.parallelize(lst) # A is an RDD dataset

[0 0 5 ... 0 4 3]


In [57]:
# A is an RDD object
type(A)
# print(A) # The content of an RDD object cannot be seen directly.

In [58]:
# Convert the data back to a regular list.
AA = A.collect()
print(type(AA))
len(AA)

<class 'list'>


500000

In [59]:
# A splits the data into 4 pieces.
print(len(A.glom().collect()[0]))
print(len(A.glom().collect()))


124928
4


In [60]:
# Let's see what happens if we assign 2 cores instead.

sc.stop()  # stop current spark environment

sc = SparkContext(master="local[2]")  # Assign 2 cores

A = sc.parallelize(lst)
print(len(A.glom().collect()[0]))
print(len(A.glom().collect()))


249856
2


The `map()` method applies a function to the values contained in one RDD and creates a new RDD with the results.

In [61]:
type(A)

In [62]:
# lets go back to 4 cores
sc.stop()  # stop current spark environment

sc = SparkContext(master="local[4]")  # Assign 4 cores

A = sc.parallelize(lst)
print(len(A.glom().collect()))

4


In [63]:
import time
# We can create a new RDD by squaring the values contained in A.
# start timer
start = time.time()
B = A.map(lambda x: x * x)  # the mapping is defined by a lambda expression
# stop timer
end = time.time()
print(end - start)

#start time
start = time.time()
AA = B.collect()
#stop time
end = time.time()
print(end - start)
print(AA[0:2])

#start time
start = time.time()
AA = B.collect()
#stop time
end = time.time()
print(end - start)
print(AA[0:2])

0.003634929656982422
4.358624458312988
[np.int64(64), np.int64(49)]
3.3976552486419678
[np.int64(64), np.int64(49)]


In [64]:
print(len(B.glom().collect()))

4


The `reduce()` method creates a new RDD by aggregating values in an RDD by certain rules.

In [71]:
#start time
start = time.time()
print(sum(A.collect()))
#stop time
end = time.time()
print(end - start)

start = time.time()
# Calculate the sum of the data.
print(A.reduce(lambda x, y: x + y))  # Every two values in the data is replace by
                             # their sum.
#stop time
end = time.time()
print(end - start)


# np.sum(lst)

22497715
3.830104351043701
22497715
4.8359293937683105


### The above was slower with reduce, but if it data was larger reduce would be faster

In [76]:
# Find the maximum value.
C = A.reduce(lambda x,y: x if x > y else y)

# np.max(lst)
print(type(C))
C

<class 'numpy.int64'>


np.int64(9)

In [77]:
# Find all integers in A that are divisible by 3.
C = A.filter(lambda x: x % 3 == 0)
print(type(C))
C.collect()

<class 'pyspark.core.rdd.PipelinedRDD'>


[np.int64(0),
 np.int64(0),
 np.int64(6),
 np.int64(0),
 np.int64(0),
 np.int64(3),
 np.int64(6),
 np.int64(3),
 np.int64(3),
 np.int64(6),
 np.int64(9),
 np.int64(3),
 np.int64(6),
 np.int64(9),
 np.int64(9),
 np.int64(6),
 np.int64(3),
 np.int64(6),
 np.int64(9),
 np.int64(9),
 np.int64(3),
 np.int64(0),
 np.int64(3),
 np.int64(9),
 np.int64(3),
 np.int64(3),
 np.int64(0),
 np.int64(0),
 np.int64(0),
 np.int64(6),
 np.int64(0),
 np.int64(0),
 np.int64(0),
 np.int64(0),
 np.int64(9),
 np.int64(9),
 np.int64(0),
 np.int64(0),
 np.int64(6),
 np.int64(3),
 np.int64(3),
 np.int64(3),
 np.int64(6),
 np.int64(9),
 np.int64(3),
 np.int64(0),
 np.int64(3),
 np.int64(3),
 np.int64(0),
 np.int64(6),
 np.int64(9),
 np.int64(9),
 np.int64(6),
 np.int64(3),
 np.int64(3),
 np.int64(6),
 np.int64(6),
 np.int64(3),
 np.int64(3),
 np.int64(9),
 np.int64(6),
 np.int64(3),
 np.int64(6),
 np.int64(9),
 np.int64(9),
 np.int64(3),
 np.int64(0),
 np.int64(3),
 np.int64(9),
 np.int64(3),
 np.int64(3),
 np.in

In [78]:
C

PythonRDD[2] at collect at /tmp/ipython-input-3849992152.py:4

## Spark Data Frames
A **Spark DataFrame** is a distributed collection of rows under named columns. It is conceptually equivalent to data frames provided by Pandas, but it is constructed upon data formats such as RDDs so that it can handle large amount of data efficiently.

We need to keep in mind that Spark DataFrame is immutable, which means that we can't change a data frame once it is created. In most cases, we need to create a new data frame after applying transformations to an existing one.

In [79]:
# Download the movielens dataset
!wget "http://files.grouplens.org/datasets/movielens/ml-latest-small.zip"

--2026-02-11 01:28:18--  http://files.grouplens.org/datasets/movielens/ml-latest-small.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.96.204
Connecting to files.grouplens.org (files.grouplens.org)|128.101.96.204|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://files.grouplens.org/datasets/movielens/ml-latest-small.zip [following]
--2026-02-11 01:28:18--  https://files.grouplens.org/datasets/movielens/ml-latest-small.zip
Connecting to files.grouplens.org (files.grouplens.org)|128.101.96.204|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 978202 (955K) [application/zip]
Saving to: ‘ml-latest-small.zip.1’


2026-02-11 01:28:19 (1.65 MB/s) - ‘ml-latest-small.zip.1’ saved [978202/978202]



In [None]:
# Extract the CSV files
!mkdir Data
!unzip "ml-latest-small.zip" -d "Data"

mkdir: cannot create directory ‘Data’: File exists
Archive:  ml-latest-small.zip
replace Data/ml-latest-small/links.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

In [42]:
!ls Data/ml-latest-small/

links.csv  movies.csv  ratings.csv  README.txt	tags.csv


In [43]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import count, desc, col, max, struct, mean

`SparkSession()` is an environment introduced in Spark 2.0. It is the main entry point for creating data frames.

In [44]:
# Create a new spark session
spark = SparkSession.builder.appName('spark_app').getOrCreate()

In [45]:
# Import the ratings.csv file:
path = "Data/ml-latest-small/ratings.csv"
ratings = spark.read.format('csv')\
            .option('inferSchema', True)\
            .option('header', True)\
            .load(path)

# inferSchema: Let spark decide the data types of each column
# header: Use the first row as column names
# load: Specify the source of data

# Equivalent statement for Pandas
# ratings = pd.read_csv(path)
# ratings.head()

ratings.show()

+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
|     1|      1|   4.0|964982703|
|     1|      3|   4.0|964981247|
|     1|      6|   4.0|964982224|
|     1|     47|   5.0|964983815|
|     1|     50|   5.0|964982931|
|     1|     70|   3.0|964982400|
|     1|    101|   5.0|964980868|
|     1|    110|   4.0|964982176|
|     1|    151|   5.0|964984041|
|     1|    157|   5.0|964984100|
|     1|    163|   5.0|964983650|
|     1|    216|   5.0|964981208|
|     1|    223|   3.0|964980985|
|     1|    231|   5.0|964981179|
|     1|    235|   4.0|964980908|
|     1|    260|   5.0|964981680|
|     1|    296|   3.0|964982967|
|     1|    316|   3.0|964982310|
|     1|    333|   5.0|964981179|
|     1|    349|   4.0|964982563|
+------+-------+------+---------+
only showing top 20 rows


In [46]:
# "Delete" the timestamp column
ratings = ratings.drop('timestamp')
ratings.show()

+------+-------+------+
|userId|movieId|rating|
+------+-------+------+
|     1|      1|   4.0|
|     1|      3|   4.0|
|     1|      6|   4.0|
|     1|     47|   5.0|
|     1|     50|   5.0|
|     1|     70|   3.0|
|     1|    101|   5.0|
|     1|    110|   4.0|
|     1|    151|   5.0|
|     1|    157|   5.0|
|     1|    163|   5.0|
|     1|    216|   5.0|
|     1|    223|   3.0|
|     1|    231|   5.0|
|     1|    235|   4.0|
|     1|    260|   5.0|
|     1|    296|   3.0|
|     1|    316|   3.0|
|     1|    333|   5.0|
|     1|    349|   4.0|
+------+-------+------+
only showing top 20 rows


In [47]:
type(ratings)

In [48]:
# Show data types
ratings.printSchema()

root
 |-- userId: integer (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- rating: double (nullable = true)



In [49]:
# Shape of the data frame
print(ratings.count(), len(ratings.columns))

100836 3


In [50]:
# Query 1: Select all ratings of movie 1.
# q1 = ratings.select('*').filter(ratings.movieId == 1)
q1 = ratings.filter(ratings.movieId == 1)
q1.show()

+------+-------+------+
|userId|movieId|rating|
+------+-------+------+
|     1|      1|   4.0|
|     5|      1|   4.0|
|     7|      1|   4.5|
|    15|      1|   2.5|
|    17|      1|   4.5|
|    18|      1|   3.5|
|    19|      1|   4.0|
|    21|      1|   3.5|
|    27|      1|   3.0|
|    31|      1|   5.0|
|    32|      1|   3.0|
|    33|      1|   3.0|
|    40|      1|   5.0|
|    43|      1|   5.0|
|    44|      1|   3.0|
|    45|      1|   4.0|
|    46|      1|   5.0|
|    50|      1|   3.0|
|    54|      1|   3.0|
|    57|      1|   5.0|
+------+-------+------+
only showing top 20 rows


In [51]:
# Query 2: show most-rated movies
ratings_count = ratings.groupby('movieId').agg(count('*').alias('count'))
ratings_count.show()

+-------+-----+
|movieId|count|
+-------+-----+
|   1580|  165|
|   2366|   25|
|   3175|   75|
|   1088|   42|
|  32460|    4|
|  44022|   23|
|  96488|    4|
|   1238|    9|
|   1342|   11|
|   1591|   26|
|   1645|   51|
|   4519|    9|
|   2142|   10|
|    471|   40|
|   3997|   12|
|    833|    6|
|   3918|    9|
|   7982|    4|
|   1959|   15|
|  68135|   10|
+-------+-----+
only showing top 20 rows


In [52]:
q2 = ratings_count.orderBy(desc('count')).limit(10)
# q2 is a small data frame, we can convert it into a Pandas data frame
q2.toPandas()

Unnamed: 0,movieId,count
0,356,329
1,318,317
2,296,307
3,593,279
4,2571,278
5,260,251
6,480,238
7,110,237
8,589,224
9,527,220


In [None]:
# Exercise:
# Query 3: find top 10 users with most ratings

q3 = ratings.groupBy("userId").agg(count("*").alias("count")).orderBy(desc('count')).limit(10)
q3.show()

+------+-----+
|userId|count|
+------+-----+
|   414| 2698|
|   599| 2478|
|   474| 2108|
|   448| 1864|
|   274| 1346|
|   610| 1302|
|    68| 1260|
|   380| 1218|
|   606| 1115|
|   288| 1055|
+------+-----+



Now, let's merge the rating data with the movies data.

In [None]:
path = 'Data/ml-latest-small/movies.csv'

# Load movies.csv as a Spark data frame named movies
movies = spark.read.format('csv')\
            .option('inferSchema', True)\
            .option('header', True)\
            .load(path)

movies.show(20, False) # Use parameter (20, False) to show full columns

+-------+-------------------------------------+-------------------------------------------+
|movieId|title                                |genres                                     |
+-------+-------------------------------------+-------------------------------------------+
|1      |Toy Story (1995)                     |Adventure|Animation|Children|Comedy|Fantasy|
|2      |Jumanji (1995)                       |Adventure|Children|Fantasy                 |
|3      |Grumpier Old Men (1995)              |Comedy|Romance                             |
|4      |Waiting to Exhale (1995)             |Comedy|Drama|Romance                       |
|5      |Father of the Bride Part II (1995)   |Comedy                                     |
|6      |Heat (1995)                          |Action|Crime|Thriller                      |
|7      |Sabrina (1995)                       |Comedy|Romance                             |
|8      |Tom and Huck (1995)                  |Adventure|Children               

In [None]:
# merge movies with ratings
data = ratings.join(movies, how='inner', on=['movieId'])
data.show()

+-------+------+------+--------------------+--------------------+
|movieId|userId|rating|               title|              genres|
+-------+------+------+--------------------+--------------------+
|      1|     1|   4.0|    Toy Story (1995)|Adventure|Animati...|
|      3|     1|   4.0|Grumpier Old Men ...|      Comedy|Romance|
|      6|     1|   4.0|         Heat (1995)|Action|Crime|Thri...|
|     47|     1|   5.0|Seven (a.k.a. Se7...|    Mystery|Thriller|
|     50|     1|   5.0|Usual Suspects, T...|Crime|Mystery|Thr...|
|     70|     1|   3.0|From Dusk Till Da...|Action|Comedy|Hor...|
|    101|     1|   5.0|Bottle Rocket (1996)|Adventure|Comedy|...|
|    110|     1|   4.0|   Braveheart (1995)|    Action|Drama|War|
|    151|     1|   5.0|      Rob Roy (1995)|Action|Drama|Roma...|
|    157|     1|   5.0|Canadian Bacon (1...|          Comedy|War|
|    163|     1|   5.0|    Desperado (1995)|Action|Romance|We...|
|    216|     1|   5.0|Billy Madison (1995)|              Comedy|
|    223| 

In [None]:
q2.show(3)

+-------+-----+
|movieId|count|
+-------+-----+
|    356|  329|
|    318|  317|
|    296|  307|
+-------+-----+
only showing top 3 rows



In [None]:
# Exercise:
# Merge q2 with movies to find out the title of the 10 most-rated movies

data2 = q2.join(movies, how="inner", on=["movieId"]).orderBy(desc("count"))
data2.show(20, False)

+-------+-----+-----------------------------------------+--------------------------------+
|movieId|count|title                                    |genres                          |
+-------+-----+-----------------------------------------+--------------------------------+
|356    |329  |Forrest Gump (1994)                      |Comedy|Drama|Romance|War        |
|318    |317  |Shawshank Redemption, The (1994)         |Crime|Drama                     |
|296    |307  |Pulp Fiction (1994)                      |Comedy|Crime|Drama|Thriller     |
|593    |279  |Silence of the Lambs, The (1991)         |Crime|Horror|Thriller           |
|2571   |278  |Matrix, The (1999)                       |Action|Sci-Fi|Thriller          |
|260    |251  |Star Wars: Episode IV - A New Hope (1977)|Action|Adventure|Sci-Fi         |
|480    |238  |Jurassic Park (1993)                     |Action|Adventure|Sci-Fi|Thriller|
|110    |237  |Braveheart (1995)                        |Action|Drama|War                |

In [None]:
# Query 4: Find the average rating of each movie
avg_ratings = data.select('rating', 'title')\
        .groupby('title')\
        .agg(mean('rating').alias('AvgRating'))
avg_ratings.show()

+--------------------+------------------+
|               title|         AvgRating|
+--------------------+------------------+
|       Psycho (1960)| 4.036144578313253|
|Men in Black (a.k...| 3.487878787878788|
|Gulliver's Travel...|               3.0|
|Heavenly Creature...|3.9285714285714284|
|    Elizabeth (1998)|3.6739130434782608|
|Before Night Fall...|               4.3|
|O Brother, Where ...|3.8085106382978724|
|Snow White and th...| 3.616883116883117|
| Three Wishes (1995)|               3.0|
|When We Were King...|               3.9|
|   Annie Hall (1977)|3.8706896551724137|
| If Lucy Fell (1996)|               2.5|
|First Blood (Ramb...|              3.55|
|Don't Tell Mom th...|2.3461538461538463|
| Nut Job, The (2014)| 4.333333333333333|
|22 Jump Street (2...|3.6842105263157894|
|   Deadpool 2 (2018)|             3.875|
|Starship Troopers...|               1.5|
|Voices from the L...|               4.3|
|Night of the Livi...|              3.75|
+--------------------+------------

In [None]:
avg_ratings = avg_ratings.orderBy(desc('AvgRating'))
avg_ratings.limit(10).show(20, False)

+------------------------------------------+---------+
|title                                     |AvgRating|
+------------------------------------------+---------+
|Martin Lawrence Live: Runteldat (2002)    |5.0      |
|Tickling Giants (2017)                    |5.0      |
|Bill Hicks: Revelations (1993)            |5.0      |
|English Vinglish (2012)                   |5.0      |
|National Lampoon's Bag Boy (2007)         |5.0      |
|Zeitgeist: Moving Forward (2011)          |5.0      |
|Reform School Girls (1986)                |5.0      |
|Shogun Assassin (1980)                    |5.0      |
|'Salem's Lot (2004)                       |5.0      |
|George Carlin: You Are All Diseased (1999)|5.0      |
+------------------------------------------+---------+



Is it strange that no one has heard of any of these top-rated movies?

In [None]:
# Query 5: Find the number of ratings for each movie

num_ratings = data.select('rating', 'title').groupby('title').agg(count('*').alias('NumRating'))
num_ratings.show()

+--------------------+---------+
|               title|NumRating|
+--------------------+---------+
|       Psycho (1960)|       83|
|Men in Black (a.k...|      165|
|Gulliver's Travel...|        3|
|Heavenly Creature...|       21|
|    Elizabeth (1998)|       23|
|Before Night Fall...|        5|
|O Brother, Where ...|       94|
|Snow White and th...|       77|
| Three Wishes (1995)|        1|
|When We Were King...|       10|
|   Annie Hall (1977)|       58|
| If Lucy Fell (1996)|        2|
|First Blood (Ramb...|       30|
|Don't Tell Mom th...|       13|
| Nut Job, The (2014)|        3|
|22 Jump Street (2...|       19|
|   Deadpool 2 (2018)|       12|
|Starship Troopers...|        2|
|Voices from the L...|        5|
|Night of the Livi...|       28|
+--------------------+---------+
only showing top 20 rows



In [None]:
df = avg_ratings.join(num_ratings, how='inner', on=['title'])
df.show()

+--------------------+------------------+---------+
|               title|         AvgRating|NumRating|
+--------------------+------------------+---------+
|       Psycho (1960)| 4.036144578313253|       83|
|Men in Black (a.k...| 3.487878787878788|      165|
|Gulliver's Travel...|               3.0|        3|
|Heavenly Creature...|3.9285714285714284|       21|
|    Elizabeth (1998)|3.6739130434782608|       23|
|Before Night Fall...|               4.3|        5|
|O Brother, Where ...|3.8085106382978724|       94|
|Snow White and th...| 3.616883116883117|       77|
| Three Wishes (1995)|               3.0|        1|
|When We Were King...|               3.9|       10|
|   Annie Hall (1977)|3.8706896551724137|       58|
| If Lucy Fell (1996)|               2.5|        2|
|First Blood (Ramb...|              3.55|       30|
|Don't Tell Mom th...|2.3461538461538463|       13|
| Nut Job, The (2014)| 4.333333333333333|        3|
|22 Jump Street (2...|3.6842105263157894|       19|
|   Deadpool

In [None]:
df.orderBy(desc('AvgRating')).limit(20).show()

+--------------------+---------+---------+
|               title|AvgRating|NumRating|
+--------------------+---------+---------+
|    Radio Day (2008)|      5.0|        1|
|Cosmic Scrat-tast...|      5.0|        1|
|         Rain (2001)|      5.0|        1|
|    Lady Jane (1986)|      5.0|        1|
|Stuart Little 3: ...|      5.0|        1|
|English Vinglish ...|      5.0|        1|
|Tom Segura: Mostl...|      5.0|        1|
|Bill Hicks: Revel...|      5.0|        1|
|Louis Theroux: La...|      5.0|        1|
|Shogun Assassin (...|      5.0|        1|
|In the blue sea, ...|      5.0|        1|
|George Carlin: Yo...|      5.0|        1|
|Vacations in Pros...|      5.0|        1|
|Awfully Big Adven...|      5.0|        1|
|Human Condition I...|      5.0|        1|
|Chinese Puzzle (C...|      5.0|        1|
|Paper Birds (Pája...|      5.0|        1|
|Sonatine (Sonachi...|      5.0|        1|
|        Black Mirror|      5.0|        1|
|Zeitgeist: Moving...|      5.0|        1|
+----------

In [None]:
# Select movies with at least 50 ratings
Rating50 = df.filter(df.NumRating > 50)
Rating50.show()

+--------------------+------------------+---------+
|               title|         AvgRating|NumRating|
+--------------------+------------------+---------+
|       Psycho (1960)| 4.036144578313253|       83|
|Men in Black (a.k...| 3.487878787878788|      165|
|O Brother, Where ...|3.8085106382978724|       94|
|Snow White and th...| 3.616883116883117|       77|
|   Annie Hall (1977)|3.8706896551724137|       58|
|         Hook (1991)| 3.358490566037736|       53|
|Kill Bill: Vol. 2...| 3.868181818181818|      110|
|Eternal Sunshine ...|4.1603053435114505|      131|
|Last Action Hero ...|2.9339622641509435|       53|
|Dumb & Dumber (Du...|3.0601503759398496|      133|
|City Slickers II:...|2.6454545454545455|       55|
|Indiana Jones and...| 3.638888888888889|      108|
|North by Northwes...| 4.184210526315789|       57|
|In the Line of Fi...| 3.692857142857143|       70|
|      Jumanji (1995)|3.4318181818181817|      110|
|        Ghost (1990)|3.4347826086956523|      115|
|The Hunger 

In [None]:
Rating50.count()

437

In [None]:
# Query 6: Find the top 10 highly-rated movies with at least 50 ratings.

Rating50.orderBy(desc('AvgRating')).limit(10).show(10, False)

+---------------------------------------------------------------------------+-----------------+---------+
|title                                                                      |AvgRating        |NumRating|
+---------------------------------------------------------------------------+-----------------+---------+
|Shawshank Redemption, The (1994)                                           |4.429022082018927|317      |
|Godfather, The (1972)                                                      |4.2890625        |192      |
|Fight Club (1999)                                                          |4.272935779816514|218      |
|Cool Hand Luke (1967)                                                      |4.271929824561403|57       |
|Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964)|4.268041237113402|97       |
|Rear Window (1954)                                                         |4.261904761904762|84       |
|Godfather: Part II, The (1974)               

In [None]:
# Create both NumRating and AvgRating
df2 = data.select('rating', 'title')\
        .groupby('title')\
        .agg(count('rating').alias('NumRating'), mean('rating').alias('AvgRating'))
df2.show()

+--------------------+---------+------------------+
|               title|NumRating|         AvgRating|
+--------------------+---------+------------------+
|       Psycho (1960)|       83| 4.036144578313253|
|Men in Black (a.k...|      165| 3.487878787878788|
|Gulliver's Travel...|        3|               3.0|
|Heavenly Creature...|       21|3.9285714285714284|
|    Elizabeth (1998)|       23|3.6739130434782608|
|Before Night Fall...|        5|               4.3|
|O Brother, Where ...|       94|3.8085106382978724|
|Snow White and th...|       77| 3.616883116883117|
| Three Wishes (1995)|        1|               3.0|
|When We Were King...|       10|               3.9|
|   Annie Hall (1977)|       58|3.8706896551724137|
| If Lucy Fell (1996)|        2|               2.5|
|First Blood (Ramb...|       30|              3.55|
|Don't Tell Mom th...|       13|2.3461538461538463|
| Nut Job, The (2014)|        3| 4.333333333333333|
|22 Jump Street (2...|       19|3.6842105263157894|
|   Deadpool