# Getting Started with Spark

### Analyzing IMDB Data

Public data available at https://datasets.imdbws.com.

First, we setup the SparkSession.

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as f

spark = SparkSession.builder.appName("IMDBData").getOrCreate()

# set log level to ERROR so we don't have many outputs
spark.sparkContext.setLogLevel("ERROR")

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/02/02 12:36:06 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Then, we define the schema for the tables. This improves Spark's performance A LOT!

In [2]:
schema_names = "nconst string, primaryName string, birthYear int, deathYear int, primaryProfession string, knownForTitles string"
schema_basics = """
tconst string, titleType string, primaryTitle string, originalTitle string, isAdult int, startYear int, endYear int,
runtimeMinutes double, genres string
"""
schema_crew = "tconst string, directors string, writers string"
schema_principals = "tconst string, ordering int, nconst string, category string, job string, characters string"
schema_ratings = "tconst string, averageRating double, numVotes int"

Reading the tables.

In [15]:
names = (
    spark
    .read
    .schema(schema_names)
    .options(header=True, delimiter="\t")
    .csv('data/imdb/names.tsv.gz')
)

basics = (
    spark
    .read
    .schema(schema_basics)
    .options(header=True, inferSchema=True, delimiter="\t")
    .csv('data/imdb/basics.tsv.gz')
)

crew = (
    spark
    .read
    .schema(schema_crew)
    .options(header=True, inferSchema=True, delimiter="\t")
    .csv('data/imdb/crew.tsv.gz')
)

principals = (
    spark
    .read
    .schema(schema_principals)
    .options(header=True, inferSchema=True, delimiter="\t")
    .csv('data/imdb/principals.tsv.gz')
)

ratings = (
    spark
    .read
    .schema(schema_ratings)
    .options(header=True, inferSchema=True, delimiter="\t")
    .csv('data/imdb/ratings.tsv.gz')
)

Now, we check if the schemas were correctly handled.

In [4]:
print("NAMES Schema")
names.printSchema()
print("BASICS Schema")
basics.printSchema()
print("CREW Schema")
crew.printSchema()
print("PRINCIPALS Schema")
principals.printSchema()
print("RATINGS Schema")
ratings.printSchema()

NAMES Schema
root
 |-- nconst: string (nullable = true)
 |-- primaryName: string (nullable = true)
 |-- birthYear: integer (nullable = true)
 |-- deathYear: integer (nullable = true)
 |-- primaryProfession: string (nullable = true)
 |-- knownForTitles: string (nullable = true)

BASICS Schema
root
 |-- tconst: string (nullable = true)
 |-- titleType: string (nullable = true)
 |-- primaryTitle: string (nullable = true)
 |-- originalTitle: string (nullable = true)
 |-- isAdult: integer (nullable = true)
 |-- startYear: integer (nullable = true)
 |-- endYear: integer (nullable = true)
 |-- runtimeMinutes: double (nullable = true)
 |-- genres: string (nullable = true)

CREW Schema
root
 |-- tconst: string (nullable = true)
 |-- directors: string (nullable = true)
 |-- writers: string (nullable = true)

PRINCIPALS Schema
root
 |-- tconst: string (nullable = true)
 |-- ordering: integer (nullable = true)
 |-- nconst: string (nullable = true)
 |-- category: string (nullable = true)
 |-- job: s

We take a look at the `names` table and filter for people with the name **Keanu** who are **actors**.

In [16]:
names.filter("primaryName LIKE 'Keanu%' AND primaryProfession LIKE '%actor%'").show()

[Stage 15:>                                                         (0 + 1) / 1]

+----------+---------------+---------+---------+--------------------+--------------------+
|    nconst|    primaryName|birthYear|deathYear|   primaryProfession|      knownForTitles|
+----------+---------------+---------+---------+--------------------+--------------------+
| nm0000206|   Keanu Reeves|     1964|     null|actor,producer,so...|tt0102685,tt02342...|
|nm10158822|    Keanu Anoai|     null|     null|               actor|           tt8174116|
|nm10263947|  Keanu Yamaoka|     null|     null|               actor|           tt9287504|
|nm10456419|    Keanu Parks|     null|     null|               actor|           tt3208026|
|nm10515123|     Keanu Blye|     null|     null|producer,actor,wr...|tt23133930,tt2240...|
|nm10670161|    Keanu Kulin|     null|     null|               actor|           tt9483838|
|nm10806151|   Keanu Peyran|     null|     null|               actor|tt14250484,tt1600...|
|nm10823300|          Keanu|     null|     null|               actor|                  \N|

                                                                                

We can see the column `knowForTitles` is an array. We should work on that if we want to join this table with the titles' `basics` table.

In [17]:
names = names.select(
    'nconst', 'primaryName', 'birthYear', 'deathYear', 
    f.explode(f.split('knownForTitles', ',')).alias('knownForTitles')
)

We need to do exactly the same with the `directors` columns of the `crew` table.

In [19]:
crew.filter("directors LIKE '%,%'").show()

+---------+--------------------+-------------------+
|   tconst|           directors|            writers|
+---------+--------------------+-------------------+
|tt0000007| nm0005690,nm0374658|                 \N|
|tt0000012| nm0525908,nm0525910|                 \N|
|tt0000017| nm1587194,nm0804434|                 \N|
|tt0000030| nm0010291,nm0666972|                 \N|
|tt0000089| nm0525908,nm0698645|                 \N|
|tt0000093| nm0525908,nm0525910|                 \N|
|tt0000247|nm2156608,nm00056...|nm0000636,nm0002504|
|tt0000287| nm0085865,nm0807236|                 \N|
|tt0000335| nm0095714,nm0675140|                 \N|
|tt0000380| nm0634629,nm0954087|          nm0674518|
|tt0000387| nm0617588,nm0881616|                 \N|
|tt0000399| nm2092030,nm0692105|                 \N|
|tt0000420| nm0832948,nm0378408|nm0140902,nm0378408|
|tt0000436| nm0095816,nm0666972|                 \N|
|tt0000447| nm2092030,nm0692105|          nm0692105|
|tt0000498| nm0280432,nm0378408|          nm15

In [7]:
crew = crew.select(
    'tconst', f.explode(f.split('directors', ',')).alias('directors'), 'writers'
)

Now, let's try to visualize all the movies with Keanu Reeves.

In [20]:
only_keanu = names.filter("primaryName = 'Keanu Reeves'")
only_keanu.show()

[Stage 18:>                                                         (0 + 1) / 1]

+---------+------------+---------+---------+--------------+
|   nconst| primaryName|birthYear|deathYear|knownForTitles|
+---------+------------+---------+---------+--------------+
|nm0000206|Keanu Reeves|     1964|     null|     tt0102685|
|nm0000206|Keanu Reeves|     1964|     null|     tt0234215|
|nm0000206|Keanu Reeves|     1964|     null|     tt0133093|
|nm0000206|Keanu Reeves|     1964|     null|     tt0111257|
+---------+------------+---------+---------+--------------+



                                                                                

In [23]:
keanus_movies = (
    basics.select('tconst', 'primaryTitle', 'startYear')
    .join(
        only_keanu.select('primaryName', 'knownForTitles'), 
        basics.tconst == names.knownForTitles, how='inner'
    )
)

In [24]:
keanus_movies.explain('formatted')

== Physical Plan ==
AdaptiveSparkPlan (11)
+- SortMergeJoin Inner (10)
   :- Sort (4)
   :  +- Exchange (3)
   :     +- Filter (2)
   :        +- Scan csv  (1)
   +- Sort (9)
      +- Exchange (8)
         +- Generate (7)
            +- Filter (6)
               +- Scan csv  (5)


(1) Scan csv 
Output [3]: [tconst#274, primaryTitle#276, startYear#279]
Batched: false
Location: InMemoryFileIndex [file:/Users/neylsoncrepalde/Bigdata-on-Kubernetes/Chapter 5/data/imdb/basics.tsv.gz]
PushedFilters: [IsNotNull(tconst)]
ReadSchema: struct<tconst:string,primaryTitle:string,startYear:int>

(2) Filter
Input [3]: [tconst#274, primaryTitle#276, startYear#279]
Condition : isnotnull(tconst#274)

(3) Exchange
Input [3]: [tconst#274, primaryTitle#276, startYear#279]
Arguments: hashpartitioning(tconst#274, 200), ENSURE_REQUIREMENTS, [plan_id=374]

(4) Sort
Input [3]: [tconst#274, primaryTitle#276, startYear#279]
Arguments: [tconst#274 ASC NULLS FIRST], false, 0

(5) Scan csv 
Output [2]: [primaryName#26

In [25]:
keanus_movies.show()

                                                                                

+---------+-------------------+---------+------------+--------------+
|   tconst|       primaryTitle|startYear| primaryName|knownForTitles|
+---------+-------------------+---------+------------+--------------+
|tt0111257|              Speed|     1994|Keanu Reeves|     tt0111257|
|tt0234215|The Matrix Reloaded|     2003|Keanu Reeves|     tt0234215|
|tt0102685|        Point Break|     1991|Keanu Reeves|     tt0102685|
|tt0133093|         The Matrix|     1999|Keanu Reeves|     tt0133093|
+---------+-------------------+---------+------------+--------------+



See that we made a Shuffle Merge Join. Now, let's try to minimize the processing time reducing the first table to only Keanu Reeves and trying a Broadcast Join.

In [26]:
keanus_movies2 = (
    basics.select(
        'tconst', 'primaryTitle', 'startYear'
    ).join(
        f.broadcast(only_keanu), 
        basics.tconst == names.knownForTitles, how='inner'
    )
)

In [27]:
keanus_movies2.explain('formatted')

== Physical Plan ==
AdaptiveSparkPlan (8)
+- BroadcastHashJoin Inner BuildRight (7)
   :- Filter (2)
   :  +- Scan csv  (1)
   +- BroadcastExchange (6)
      +- Generate (5)
         +- Filter (4)
            +- Scan csv  (3)


(1) Scan csv 
Output [3]: [tconst#274, primaryTitle#276, startYear#279]
Batched: false
Location: InMemoryFileIndex [file:/Users/neylsoncrepalde/Bigdata-on-Kubernetes/Chapter 5/data/imdb/basics.tsv.gz]
PushedFilters: [IsNotNull(tconst)]
ReadSchema: struct<tconst:string,primaryTitle:string,startYear:int>

(2) Filter
Input [3]: [tconst#274, primaryTitle#276, startYear#279]
Condition : isnotnull(tconst#274)

(3) Scan csv 
Output [5]: [nconst#262, primaryName#263, birthYear#264, deathYear#265, knownForTitles#267]
Batched: false
Location: InMemoryFileIndex [file:/Users/neylsoncrepalde/Bigdata-on-Kubernetes/Chapter 5/data/imdb/names.tsv.gz]
PushedFilters: [IsNotNull(primaryName), EqualTo(primaryName,Keanu Reeves)]
ReadSchema: struct<nconst:string,primaryName:string,bir

In [28]:
keanus_movies2.show()

[Stage 31:>                                                         (0 + 1) / 1]

+---------+-------------------+---------+---------+------------+---------+---------+--------------+
|   tconst|       primaryTitle|startYear|   nconst| primaryName|birthYear|deathYear|knownForTitles|
+---------+-------------------+---------+---------+------------+---------+---------+--------------+
|tt0102685|        Point Break|     1991|nm0000206|Keanu Reeves|     1964|     null|     tt0102685|
|tt0111257|              Speed|     1994|nm0000206|Keanu Reeves|     1964|     null|     tt0111257|
|tt0133093|         The Matrix|     1999|nm0000206|Keanu Reeves|     1964|     null|     tt0133093|
|tt0234215|The Matrix Reloaded|     2003|nm0000206|Keanu Reeves|     1964|     null|     tt0234215|
+---------+-------------------+---------+---------+------------+---------+---------+--------------+



                                                                                

In [None]:
principals.filter("category = 'actor'").show()

In [None]:
names.filter("primaryName = 'Christian Bale'").show()

In [None]:
names.count()