3. Load the movies.json file from DBFS into a dataframe and display the first 10 records and print the schema for the dataframe. ( 2 marks)

In [0]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Define Schema 
customSchema = StructType([
    StructField("actor_name", StringType(), True), 
    StructField("movie_title", StringType(), True), 
    StructField("produced_year", IntegerType(), True)
])

# Load JSON file
df = spark.read.schema(customSchema).json("dbfs:/FileStore/movies.json")

# Show schema
df.printSchema()

# Display first 10 records
df.show(10, truncate=False)


root
 |-- actor_name: string (nullable = true)
 |-- movie_title: string (nullable = true)
 |-- produced_year: integer (nullable = true)

+-----------------+---------------------------+-------------+
|actor_name       |movie_title                |produced_year|
+-----------------+---------------------------+-------------+
|McClure, Marc (I)|Coach Carter               |2005         |
|McClure, Marc (I)|Superman II                |1980         |
|McClure, Marc (I)|Apollo 13                  |1995         |
|McClure, Marc (I)|Superman                   |1978         |
|McClure, Marc (I)|Back to the Future         |1985         |
|McClure, Marc (I)|Back to the Future Part III|1990         |
|Cooper, Chris (I)|Me, Myself & Irene         |2000         |
|Cooper, Chris (I)|October Sky                |1999         |
|Cooper, Chris (I)|Capote                     |2005         |
|Cooper, Chris (I)|The Bourne Supremacy       |2004         |
+-----------------+---------------------------+----------

4. Use the ‘withColumn’ command to add a new column ‘decade’. Store the result in a new dataframe and print the first 10 movies to verify the column was added. ( 2 marks)

In [0]:
from pyspark.sql.functions import floor

# Add decade column
df2 = df.withColumn("decade", floor(df.produced_year / 10) * 10)

# Show first 10 movies
df2.show(10, truncate=False)


+-----------------+---------------------------+-------------+------+
|actor_name       |movie_title                |produced_year|decade|
+-----------------+---------------------------+-------------+------+
|McClure, Marc (I)|Coach Carter               |2005         |2000  |
|McClure, Marc (I)|Superman II                |1980         |1980  |
|McClure, Marc (I)|Apollo 13                  |1995         |1990  |
|McClure, Marc (I)|Superman                   |1978         |1970  |
|McClure, Marc (I)|Back to the Future         |1985         |1980  |
|McClure, Marc (I)|Back to the Future Part III|1990         |1990  |
|Cooper, Chris (I)|Me, Myself & Irene         |2000         |2000  |
|Cooper, Chris (I)|October Sky                |1999         |1990  |
|Cooper, Chris (I)|Capote                     |2005         |2000  |
|Cooper, Chris (I)|The Bourne Supremacy       |2004         |2000  |
+-----------------+---------------------------+-------------+------+
only showing top 10 rows



5. Use the ‘withColumnRenamed’ command to rename the first two columns to ‘actor’ and ‘title’. Store the result in a new dataframe and print the first 10 movies to verify the columns were renamed. (2 marks)

In [0]:
# Rename columns actor and title
df3 = df2.withColumnRenamed("actor_name", "actor").withColumnRenamed("movie_title", "title")

# Show first 10 movies to verify
df3.show(10, truncate=False)


+-----------------+---------------------------+-------------+------+
|actor            |title                      |produced_year|decade|
+-----------------+---------------------------+-------------+------+
|McClure, Marc (I)|Coach Carter               |2005         |2000  |
|McClure, Marc (I)|Superman II                |1980         |1980  |
|McClure, Marc (I)|Apollo 13                  |1995         |1990  |
|McClure, Marc (I)|Superman                   |1978         |1970  |
|McClure, Marc (I)|Back to the Future         |1985         |1980  |
|McClure, Marc (I)|Back to the Future Part III|1990         |1990  |
|Cooper, Chris (I)|Me, Myself & Irene         |2000         |2000  |
|Cooper, Chris (I)|October Sky                |1999         |1990  |
|Cooper, Chris (I)|Capote                     |2005         |2000  |
|Cooper, Chris (I)|The Bourne Supremacy       |2004         |2000  |
+-----------------+---------------------------+-------------+------+
only showing top 10 rows



6. Use the DataFrame api to determine which decade has the most number of movies. ( 4 marks)

In [0]:
# Count movies per decade 
df_decade_count = df2.groupBy("decade").count().orderBy("count", ascending=False)

# Show the decade with the most movies
df_decade_count.show(1)

+------+-----+
|decade|count|
+------+-----+
|  2000|18622|
+------+-----+
only showing top 1 row



7. Compute the number of movies each actor was in. The output should have two columns: actor, count. The output should be ordered by the count in descending order. ( 4 marks)

In [0]:
# Count movies per actor
df5 = df3.groupBy("actor").count().sort("count", ascending=False)

# Show top 10 actors with the most movies
df5.show(10, truncate=False)


+------------------+-----+
|actor             |count|
+------------------+-----+
|Tatasciore, Fred  |38   |
|Welker, Frank     |38   |
|Jackson, Samuel L.|32   |
|Harnell, Jess     |31   |
|Damon, Matt       |27   |
|Willis, Bruce     |27   |
|Cummings, Jim (I) |26   |
|Hanks, Tom        |25   |
|Lynn, Sherry (I)  |25   |
|McGowan, Mickie   |25   |
+------------------+-----+
only showing top 10 rows

