# Lab #5 Spark Dataframes and Spark SQL 

In [0]:
from pyspark.sql.functions import col, lit, concat, substring

tsv_dataframe = spark.read.option("sep", "\t").option("header", "false").csv("dbfs:/FileStore/tables/movie_ratings.tsv")
column_names = ["rating", "title", "year"]
tsv_dataframe = tsv_dataframe.toDF(*column_names)
tsv_dataframe = tsv_dataframe.withColumn("rating", col("rating").cast("double"))
df = spark.read.json("dbfs:/FileStore/movies.json")



Repeat Steps #1 - #4 from Lab #3. This will create a spark dataframe. Name this dataframe `movies_with_decade.`

In [0]:
movies_with_decade = df.withColumn("Decade",concat(substring(df.produced_year.cast("String"), -2, 1), lit("0's")))
movies_with_decade.show(10)

+-----------------+--------------------+-------------+------+
|       actor_name|         movie_title|produced_year|Decade|
+-----------------+--------------------+-------------+------+
|McClure, Marc (I)|        Coach Carter|         2005|  00's|
|McClure, Marc (I)|         Superman II|         1980|  80's|
|McClure, Marc (I)|           Apollo 13|         1995|  90's|
|McClure, Marc (I)|            Superman|         1978|  70's|
|McClure, Marc (I)|  Back to the Future|         1985|  80's|
|McClure, Marc (I)|Back to the Futur...|         1990|  90's|
|Cooper, Chris (I)|  Me, Myself & Irene|         2000|  00's|
|Cooper, Chris (I)|         October Sky|         1999|  90's|
|Cooper, Chris (I)|              Capote|         2005|  00's|
|Cooper, Chris (I)|The Bourne Supremacy|         2004|  00's|
+-----------------+--------------------+-------------+------+
only showing top 10 rows



1. Create a sql view using this “movies_with_decade” dataframe. Run a command to
display all the views ( You will need to look up this command). ( 2 marks )

In [0]:
movies_with_decade_dataframe.createOrReplaceTempView("movies_with_decade_temp_view")
spark.sql("SELECT * FROM movies_with_decade_temp_view").show()


+-----------------+--------------------+-------------+------+
|       actor_name|         movie_title|produced_year|Decade|
+-----------------+--------------------+-------------+------+
|McClure, Marc (I)|        Coach Carter|         2005|  00's|
|McClure, Marc (I)|         Superman II|         1980|  80's|
|McClure, Marc (I)|           Apollo 13|         1995|  90's|
|McClure, Marc (I)|            Superman|         1978|  70's|
|McClure, Marc (I)|  Back to the Future|         1985|  80's|
|McClure, Marc (I)|Back to the Futur...|         1990|  90's|
|Cooper, Chris (I)|  Me, Myself & Irene|         2000|  00's|
|Cooper, Chris (I)|         October Sky|         1999|  90's|
|Cooper, Chris (I)|              Capote|         2005|  00's|
|Cooper, Chris (I)|The Bourne Supremacy|         2004|  00's|
|Cooper, Chris (I)|         The Patriot|         2000|  00's|
|Cooper, Chris (I)|            The Town|         2010|  10's|
|Cooper, Chris (I)|          Seabiscuit|         2003|  00's|
|Cooper,

2. Use the Spark Session “sql” function to determine which decade has the most number of movies. ( 4 marks )

In [0]:

spark.sql("Select Decade, count(Decade) as num_produced from movies_with_decade_temp_view group by Decade order by num_produced desc limit 1").show()

+------+------------+
|Decade|num_produced|
+------+------------+
|  00's|       18622|
+------+------------+



3. Use the Spark Session “sql” function to compute the number of movies each actor was in. The output should have two columns: actor, count. The output should be ordered by the count in descending order. ( 4 marks)

In [0]:
spark.sql("select actor_name as actor, count(actor_name) as count from movies_with_decade_temp_view group by actor order by count desc ").show()

+-------------------+-----+
|              actor|count|
+-------------------+-----+
|   Tatasciore, Fred|   38|
|      Welker, Frank|   38|
| Jackson, Samuel L.|   32|
|      Harnell, Jess|   31|
|        Damon, Matt|   27|
|      Willis, Bruce|   27|
|  Cummings, Jim (I)|   26|
|         Hanks, Tom|   25|
|   Lynn, Sherry (I)|   25|
|    McGowan, Mickie|   25|
|    Bergen, Bob (I)|   25|
|      Proctor, Phil|   24|
|   Wilson, Owen (I)|   23|
|        Cruise, Tom|   23|
|         Pitt, Brad|   23|
|Freeman, Morgan (I)|   22|
|Williams, Robin (I)|   22|
|       Depp, Johnny|   22|
|     Morrison, Rana|   22|
|      Diaz, Cameron|   21|
+-------------------+-----+
only showing top 20 rows




4. Compute the highest-rated movie per year. The output should have only one movie per year, and it should contain three columns: 
    - year
    - movie title
    - rating.
- One solution to this is to use a correlated subquery. (6 marks)

In [0]:
tsv_dataframe.createOrReplaceTempView("tsv_temp_view")
spark.sql("select year, title, rating from tsv_temp_view WHERE rating = (SELECT MAX(rating) FROM tsv_temp_view m2 WHERE m2.year = tsv_temp_view.year) order by year desc").show()

+----+--------------------+-------+
|year|               title| rating|
+----+--------------------+-------+
|2013|       The Wolverine|   12.5|
|2012|                1066|12.8205|
|2011|Ang babae sa sept...|14.1527|
|2010|           Beginners|14.2173|
|2009|Kimmy Dora: Kamba...|13.7234|
|2008|         Man on Wire|14.0356|
|2007|     Hostel: Part II|13.7432|
|2006|Love and Other Di...|13.7696|
|2005|             The Man|14.1976|
|2004|           Sleepover|14.2073|
|2003|               Gigli|14.1829|
|2002|         Extreme Ops|12.8821|
|2001|       Tortilla Soup|14.0009|
|2000|              Taxi 2|14.1178|
|1999|       Sofies verden|14.0421|
|1998|            My Giant| 14.015|
|1997|  Leave It to Beaver|13.7106|
|1996|          Sgt. Bilko|   14.1|
|1995|Something to Talk...|13.0879|
|1994|Änglagård - Andra...|14.1242|
+----+--------------------+-------+
only showing top 20 rows



5. Determine which pair of actors worked together most. 
    - Working together is defined as appearing in the same movie. The output should have three columns: 
        - actor1, 
        - actor2
        - count. 
    - The output should be sorted by the count in descending order. 
    - The solution to this question requires doing self-join. ( 6 marks)

In [0]:
spark.sql("""
          select m1.actor_name as actor1, m2.actor_name as actor2, count(*) as count
          from movies_with_decade_temp_view m1 
          join movies_with_decade_temp_view m2 on m1.movie_title = m2.movie_title
          and m1.actor_name > m2.actor_name
          group by actor1, actor2
          order by count desc
          """).show()


+------------------+-----------------+-----+
|            actor1|           actor2|count|
+------------------+-----------------+-----+
|   McGowan, Mickie| Lynn, Sherry (I)|   23|
|   McGowan, Mickie|  Bergen, Bob (I)|   19|
|  Lynn, Sherry (I)|  Bergen, Bob (I)|   19|
|   McGowan, Mickie|  Angel, Jack (I)|   17|
|  Lynn, Sherry (I)|  Angel, Jack (I)|   17|
|       Rabson, Jan|  McGowan, Mickie|   16|
|       Rabson, Jan| Lynn, Sherry (I)|   16|
|   McGowan, Mickie|Darling, Jennifer|   15|
|Schneider, Rob (I)|Sandler, Adam (I)|   14|
|     Harnell, Jess|  Bergen, Bob (I)|   14|
|   McGowan, Mickie| Farmer, Bill (I)|   14|
|  Lynn, Sherry (I)|Darling, Jennifer|   14|
|   McGowan, Mickie|    Harnell, Jess|   14|
|       Rabson, Jan|  Bergen, Bob (I)|   14|
|   Bergen, Bob (I)|  Angel, Jack (I)|   13|
|  Lynn, Sherry (I)| Farmer, Bill (I)|   13|
|   Bumpass, Rodger|  Bergen, Bob (I)|   13|
|  Lynn, Sherry (I)|    Harnell, Jess|   13|
|  Farmer, Bill (I)|  Bergen, Bob (I)|   12|
| Sandler,