# Exploratory Data Analysis

In [34]:
copy.show()

+------+--------------------+--------------------+------------+--------------------+---------------------+-----------------+-----------+------------+--------------------+--------------------+----------+------------+----------+-------+--------------------+--------------------+--------------------+--------------------+---------+---------------+---------+
|    id|               title|             tagline|release_date|              genres|belongs_to_collection|original_language|budget_musd|revenue_musd|production_companies|production_countries|vote_count|vote_average|popularity|runtime|            overview|    spoken_languages|         poster_path|                cast|cast_size|       director|crew_size|
+------+--------------------+--------------------+------------+--------------------+---------------------+-----------------+-----------+------------+--------------------+--------------------+----------+------------+----------+-------+--------------------+--------------------+--------------

Identifying the Best and Worst Performing Movies based on key metrics

Creating a UDF to streamline ranking operations

In [35]:
def rank_movies(copy, column, ascending=False, filter_condition=None, n=10):
    """
    Rank movies based on a column with optional filtering
    """
    if filter_condition:
        filtered_data = copy.filter(filter_condition)
    else:
        filtered_data= copy

    if ascending:
        ranked_data = filtered_data.orderBy(col(column).asc())
    else:
        ranked_data = filtered_data.orderBy(col(column).desc())

    return ranked_data.limit(n)

Top Rated Movies based on Key Performance Indicators (KPIs)

In [36]:
# Displaying top 10 ranked movies based on Highest Revenue
# The UDF was called to perform this action
highest_revenue = rank_movies(copy, "revenue_musd")
highest_revenue.select("title", "revenue_musd").show(10)

+--------------------+------------+
|               title|revenue_musd|
+--------------------+------------+
|              Avatar| 2923.706026|
|   Avengers: Endgame|   2799.4391|
|Star Wars: The Fo...| 2068.223624|
|Avengers: Infinit...| 2052.415039|
|      Jurassic World| 1671.537444|
|       The Lion King| 1662.020819|
|        The Avengers| 1518.815515|
|           Furious 7|      1515.4|
|           Frozen II| 1453.683476|
|Avengers: Age of ...| 1405.403694|
+--------------------+------------+



In [37]:
# Ranking movies based on Highest Budget
highest_budget = rank_movies(copy, "budget_musd", ascending=False)
highest_budget.select("title", "budget_musd").show(10)

+--------------------+-----------+
|               title|budget_musd|
+--------------------+-----------+
|Avengers: Age of ...|      365.0|
|   Avengers: Endgame|      356.0|
|Avengers: Infinit...|      300.0|
|       The Lion King|      260.0|
|Star Wars: The Fo...|      245.0|
|              Avatar|      237.0|
|        The Avengers|      220.0|
|       Incredibles 2|      200.0|
|Star Wars: The La...|      200.0|
|       Black Panther|      200.0|
+--------------------+-----------+



In [38]:
# Ranked movies based on Highest Profit by deducting the budget amount from the Revenue generated.
copy = copy.withColumn("profit_musd", col("revenue_musd") - col("budget_musd"))
highest_profit = rank_movies(copy, "profit_musd")
highest_profit.select("title", "profit_musd").show()

+--------------------+------------------+
|               title|       profit_musd|
+--------------------+------------------+
|              Avatar|       2686.706026|
|   Avengers: Endgame|         2443.4391|
|Star Wars: The Fo...|1823.2236240000002|
|Avengers: Infinit...|       1752.415039|
|      Jurassic World|       1521.537444|
|       The Lion King|       1402.020819|
|           Furious 7|            1325.4|
|           Frozen II|       1303.683476|
|        The Avengers|       1298.815515|
|Harry Potter and ...|       1216.511219|
+--------------------+------------------+



**Low Profit Movies:** The *rank_movies* function created above, orders the profit column in descending order, then shows the last 10 movie titles at the bottom.

In [39]:
copy = copy.withColumn("profit_musd", col("revenue_musd") - col("budget_musd"))
lowest_profit = rank_movies(copy, "profit_musd", ascending=True)
lowest_profit.select("title", "profit_musd").show()

+--------------------+-----------+
|               title|profit_musd|
+--------------------+-----------+
|Avengers: Age of ...|1040.403694|
|       Incredibles 2|1042.805359|
|              Frozen|1124.219009|
|Star Wars: The La...| 1132.69883|
|Jurassic World: F...|1140.466296|
|       Black Panther|1149.926083|
|Harry Potter and ...|1216.511219|
|        The Avengers|1298.815515|
|           Frozen II|1303.683476|
|           Furious 7|     1325.4|
+--------------------+-----------+



ROI for each movie title is calculated, a filter is applied to the budget_musd column to calculate the ROI for movies whose budget are less or equal to 10M.

In [42]:
# Highest ROI
copy.filter(col("budget_musd") >= 10).select(col("title"), (col("revenue_musd") / col("budget_musd")).alias("ROI")).orderBy(col("ROI").desc()).show(5)

+--------------------+-----------------+
|               title|              ROI|
+--------------------+-----------------+
|              Avatar|12.33631234599156|
|      Jurassic World|      11.14358296|
|Harry Potter and ...|     10.732089752|
|           Frozen II|9.691223173333332|
|              Frozen|8.494793393333333|
+--------------------+-----------------+
only showing top 5 rows



Same thing in the 'Highest ROI' code block is applied here, but the ROI column was ordered in ascending order in order to display only the ROI of movies with Budget ≥ 10M.

In [44]:
# Lowest ROI
copy.filter(col("budget_musd") >= 10).select(col("title"), (col("revenue_musd") / col("budget_musd")).alias("ROI")).orderBy(col("ROI").asc()).show(5)

+--------------------+------------------+
|               title|               ROI|
+--------------------+------------------+
|Avengers: Age of ...| 3.850421079452055|
|       Incredibles 2|       6.214026795|
|       The Lion King|6.3923877653846155|
|Star Wars: The La...|        6.66349415|
|       Black Panther|       6.749630415|
+--------------------+------------------+
only showing top 5 rows



**Most Voted Movies**

In [45]:
# Most Voted Movies
most_voted = rank_movies(copy, "vote_count")
most_voted.select("title", "vote_count").show()

+--------------------+----------+
|               title|vote_count|
+--------------------+----------+
|              Avatar|     32172|
|        The Avengers|     31648|
|Avengers: Infinit...|     30447|
|   Avengers: Endgame|     26262|
|Avengers: Age of ...|     23377|
|       Black Panther|     22517|
|Harry Potter and ...|     20983|
|      Jurassic World|     20655|
|Star Wars: The Fo...|     19701|
|              Frozen|     16823|
+--------------------+----------+



In [47]:
# Highest Rated Movies
highest_rated = rank_movies(copy, "vote_average", ascending=False)
highest_rated.select("title", "vote_average").show()

+--------------------+------------+
|               title|vote_average|
+--------------------+------------+
|   Avengers: Endgame|       8.237|
|Avengers: Infinit...|       8.235|
|Harry Potter and ...|       8.087|
|        The Avengers|       7.741|
|              Avatar|       7.588|
|       Incredibles 2|       7.454|
|       Black Panther|       7.373|
|Avengers: Age of ...|       7.271|
|Star Wars: The Fo...|       7.261|
|           Frozen II|       7.249|
+--------------------+------------+



In [52]:
# Lowest Rated movies
lowest_rated = rank_movies(copy, "vote_average", ascending=True)
lowest_rated.select("title", "vote_average").show()

+--------------------+------------+
|               title|vote_average|
+--------------------+------------+
|Jurassic World: F...|       6.538|
|      Jurassic World|       6.694|
|Star Wars: The La...|         6.8|
|       The Lion King|       7.111|
|           Furious 7|       7.226|
|              Frozen|       7.247|
|           Frozen II|       7.249|
|Star Wars: The Fo...|       7.261|
|Avengers: Age of ...|       7.271|
|       Black Panther|       7.373|
+--------------------+------------+



In [53]:
# Most Popular Movies
most_popular = rank_movies(copy, "popularity")
most_popular.select("title", "popularity").show()

+--------------------+----------+
|               title|popularity|
+--------------------+----------+
|Avengers: Infinit...|  122.0331|
|       The Lion King|   97.0871|
|Star Wars: The La...|   91.5889|
|       Black Panther|   87.9532|
|   Avengers: Endgame|   73.6149|
|Jurassic World: F...|   72.0926|
|        The Avengers|   33.3085|
|              Avatar|   25.0365|
|Avengers: Age of ...|    24.122|
|              Frozen|   20.4818|
+--------------------+----------+





---



# Advanced Movies Filtering and Search Queries

Filtering the dataset based on specific queries

* Search 1

In [58]:
sci_fi_action_bruce = copy.filter(col("genres").like("%Science Fiction%") &
                                  col("genres").like("%Action%") &
                                  col("cast").like("%Bruce Willis%")
                                  ).orderBy(col("vote_average").desc())
sci_fi_action_bruce.select("title", "vote_average", "genres", "cast").show(truncate=False)

+-----+------------+------+----+
|title|vote_average|genres|cast|
+-----+------------+------+----+
+-----+------------+------+----+



* Search 2

In [59]:
uma_quentin_movies = copy.filter((col("cast").like("%Uma Thurman%")) & (col("director") == "Quentin Tarantino")).orderBy(col("runtime").asc())
uma_quentin_movies.select("title", "runtime").show()

+-----+-------+
|title|runtime|
+-----+-------+
+-----+-------+

