<a href="https://colab.research.google.com/github/TanishqLambhate/Data-Science-Training/blob/pyspark/Pyspark_movies_excercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
! pip install pyspark

In [None]:
# ### **Exercise: PySpark Data Transformations on Movie Data**

# #### **Objective:**
# You have a dataset containing movie details. The goal is to use PySpark to apply data transformations to derive insights.

# #### **Dataset**:

# Here’s a sample dataset of movies (save this as a CSV file if necessary):

# ```csv
# movie_id,title,genre,rating,box_office,date
# 1,Inception,Sci-Fi,8.8,830000000,2010-07-16
# 2,The Dark Knight,Action,9.0,1004000000,2008-07-18
# 3,Interstellar,Sci-Fi,8.6,677000000,2014-11-07
# 4,Avengers: Endgame,Action,8.4,2797000000,2019-04-26
# 5,The Lion King,Animation,8.5,1657000000,1994-06-15
# 6,Toy Story 4,Animation,7.8,1073000000,2019-06-21
# 7,Frozen II,Animation,7.0,1450000000,2019-11-22
# 8,Joker,Drama,8.5,1074000000,2019-10-04
# 9,Parasite,Drama,8.6,258000000,2019-05-30
# ```


In [None]:
# ### **Tasks**:

# 1. **Load the Dataset**:
#    - Read the CSV file into a PySpark DataFrame.

from pyspark.sql import SparkSession
spark=SparkSession.builder.appName('Movies').getOrCreate()
csv_file_path = "/content/sample_data/movies.csv"
df_movies=spark.read.format("csv").option("header","true").load(csv_file_path)
df_movies.show()

# 2. **Filter Movies by Genre**:
#    - Find all movies in the "Sci-Fi" genre.

df_gener=df_movies.filter(df_movies.genre=="Sci-Fi")
df_gener.show()

# 3. **Top-Rated Movies**:
#    - Find the top 3 highest-rated movies.
df_top=df_movies.orderBy(df_movies.rating.desc()).limit(3)
df_top.show()

# 4. **Movies Released After 2010**:
#    - Filter out all movies released after the year 2010.

df_after_2010=df_movies.filter(df_movies.date>"2010-01-01")
df_after_2010.show()

+--------+-----------------+---------+------+----------+----------+
|movie_id|            title|    genre|rating|box_office|      date|
+--------+-----------------+---------+------+----------+----------+
|       1|        Inception|   Sci-Fi|   8.8| 830000000|2010-07-16|
|       2|  The Dark Knight|   Action|   9.0|1004000000|2008-07-18|
|       3|     Interstellar|   Sci-Fi|   8.6| 677000000|2014-11-07|
|       4|Avengers: Endgame|   Action|   8.4|2797000000|2019-04-26|
|       5|    The Lion King|Animation|   8.5|1657000000|1994-06-15|
|       6|      Toy Story 4|Animation|   7.8|1073000000|2019-06-21|
|       7|        Frozen II|Animation|   7.0|1450000000|2019-11-22|
|       8|            Joker|    Drama|   8.5|1074000000|2019-10-04|
|       9|         Parasite|    Drama|   8.6| 258000000|2019-05-30|
+--------+-----------------+---------+------+----------+----------+

+--------+------------+------+------+----------+----------+
|movie_id|       title| genre|rating|box_office|      d

In [None]:
# 5. **Calculate Average Box Office Collection by Genre**:
#    - Group the movies by `genre` and calculate the average box office collection for each genre.

df_collection=df_movies.groupBy("genre").agg({"box_office":"avg"})
df_collection.show()

# 6. **Add a New Column for Box Office in Billions**:
#    - Add a new column that shows the box office collection in billions.

df_billions=df_movies.withColumn("box_office_in_billions",df_movies.box_office/1000000000)
df_billions.show()

# 7. **Sort Movies by Box Office Collection**:
#    - Sort the movies in descending order based on their box office collection.

df_sort_movies=df_movies.orderBy(df_movies.box_office.desc())
df_sort_movies.show()

# 8. **Count the Number of Movies per Genre**:
#    - Count the number of movies in each genre.

df_count=df_movies.groupBy("genre").count()
df_count.show()

+---------+--------------------+
|    genre|     avg(box_office)|
+---------+--------------------+
|    Drama|              6.66E8|
|Animation|1.3933333333333333E9|
|   Action|            1.9005E9|
|   Sci-Fi|             7.535E8|
+---------+--------------------+

+--------+-----------------+---------+------+----------+----------+----------------------+
|movie_id|            title|    genre|rating|box_office|      date|box_office_in_billions|
+--------+-----------------+---------+------+----------+----------+----------------------+
|       1|        Inception|   Sci-Fi|   8.8| 830000000|2010-07-16|                  0.83|
|       2|  The Dark Knight|   Action|   9.0|1004000000|2008-07-18|                 1.004|
|       3|     Interstellar|   Sci-Fi|   8.6| 677000000|2014-11-07|                 0.677|
|       4|Avengers: Endgame|   Action|   8.4|2797000000|2019-04-26|                 2.797|
|       5|    The Lion King|Animation|   8.5|1657000000|1994-06-15|                 1.657|
|      