<a href="https://colab.research.google.com/github/SrijaG29/Movie-Data-Analysis-and-Recommendations-Using-PySpark/blob/main/Movies_ratings_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To create a project using the movie and ratings datasets with PySpark, here’s a step-by-step guide along with some ideas for questions you can answer. This will help demonstrate your PySpark skills and data analysis capabilities on your resume.

### Step 1: Data Exploration
1. **Load the datasets into PySpark**:
   - Load the movie dataset (`movies.csv`) and ratings dataset (`ratings.csv`) into PySpark DataFrames.

2. **Inspect the Data**:
   - Display the schema and first few rows of both DataFrames to understand the data structure.

### Step 2: Data Cleaning and Preparation
1. **Handle Missing Values**:
   - Check for any missing values and decide how to handle them (e.g., drop rows, fill with default values).

2. **Data Transformation**:
   - Extract the year from the movie title and create a new column `year`.
   - Split the `genres` column into an array of genres.

3. **Join the Datasets**:
   - Perform an inner join on the `movieId` column to combine the movie and ratings datasets.

### Step 3: Data Analysis
1. **Top-Rated Movies**:
   - Find the top 10 movies with the highest average rating.

2. **Popular Genres**:
   - Identify the most popular genres based on the number of ratings.

3. **User Behavior**:
   - Analyze the distribution of ratings by users. For example, find how many movies each user has rated.

4. **Yearly Trends**:
   - Determine how the average movie rating has changed over the years.

5. **Genre-Based Recommendations**:
   - For a given genre, list the top 5 movies based on average ratings.

6. **Movies with the Most Reviews**:
   - Identify the movies with the highest number of ratings.

### Step 4: Advanced Analysis
1. **User-Specific Recommendations**:
   - Build a basic recommendation system by suggesting top-rated movies that a user has not rated yet.

2. **Correlate Ratings and Release Year**:
   - Analyze if there’s any correlation between the release year and the average rating of movies.

3. **Genre Diversity in Top-Rated Movies**:
   - Examine the genre diversity among the top 100 highest-rated movies.

### Step 5: Performance Optimization
1. **Cache and Persist**:
   - Use caching and persistence in PySpark to optimize the performance of your queries.

2. **Partitioning**:
   - Apply partitioning to the data to improve the efficiency of operations, especially for large datasets.

### Step 6: Visualization
1. **Visualize Data**:
   - Use PySpark with an external library like Matplotlib, Seaborn, or even Power BI to create visualizations such as:
     - Distribution of ratings.
     - Average rating per genre.
     - Trends in movie ratings over the years.

### Conclusion
This project structure not only demonstrates your technical ability to handle and analyze data using PySpark but also your ability to derive meaningful insights from complex datasets. Once complete, you can showcase this project in your portfolio or resume, highlighting key aspects such as data cleaning, transformation, analysis, and visualization.

In [1]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.2.tar.gz (317.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.3/317.3 MB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.2-py2.py3-none-any.whl size=317812365 sha256=f369641de924ad36735292233cd80c06e0d4d73e90d057c60e706f42ca839103
  Stored in directory: /root/.cache/pip/wheels/34/34/bd/03944534c44b677cd5859f248090daa9fb27b3c8f8e5f49574
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.2


In [2]:
from pyspark.sql import SparkSession

In [3]:
spark = (
    SparkSession
    .builder
    .appName("Movie ratings")
    .master("local[*]")
    .getOrCreate()
)

In [4]:
spark

**Step 1: Data Exploration**
Load the datasets into PySpark session **"spark"**:

Load the movie dataset (movies.csv) and ratings dataset (ratings.csv) into PySpark DataFrames.
Inspect the Data:

Display the schema and first few rows of both DataFrames to understand the data structure.

In [81]:
movies_df = spark.read.format("csv").option("header",True).load("/content/movies.csv")
movies_df.show(truncate = False)

+-------+-------------------------------------+-------------------------------------------+
|movieId|title                                |genres                                     |
+-------+-------------------------------------+-------------------------------------------+
|1      |Toy Story (1995)                     |Adventure|Animation|Children|Comedy|Fantasy|
|2      |Jumanji (1995)                       |Adventure|Children|Fantasy                 |
|3      |Grumpier Old Men (1995)              |Comedy|Romance                             |
|4      |Waiting to Exhale (1995)             |Comedy|Drama|Romance                       |
|5      |Father of the Bride Part II (1995)   |Comedy                                     |
|6      |Heat (1995)                          |Action|Crime|Thriller                      |
|7      |Sabrina (1995)                       |Comedy|Romance                             |
|8      |Tom and Huck (1995)                  |Adventure|Children               

In [66]:
ratings_df = spark.read.format("csv").option("header",True).load("/content/ratings.csv")
ratings_df.show(truncate = False)

+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
|1     |17     |4.0   |944249077|
|1     |25     |1.0   |944250228|
|1     |29     |2.0   |943230976|
|1     |30     |5.0   |944249077|
|1     |32     |5.0   |943228858|
|1     |34     |2.0   |943228491|
|1     |36     |1.0   |944249008|
|1     |80     |5.0   |944248943|
|1     |110    |3.0   |943231119|
|1     |111    |5.0   |944249008|
|1     |161    |1.0   |943231162|
|1     |166    |5.0   |943228442|
|1     |176    |4.0   |944079496|
|1     |223    |3.0   |944082810|
|1     |232    |5.0   |943228442|
|1     |260    |5.0   |943228696|
|1     |302    |4.0   |944253272|
|1     |306    |5.0   |944248888|
|1     |307    |5.0   |944253207|
|1     |322    |4.0   |944053801|
+------+-------+------+---------+
only showing top 20 rows



**Step 2: Data Cleaning and Preparation**

**Handle Missing Values:**

Check for any missing values and decide how to handle them (e.g., drop rows, fill with default values).

In [67]:
from pyspark.sql.functions import sum,col,when,split

In [68]:
print(movies_df.dtypes)

[('movieId', 'string'), ('title', 'string'), ('genres', 'string')]


In [69]:
print(ratings_df.dtypes)

[('userId', 'string'), ('movieId', 'string'), ('rating', 'string'), ('timestamp', 'string')]


In [70]:
# Create an empty list to hold column expressions for missing values
missing_value_expressions_movies = []

# Loop through each column in the DataFrame
for column in movies_df.columns:
    # Count the number of null values in the current column
    missing_expr = sum(when(col(column).isNull(), 1).otherwise(0)).alias(column)
    missing_value_expressions_movies.append(missing_expr)

# Aggregate the missing value counts
missing_movies_df = movies_df.agg(*missing_value_expressions_movies)

# Show the results
missing_movies_df.show()

+-------+-----+------+
|movieId|title|genres|
+-------+-----+------+
|      0|    0|     0|
+-------+-----+------+



In [71]:
# Create an empty list to hold column expressions for missing values
missing_value_expressions_ratings = []

# Loop through each column in the DataFrame
for column in ratings_df.columns:
    # Count the number of null values in the current column
    missing_expr = sum(when(col(column).isNull(), 1).otherwise(0)).alias(column)
    missing_value_expressions_ratings.append(missing_expr)

# Aggregate the missing value counts
missing_ratings_df = ratings_df.agg(*missing_value_expressions_ratings)

# Show the results
missing_ratings_df.show()

+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
|     0|      0|     0|        0|
+------+-------+------+---------+



**" There are no null values in these datasets. "**

**Data Transformation:**

Extract the year from the movie title and create a new column year. Split the genres column into an array of genres.

In [82]:
movies_df.show(truncate = False)

+-------+-------------------------------------+-------------------------------------------+
|movieId|title                                |genres                                     |
+-------+-------------------------------------+-------------------------------------------+
|1      |Toy Story (1995)                     |Adventure|Animation|Children|Comedy|Fantasy|
|2      |Jumanji (1995)                       |Adventure|Children|Fantasy                 |
|3      |Grumpier Old Men (1995)              |Comedy|Romance                             |
|4      |Waiting to Exhale (1995)             |Comedy|Drama|Romance                       |
|5      |Father of the Bride Part II (1995)   |Comedy                                     |
|6      |Heat (1995)                          |Action|Crime|Thriller                      |
|7      |Sabrina (1995)                       |Comedy|Romance                             |
|8      |Tom and Huck (1995)                  |Adventure|Children               

Extracting year from title.

In [13]:
from pyspark.sql.functions import expr,trim

In [86]:
movies_df = movies_df.withColumn("movie_name",split(col("title"),"\\(").getItem(0))\
                     .withColumn("year", expr("substring_index(title, '(', -1)"))\
                     .withColumn("year", trim(col("year")).substr(1, 4))

movies_df.show(truncate=False)

+-------+-------------------------------------+-------------------------------------------+-------------------------------+----+
|movieId|title                                |genres                                     |movie_name                     |year|
+-------+-------------------------------------+-------------------------------------------+-------------------------------+----+
|1      |Toy Story (1995)                     |Adventure|Animation|Children|Comedy|Fantasy|Toy Story                      |1995|
|2      |Jumanji (1995)                       |Adventure|Children|Fantasy                 |Jumanji                        |1995|
|3      |Grumpier Old Men (1995)              |Comedy|Romance                             |Grumpier Old Men               |1995|
|4      |Waiting to Exhale (1995)             |Comedy|Drama|Romance                       |Waiting to Exhale              |1995|
|5      |Father of the Bride Part II (1995)   |Comedy                                     |Father

In [87]:
# x = movies_df.filter(col('genres') == '(no genres listed)')
x = movies_df.filter(col('genres') == 'Comedy')
x.show()

+-------+--------------------+------+--------------------+----+
|movieId|               title|genres|          movie_name|year|
+-------+--------------------+------+--------------------+----+
|      5|Father of the Bri...|Comedy|Father of the Bri...|1995|
|     18|   Four Rooms (1995)|Comedy|         Four Rooms |1995|
|     19|Ace Ventura: When...|Comedy|Ace Ventura: When...|1995|
|     65|     Bio-Dome (1996)|Comedy|           Bio-Dome |1996|
|     69|       Friday (1995)|Comedy|             Friday |1995|
|     88|  Black Sheep (1996)|Comedy|        Black Sheep |1996|
|    102|    Mr. Wrong (1996)|Comedy|          Mr. Wrong |1996|
|    104|Happy Gilmore (1996)|Comedy|      Happy Gilmore |1996|
|    115|Happiness Is in t...|Comedy|Happiness Is in t...|1995|
|    119|Steal Big, Steal ...|Comedy|Steal Big, Steal ...|1995|
|    125|Flirting With Dis...|Comedy|Flirting With Dis...|1996|
|    135|Down Periscope (1...|Comedy|     Down Periscope |1996|
|    141|Birdcage, The (1996)|Comedy|   

In [89]:
x = movies_df.filter(col('genres').contains('(no genres listed)'))
x.show(truncate=False)

+-------+-----------------------------------------------+------------------+------------------------------------+----+
|movieId|title                                          |genres            |movie_name                          |year|
+-------+-----------------------------------------------+------------------+------------------------------------+----+
|83773  |Away with Words (San tiao ren) (1999)          |(no genres listed)|Away with Words                     |1999|
|84768  |Glitterbug (1994)                              |(no genres listed)|Glitterbug                          |1994|
|86493  |Age of the Earth, The (A Idade da Terra) (1980)|(no genres listed)|Age of the Earth, The               |1980|
|87061  |Trails (Veredas) (1978)                        |(no genres listed)|Trails                              |1978|
|91246  |Milky Way (Tejút) (2007)                       |(no genres listed)|Milky Way                           |2007|
|92435  |Dancing Hawk, The (Tanczacy jastrzab) (

In [94]:
# dropping the column title

movies_df = movies_df.drop(movies_df.title)
movies_df.show(truncate = False)

+-------+-------------------------------------------+-------------------------------+----+
|movieId|genres                                     |movie_name                     |year|
+-------+-------------------------------------------+-------------------------------+----+
|1      |Adventure|Animation|Children|Comedy|Fantasy|Toy Story                      |1995|
|2      |Adventure|Children|Fantasy                 |Jumanji                        |1995|
|3      |Comedy|Romance                             |Grumpier Old Men               |1995|
|4      |Comedy|Drama|Romance                       |Waiting to Exhale              |1995|
|5      |Comedy                                     |Father of the Bride Part II    |1995|
|6      |Action|Crime|Thriller                      |Heat                           |1995|
|7      |Comedy|Romance                             |Sabrina                        |1995|
|8      |Adventure|Children                         |Tom and Huck                   |1995|

In [17]:
genres_df = movies_df.select(movies_df.movieId,movies_df.genres)
genres_df.show(truncate = False)

+-------+-------------------------------------------+
|movieId|genres                                     |
+-------+-------------------------------------------+
|1      |Adventure|Animation|Children|Comedy|Fantasy|
|2      |Adventure|Children|Fantasy                 |
|3      |Comedy|Romance                             |
|4      |Comedy|Drama|Romance                       |
|5      |Comedy                                     |
|6      |Action|Crime|Thriller                      |
|7      |Comedy|Romance                             |
|8      |Adventure|Children                         |
|9      |Action                                     |
|10     |Action|Adventure|Thriller                  |
|11     |Comedy|Drama|Romance                       |
|12     |Comedy|Horror                              |
|13     |Adventure|Animation|Children               |
|14     |Drama                                      |
|15     |Action|Adventure|Romance                   |
|16     |Crime|Drama        

In [18]:
genres_df = genres_df.withColumn("gener",split(genres_df.genres,"\|"))
genres_df.show(truncate = False)

+-------+-------------------------------------------+-------------------------------------------------+
|movieId|genres                                     |gener                                            |
+-------+-------------------------------------------+-------------------------------------------------+
|1      |Adventure|Animation|Children|Comedy|Fantasy|[Adventure, Animation, Children, Comedy, Fantasy]|
|2      |Adventure|Children|Fantasy                 |[Adventure, Children, Fantasy]                   |
|3      |Comedy|Romance                             |[Comedy, Romance]                                |
|4      |Comedy|Drama|Romance                       |[Comedy, Drama, Romance]                         |
|5      |Comedy                                     |[Comedy]                                         |
|6      |Action|Crime|Thriller                      |[Action, Crime, Thriller]                        |
|7      |Comedy|Romance                             |[Comedy, Ro

In [19]:
genres_df = genres_df.drop(genres_df.genres)
genres_df.show(truncate = False)

+-------+-------------------------------------------------+
|movieId|gener                                            |
+-------+-------------------------------------------------+
|1      |[Adventure, Animation, Children, Comedy, Fantasy]|
|2      |[Adventure, Children, Fantasy]                   |
|3      |[Comedy, Romance]                                |
|4      |[Comedy, Drama, Romance]                         |
|5      |[Comedy]                                         |
|6      |[Action, Crime, Thriller]                        |
|7      |[Comedy, Romance]                                |
|8      |[Adventure, Children]                            |
|9      |[Action]                                         |
|10     |[Action, Adventure, Thriller]                    |
|11     |[Comedy, Drama, Romance]                         |
|12     |[Comedy, Horror]                                 |
|13     |[Adventure, Animation, Children]                 |
|14     |[Drama]                        

**Join the Datasets:**

Perform an inner join on the movieId column to combine the movie and ratings datasets.

In [20]:
movies_ratings_df = movies_df.join(ratings_df,on='movieId',how='inner')
movies_ratings_df.show(truncate = False)

+-------+--------------------------------------+-----------------------------------+----+------+------+---------+
|movieId|genres                                |movie_name                         |year|userId|rating|timestamp|
+-------+--------------------------------------+-----------------------------------+----+------+------+---------+
|17     |Drama|Romance                         |Sense and Sensibility              |1995|1     |4.0   |944249077|
|25     |Drama|Romance                         |Leaving Las Vegas                  |1995|1     |1.0   |944250228|
|29     |Adventure|Drama|Fantasy|Mystery|Sci-Fi|City of Lost Children, The         |1995|1     |2.0   |943230976|
|30     |Crime|Drama                           |Shanghai Triad                     |1995|1     |5.0   |944249077|
|32     |Mystery|Sci-Fi|Thriller               |Twelve Monkeys                     |1995|1     |5.0   |943228858|
|34     |Children|Drama                        |Babe                               |1995

**Step 3: Data Analysis**



**Top-Rated Movies:**

Find the top 10 movies with the highest average rating.

In [21]:
from pyspark.sql.functions import avg,desc,asc
top_ten_movies = movies_ratings_df.groupby('movieId','movie_name').agg(avg('rating').alias('Avg_rating'))
top_ten_movies = top_ten_movies.orderBy(top_ten_movies.Avg_rating.desc())
top_ten_movies.show(10,truncate = False)

+-------+----------------------------------+----------+
|movieId|movie_name                        |Avg_rating|
+-------+----------------------------------+----------+
|267940 |Silvery Moon                      |5.0       |
|284191 |The Red Suitcase                  |5.0       |
|148084 |Emmanuelle in Soho                |5.0       |
|166267 |Finnish Blood Swedish Heart       |5.0       |
|126959 |The Epic of Everest               |5.0       |
|141064 |Uomo e galantuomo                 |5.0       |
|287247 |The Beach Boys: Making Pet Sounds |5.0       |
|203060 |Worlds of Ursula K. Le Guin       |5.0       |
|244224 |Selfie                            |5.0       |
|111401 |Home and the World, The           |5.0       |
+-------+----------------------------------+----------+
only showing top 10 rows



**Popular Genres:**

Identify the most popular genres based on the number of ratings.


In [22]:
from pyspark.sql.functions import count

popular_genres = movies_ratings_df.groupBy('genres').agg(count('rating').alias('No_of_ratings'))
popular_genres = popular_genres.sort(popular_genres.No_of_ratings.desc())
popular_genres.show(truncate = False)

+--------------------------------+-------------+
|genres                          |No_of_ratings|
+--------------------------------+-------------+
|Drama                           |405611       |
|Comedy                          |340763       |
|Comedy|Romance                  |197856       |
|Drama|Romance                   |176116       |
|Comedy|Drama                    |167208       |
|Comedy|Drama|Romance            |158487       |
|Action|Adventure|Sci-Fi         |144021       |
|Crime|Drama                     |137683       |
|Action|Crime|Thriller           |87817        |
|Drama|Thriller                  |84616        |
|Action|Adventure|Sci-Fi|Thriller|79065        |
|Action|Adventure|Thriller       |75067        |
|Action|Sci-Fi|Thriller          |71351        |
|Crime|Drama|Thriller            |70518        |
|Drama|War                       |64149        |
|Action|Crime|Drama|Thriller     |59467        |
|Comedy|Crime                    |58213        |
|Action|Drama|War   

**User Behavior:**

Analyze the distribution of ratings by users. For example, find how many movies each user has rated.


In [23]:
user_rating = movies_ratings_df.groupBy('userId').agg(count('userId').alias('No_of_ratings')).orderBy(movies_ratings_df.userId.asc())
user_rating.show(truncate = False)

+------+-------------+
|userId|No_of_ratings|
+------+-------------+
|1     |141          |
|10    |660          |
|100   |248          |
|1000  |27           |
|10000 |165          |
|10001 |25           |
|10002 |35           |
|10003 |447          |
|10004 |212          |
|10005 |420          |
|10006 |63           |
|10007 |89           |
|10008 |91           |
|10009 |59           |
|1001  |102          |
|10010 |137          |
|10011 |20           |
|10012 |64           |
|10013 |30           |
|10014 |88           |
+------+-------------+
only showing top 20 rows



**Yearly Trends:**

Determine how the average movie rating has changed over the years.


In [24]:
movies_ratings_df.select('year').show()

+----+
|year|
+----+
|1995|
|1995|
|1995|
|1995|
|1995|
|1995|
|1995|
|1995|
|1995|
|1976|
|1995|
|1995|
|1995|
|1994|
|1994|
|1977|
|1994|
|1994|
|1993|
|1995|
+----+
only showing top 20 rows



In [25]:
avg_rating_year = movies_ratings_df.groupBy('year').agg(avg('rating').alias('Avg_rating'))
avg_rating_year.show(truncate = False)

+----+------------------+
|year|Avg_rating        |
+----+------------------+
|1953|3.6956057541020453|
|1903|2.947058823529412 |
|1957|4.022665522665522 |
|1897|2.796875          |
|1987|3.5515874990669087|
|1956|3.718933333333333 |
|2016|3.544669946699467 |
|1936|3.848736325914749 |
|2012|3.534303460694162 |
|2020|3.3116364999464496|
|1958|3.8176854830847224|
|1910|2.7777777777777777|
|1943|3.667924528301887 |
|1915|3.0153846153846153|
|1972|3.965176268271711 |
|1931|3.9023941068139965|
|1988|3.5300534613766303|
|1938|3.889705882352941 |
|1926|3.846683354192741 |
|1911|3.6538461538461537|
+----+------------------+
only showing top 20 rows



**Genre-Based Recommendations:**

For a given genre, list the top 5 movies based on average ratings.


In [26]:
genre_input = input("Enter the genre to filter movies: ")

top_five_movies = (movies_ratings_df.filter(col('genres') == genre_input)\
                   .groupBy('movie_name')\
                   .agg(avg('rating').alias('Avg_rating'))\
                   .orderBy(col('Avg_rating').desc())
                   .limit(5))
top_five_movies.show(truncate = False)


Enter the genre to filter movies: Action
+----------------------------+----------+
|movie_name                  |Avg_rating|
+----------------------------+----------+
|Buffalo Girls               |5.0       |
|Dreadnaught                 |5.0       |
|The Division: Agent Origins |5.0       |
|Drunken Master 3            |5.0       |
|Reich                       |5.0       |
+----------------------------+----------+



In [27]:
x = movies_df.filter(col("title").like("%Dreadnaught%"))
x.select(x.genres).show()

+------+
|genres|
+------+
|Action|
+------+



**Movies with the Most Reviews:**

Identify the movies with the highest number of ratings.

In [28]:
from pyspark.sql.functions import desc

In [29]:
most_reviews = movies_ratings_df.groupBy('movieId','movie_name')\
              .agg(count(col('rating')).alias('No_of_reviews'))\
              .orderBy(col('No_of_reviews').desc())\
              .limit(1)
most_reviews.show(truncate = False)

+-------+--------------------------+-------------+
|movieId|movie_name                |No_of_reviews|
+-------+--------------------------+-------------+
|318    |Shawshank Redemption, The |18549        |
+-------+--------------------------+-------------+



In [30]:
x = movies_df.filter(col("title").like("Shawshank Redemption%"))
x.show(truncate = False)

+-------+-----------+--------------------------+----+
|movieId|genres     |movie_name                |year|
+-------+-----------+--------------------------+----+
|318    |Crime|Drama|Shawshank Redemption, The |1994|
+-------+-----------+--------------------------+----+



**Step 4: Advanced Analysis**


**User-Specific Recommendations:**

Build a basic recommendation system by suggesting top-rated movies that a user has not rated yet.


In [31]:
top_rated_movies = movies_ratings_df.groupBy('movieId', 'movie_name')\
    .agg(avg(col('rating')).alias('Avg_rating'))\
    .orderBy(col('Avg_rating').desc())\
    .limit(100)\
    .select('movieId','movie_name')

top_rated_movies.show(truncate = False)

+-------+--------------------------------------------------------------------------------+
|movieId|movie_name                                                                      |
+-------+--------------------------------------------------------------------------------+
|152916 |Scene from the Elevator Ascending Eiffel Tower                                  |
|234919 |How It Feels to Be Run Over                                                     |
|195369 |Reich                                                                           |
|111901 |Cinderella                                                                      |
|208090 |The Fallen of World War II                                                      |
|247252 |Caught by a Wave                                                                |
|203060 |Worlds of Ursula K. Le Guin                                                     |
|205593 |The Maiden Danced to Death                                                      |

In [32]:
total_users = movies_ratings_df.select('userId').distinct().count()
print(total_users)

36112


In [33]:
user_id_input = input('Enter user id')

user_ratings = movies_ratings_df.filter(col('userId') == user_id_input).select('movieId','movie_name')
user_ratings.show(truncate = False)

Enter user id100
+-------+-----------------------------------+
|movieId|movie_name                         |
+-------+-----------------------------------+
|1      |Toy Story                          |
|2      |Jumanji                            |
|5      |Father of the Bride Part II        |
|31     |Dangerous Minds                    |
|34     |Babe                               |
|47     |Seven                              |
|50     |Usual Suspects, The                |
|104    |Happy Gilmore                      |
|150    |Apollo 13                          |
|165    |Die Hard: With a Vengeance         |
|260    |Star Wars: Episode IV - A New Hope |
|293    |Léon: The Professional             |
|296    |Pulp Fiction                       |
|318    |Shawshank Redemption, The          |
|356    |Forrest Gump                       |
|357    |Four Weddings and a Funeral        |
|364    |Lion King, The                     |
|377    |Speed                              |
|480    |Jurassic


The **left anti join** in PySpark is similar to the join functionality, but it returns only columns from the left DataFrame for non-matched records.

In [34]:
recommended_movies_df = top_rated_movies.join(user_ratings, on='movieId', how='left_anti')
recommended_movies_df.show(truncate = False)

+-------+----------------------------------------------------------------------+
|movieId|movie_name                                                            |
+-------+----------------------------------------------------------------------+
|250648 |From Hare to Heir                                                     |
|222374 |The Colossus of Destiny: A Melvins Tale                               |
|216372 |Adoring                                                               |
|199710 |L'arroseur arrosé                                                     |
|80210  |End of Poverty, The                                                   |
|154606 |Pastorale                                                             |
|161818 |VeggieTales: Duke and the Great Pie War                               |
|267940 |Silvery Moon                                                          |
|83558  |On the Occasion of Remembering the Turning Gate                       |
|126959 |The Epic of Everest

left_anti join will keep only the movies that are not in the user’s rated movies list.

**Correlate Ratings and Release Year:**

Analyze if there’s any correlation between the release year and the average rating of movies.


In [38]:
avg_rating_year = movies_ratings_df.groupBy('year').agg(avg(col('rating')).alias('Avg_rating'))
avg_rating_year.show(truncate = False)

+----+------------------+
|year|Avg_rating        |
+----+------------------+
|1953|3.6956057541020453|
|1903|2.947058823529412 |
|1957|4.022665522665522 |
|1897|2.796875          |
|1987|3.5515874990669087|
|1956|3.718933333333333 |
|2016|3.544669946699467 |
|1936|3.848736325914749 |
|2012|3.534303460694162 |
|2020|3.3116364999464496|
|1958|3.8176854830847224|
|1910|2.7777777777777777|
|1943|3.667924528301887 |
|1915|3.0153846153846153|
|1972|3.965176268271711 |
|1931|3.9023941068139965|
|1988|3.5300534613766303|
|1938|3.889705882352941 |
|1926|3.846683354192741 |
|1911|3.6538461538461537|
+----+------------------+
only showing top 20 rows



In [40]:
avg_rating_year_pd = avg_rating_year.toPandas()
correlation = avg_rating_year_pd['year'].astype(float).corr(avg_rating_year_pd['Avg_rating'])
print("Correlation between release year and average rating: ",correlation)


Correlation between release year and average rating:  0.5714955278992014


In this we are caluclatuing Pearson correlation coefficient between the year and Avg_rating columns.

**Conclusion:** There appears to be a moderate trend where newer movies (those released more recently) tend to receive slightly higher average ratings than older movies as pearson correlation coefficient is between 0.5 to 0.7 so this comes under moderate strength.

**Genre Diversity in Top-Rated Movies:**

Examine the genre diversity among the top 100 highest-rated movies.

In [52]:
top_rated_movies = movies_ratings_df.groupBy('movieId', 'movie_name')\
    .agg(avg(col('rating')).alias('Avg_rating'))\
    .orderBy(col('Avg_rating').desc())\
    .limit(100)\
    .select('movieId','movie_name')

top_rated_movies.show(truncate = False)

+-------+--------------------------------------------------------------------------------+
|movieId|movie_name                                                                      |
+-------+--------------------------------------------------------------------------------+
|234919 |How It Feels to Be Run Over                                                     |
|152916 |Scene from the Elevator Ascending Eiffel Tower                                  |
|111901 |Cinderella                                                                      |
|195369 |Reich                                                                           |
|247252 |Caught by a Wave                                                                |
|208090 |The Fallen of World War II                                                      |
|205593 |The Maiden Danced to Death                                                      |
|203060 |Worlds of Ursula K. Le Guin                                                     |

In [43]:
from pyspark.sql.functions import explode,countDistinct

In [47]:
movies_df_aliased = movies_df.alias('m')
top_rated_movies_aliased = top_rated_movies.alias('t')


genres_split_df = movies_df_aliased.withColumn("genre", explode(split(col("genres"), "\\|"))) \
                                   .join(top_rated_movies_aliased, on="movieId") \
                                   .select('m.movie_name', 'genre')

genres_split_df.show(truncate=False)


+------------------------------------------------+-----------+
|movie_name                                      |genre      |
+------------------------------------------------+-----------+
|Death Takes a Holiday                           |Fantasy    |
|Death Takes a Holiday                           |Romance    |
|Ceremony, The                                   |Comedy     |
|Ceremony, The                                   |Drama      |
|Voyager                                         |Drama      |
|End of Poverty, The                             |Documentary|
|On the Occasion of Remembering the Turning Gate |Drama      |
|Time That Remains, The                          |Drama      |
|Unexpected Love, An                             |Drama      |
|Old Fashioned Way, The                          |Comedy     |
|Treed Murray                                    |Drama      |
|Human Scale, The                                |Documentary|
|Free Radicals:  A History of Experimental Film  |Docum

In [59]:
# Count distinct genres
genre_diversity = genres_split_df.groupBy("movie_name") \
                                .agg(countDistinct("genre").alias("distinct_genres"))
genre_diversity = genre_diversity.orderBy(col('distinct_genres').asc())
genre_diversity.show(truncate = False)

+-----------------------------------------------+---------------+
|movie_name                                     |distinct_genres|
+-----------------------------------------------+---------------+
|Rolli and the Golden Key                       |1              |
|Lil Rel: RELevent                              |1              |
|The Rashevski Tango                            |1              |
|Adoring                                        |1              |
|Treed Murray                                   |1              |
|L'arroseur arrosé                              |1              |
|Breaking Boundaries: The Science of Our Planet |1              |
|Selfie                                         |1              |
|Path of Blood                                  |1              |
|The Cookie Carnival                            |1              |
|Time That Remains, The                         |1              |
|First Contact                                  |1              |
|The Conne

In [60]:
x = genres_split_df.filter(col('genre') == '(no genres listed)')
x.show(truncate = False)

+-----------------------------------------------+------------------+
|movie_name                                     |genre             |
+-----------------------------------------------+------------------+
|Sally Hemings: An American Scandal             |(no genres listed)|
|Dangerous Child                                |(no genres listed)|
|Michael Jackson: Video Greatest Hits - HIStory |(no genres listed)|
|Lebedyne ozero-zona                            |(no genres listed)|
|Vintage Tomorrows                              |(no genres listed)|
|Lana Del Rey: The Greatest Story Never Told    |(no genres listed)|
|Stream of Love                                 |(no genres listed)|
|1968                                           |(no genres listed)|
|L'arroseur arrosé                              |(no genres listed)|
|The Maiden Danced to Death                     |(no genres listed)|
|Inside 'The Talented Mr. Ripley'               |(no genres listed)|
+---------------------------------

In [56]:
movies_ratings_df.filter(col('movie_name').like("%A Sister's Revenge%")).show(truncate = False)


+-------+----------------------+-------------------+----+------+------+----------+
|movieId|genres                |movie_name         |year|userId|rating|timestamp |
+-------+----------------------+-------------------+----+------+------+----------+
|137018 |Drama|Mystery|Thriller|A Sister's Revenge |2013|9785  |5.0   |1436142277|
+-------+----------------------+-------------------+----+------+------+----------+



**Total Genre Counts:**

You might want to count how many times each genre appears across all the top 100 movies.


In [95]:
total_count = genres_split_df.groupBy('genre').agg(count(col('genre')).alias('No_of_count'))
total_count = total_count.orderBy(col('No_of_count').desc())
total_count.show(truncate = False)

+------------------+-----------+
|genre             |No_of_count|
+------------------+-----------+
|Drama             |34         |
|Documentary       |22         |
|Comedy            |21         |
|Romance           |15         |
|(no genres listed)|11         |
|Fantasy           |8          |
|Animation         |8          |
|Thriller          |7          |
|Crime             |6          |
|Mystery           |6          |
|Horror            |5          |
|Children          |5          |
|Sci-Fi            |5          |
|Action            |4          |
|War               |3          |
|Adventure         |2          |
+------------------+-----------+



**Percentage Representation:**

Calculate the percentage representation of each genre among the top-rated movies to understand which genres dominate.

In [103]:
from pyspark.sql.functions import col, avg

most_rated_year = movies_ratings_df.groupBy(col('year'),col('movie_name')) \
    .agg(avg(col('rating')).alias('Avg_rating'))

# most_rated_year= most_rated_year.orderBy(col('Avg_rating').desc())


most_rated_year.show(truncate=False)

+----+------------------------------------------+------------------+
|year|movie_name                                |Avg_rating        |
+----+------------------------------------------+------------------+
|1999|Liberty Heights                           |3.5559701492537314|
|2004|Phantom of the Opera, The                 |3.481216457960644 |
|2006|Eragon                                    |2.4310344827586206|
|2007|National Treasure: Book of Secrets        |3.251366120218579 |
|2012|Sinister                                  |3.57              |
|1995|Pride and Prejudice                       |3.9887152777777777|
|1989|Sea of Love                               |3.4524793388429753|
|2014|Blended                                   |3.3251121076233185|
|1994|Three Colors: White                       |3.856945722171113 |
|1997|Ice Storm, The                            |3.8108736059479553|
|1990|King of New York                          |3.484320557491289 |
|1998|Very Bad Things             

In [101]:
from pyspark.sql.functions import col, avg

most_rated_year = movies_ratings_df.groupBy(col('year')) \
    .agg(avg(col('rating')).alias('Avg_rating'))

most_rated_year= most_rated_year.orderBy(col('Avg_rating').desc())


most_rated_year.show(truncate=False)

+----+------------------+
|year|Avg_rating        |
+----+------------------+
|1957|4.022665522665522 |
|1954|4.004514672686231 |
|1946|3.9740790436691875|
|1972|3.965176268271711 |
|1942|3.961504640937958 |
|1962|3.940851449275362 |
|1944|3.9371643394199785|
|1941|3.9285191286838734|
|1934|3.920393559928444 |
|1931|3.9023941068139965|
|1974|3.8981538225831165|
|1964|3.894499120429801 |
|1952|3.893018018018018 |
|1975|3.892570281124498 |
|1938|3.889705882352941 |
|1951|3.882144284572342 |
|1948|3.8773248168326133|
|1939|3.8682272057619906|
|1927|3.864066852367688 |
|1949|3.855448053443907 |
+----+------------------+
only showing top 20 rows



**Step 5: Performance Optimization**


**Cache and Persist:**

Use caching and persistence in PySpark to optimize the performance of your queries.


**Partitioning:**

Apply partitioning to the data to improve the efficiency of operations, especially for large datasets.


**Step 6: Visualization**

**Visualize Data:**
Use PySpark with an external library like Matplotlib, Seaborn, or even Power BI to create visualizations such as:
Distribution of ratings.
Average rating per genre.
Trends in movie ratings over the years.