<a href="https://colab.research.google.com/github/SrijaG29/Movie-Data-Analysis-and-Recommendations-Using-PySpark/blob/main/Movies_ratings_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To create a project using the movie and ratings datasets with PySpark, here’s a step-by-step guide along with some ideas for questions you can answer. This will help demonstrate your PySpark skills and data analysis capabilities on your resume.

### Step 1: Data Exploration
1. **Load the datasets into PySpark**:
   - Load the movie dataset (`movies.csv`) and ratings dataset (`ratings.csv`) into PySpark DataFrames.

2. **Inspect the Data**:
   - Display the schema and first few rows of both DataFrames to understand the data structure.

### Step 2: Data Cleaning and Preparation
1. **Handle Missing Values**:
   - Check for any missing values and decide how to handle them (e.g., drop rows, fill with default values).

2. **Data Transformation**:
   - Extract the year from the movie title and create a new column `year`.
   - Split the `genres` column into an array of genres.

3. **Join the Datasets**:
   - Perform an inner join on the `movieId` column to combine the movie and ratings datasets.

### Step 3: Data Analysis
1. **Top-Rated Movies**:
   - Find the top 10 movies with the highest average rating.

2. **Popular Genres**:
   - Identify the most popular genres based on the number of ratings.

3. **User Behavior**:
   - Analyze the distribution of ratings by users. For example, find how many movies each user has rated.

4. **Yearly Trends**:
   - Determine how the average movie rating has changed over the years.

5. **Genre-Based Recommendations**:
   - For a given genre, list the top 5 movies based on average ratings.

6. **Movies with the Most Reviews**:
   - Identify the movies with the highest number of ratings.

### Step 4: Advanced Analysis
1. **User-Specific Recommendations**:
   - Build a basic recommendation system by suggesting top-rated movies that a user has not rated yet.

2. **Correlate Ratings and Release Year**:
   - Analyze if there’s any correlation between the release year and the average rating of movies.

3. **Genre Diversity in Top-Rated Movies**:
   - Examine the genre diversity among the top 100 highest-rated movies.

### Step 5: Performance Optimization
1. **Cache and Persist**:
   - Use caching and persistence in PySpark to optimize the performance of your queries.

2. **Partitioning**:
   - Apply partitioning to the data to improve the efficiency of operations, especially for large datasets.

### Step 6: Visualization
1. **Visualize Data**:
   - Use PySpark with an external library like Matplotlib, Seaborn, or even Power BI to create visualizations such as:
     - Distribution of ratings.
     - Average rating per genre.
     - Trends in movie ratings over the years.

### Conclusion
This project structure not only demonstrates your technical ability to handle and analyze data using PySpark but also your ability to derive meaningful insights from complex datasets. Once complete, you can showcase this project in your portfolio or resume, highlighting key aspects such as data cleaning, transformation, analysis, and visualization.

In [46]:
!pip install pyspark



In [47]:
from pyspark.sql import SparkSession

In [48]:
spark = (
    SparkSession
    .builder
    .appName("Movie ratings")
    .master("local[*]")
    .getOrCreate()
)

In [49]:
spark

**Step 1: Data Exploration**
Load the datasets into PySpark session **"spark"**:

Load the movie dataset (movies.csv) and ratings dataset (ratings.csv) into PySpark DataFrames.
Inspect the Data:

Display the schema and first few rows of both DataFrames to understand the data structure.

In [78]:
movies_df = spark.read.format("csv").option("header",True).load("/content/movies.csv")
movies_df.show(truncate = False)

+-------+-------------------------------------+-------------------------------------------+
|movieId|title                                |genres                                     |
+-------+-------------------------------------+-------------------------------------------+
|1      |Toy Story (1995)                     |Adventure|Animation|Children|Comedy|Fantasy|
|2      |Jumanji (1995)                       |Adventure|Children|Fantasy                 |
|3      |Grumpier Old Men (1995)              |Comedy|Romance                             |
|4      |Waiting to Exhale (1995)             |Comedy|Drama|Romance                       |
|5      |Father of the Bride Part II (1995)   |Comedy                                     |
|6      |Heat (1995)                          |Action|Crime|Thriller                      |
|7      |Sabrina (1995)                       |Comedy|Romance                             |
|8      |Tom and Huck (1995)                  |Adventure|Children               

In [51]:
ratings_df = spark.read.format("csv").option("header",True).load("/content/ratings.csv")
ratings_df.show(truncate = False)

+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
|1     |17     |4.0   |944249077|
|1     |25     |1.0   |944250228|
|1     |29     |2.0   |943230976|
|1     |30     |5.0   |944249077|
|1     |32     |5.0   |943228858|
|1     |34     |2.0   |943228491|
|1     |36     |1.0   |944249008|
|1     |80     |5.0   |944248943|
|1     |110    |3.0   |943231119|
|1     |111    |5.0   |944249008|
|1     |161    |1.0   |943231162|
|1     |166    |5.0   |943228442|
|1     |176    |4.0   |944079496|
|1     |223    |3.0   |944082810|
|1     |232    |5.0   |943228442|
|1     |260    |5.0   |943228696|
|1     |302    |4.0   |944253272|
|1     |306    |5.0   |944248888|
|1     |307    |5.0   |944253207|
|1     |322    |4.0   |944053801|
+------+-------+------+---------+
only showing top 20 rows



**Step 2: Data Cleaning and Preparation**

**Handle Missing Values:**

Check for any missing values and decide how to handle them (e.g., drop rows, fill with default values).

In [52]:
from pyspark.sql.functions import sum,col,when,split

In [8]:
print(movies_df.dtypes)

[('movieId', 'string'), ('title', 'string'), ('genres', 'string')]


In [9]:
print(ratings_df.dtypes)

[('userId', 'string'), ('movieId', 'string'), ('rating', 'string'), ('timestamp', 'string')]


In [10]:
# Create an empty list to hold column expressions for missing values
missing_value_expressions_movies = []

# Loop through each column in the DataFrame
for column in movies_df.columns:
    # Count the number of null values in the current column
    missing_expr = sum(when(col(column).isNull(), 1).otherwise(0)).alias(column)
    missing_value_expressions_movies.append(missing_expr)

# Aggregate the missing value counts
missing_movies_df = movies_df.agg(*missing_value_expressions_movies)

# Show the results
missing_movies_df.show()

+-------+-----+------+
|movieId|title|genres|
+-------+-----+------+
|      0|    0|     0|
+-------+-----+------+



In [11]:
# Create an empty list to hold column expressions for missing values
missing_value_expressions_ratings = []

# Loop through each column in the DataFrame
for column in ratings_df.columns:
    # Count the number of null values in the current column
    missing_expr = sum(when(col(column).isNull(), 1).otherwise(0)).alias(column)
    missing_value_expressions_ratings.append(missing_expr)

# Aggregate the missing value counts
missing_ratings_df = ratings_df.agg(*missing_value_expressions_ratings)

# Show the results
missing_ratings_df.show()

+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
|     0|      0|     1|        1|
+------+-------+------+---------+



In [None]:
ratings_df = ratings_df.dropna()

In [104]:
null_count = ratings_df.filter(col("rating").isNull()).count()
print(null_count)

0


In [105]:
null_count = ratings_df.filter(col("timestamp").isNull()).count()
print(null_count)

0


**" There are no null values in these datasets. "**

**Data Transformation:**

Extract the year from the movie title and create a new column year. Split the genres column into an array of genres.

In [53]:
movies_df.show(truncate = False)

+-------+-------------------------------------+-------------------------------------------+
|movieId|title                                |genres                                     |
+-------+-------------------------------------+-------------------------------------------+
|1      |Toy Story (1995)                     |Adventure|Animation|Children|Comedy|Fantasy|
|2      |Jumanji (1995)                       |Adventure|Children|Fantasy                 |
|3      |Grumpier Old Men (1995)              |Comedy|Romance                             |
|4      |Waiting to Exhale (1995)             |Comedy|Drama|Romance                       |
|5      |Father of the Bride Part II (1995)   |Comedy                                     |
|6      |Heat (1995)                          |Action|Crime|Thriller                      |
|7      |Sabrina (1995)                       |Comedy|Romance                             |
|8      |Tom and Huck (1995)                  |Adventure|Children               

Extracting year from title.

In [79]:
# from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

# def multiply_by_2(x):
#     return x * 2

# multiply_udf = udf(multiply_by_2, IntegerType())

# df = df.withColumn("new_column", multiply_udf(F.col("existing_column")))


from pyspark.sql.functions import udf

def year_extract(x):
  try:
    f = x.split('(')[-1].strip(')')
    return int(f)
  except:
    return None

x_udf = udf(year_extract,IntegerType())

movies_df = movies_df.withColumn('year',x_udf(col('title')))
movies_df.show(truncate = False)

+-------+-------------------------------------+-------------------------------------------+----+
|movieId|title                                |genres                                     |year|
+-------+-------------------------------------+-------------------------------------------+----+
|1      |Toy Story (1995)                     |Adventure|Animation|Children|Comedy|Fantasy|1995|
|2      |Jumanji (1995)                       |Adventure|Children|Fantasy                 |1995|
|3      |Grumpier Old Men (1995)              |Comedy|Romance                             |1995|
|4      |Waiting to Exhale (1995)             |Comedy|Drama|Romance                       |1995|
|5      |Father of the Bride Part II (1995)   |Comedy                                     |1995|
|6      |Heat (1995)                          |Action|Crime|Thriller                      |1995|
|7      |Sabrina (1995)                       |Comedy|Romance                             |1995|
|8      |Tom and Huck (1995)  

Few columns are None so we will remove those columns.

In [80]:
null_count = movies_df.filter(col("year").isNull()).count()
print(null_count)

796


In [81]:
movies_df = movies_df.dropna(subset=['year'])
movies_df.show(truncate = False)

+-------+-------------------------------------+-------------------------------------------+----+
|movieId|title                                |genres                                     |year|
+-------+-------------------------------------+-------------------------------------------+----+
|1      |Toy Story (1995)                     |Adventure|Animation|Children|Comedy|Fantasy|1995|
|2      |Jumanji (1995)                       |Adventure|Children|Fantasy                 |1995|
|3      |Grumpier Old Men (1995)              |Comedy|Romance                             |1995|
|4      |Waiting to Exhale (1995)             |Comedy|Drama|Romance                       |1995|
|5      |Father of the Bride Part II (1995)   |Comedy                                     |1995|
|6      |Heat (1995)                          |Action|Crime|Thriller                      |1995|
|7      |Sabrina (1995)                       |Comedy|Romance                             |1995|
|8      |Tom and Huck (1995)  

In [82]:
null_count = movies_df.filter(col("year").isNull()).count()
print(null_count)

0


In [83]:
genres_df = movies_df.select(movies_df.movieId,movies_df.genres)
genres_df.show(truncate = False)

+-------+-------------------------------------------+
|movieId|genres                                     |
+-------+-------------------------------------------+
|1      |Adventure|Animation|Children|Comedy|Fantasy|
|2      |Adventure|Children|Fantasy                 |
|3      |Comedy|Romance                             |
|4      |Comedy|Drama|Romance                       |
|5      |Comedy                                     |
|6      |Action|Crime|Thriller                      |
|7      |Comedy|Romance                             |
|8      |Adventure|Children                         |
|9      |Action                                     |
|10     |Action|Adventure|Thriller                  |
|11     |Comedy|Drama|Romance                       |
|12     |Comedy|Horror                              |
|13     |Adventure|Animation|Children               |
|14     |Drama                                      |
|15     |Action|Adventure|Romance                   |
|16     |Crime|Drama        

In [84]:
genres_df = genres_df.withColumn("gener",split(genres_df.genres,"\|"))
genres_df.show(truncate = False)

+-------+-------------------------------------------+-------------------------------------------------+
|movieId|genres                                     |gener                                            |
+-------+-------------------------------------------+-------------------------------------------------+
|1      |Adventure|Animation|Children|Comedy|Fantasy|[Adventure, Animation, Children, Comedy, Fantasy]|
|2      |Adventure|Children|Fantasy                 |[Adventure, Children, Fantasy]                   |
|3      |Comedy|Romance                             |[Comedy, Romance]                                |
|4      |Comedy|Drama|Romance                       |[Comedy, Drama, Romance]                         |
|5      |Comedy                                     |[Comedy]                                         |
|6      |Action|Crime|Thriller                      |[Action, Crime, Thriller]                        |
|7      |Comedy|Romance                             |[Comedy, Ro

In [85]:
genres_df = genres_df.drop(genres_df.genres)
genres_df.show(truncate = False)

+-------+-------------------------------------------------+
|movieId|gener                                            |
+-------+-------------------------------------------------+
|1      |[Adventure, Animation, Children, Comedy, Fantasy]|
|2      |[Adventure, Children, Fantasy]                   |
|3      |[Comedy, Romance]                                |
|4      |[Comedy, Drama, Romance]                         |
|5      |[Comedy]                                         |
|6      |[Action, Crime, Thriller]                        |
|7      |[Comedy, Romance]                                |
|8      |[Adventure, Children]                            |
|9      |[Action]                                         |
|10     |[Action, Adventure, Thriller]                    |
|11     |[Comedy, Drama, Romance]                         |
|12     |[Comedy, Horror]                                 |
|13     |[Adventure, Animation, Children]                 |
|14     |[Drama]                        

In movies_df there are rows in genre which have value like (no genre listed) so we will remove them.

In [109]:
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf, col

def clean_genre(x):
    if x != '(no genres listed)':
        return x
    else:
        return None

clean_genre_udf = udf(clean_genre, StringType())
movies_df = movies_df.withColumn('genres', clean_genre_udf(col('genres')))
movies_df.show(truncate=False)

+-------+-------------------------------------+-------------------------------------------+----+
|movieId|title                                |genres                                     |year|
+-------+-------------------------------------+-------------------------------------------+----+
|1      |Toy Story (1995)                     |Adventure|Animation|Children|Comedy|Fantasy|1995|
|2      |Jumanji (1995)                       |Adventure|Children|Fantasy                 |1995|
|3      |Grumpier Old Men (1995)              |Comedy|Romance                             |1995|
|4      |Waiting to Exhale (1995)             |Comedy|Drama|Romance                       |1995|
|5      |Father of the Bride Part II (1995)   |Comedy                                     |1995|
|6      |Heat (1995)                          |Action|Crime|Thriller                      |1995|
|7      |Sabrina (1995)                       |Comedy|Romance                             |1995|
|8      |Tom and Huck (1995)  

Checking for null values.

In [110]:
null_count = movies_df.filter(col("genres").isNull()).count()
print(null_count)

6702


In [111]:
movies_df = movies_df.filter(movies_df.genres.isNotNull())
movies_df.show(truncate=False)

+-------+-------------------------------------+-------------------------------------------+----+
|movieId|title                                |genres                                     |year|
+-------+-------------------------------------+-------------------------------------------+----+
|1      |Toy Story (1995)                     |Adventure|Animation|Children|Comedy|Fantasy|1995|
|2      |Jumanji (1995)                       |Adventure|Children|Fantasy                 |1995|
|3      |Grumpier Old Men (1995)              |Comedy|Romance                             |1995|
|4      |Waiting to Exhale (1995)             |Comedy|Drama|Romance                       |1995|
|5      |Father of the Bride Part II (1995)   |Comedy                                     |1995|
|6      |Heat (1995)                          |Action|Crime|Thriller                      |1995|
|7      |Sabrina (1995)                       |Comedy|Romance                             |1995|
|8      |Tom and Huck (1995)  

In [112]:
null_count = movies_df.filter(col("genres").isNull()).count()
print(null_count)

0


**Join the Datasets:**

Perform an inner join on the movieId column to combine the movie and ratings datasets.

In [86]:
movies_ratings_df = movies_df.join(ratings_df,on='movieId',how='inner')
movies_ratings_df.show(truncate = False)

+-------+---------------------------------------------------------------+--------------------------------------+----+------+------+---------+
|movieId|title                                                          |genres                                |year|userId|rating|timestamp|
+-------+---------------------------------------------------------------+--------------------------------------+----+------+------+---------+
|17     |Sense and Sensibility (1995)                                   |Drama|Romance                         |1995|1     |4.0   |944249077|
|25     |Leaving Las Vegas (1995)                                       |Drama|Romance                         |1995|1     |1.0   |944250228|
|29     |City of Lost Children, The (Cité des enfants perdus, La) (1995)|Adventure|Drama|Fantasy|Mystery|Sci-Fi|1995|1     |2.0   |943230976|
|30     |Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)           |Crime|Drama                           |1995|1     |5.0   |944249077|
|32   

**Step 3: Data Analysis**



**Top-Rated Movies:**

Find the top 10 movies with the highest average rating.

In [87]:
from pyspark.sql.functions import avg,desc,asc
top_ten_movies = movies_ratings_df.groupby('movieId','title').agg(avg('rating').alias('Avg_rating'))
top_ten_movies = top_ten_movies.orderBy(top_ten_movies.Avg_rating.desc())
top_ten_movies.show(10,truncate = False)

+-------+---------------------------------------+----------+
|movieId|title                                  |Avg_rating|
+-------+---------------------------------------+----------+
|232329 |Snow Trail (1947)                      |5.0       |
|189587 |The Idiot Cycle (2009)                 |5.0       |
|148084 |Emmanuelle in Soho (1981)              |5.0       |
|137032 |The Perfect Neighbor (2005)            |5.0       |
|152721 |Black Wind (1964)                      |5.0       |
|276277 |Exploited (2022)                       |5.0       |
|267948 |Christmas Cracker (1963)               |5.0       |
|212947 |Uppity: The Willy T. Ribbs Story (2020)|5.0       |
|226356 |Story of a Bad Boy (1999)              |5.0       |
|134346 |Samay: When Time Strikes (2003)        |5.0       |
+-------+---------------------------------------+----------+
only showing top 10 rows



**Popular Genres:**

Identify the most popular genres based on the number of ratings.


In [23]:
from pyspark.sql.functions import count

popular_genres = movies_ratings_df.groupBy('genres').agg(count('rating').alias('No_of_ratings'))
popular_genres = popular_genres.sort(popular_genres.No_of_ratings.desc())
popular_genres.show(truncate = False)

+--------------------------------+-------------+
|genres                          |No_of_ratings|
+--------------------------------+-------------+
|Drama                           |1900601      |
|Comedy                          |1589068      |
|Comedy|Romance                  |925186       |
|Drama|Romance                   |827297       |
|Comedy|Drama                    |777409       |
|Comedy|Drama|Romance            |738314       |
|Action|Adventure|Sci-Fi         |677323       |
|Crime|Drama                     |644541       |
|Action|Crime|Thriller           |409571       |
|Drama|Thriller                  |395474       |
|Action|Adventure|Sci-Fi|Thriller|371381       |
|Action|Adventure|Thriller       |351702       |
|Action|Sci-Fi|Thriller          |340168       |
|Crime|Drama|Thriller            |327702       |
|Drama|War                       |302972       |
|Action|Crime|Drama|Thriller     |277004       |
|Action|Drama|War                |275104       |
|Comedy|Crime       

**User Behavior:**

Analyze the distribution of ratings by users. For example, find how many movies each user has rated.


In [24]:
user_rating = movies_ratings_df.groupBy('userId').agg(count('userId').alias('No_of_ratings')).orderBy(movies_ratings_df.userId.asc())
user_rating.show(truncate = False)

+------+-------------+
|userId|No_of_ratings|
+------+-------------+
|1     |141          |
|10    |660          |
|100   |249          |
|1000  |27           |
|10000 |165          |
|100000|197          |
|100001|31           |
|100002|123          |
|100003|186          |
|100004|113          |
|100005|29           |
|100006|44           |
|100007|28           |
|100008|24           |
|100009|265          |
|10001 |25           |
|100010|288          |
|100011|300          |
|100012|102          |
|100013|28           |
+------+-------------+
only showing top 20 rows



**Yearly Trends:**

Determine how the average movie rating has changed over the years.


In [88]:
movies_ratings_df.select('year').show()

+----+
|year|
+----+
|1995|
|1995|
|1995|
|1995|
|1995|
|1995|
|1995|
|1995|
|1995|
|1976|
|1995|
|1995|
|1995|
|1994|
|1994|
|1977|
|1994|
|1994|
|1993|
|1995|
+----+
only showing top 20 rows



In [89]:
avg_rating_year = movies_ratings_df.groupBy('year').agg(avg('rating').alias('Avg_rating'))
avg_rating_year.show(truncate = False)

+----+------------------+
|year|Avg_rating        |
+----+------------------+
|1959|3.8124824072510273|
|1990|3.4511998418576675|
|1896|2.895             |
|1903|3.095622119815668 |
|1975|3.882580279336278 |
|1977|3.8249607945635127|
|1888|2.596774193548387 |
|1924|3.7569906790945407|
|2003|3.4766174307820195|
|2007|3.5071440157352596|
|1892|2.1931818181818183|
|2018|3.479722733282176 |
|1974|3.894832976797762 |
|2015|3.564475949876199 |
|2023|3.3467310972143265|
|1927|3.860412299091544 |
|1955|3.723592937893098 |
|1890|2.275             |
|2006|3.555251218560976 |
|1978|3.4619053148430763|
+----+------------------+
only showing top 20 rows



**Genre-Based Recommendations:**

For a given genre, list the top 5 movies based on average ratings.


In [90]:
genre_input = input("Enter the genre to filter movies: ")

top_five_movies = (movies_ratings_df.filter(col('genres') == genre_input)\
                   .groupBy('title')\
                   .agg(avg('rating').alias('Avg_rating'))\
                   .orderBy(col('Avg_rating').desc())
                   .limit(5))
top_five_movies.show(truncate = False)


Enter the genre to filter movies: Action
+-------------------------------+----------+
|title                          |Avg_rating|
+-------------------------------+----------+
|Street Level (2016)            |5.0       |
|Yasmine (2014)                 |5.0       |
|FB: Fighting Beat (2007)       |5.0       |
|Shanghai 13 (1984)             |5.0       |
|Angels Hard as They Come (1971)|5.0       |
+-------------------------------+----------+



**Movies with the Most Reviews:**

Identify the movies with the highest number of ratings.

In [28]:
from pyspark.sql.functions import desc

In [91]:
most_reviews = movies_ratings_df.groupBy('movieId','title')\
              .agg(count(col('rating')).alias('No_of_reviews'))\
              .orderBy(col('No_of_reviews').desc())\
              .limit(1)
most_reviews.show(truncate = False)

+-------+--------------------------------+-------------+
|movieId|title                           |No_of_reviews|
+-------+--------------------------------+-------------+
|318    |Shawshank Redemption, The (1994)|87165        |
+-------+--------------------------------+-------------+



In [92]:
x = movies_df.filter(col("title").like("Shawshank Redemption%"))
x.show(truncate = False)

+-------+--------------------------------+-----------+----+
|movieId|title                           |genres     |year|
+-------+--------------------------------+-----------+----+
|318    |Shawshank Redemption, The (1994)|Crime|Drama|1994|
+-------+--------------------------------+-----------+----+



**Step 4: Advanced Analysis**


**User-Specific Recommendations:**

Build a basic recommendation system by suggesting top-rated movies that a user has not rated yet.


In [93]:
top_rated_movies = movies_ratings_df.groupBy('movieId', 'title')\
    .agg(avg(col('rating')).alias('Avg_rating'))\
    .orderBy(col('Avg_rating').desc())\
    .limit(100)\
    .select('movieId','title')

top_rated_movies.show(truncate = False)

+-------+-------------------------------------------------------+
|movieId|title                                                  |
+-------+-------------------------------------------------------+
|171279 |Robin Williams - Off the Wall (1978)                   |
|137030 |Dangerous Child (2001)                                 |
|248742 |Insignificant Details of the Accidental Episode (2011) |
|262795 |Kevin Pollak: The Littlest Suspect (2010)              |
|184887 |Meditation Park (2017)                                 |
|216372 |Adoring (2019)                                         |
|292047 |Bamboo Doll of Echizen (1963)                          |
|289251 |Cirque du Soleil: Saltimbanco (1997)                   |
|200016 |The Nagano Tapes (2018)                                |
|267948 |Christmas Cracker (1963)                               |
|228813 |The Hero's Journey: The World of Joseph Campbell (1987)|
|148084 |Emmanuelle in Soho (1981)                              |
|233251 |L

In [32]:
total_users = movies_ratings_df.select('userId').distinct().count()
print(total_users)

170081


In [94]:
user_id_input = input('Enter user id')

user_ratings = movies_ratings_df.filter(col('userId') == user_id_input).select('movieId','title')
user_ratings.show(truncate = False)

Enter user id100
+-------+--------------------------------------------------------------+
|movieId|title                                                         |
+-------+--------------------------------------------------------------+
|1      |Toy Story (1995)                                              |
|2      |Jumanji (1995)                                                |
|5      |Father of the Bride Part II (1995)                            |
|31     |Dangerous Minds (1995)                                        |
|34     |Babe (1995)                                                   |
|47     |Seven (a.k.a. Se7en) (1995)                                   |
|50     |Usual Suspects, The (1995)                                    |
|104    |Happy Gilmore (1996)                                          |
|150    |Apollo 13 (1995)                                              |
|165    |Die Hard: With a Vengeance (1995)                             |
|260    |Star Wars: Episode IV - A


The **left anti join** in PySpark is similar to the join functionality, but it returns only columns from the left DataFrame for non-matched records.

In [96]:
recommended_movies_df = top_rated_movies.join(user_ratings, on='movieId', how='left_anti')
recommended_movies_df.show(truncate = False)

+-------+---------------------------------------+
|movieId|title                                  |
+-------+---------------------------------------+
|148114 |The Ties That Bind (2015)              |
|210559 |Black Harvest (1992)                   |
|254334 |Sweet Carolina (2021)                  |
|289775 |Toma (2021)                            |
|212947 |Uppity: The Willy T. Ribbs Story (2020)|
|246168 |The Christmas Edition (2020)           |
|268482 |'Tis the Season to be Merry (2021)     |
|268488 |Runaway Christmas Bride (2017)         |
|285265 |Love in Bloom (2022)                   |
|218095 |Most Valuable Players (2009)           |
|289861 |The Time of Secrets (2022)             |
|171279 |Robin Williams - Off the Wall (1978)   |
|281956 |ManFish (2022)                         |
|137052 |A Job to Kill For (2006)               |
|267940 |Silvery Moon (1933)                    |
|204034 |The Scoundrel (1988)                   |
|137032 |The Perfect Neighbor (2005)            |


left_anti join will keep only the movies that are not in the user’s rated movies list.

**Correlate Ratings and Release Year:**

Analyze if there’s any correlation between the release year and the average rating of movies.


In [95]:
avg_rating_year = movies_ratings_df.groupBy('year').agg(avg(col('rating')).alias('Avg_rating'))
avg_rating_year.show(truncate = False)

+----+------------------+
|year|Avg_rating        |
+----+------------------+
|1959|3.8124824072510273|
|1990|3.4511998418576675|
|1896|2.895             |
|1903|3.095622119815668 |
|1975|3.882580279336278 |
|1977|3.8249607945635127|
|1888|2.596774193548387 |
|1924|3.7569906790945407|
|2003|3.4766174307820195|
|2007|3.5071440157352596|
|1892|2.1931818181818183|
|2018|3.479722733282176 |
|1974|3.894832976797762 |
|2015|3.564475949876199 |
|2023|3.3467310972143265|
|1927|3.860412299091544 |
|1955|3.723592937893098 |
|1890|2.275             |
|2006|3.555251218560976 |
|1978|3.4619053148430763|
+----+------------------+
only showing top 20 rows



In [99]:
avg_rating_year_pd = avg_rating_year.toPandas()
correlation = avg_rating_year_pd['year'].astype(float).corr(avg_rating_year_pd['Avg_rating'])
print("Correlation between release year and average rating: ",correlation)


Correlation between release year and average rating:  0.46164532462049135


In this we are caluclatuing Pearson correlation coefficient between the year and Avg_rating columns.

**Conclusion:** There appears to be a moderate trend where newer movies (those released more recently) tend to receive slightly higher average ratings than older movies as pearson correlation coefficient comes under moderate strength.

**Genre Diversity in Top-Rated Movies:**

Examine the genre diversity among the top 100 highest-rated movies.

In [113]:
top_rated_movies = movies_ratings_df.groupBy('movieId', 'title')\
    .agg(avg(col('rating')).alias('Avg_rating'))\
    .orderBy(col('Avg_rating').desc())\
    .limit(100)\
    .select('movieId','title')

top_rated_movies.show(truncate = False)

+-------+-------------------------------------------------------+
|movieId|title                                                  |
+-------+-------------------------------------------------------+
|137030 |Dangerous Child (2001)                                 |
|171279 |Robin Williams - Off the Wall (1978)                   |
|262795 |Kevin Pollak: The Littlest Suspect (2010)              |
|248742 |Insignificant Details of the Accidental Episode (2011) |
|216372 |Adoring (2019)                                         |
|184887 |Meditation Park (2017)                                 |
|289251 |Cirque du Soleil: Saltimbanco (1997)                   |
|292047 |Bamboo Doll of Echizen (1963)                          |
|267948 |Christmas Cracker (1963)                               |
|200016 |The Nagano Tapes (2018)                                |
|148084 |Emmanuelle in Soho (1981)                              |
|228813 |The Hero's Journey: The World of Joseph Campbell (1987)|
|221334 |B

In [115]:
from pyspark.sql.functions import explode,countDistinct

In [116]:
movies_df_aliased = movies_df.alias('m')
top_rated_movies_aliased = top_rated_movies.alias('t')


genres_split_df = movies_df_aliased.withColumn("genre", explode(split(col("genres"), "\\|"))) \
                                   .join(top_rated_movies_aliased, on="movieId") \
                                   .select('m.title', 'genre')

genres_split_df.show(truncate=False)


+-------------------------------+-----------+
|title                          |genre      |
+-------------------------------+-----------+
|Fitzgerald (2002)              |Drama      |
|The Seven Deadly Sins (1952)   |Drama      |
|The Wrong Girl (1999)          |Drama      |
|The Wrong Girl (1999)          |Thriller   |
|The Perfect Neighbor (2005)    |Drama      |
|The Perfect Neighbor (2005)    |Thriller   |
|A Job to Kill For (2006)       |Drama      |
|A Job to Kill For (2006)       |Thriller   |
|Stranger in My House (1999)    |Thriller   |
|Duel of Hearts (1992)          |Drama      |
|Duel of Hearts (1992)          |Mystery    |
|Duel of Hearts (1992)          |Romance    |
|Unmatched (2010)               |Documentary|
|Who Killed Chea Vichea? (2010) |Documentary|
|Las Poquianchis (1976)         |Crime      |
|Las Poquianchis (1976)         |Drama      |
|Quax, der Bruchpilot (1941)    |Comedy     |
|In the Meantime, Darling (1944)|Comedy     |
|In the Meantime, Darling (1944)|W

In [119]:
# Count distinct genres
genre_diversity = genres_split_df.groupBy("title") \
                                .agg(countDistinct("genre").alias("distinct_genres"))
genre_diversity = genre_diversity.orderBy(col('distinct_genres').desc())
genre_diversity.show(truncate = False)

+-------------------------------------+---------------+
|title                                |distinct_genres|
+-------------------------------------+---------------+
|Eclectic Shorts by Eric Leiser (2006)|4              |
|The Light in the Forest (1958)       |4              |
|Hanukkah on Rye (2022)               |4              |
|Where Is Anne Frank (2021)           |4              |
|Duel of Hearts (1992)                |3              |
|Silvery Moon (1933)                  |3              |
|Christmas in the City (2013)         |3              |
|Paper Marriage (1988)                |3              |
|Pari (2018)                          |3              |
|In Merry Measure (2022)              |3              |
|The Thousand Faces of Dunjia (2017)  |3              |
|Jingle Bell Princess (2021)          |3              |
|Sweet Carolina (2021)                |3              |
|Stars Fell on Alabama (2021)         |3              |
|Ghosts of Christmas Always (2022)    |3        

**Total Genre Counts:**

You might want to count how many times each genre appears across all the top 100 movies.


In [120]:
total_count = genres_split_df.groupBy('genre').agg(count(col('genre')).alias('No_of_count'))
total_count = total_count.orderBy(col('No_of_count').desc())
total_count.show(truncate = False)

+-----------+-----------+
|genre      |No_of_count|
+-----------+-----------+
|Drama      |31         |
|Romance    |27         |
|Documentary|24         |
|Comedy     |24         |
|Children   |7          |
|Thriller   |5          |
|War        |5          |
|Fantasy    |5          |
|Animation  |4          |
|Horror     |4          |
|Action     |4          |
|Adventure  |3          |
|Mystery    |3          |
|Crime      |2          |
|Western    |1          |
+-----------+-----------+



**Percentage Representation:**

Calculate the percentage representation of each genre among the top-rated movies to understand which genres dominate.

In [126]:
from pyspark.sql.functions import round, col

percentage_count = total_count.withColumn('percentage_out_of_5',(col('No_of_count') / 5) * 100)
percentage_count.show(truncate=False)

+-----------+-----------+-------------------+
|genre      |No_of_count|percentage_out_of_5|
+-----------+-----------+-------------------+
|Drama      |31         |620.0              |
|Romance    |27         |540.0              |
|Documentary|24         |480.0              |
|Comedy     |24         |480.0              |
|Children   |7          |140.0              |
|Thriller   |5          |100.0              |
|War        |5          |100.0              |
|Fantasy    |5          |100.0              |
|Animation  |4          |80.0               |
|Horror     |4          |80.0               |
|Action     |4          |80.0               |
|Adventure  |3          |60.0               |
|Mystery    |3          |60.0               |
|Crime      |2          |40.0               |
|Western    |1          |20.0               |
+-----------+-----------+-------------------+



**Step 5: Performance Optimization**


**Cache and Persist:**

Use caching and persistence in PySpark to optimize the performance of your queries.


**Partitioning:**

Apply partitioning to the data to improve the efficiency of operations, especially for large datasets.


**Step 6: Visualization**

**Visualize Data:**
Use PySpark with an external library like Matplotlib, Seaborn, or even Power BI to create visualizations such as:
Distribution of ratings.
Average rating per genre.
Trends in movie ratings over the years.