<a href="https://colab.research.google.com/github/SrijaG29/Movie-Data-Analysis-and-Recommendations-Using-PySpark/blob/main/Movies_ratings_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To create a project using the movie and ratings datasets with PySpark, here’s a step-by-step guide along with some ideas for questions you can answer. This will help demonstrate your PySpark skills and data analysis capabilities on your resume.

### Step 1: Data Exploration
1. **Load the datasets into PySpark**:
   - Load the movie dataset (`movies.csv`) and ratings dataset (`ratings.csv`) into PySpark DataFrames.

2. **Inspect the Data**:
   - Display the schema and first few rows of both DataFrames to understand the data structure.

### Step 2: Data Cleaning and Preparation
1. **Handle Missing Values**:
   - Check for any missing values and decide how to handle them (e.g., drop rows, fill with default values).

2. **Data Transformation**:
   - Extract the year from the movie title and create a new column `year`.
   - Split the `genres` column into an array of genres.

3. **Join the Datasets**:
   - Perform an inner join on the `movieId` column to combine the movie and ratings datasets.

### Step 3: Data Analysis
1. **Top-Rated Movies**:
   - Find the top 10 movies with the highest average rating.

2. **Popular Genres**:
   - Identify the most popular genres based on the number of ratings.

3. **User Behavior**:
   - Analyze the distribution of ratings by users. For example, find how many movies each user has rated.

4. **Yearly Trends**:
   - Determine how the average movie rating has changed over the years.

5. **Genre-Based Recommendations**:
   - For a given genre, list the top 5 movies based on average ratings.

6. **Movies with the Most Reviews**:
   - Identify the movies with the highest number of ratings.

### Step 4: Advanced Analysis
1. **User-Specific Recommendations**:
   - Build a basic recommendation system by suggesting top-rated movies that a user has not rated yet.

2. **Correlate Ratings and Release Year**:
   - Analyze if there’s any correlation between the release year and the average rating of movies.

3. **Genre Diversity in Top-Rated Movies**:
   - Examine the genre diversity among the top 100 highest-rated movies.

### Step 5: Performance Optimization
1. **Cache and Persist**:
   - Use caching and persistence in PySpark to optimize the performance of your queries.

2. **Partitioning**:
   - Apply partitioning to the data to improve the efficiency of operations, especially for large datasets.

### Step 6: Visualization
1. **Visualize Data**:
   - Use PySpark with an external library like Matplotlib, Seaborn, or even Power BI to create visualizations such as:
     - Distribution of ratings.
     - Average rating per genre.
     - Trends in movie ratings over the years.

### Conclusion
This project structure not only demonstrates your technical ability to handle and analyze data using PySpark but also your ability to derive meaningful insights from complex datasets. Once complete, you can showcase this project in your portfolio or resume, highlighting key aspects such as data cleaning, transformation, analysis, and visualization.

In [1]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.2.tar.gz (317.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.3/317.3 MB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.2-py2.py3-none-any.whl size=317812365 sha256=f8096a052c068fd7887adb33f78a26ea9ad65c6b36e010f75f01fe9e8682f4e8
  Stored in directory: /root/.cache/pip/wheels/34/34/bd/03944534c44b677cd5859f248090daa9fb27b3c8f8e5f49574
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.2


In [2]:
from pyspark.sql import SparkSession

In [3]:
spark = (
    SparkSession
    .builder
    .appName("Movie ratings")
    .master("local[*]")
    .getOrCreate()
)

In [40]:
spark

**Step 1: Data Exploration**
Load the datasets into PySpark session **"spark"**:

Load the movie dataset (movies.csv) and ratings dataset (ratings.csv) into PySpark DataFrames.
Inspect the Data:

Display the schema and first few rows of both DataFrames to understand the data structure.

In [82]:
movies_df = spark.read.format("csv").option("header",True).load("/content/movies.csv")
movies_df.show(truncate = False)

+-------+-------------------------------------+-------------------------------------------+
|movieId|title                                |genres                                     |
+-------+-------------------------------------+-------------------------------------------+
|1      |Toy Story (1995)                     |Adventure|Animation|Children|Comedy|Fantasy|
|2      |Jumanji (1995)                       |Adventure|Children|Fantasy                 |
|3      |Grumpier Old Men (1995)              |Comedy|Romance                             |
|4      |Waiting to Exhale (1995)             |Comedy|Drama|Romance                       |
|5      |Father of the Bride Part II (1995)   |Comedy                                     |
|6      |Heat (1995)                          |Action|Crime|Thriller                      |
|7      |Sabrina (1995)                       |Comedy|Romance                             |
|8      |Tom and Huck (1995)                  |Adventure|Children               

In [5]:
ratings_df = spark.read.format("csv").option("header",True).load("/content/ratings.csv")
ratings_df.show(truncate = False)

+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
|1     |17     |4.0   |944249077|
|1     |25     |1.0   |944250228|
|1     |29     |2.0   |943230976|
|1     |30     |5.0   |944249077|
|1     |32     |5.0   |943228858|
|1     |34     |2.0   |943228491|
|1     |36     |1.0   |944249008|
|1     |80     |5.0   |944248943|
|1     |110    |3.0   |943231119|
|1     |111    |5.0   |944249008|
|1     |161    |1.0   |943231162|
|1     |166    |5.0   |943228442|
|1     |176    |4.0   |944079496|
|1     |223    |3.0   |944082810|
|1     |232    |5.0   |943228442|
|1     |260    |5.0   |943228696|
|1     |302    |4.0   |944253272|
|1     |306    |5.0   |944248888|
|1     |307    |5.0   |944253207|
|1     |322    |4.0   |944053801|
+------+-------+------+---------+
only showing top 20 rows



**Step 2: Data Cleaning and Preparation**

**Handle Missing Values:**

Check for any missing values and decide how to handle them (e.g., drop rows, fill with default values).

In [6]:
from pyspark.sql.functions import sum,col,when,split

In [7]:
print(movies_df.dtypes)

[('movieId', 'string'), ('title', 'string'), ('genres', 'string')]


In [8]:
print(ratings_df.dtypes)

[('userId', 'string'), ('movieId', 'string'), ('rating', 'string'), ('timestamp', 'string')]


In [9]:
# Create an empty list to hold column expressions for missing values
missing_value_expressions_movies = []

# Loop through each column in the DataFrame
for column in movies_df.columns:
    # Count the number of null values in the current column
    missing_expr = sum(when(col(column).isNull(), 1).otherwise(0)).alias(column)
    missing_value_expressions_movies.append(missing_expr)

# Aggregate the missing value counts
missing_movies_df = movies_df.agg(*missing_value_expressions_movies)

# Show the results
missing_movies_df.show()

+-------+-----+------+
|movieId|title|genres|
+-------+-----+------+
|      0|    0|     0|
+-------+-----+------+



In [10]:
# Create an empty list to hold column expressions for missing values
missing_value_expressions_ratings = []

# Loop through each column in the DataFrame
for column in ratings_df.columns:
    # Count the number of null values in the current column
    missing_expr = sum(when(col(column).isNull(), 1).otherwise(0)).alias(column)
    missing_value_expressions_ratings.append(missing_expr)

# Aggregate the missing value counts
missing_ratings_df = ratings_df.agg(*missing_value_expressions_ratings)

# Show the results
missing_ratings_df.show()

+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
|     0|      0|     0|        0|
+------+-------+------+---------+



**" There are no null values in these datasets. "**

**Data Transformation:**

Extract the year from the movie title and create a new column year. Split the genres column into an array of genres.

In [11]:
movies_df.show(truncate = False)

+-------+-------------------------------------+-------------------------------------------+
|movieId|title                                |genres                                     |
+-------+-------------------------------------+-------------------------------------------+
|1      |Toy Story (1995)                     |Adventure|Animation|Children|Comedy|Fantasy|
|2      |Jumanji (1995)                       |Adventure|Children|Fantasy                 |
|3      |Grumpier Old Men (1995)              |Comedy|Romance                             |
|4      |Waiting to Exhale (1995)             |Comedy|Drama|Romance                       |
|5      |Father of the Bride Part II (1995)   |Comedy                                     |
|6      |Heat (1995)                          |Action|Crime|Thriller                      |
|7      |Sabrina (1995)                       |Comedy|Romance                             |
|8      |Tom and Huck (1995)                  |Adventure|Children               

Extracting year from title.

In [80]:
from pyspark.sql.functions import expr,trim

In [83]:
movies_df = movies_df.withColumn("movie_name",split(col("title"),"\\(").getItem(0))\
                     .withColumn("year", expr("substring_index(title, '(', -1)"))\
                     .withColumn("year", trim(col("year")).substr(1, 4))

movies_df.show(truncate=False)

+-------+-------------------------------------+-------------------------------------------+-------------------------------+----+
|movieId|title                                |genres                                     |movie_name                     |year|
+-------+-------------------------------------+-------------------------------------------+-------------------------------+----+
|1      |Toy Story (1995)                     |Adventure|Animation|Children|Comedy|Fantasy|Toy Story                      |1995|
|2      |Jumanji (1995)                       |Adventure|Children|Fantasy                 |Jumanji                        |1995|
|3      |Grumpier Old Men (1995)              |Comedy|Romance                             |Grumpier Old Men               |1995|
|4      |Waiting to Exhale (1995)             |Comedy|Drama|Romance                       |Waiting to Exhale              |1995|
|5      |Father of the Bride Part II (1995)   |Comedy                                     |Father

Here for some movies there is no year mentioned so I am deleting those rows.

In [85]:
movies_df = movies_df.filter(col("year").rlike("^[0-9]{4}$"))
movies_df.show()

+-------+--------------------+--------------------+--------------------+----+
|movieId|               title|              genres|          movie_name|year|
+-------+--------------------+--------------------+--------------------+----+
|      1|    Toy Story (1995)|Adventure|Animati...|          Toy Story |1995|
|      2|      Jumanji (1995)|Adventure|Childre...|            Jumanji |1995|
|      3|Grumpier Old Men ...|      Comedy|Romance|   Grumpier Old Men |1995|
|      4|Waiting to Exhale...|Comedy|Drama|Romance|  Waiting to Exhale |1995|
|      5|Father of the Bri...|              Comedy|Father of the Bri...|1995|
|      6|         Heat (1995)|Action|Crime|Thri...|               Heat |1995|
|      7|      Sabrina (1995)|      Comedy|Romance|            Sabrina |1995|
|      8| Tom and Huck (1995)|  Adventure|Children|       Tom and Huck |1995|
|      9| Sudden Death (1995)|              Action|       Sudden Death |1995|
|     10|    GoldenEye (1995)|Action|Adventure|...|          Gol

In [86]:
# dropping the column title

movies_df = movies_df.drop(movies_df.title)
movies_df.show(truncate = False)

+-------+-------------------------------------------+-------------------------------+----+
|movieId|genres                                     |movie_name                     |year|
+-------+-------------------------------------------+-------------------------------+----+
|1      |Adventure|Animation|Children|Comedy|Fantasy|Toy Story                      |1995|
|2      |Adventure|Children|Fantasy                 |Jumanji                        |1995|
|3      |Comedy|Romance                             |Grumpier Old Men               |1995|
|4      |Comedy|Drama|Romance                       |Waiting to Exhale              |1995|
|5      |Comedy                                     |Father of the Bride Part II    |1995|
|6      |Action|Crime|Thriller                      |Heat                           |1995|
|7      |Comedy|Romance                             |Sabrina                        |1995|
|8      |Adventure|Children                         |Tom and Huck                   |1995|

In [87]:
genres_df = movies_df.select(movies_df.movieId,movies_df.genres)
genres_df.show(truncate = False)

+-------+-------------------------------------------+
|movieId|genres                                     |
+-------+-------------------------------------------+
|1      |Adventure|Animation|Children|Comedy|Fantasy|
|2      |Adventure|Children|Fantasy                 |
|3      |Comedy|Romance                             |
|4      |Comedy|Drama|Romance                       |
|5      |Comedy                                     |
|6      |Action|Crime|Thriller                      |
|7      |Comedy|Romance                             |
|8      |Adventure|Children                         |
|9      |Action                                     |
|10     |Action|Adventure|Thriller                  |
|11     |Comedy|Drama|Romance                       |
|12     |Comedy|Horror                              |
|13     |Adventure|Animation|Children               |
|14     |Drama                                      |
|15     |Action|Adventure|Romance                   |
|16     |Crime|Drama        

In [88]:
genres_df = genres_df.withColumn("gener",split(genres_df.genres,"\|"))
genres_df.show(truncate = False)

+-------+-------------------------------------------+-------------------------------------------------+
|movieId|genres                                     |gener                                            |
+-------+-------------------------------------------+-------------------------------------------------+
|1      |Adventure|Animation|Children|Comedy|Fantasy|[Adventure, Animation, Children, Comedy, Fantasy]|
|2      |Adventure|Children|Fantasy                 |[Adventure, Children, Fantasy]                   |
|3      |Comedy|Romance                             |[Comedy, Romance]                                |
|4      |Comedy|Drama|Romance                       |[Comedy, Drama, Romance]                         |
|5      |Comedy                                     |[Comedy]                                         |
|6      |Action|Crime|Thriller                      |[Action, Crime, Thriller]                        |
|7      |Comedy|Romance                             |[Comedy, Ro

In [89]:
genres_df = genres_df.drop(genres_df.genres)
genres_df.show(truncate = False)

+-------+-------------------------------------------------+
|movieId|gener                                            |
+-------+-------------------------------------------------+
|1      |[Adventure, Animation, Children, Comedy, Fantasy]|
|2      |[Adventure, Children, Fantasy]                   |
|3      |[Comedy, Romance]                                |
|4      |[Comedy, Drama, Romance]                         |
|5      |[Comedy]                                         |
|6      |[Action, Crime, Thriller]                        |
|7      |[Comedy, Romance]                                |
|8      |[Adventure, Children]                            |
|9      |[Action]                                         |
|10     |[Action, Adventure, Thriller]                    |
|11     |[Comedy, Drama, Romance]                         |
|12     |[Comedy, Horror]                                 |
|13     |[Adventure, Animation, Children]                 |
|14     |[Drama]                        

**Join the Datasets:**

Perform an inner join on the movieId column to combine the movie and ratings datasets.

In [90]:
movies_ratings_df = movies_df.join(ratings_df,on='movieId',how='inner')
movies_ratings_df.show(truncate = False)

+-------+--------------------------------------+-----------------------------------+----+------+------+---------+
|movieId|genres                                |movie_name                         |year|userId|rating|timestamp|
+-------+--------------------------------------+-----------------------------------+----+------+------+---------+
|17     |Drama|Romance                         |Sense and Sensibility              |1995|1     |4.0   |944249077|
|25     |Drama|Romance                         |Leaving Las Vegas                  |1995|1     |1.0   |944250228|
|29     |Adventure|Drama|Fantasy|Mystery|Sci-Fi|City of Lost Children, The         |1995|1     |2.0   |943230976|
|30     |Crime|Drama                           |Shanghai Triad                     |1995|1     |5.0   |944249077|
|32     |Mystery|Sci-Fi|Thriller               |Twelve Monkeys                     |1995|1     |5.0   |943228858|
|34     |Children|Drama                        |Babe                               |1995

**Step 3: Data Analysis**



**Top-Rated Movies:**

Find the top 10 movies with the highest average rating.

In [91]:
from pyspark.sql.functions import avg,desc,asc
top_ten_movies = movies_ratings_df.groupby('movieId','movie_name').agg(avg('rating').alias('Avg_rating'))
top_ten_movies = top_ten_movies.orderBy(top_ten_movies.Avg_rating.desc())
top_ten_movies.show(10,truncate = False)

+-------+--------------------------------+----------+
|movieId|movie_name                      |Avg_rating|
+-------+--------------------------------+----------+
|143061 |Don't Look Down                 |5.0       |
|292047 |Bamboo Doll of Echizen          |5.0       |
|173815 |Carnival                        |5.0       |
|132150 |The 7 Grandmasters              |5.0       |
|144058 |Territory                       |5.0       |
|136878 |Paradh                          |5.0       |
|199572 |Love+Sling                      |5.0       |
|256477 |Justice League: Secret Origins  |5.0       |
|141870 |One-Two, Soldiers Were Going... |5.0       |
|203060 |Worlds of Ursula K. Le Guin     |5.0       |
+-------+--------------------------------+----------+
only showing top 10 rows



**Popular Genres:**

Identify the most popular genres based on the number of ratings.


In [92]:
from pyspark.sql.functions import count

popular_genres = movies_ratings_df.groupBy('genres').agg(count('rating').alias('No_of_ratings'))
popular_genres = popular_genres.sort(popular_genres.No_of_ratings.desc())
popular_genres.show(truncate = False)

+--------------------------------+-------------+
|genres                          |No_of_ratings|
+--------------------------------+-------------+
|Drama                           |233610       |
|Comedy                          |194932       |
|Comedy|Romance                  |112686       |
|Drama|Romance                   |101327       |
|Comedy|Drama                    |96340        |
|Comedy|Drama|Romance            |90816        |
|Action|Adventure|Sci-Fi         |82561        |
|Crime|Drama                     |79742        |
|Action|Crime|Thriller           |50769        |
|Drama|Thriller                  |48991        |
|Action|Adventure|Sci-Fi|Thriller|45753        |
|Action|Adventure|Thriller       |43291        |
|Action|Sci-Fi|Thriller          |41019        |
|Crime|Drama|Thriller            |40916        |
|Drama|War                       |36949        |
|Action|Crime|Drama|Thriller     |34356        |
|Action|Drama|War                |33745        |
|Comedy|Crime       

**User Behavior:**

Analyze the distribution of ratings by users. For example, find how many movies each user has rated.


In [93]:
user_rating = movies_ratings_df.groupBy('userId').agg(count('userId').alias('No_of_ratings')).orderBy(movies_ratings_df.userId.asc())
user_rating.show(truncate = False)

+------+-------------+
|userId|No_of_ratings|
+------+-------------+
|1     |141          |
|10    |660          |
|100   |248          |
|1000  |27           |
|10000 |165          |
|10001 |25           |
|10002 |35           |
|10003 |447          |
|10004 |212          |
|10005 |420          |
|10006 |63           |
|10007 |89           |
|10008 |91           |
|10009 |59           |
|1001  |102          |
|10010 |137          |
|10011 |20           |
|10012 |64           |
|10013 |30           |
|10014 |88           |
+------+-------------+
only showing top 20 rows



**Yearly Trends:**

Determine how the average movie rating has changed over the years.


In [94]:
movies_ratings_df.select('year').show()

+----+
|year|
+----+
|1995|
|1995|
|1995|
|1995|
|1995|
|1995|
|1995|
|1995|
|1995|
|1976|
|1995|
|1995|
|1995|
|1994|
|1994|
|1977|
|1994|
|1994|
|1993|
|1995|
+----+
only showing top 20 rows



In [95]:
avg_rating_year = movies_ratings_df.groupBy('year').agg(avg('rating').alias('Avg_rating'))
avg_rating_year.show(truncate = False)

+----+------------------+
|year|Avg_rating        |
+----+------------------+
|1953|3.7073750991276766|
|1903|2.836206896551724 |
|1957|4.027836857250094 |
|1897|2.7857142857142856|
|1987|3.55497074751517  |
|1956|3.7286485218207415|
|2016|3.5318225764614333|
|1936|3.8399738732854343|
|2012|3.523314898029727 |
|2020|3.2318540977581773|
|1958|3.8233293657541236|
|1910|2.7               |
|1943|3.6489776046738074|
|1915|3.056338028169014 |
|1972|3.9569477434679334|
|1931|3.8992463308211027|
|1988|3.5281829017228303|
|1938|3.9044081228330856|
|1926|3.803921568627451 |
|1911|3.4285714285714284|
+----+------------------+
only showing top 20 rows



**Genre-Based Recommendations:**

For a given genre, list the top 5 movies based on average ratings.


In [98]:
genre_input = input("Enter the genre to filter movies: ")

top_five_movies = (movies_ratings_df.filter(col('genres') == genre_input)\
                   .groupBy('movie_name')\
                   .agg(avg('rating').alias('Avg_rating'))\
                   .orderBy(col('Avg_rating').desc())
                   .limit(5))
top_five_movies.show(truncate = False)


Enter the genre to filter movies: Action
+----------------------------+----------+
|movie_name                  |Avg_rating|
+----------------------------+----------+
|Drunken Master 3            |5.0       |
|Dreadnaught                 |5.0       |
|Buffalo Girls               |5.0       |
|The 7 Grandmasters          |5.0       |
|The Division: Agent Origins |5.0       |
+----------------------------+----------+



In [117]:
x = movies_df.filter(col("title").like("%Dreadnaught%"))
x.select(x.genres).show()

+------+
|genres|
+------+
|Action|
+------+



**Movies with the Most Reviews:**

Identify the movies with the highest number of ratings.

In [109]:
from pyspark.sql.functions import desc

In [112]:
most_reviews = movies_ratings_df.groupBy('movieId','movie_name')\
              .agg(count(col('rating')).alias('No_of_reviews'))\
              .orderBy(col('No_of_reviews').desc())\
              .limit(1)
most_reviews.show(truncate = False)

+-------+--------------------------+-------------+
|movieId|movie_name                |No_of_reviews|
+-------+--------------------------+-------------+
|318    |Shawshank Redemption, The |10749        |
+-------+--------------------------+-------------+



In [116]:
x = movies_df.filter(col("title").like("Shawshank Redemption%"))
x.show(truncate = False)

+-------+-----------+--------------------------+----+
|movieId|genres     |movie_name                |year|
+-------+-----------+--------------------------+----+
|318    |Crime|Drama|Shawshank Redemption, The |1994|
+-------+-----------+--------------------------+----+



**Step 4: Advanced Analysis**


**User-Specific Recommendations:**

Build a basic recommendation system by suggesting top-rated movies that a user has not rated yet.


**Correlate Ratings and Release Year:**

Analyze if there’s any correlation between the release year and the average rating of movies.


**Genre Diversity in Top-Rated Movies:**

Examine the genre diversity among the top 100 highest-rated movies.

**Step 5: Performance Optimization**


**Cache and Persist:**

Use caching and persistence in PySpark to optimize the performance of your queries.


**Partitioning:**

Apply partitioning to the data to improve the efficiency of operations, especially for large datasets.


**Step 6: Visualization**

**Visualize Data:**
Use PySpark with an external library like Matplotlib, Seaborn, or even Power BI to create visualizations such as:
Distribution of ratings.
Average rating per genre.
Trends in movie ratings over the years.