<a href="https://colab.research.google.com/github/SrijaG29/MovieLens-Data-Analysis-and-Recommendations-Using-PySpark/blob/main/Movies_ratings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!pip install pyspark



In [3]:
from pyspark.sql import SparkSession

In [4]:
spark = (
    SparkSession
    .builder
    .appName("Movie ratings")
    .master("local[*]")
    .getOrCreate()
)

**Step 1: Data Exploration**
Load the datasets into PySpark session **"spark"**:

Load the movie dataset (movies.csv) and ratings dataset (ratings.csv) into PySpark DataFrames.
Inspect the Data:

Display the schema and first few rows of both DataFrames to understand the data structure.

In [63]:
movies_df = spark.read.format("csv").option("header",True).load("/content/movies.csv")
movies_df.show(truncate = False)

+-------+-------------------------------------+-------------------------------------------+
|movieId|title                                |genres                                     |
+-------+-------------------------------------+-------------------------------------------+
|1      |Toy Story (1995)                     |Adventure|Animation|Children|Comedy|Fantasy|
|2      |Jumanji (1995)                       |Adventure|Children|Fantasy                 |
|3      |Grumpier Old Men (1995)              |Comedy|Romance                             |
|4      |Waiting to Exhale (1995)             |Comedy|Drama|Romance                       |
|5      |Father of the Bride Part II (1995)   |Comedy                                     |
|6      |Heat (1995)                          |Action|Crime|Thriller                      |
|7      |Sabrina (1995)                       |Comedy|Romance                             |
|8      |Tom and Huck (1995)                  |Adventure|Children               

In [12]:
ratings_df = spark.read.format("csv").option("header",True).load("/content/ratings.csv")
ratings_df.show(truncate = False)

+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
|1     |17     |4.0   |944249077|
|1     |25     |1.0   |944250228|
|1     |29     |2.0   |943230976|
|1     |30     |5.0   |944249077|
|1     |32     |5.0   |943228858|
|1     |34     |2.0   |943228491|
|1     |36     |1.0   |944249008|
|1     |80     |5.0   |944248943|
|1     |110    |3.0   |943231119|
|1     |111    |5.0   |944249008|
|1     |161    |1.0   |943231162|
|1     |166    |5.0   |943228442|
|1     |176    |4.0   |944079496|
|1     |223    |3.0   |944082810|
|1     |232    |5.0   |943228442|
|1     |260    |5.0   |943228696|
|1     |302    |4.0   |944253272|
|1     |306    |5.0   |944248888|
|1     |307    |5.0   |944253207|
|1     |322    |4.0   |944053801|
+------+-------+------+---------+
only showing top 20 rows



**Step 2: Data Cleaning and Preparation**

**Handle Missing Values:**

Check for any missing values and decide how to handle them (e.g., drop rows, fill with default values).

In [32]:
from pyspark.sql.functions import sum,col,when,split

In [19]:
print(movies_df.dtypes)

[('movieId', 'string'), ('title', 'string'), ('genres', 'string')]


In [20]:
print(ratings_df.dtypes)

[('userId', 'string'), ('movieId', 'string'), ('rating', 'string'), ('timestamp', 'string')]


In [27]:
# Create an empty list to hold column expressions for missing values
missing_value_expressions_movies = []

# Loop through each column in the DataFrame
for column in movies_df.columns:
    # Count the number of null values in the current column
    missing_expr = sum(when(col(column).isNull(), 1).otherwise(0)).alias(column)
    missing_value_expressions_movies.append(missing_expr)

# Aggregate the missing value counts
missing_movies_df = movies_df.agg(*missing_value_expressions_movies)

# Show the results
missing_movies_df.show()

+-------+-----+------+
|movieId|title|genres|
+-------+-----+------+
|      0|    0|     0|
+-------+-----+------+



In [29]:
# Create an empty list to hold column expressions for missing values
missing_value_expressions_ratings = []

# Loop through each column in the DataFrame
for column in ratings_df.columns:
    # Count the number of null values in the current column
    missing_expr = sum(when(col(column).isNull(), 1).otherwise(0)).alias(column)
    missing_value_expressions_ratings.append(missing_expr)

# Aggregate the missing value counts
missing_ratings_df = ratings_df.agg(*missing_value_expressions_ratings)

# Show the results
missing_ratings_df.show()

+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
|     0|      0|     0|        0|
+------+-------+------+---------+



**" There are no null values in these datasets. "**

**Data Transformation:**

Extract the year from the movie title and create a new column year. Split the genres column into an array of genres.

In [42]:
movies_df.show(truncate = False)

+-------+-------------------------------------+-------------------------------------------+
|movieId|title                                |genres                                     |
+-------+-------------------------------------+-------------------------------------------+
|1      |Toy Story (1995)                     |Adventure|Animation|Children|Comedy|Fantasy|
|2      |Jumanji (1995)                       |Adventure|Children|Fantasy                 |
|3      |Grumpier Old Men (1995)              |Comedy|Romance                             |
|4      |Waiting to Exhale (1995)             |Comedy|Drama|Romance                       |
|5      |Father of the Bride Part II (1995)   |Comedy                                     |
|6      |Heat (1995)                          |Action|Crime|Thriller                      |
|7      |Sabrina (1995)                       |Comedy|Romance                             |
|8      |Tom and Huck (1995)                  |Adventure|Children               

Extracting year from title.

In [64]:
movies_df = movies_df.withColumn("movie_name",split(col("title"),"\\(").getItem(0)).withColumn("year", split(col("title"), "\\(").getItem(1).substr(1, 4))

In [65]:
movies_df.show(truncate = False)

+-------+-------------------------------------+-------------------------------------------+-------------------------------+----+
|movieId|title                                |genres                                     |movie_name                     |year|
+-------+-------------------------------------+-------------------------------------------+-------------------------------+----+
|1      |Toy Story (1995)                     |Adventure|Animation|Children|Comedy|Fantasy|Toy Story                      |1995|
|2      |Jumanji (1995)                       |Adventure|Children|Fantasy                 |Jumanji                        |1995|
|3      |Grumpier Old Men (1995)              |Comedy|Romance                             |Grumpier Old Men               |1995|
|4      |Waiting to Exhale (1995)             |Comedy|Drama|Romance                       |Waiting to Exhale              |1995|
|5      |Father of the Bride Part II (1995)   |Comedy                                     |Father

In [66]:
# dropping the column title

movies_df = movies_df.drop(movies_df.title)
movies_df.show(truncate = False)

+-------+-------------------------------------------+-------------------------------+----+
|movieId|genres                                     |movie_name                     |year|
+-------+-------------------------------------------+-------------------------------+----+
|1      |Adventure|Animation|Children|Comedy|Fantasy|Toy Story                      |1995|
|2      |Adventure|Children|Fantasy                 |Jumanji                        |1995|
|3      |Comedy|Romance                             |Grumpier Old Men               |1995|
|4      |Comedy|Drama|Romance                       |Waiting to Exhale              |1995|
|5      |Comedy                                     |Father of the Bride Part II    |1995|
|6      |Action|Crime|Thriller                      |Heat                           |1995|
|7      |Comedy|Romance                             |Sabrina                        |1995|
|8      |Adventure|Children                         |Tom and Huck                   |1995|

In [68]:
genres_df = movies_df.select(movies_df.movieId,movies_df.genres)
genres_df.show(truncate = False)

+-------+-------------------------------------------+
|movieId|genres                                     |
+-------+-------------------------------------------+
|1      |Adventure|Animation|Children|Comedy|Fantasy|
|2      |Adventure|Children|Fantasy                 |
|3      |Comedy|Romance                             |
|4      |Comedy|Drama|Romance                       |
|5      |Comedy                                     |
|6      |Action|Crime|Thriller                      |
|7      |Comedy|Romance                             |
|8      |Adventure|Children                         |
|9      |Action                                     |
|10     |Action|Adventure|Thriller                  |
|11     |Comedy|Drama|Romance                       |
|12     |Comedy|Horror                              |
|13     |Adventure|Animation|Children               |
|14     |Drama                                      |
|15     |Action|Adventure|Romance                   |
|16     |Crime|Drama        

In [73]:
genres_df = genres_df.withColumn("gener",split(genres_df.genres,"\|"))
genres_df.show(truncate = False)

+-------+-------------------------------------------+-------------------------------------------------+
|movieId|genres                                     |gener                                            |
+-------+-------------------------------------------+-------------------------------------------------+
|1      |Adventure|Animation|Children|Comedy|Fantasy|[Adventure, Animation, Children, Comedy, Fantasy]|
|2      |Adventure|Children|Fantasy                 |[Adventure, Children, Fantasy]                   |
|3      |Comedy|Romance                             |[Comedy, Romance]                                |
|4      |Comedy|Drama|Romance                       |[Comedy, Drama, Romance]                         |
|5      |Comedy                                     |[Comedy]                                         |
|6      |Action|Crime|Thriller                      |[Action, Crime, Thriller]                        |
|7      |Comedy|Romance                             |[Comedy, Ro

In [74]:
genres_df = genres_df.drop(genres_df.genres)
genres_df.show(truncate = False)

+-------+-------------------------------------------------+
|movieId|gener                                            |
+-------+-------------------------------------------------+
|1      |[Adventure, Animation, Children, Comedy, Fantasy]|
|2      |[Adventure, Children, Fantasy]                   |
|3      |[Comedy, Romance]                                |
|4      |[Comedy, Drama, Romance]                         |
|5      |[Comedy]                                         |
|6      |[Action, Crime, Thriller]                        |
|7      |[Comedy, Romance]                                |
|8      |[Adventure, Children]                            |
|9      |[Action]                                         |
|10     |[Action, Adventure, Thriller]                    |
|11     |[Comedy, Drama, Romance]                         |
|12     |[Comedy, Horror]                                 |
|13     |[Adventure, Animation, Children]                 |
|14     |[Drama]                        

**Join the Datasets:**

Perform an inner join on the movieId column to combine the movie and ratings datasets.