# 🎬 Movie Rating Analysis Using PySpark with Medallion Architecture

## ❇️ Objectives 


##### 🔹To analyze movie ratings data using PySpark for efficient big data processing.

##### 🔹To implement the Medallion Architecture (Bronze, Silver, Gold) for structured data transformation.

##### 🔹To load, clean, and transform raw movie and rating data.

##### 🔹 To calculate average ratings for each movie.

##### 🔹To identify the top 10 highest-rated movies.

##### 🔹To analyze rating trends over movie release years.

##### 🔹 To create visualizations using Power BI for better interpretation of insights.

##### 🔹 To document and organize the entire process in a clean and reproducible format to  enabling 
##### better understanding of audience preferences and movie performance..

## 

##### 

## 📁 Dataset Used

##### Dataset Name: MovieLens 100K Dataset

##### Source: GroupLens Research Project

#####  Format: CSV

##### Files Used:

#####  Movie_dataset_with_title_gener_relese_year.csv: Contains user ratings for movies , timestamp, release_year , movie_title 

##### File Size: 9 KB

##### Key Columns:

##### user_Id – Unique identifier for each user

##### movie_Id – Unique identifier for each movie

 ##### rating – Rating given by the user (scale: 1.0 to 5.0)

##### timestamp – Time when the rating was given

##### movie_title – Movie title

##### genres – Genre(s) associated with the movie

##### release_year - movie release year 

####

####

## ⚙️ Technologies & Tools

##### 1. PySpark – Used for big data processing, data cleaning, and transformations across the Bronze, Silver, and Gold layers.

##### 2. Microsoft Fabric Notebook / Power BI – Used for storing, transforming, and visualizing the final insights.

##### 3. GitHub – Used for version control, sharing code snippets, and documentation.

### 🔧 How These Tools Are Used?
#### PySpark:

##### Loads and processes large datasets efficiently.

##### Handles data transformations like missing value treatment, aggregations, and sorting.

##### Computes average ratings and top-rated movies.

####  Microsoft Fabric Notebook / Power BI:

##### Stores and manages processed data .

##### use DAX Query and created measure.

##### Creates interactive dashboards for visualization of rating trends and movie popularity.

##### Helps in deriving insights through charts and reports.

#### GitHub:

##### Maintains code versions and allows collaboration.

##### Stores PySpark scripts, datasets, and documentation.

##### Provides a structured workflow for project tracking and improvements.


## 

##### 


##  Data Processing in Fabric

## 🧱 Medallion Architecture Workflow

##### 🟫 Bronze Layer (Raw Data Ingestion)
##### Loaded raw CSV data into Fabric using Spark DataFrame.

##### Performed initial schema validation.

##### 🪙 Silver Layer (Data Cleaning & Transformation)
##### Removed duplicates & missing values.

##### Standardized column names.

##### Filtered movies with at least 10 ratings for quality.

#####  🥇 Gold Layer (Aggregation & SQL Analysis)
##### Performed group-by operations & trend analysis using SQL queries, such as:

##### Top 10 Highest Rated Movies

#

#

## ❇️❧ Power BI Visualizations

### 🔹 Visualizations Created
##### To analyze and interpret the movie dataset effectively, various visualizations were created using Power BI . These visuals helps me  in understanding key insights such as rating trends, genre distribution, and movie performance.

### 🔹Visualization	Purpose
##### Bar Chart ==>	Identify the best-performing movies based on audience ratings.
##### Line Chart ==>	Shows rating trends over the years, allowing analysis of how movie ratings have changed over time.
##### Column Chart ==>	Represents the distribution of ratings (1-5), providing an overview of how users rate movies.
##### Pie Chart ==>	Illustrates the genre-wise movie distribution, showing the proportion of different movie genres in the dataset.
##### Table	==>Contains key movie details such as title, rating, rating count, release year, and genre for reference.
##### Card Visuals==>	Highlights KPI metrics including Total Movies, Average Rating, and Total Ratings, providing quick insights.
##### Scatter Plot==>	Shows the relationship between rating count and average rating, helping analyze whether higher ratings correlate with more reviews.
##### Donut Chart ==> It shows the sum of rating by genres.
#####               These visualizations provide meaningful insights into the dataset, aiding in better decision-making and analysis.


## 

##

## 💡 Key Insights

##### 1️⃣ Top-Rated Movies:

##### "The Matrix" has the highest average rating (3.4), followed by "The Godfather" (3.3) and "Finding Nemo" (3.2).

###### Average rating for movies is 3.8 

##### 2️⃣ Genre-Based Rating Trends:

##### Crime/Drama and Action/Sci-Fi dominate the ratings, with Crime/Drama having the most consistent ratings over the years.

##### Animation movies (e.g., Finding Nemo) also have strong ratings in recent years.

##### 3️⃣ Overall Movie Ratings Statistics:

##### Average movie rating across all genres is 3.27.

##### Total ratings count is 484, indicating a large dataset.

##### Total unique movies analyzed: 742K, suggesting extensive movie data coverage.

##### 4️⃣ Ratings Over Time:

##### A decline in movie ratings is observed in recent years, possibly due to changing audience preferences or an increase in lower-rated movies.

##### Highest rating count occurred in early 1980s, but a drop is noticeable post-2000s.

##### 5️⃣ Ratings Distribution:

##### The majority of movies have ratings between 4 and 5, with very few rated below 1 and 3.

##### The rating distribution indicates that most movies are rated positively, with fewer extreme highs or lows.

##### 6️⃣ User Preferences:

##### Genre popularity: Action/Sci-Fi and Crime/Drama dominate, but newer genres like Animation are catching up.

##### User selection filter allows dynamic insights into how individual users rate different movies.



##

##

## 🛠️ Challenges & Solutions

#### 1️⃣ Difficulty in understanding Medallion Architecture.

##### Go through the Fabric Documentation 

#### 2️⃣ SQL Query Syntax Issues

##### "LIMIT" issue in SQL Server (Fabric uses TOP instead).

##### Adjusting GROUP BY and HAVING clauses for accurate ranking of top-rated movies.

#### 3️⃣ Visualization Challenges

##### Selecting the right charts to represent rating distribution effectively.

##### Ensuring Power BI filters worked correctly across different visuals.


#### 4️⃣ Data Transformation in Power BI

##### Grouping movies by release year while keeping average ratings consistent was tricky.

##### Creating DAX measures for dynamic insights required debugging.





##

###

## 📌 Conclusion & Future Scope

##### Successfully done the  Movie Rating  Analysis.

##### Power BI dashboards provide deep insights into movie trends & ratings.

##### Future improvements:

##### 1.Implement collaborative filtering for personalized recommendations.

##### 2.Integrate with real-time streaming data.

##

##


#    ⦾ Thank you   ⦾


##### 

##### 