# Exploratory Data Analysis (EDA) for Movie Recommendation System

In this notebook, we will perform exploratory data analysis on the movie dataset to understand the data distributions and relationships.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the datasets
movies = pd.read_csv('../data/raw/movies.csv')
ratings = pd.read_csv('../data/raw/ratings.csv')

# Display the first few rows of the movies dataset
movies.head()

In [None]:
# Display the first few rows of the ratings dataset
ratings.head()

In [None]:
# Visualize the distribution of ratings
plt.figure(figsize=(10, 6))
sns.histplot(ratings['rating'], bins=10, kde=True)
plt.title('Distribution of Movie Ratings')
plt.xlabel('Rating')
plt.ylabel('Frequency')
plt.show()

In [None]:
# Analyze the number of ratings per movie
ratings_per_movie = ratings.groupby('movieId').size().reset_index(name='num_ratings')

plt.figure(figsize=(10, 6))
sns.histplot(ratings_per_movie['num_ratings'], bins=30, kde=True)
plt.title('Number of Ratings per Movie')
plt.xlabel('Number of Ratings')
plt.ylabel('Frequency')
plt.show()

In [None]:
# Visualize the average rating per movie
average_ratings = ratings.groupby('movieId')['rating'].mean().reset_index()

plt.figure(figsize=(10, 6))
sns.histplot(average_ratings['rating'], bins=10, kde=True)
plt.title('Average Movie Ratings')
plt.xlabel('Average Rating')
plt.ylabel('Frequency')
plt.show()

## Conclusion

In this EDA, we explored the distributions of movie ratings and the number of ratings per movie. This analysis will help us understand the dataset better and inform our recommendation strategies.