# MovieLens Dataset - Exploratory Data Analysis

## 1. Dataset Overview
This section provides an overview of the MovieLens dataset, including scale and structure.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
movies = pd.read_csv("../data/ml-latest/movies.csv")
movies.head()

In [None]:
ratings = pd.read_csv("../data/ml-latest/ratings.csv")
ratings.head()

In [None]:
tags = pd.read_csv("../data/ml-latest/tags.csv")
tags.head()

In [None]:
n_users = ratings["userId"].nunique()
n_movies = ratings["movieId"].nunique()
n_ratings = len(ratings)

print(f"Users: {n_users:,}")
print(f"Movies: {n_movies:,}")
print(f"Ratings: {n_ratings:,}")

In [None]:
ratings.info()

### Dataset Description

The MovieLens dataset contains **33.8 million explicit user–movie interactions** with no missing values.  
Each interaction includes a numeric rating (0.5–5.0) and a timestamp.

**Files used:**
- `movies.csv`: Movie metadata including title and genres
- `ratings.csv`: User ratings with timestamps
- `tags.csv`: User-generated tags describing movies

This level of sparsity and scale motivates the use of matrix factorization, which operates on latent representations rather than a dense user-item matrix.

## 2. Ratings Distribution
We analyze how users assign ratings and whether the data is skewed.

In [None]:
ratings["rating"].value_counts().sort_index().plot(kind="bar")
plt.title("Ratings Distribution")
plt.xlabel("Rating")
plt.ylabel("Count")
plt.show()

### Ratings Distribution Insights

The ratings are skewed toward higher values, with **4.0 being the most common rating**.  
Lower ratings (0.5–1.5) are relatively rare, indicating a **positivity bias** in the data.

This distribution suggests:
- Popularity-based recommenders may over-recommend highly rated movies.
- Matrix factorization (SVD) is appropriate to learn **latent user preferences**, capturing subtle taste signals beyond obvious high-rated movies.
- Metrics like **Precision@K** are essential in addition to RMSE, since ranking quality matters more than predicting exact ratings.


## 3. User Activity Analysis
This section explores how active users are in providing ratings.

In [None]:
# Compute number of ratings per user
user_activity = ratings.groupby("userId").size()
user_activity.head()

In [None]:
print(user_activity.describe())

In [None]:
plt.figure(figsize=(10,5))
plt.hist(user_activity, bins=50, color='skyblue', edgecolor='black')
plt.yscale('log') 
plt.title("Number of Ratings per User")
plt.xlabel("Number of Ratings")
plt.ylabel("Number of Users (log scale)")
plt.show()

### User Activity Analysis

Most users provide relatively few ratings, while a small number of users are extremely active (power users).
This long-tail distribution suggests:

- We may need to filter out users with very few ratings to reduce noise.  
- Sparse data will make popularity-based recommendations biased toward movies rated by power users.  
- Matrix factorization will benefit from more active users, as it has more data to learn latent preferences.

**Next steps:** We will decide a minimum number of ratings per user when preprocessing to balance coverage and model quality.

## 4. Movie Popularity Analysis
We examine the long-tail distribution of movie ratings.

In [None]:
# Compute number of ratings per movie
movie_popularity = ratings.groupby("movieId").size()
movie_popularity.head()

In [None]:
print(movie_popularity.describe())

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10,5))
plt.hist(movie_popularity, bins=50, edgecolor='black')
plt.yscale("log")
plt.title("Number of Ratings per Movie")
plt.xlabel("Number of Ratings")
plt.ylabel("Number of Movies (log scale)")
plt.show()

### Movie Popularity Analysis

Movie ratings follow a strong **long-tail distribution**, where a small number of popular movies receive a large share of all ratings, while most movies are rated infrequently, hence the reason for using a logarithmic scale.

This has important implications:
- Popularity-based recommenders tend to over-recommend a small set of widely rated movies.
- Niche movies with few ratings are underrepresented despite potential relevance to specific users.
- Matrix factorization can mitigate this bias by learning latent factors that generalize user preferences beyond highly rated movies.


## 5. Sparsity & Implications
We quantify matrix sparsity and discuss its impact on model choice.

In [None]:
n_users = ratings["userId"].nunique()
n_movies = ratings["movieId"].nunique()
n_ratings = len(ratings)

total_possible = n_users * n_movies
sparsity = 1 - (n_ratings / total_possible)

print(f"Sparsity: {sparsity:.4%}")

### Sparsity & Implications

The user–movie interaction matrix in the MovieLens dataset is extremely sparse. With over 33 million observed ratings across tens of thousands of users and movies, approximately **99.88%** of all possible user–movie pairs are missing.

This high level of sparsity makes traditional similarity-based approaches challenging, as most users share very few commonly rated movies. Matrix factorization addresses this issue by learning low-dimensional latent representations of users and items, enabling meaningful recommendations even when direct rating overlap is limited.

These characteristics motivate the use of an SVD-based collaborative filtering approach over simpler popularity-based methods.
