# Exploratory Data Analysis: MovieLens Dataset

This notebook explores the MovieLens Small dataset to understand user rating patterns, movie popularity, and sparsity.

In [1]:
import sys
import os
# Add src to path
sys.path.append(os.path.abspath(os.path.join('..', 'src')))

import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from data.loader import get_merged_data


In [2]:
# Load data
df = get_merged_data()
print(f"Total Interactions: {len(df)}")
print(f"Unique Users: {df['userId'].nunique()}")
print(f"Unique Movies: {df['movieId'].nunique()}")
df.head()

Total Interactions: 100836
Unique Users: 610
Unique Movies: 9724


Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,1,3,4.0,964981247,Grumpier Old Men (1995),Comedy|Romance
2,1,6,4.0,964982224,Heat (1995),Action|Crime|Thriller
3,1,47,5.0,964983815,Seven (a.k.a. Se7en) (1995),Mystery|Thriller
4,1,50,5.0,964982931,"Usual Suspects, The (1995)",Crime|Mystery|Thriller


## 1. Ratings Distribution
What are the most common ratings? (e.g., do users mostly give 4s and 5s?)

In [3]:
rating_counts = df['rating'].value_counts().sort_index()
fig = px.bar(x=rating_counts.index, y=rating_counts.values, labels={'x': 'Rating', 'y': 'Count'}, title="Distribution of Ratings")
fig.show()

## 2. Long Tail Distribution (Movie Popularity)
Few movies get most ratings. Most movies have very few ratings.

In [4]:
movie_counts = df.groupby('title').size().sort_values(ascending=False)
fig = px.line(x=np.arange(len(movie_counts)), y=movie_counts.values, title="Long Tail of Movie Popularity")
fig.update_layout(xaxis_title="Movie Rank", yaxis_title="Number of Ratings")
fig.show()

## 3. Sparsity
How sparse is the User-Item matrix?

In [5]:
n_users = df['userId'].nunique()
n_items = df['movieId'].nunique()
n_ratings = len(df)
sparsity = 1 - (n_ratings / (n_users * n_items))
print(f"Matrix Sparsity: {sparsity:.4%}")

Matrix Sparsity: 98.3000%
