# Exploratory Data Analysis (EDA)
## Movie Recommender System

This notebook performs exploratory data analysis on the MovieLens dataset.

**Objectives:**
1. Understand the data structure
2. Analyze rating distributions
3. Identify the long tail effect
4. Calculate sparsity
5. Explore cold start challenges

In [None]:
import sys
sys.path.append('..')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

from src.data.loader import MovieLensLoader
from src.config import Config

# Configure plotting
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

## 1. Load Data

In [None]:
# Load the dataset
loader = MovieLensLoader()
ratings, movies = loader.load_all()

print(f"Ratings shape: {ratings.shape}")
print(f"Movies shape: {movies.shape}")

In [None]:
# Display sample data
print("\n=== Ratings Sample ===")
display(ratings.head())

print("\n=== Movies Sample ===")
display(movies.head())

## 2. Basic Statistics

In [None]:
# Overall statistics
n_users = ratings['userId'].nunique()
n_movies = ratings['movieId'].nunique()
n_ratings = len(ratings)

print(f"Number of users: {n_users:,}")
print(f"Number of movies: {n_movies:,}")
print(f"Number of ratings: {n_ratings:,}")
print(f"\nAverage ratings per user: {n_ratings / n_users:.2f}")
print(f"Average ratings per movie: {n_ratings / n_movies:.2f}")

In [None]:
# Rating statistics
print("\n=== Rating Distribution ===")
print(ratings['rating'].describe())
print(f"\nRating value counts:")
print(ratings['rating'].value_counts().sort_index())

## 3. Sparsity Analysis

**Sparsity** = 1 - (actual_ratings / possible_ratings)

High sparsity is a fundamental challenge in recommendation systems.

In [None]:
# Calculate sparsity
possible_ratings = n_users * n_movies
sparsity = 1 - (n_ratings / possible_ratings)

print(f"Possible ratings: {possible_ratings:,}")
print(f"Actual ratings: {n_ratings:,}")
print(f"\nSparsity: {sparsity:.4%}")
print(f"Density: {(1-sparsity):.4%}")

## 4. Rating Distribution Analysis

In [None]:
# Plot rating distribution
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Histogram
axes[0].hist(ratings['rating'], bins=10, edgecolor='black', alpha=0.7)
axes[0].set_xlabel('Rating')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Rating Distribution')
axes[0].grid(True, alpha=0.3)

# Box plot
axes[1].boxplot(ratings['rating'], vert=True)
axes[1].set_ylabel('Rating')
axes[1].set_title('Rating Box Plot')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Are users generally positive or negative?
mean_rating = ratings['rating'].mean()
print(f"\nMean rating: {mean_rating:.2f}")
if mean_rating > 3:
    print("→ Users tend to be POSITIVE (mean > 3)")
else:
    print("→ Users tend to be NEGATIVE (mean ≤ 3)")

## 5. Long Tail Analysis

The **Long Tail** refers to the phenomenon where a few items (movies) are extremely popular, while the vast majority receive few ratings.

In [None]:
# Count ratings per movie
movie_rating_counts = ratings.groupby('movieId').size().reset_index(name='n_ratings')
movie_rating_counts = movie_rating_counts.sort_values('n_ratings', ascending=False)

# Plot long tail
plt.figure(figsize=(15, 6))
plt.plot(range(len(movie_rating_counts)), movie_rating_counts['n_ratings'].values)
plt.xlabel('Movie Rank')
plt.ylabel('Number of Ratings')
plt.title('Long Tail: Movie Popularity Distribution')
plt.yscale('log')
plt.grid(True, alpha=0.3)
plt.show()

# Statistics
top_20_pct = int(0.2 * len(movie_rating_counts))
top_20_ratings = movie_rating_counts.head(top_20_pct)['n_ratings'].sum()
total_ratings = movie_rating_counts['n_ratings'].sum()

print(f"\nTop 20% of movies account for {top_20_ratings / total_ratings:.2%} of all ratings")
print(f"Bottom 80% of movies account for {1 - (top_20_ratings / total_ratings):.2%} of all ratings")

## 6. Most Popular Movies

In [None]:
# Get most rated movies
top_movies = movie_rating_counts.head(20)
top_movies = top_movies.merge(movies[['movieId', 'title']], on='movieId')

print("\n=== Top 20 Most Rated Movies ===")
display(top_movies[['title', 'n_ratings']])

In [None]:
# Highest rated movies (with minimum ratings threshold)
min_ratings = 50
movie_stats = ratings.groupby('movieId').agg({
    'rating': ['mean', 'count']
}).reset_index()
movie_stats.columns = ['movieId', 'avg_rating', 'n_ratings']

# Filter and sort
top_rated = movie_stats[movie_stats['n_ratings'] >= min_ratings]
top_rated = top_rated.sort_values('avg_rating', ascending=False).head(20)
top_rated = top_rated.merge(movies[['movieId', 'title']], on='movieId')

print(f"\n=== Top 20 Highest Rated Movies (min {min_ratings} ratings) ===")
display(top_rated[['title', 'avg_rating', 'n_ratings']])

## 7. User Behavior Analysis

In [None]:
# Ratings per user distribution
user_rating_counts = ratings.groupby('userId').size().reset_index(name='n_ratings')

plt.figure(figsize=(15, 5))
plt.subplot(1, 2, 1)
plt.hist(user_rating_counts['n_ratings'], bins=50, edgecolor='black', alpha=0.7)
plt.xlabel('Number of Ratings')
plt.ylabel('Number of Users')
plt.title('Distribution of Ratings per User')
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
plt.boxplot(user_rating_counts['n_ratings'])
plt.ylabel('Number of Ratings')
plt.title('Ratings per User (Box Plot)')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n=== User Rating Statistics ===")
print(user_rating_counts['n_ratings'].describe())

## 8. Cold Start Analysis

**Cold Start Problem**: How do we recommend to new users or promote new movies?

In [None]:
# Users with very few ratings (new users)
cold_start_threshold = 5
cold_start_users = user_rating_counts[user_rating_counts['n_ratings'] <= cold_start_threshold]

print(f"\nUsers with ≤ {cold_start_threshold} ratings: {len(cold_start_users):,}")
print(f"Percentage of users: {len(cold_start_users) / len(user_rating_counts):.2%}")

# Movies with very few ratings (new/obscure movies)
cold_start_movies = movie_rating_counts[movie_rating_counts['n_ratings'] <= cold_start_threshold]

print(f"\nMovies with ≤ {cold_start_threshold} ratings: {len(cold_start_movies):,}")
print(f"Percentage of movies: {len(cold_start_movies) / len(movie_rating_counts):.2%}")

## 9. Temporal Analysis

In [None]:
# Convert timestamp to datetime
ratings['datetime'] = pd.to_datetime(ratings['timestamp'], unit='s')
ratings['year'] = ratings['datetime'].dt.year
ratings['month'] = ratings['datetime'].dt.month

# Ratings over time
ratings_over_time = ratings.groupby('year').size().reset_index(name='n_ratings')

plt.figure(figsize=(15, 5))
plt.plot(ratings_over_time['year'], ratings_over_time['n_ratings'], marker='o')
plt.xlabel('Year')
plt.ylabel('Number of Ratings')
plt.title('Ratings Over Time')
plt.grid(True, alpha=0.3)
plt.show()

## 10. Save Processed Data

In [None]:
# Save processed data for modeling
loader.save_processed(ratings, movies)
print("\n✓ Processed data saved successfully!")

## Key Insights

**Summary of findings:**

1. **Data characteristics**: [Add your observations]
2. **Rating bias**: Users are generally [positive/negative]
3. **Long tail**: A small number of movies dominate the ratings
4. **Sparsity**: The user-item matrix is highly sparse
5. **Cold start**: Significant portion of users/movies have few ratings

**Implications for modeling:**
- Need to handle sparsity effectively
- Cold start strategy is essential
- Popular items baseline is important
- Collaborative filtering should work well given the data density