# Data Exploration

## Introduction & Purpose

This notebook provides an initial exploration of the processed movie recommendation data. The purpose is to:
- Load and inspect the processed datasets
- Understand the basic structure and characteristics of the data
- Identify data quality issues
- Analyze distributions of key features
- Summarize user demographics

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

## Load Processed Data from data/processed/

Load the cleaned and processed datasets for analysis.

In [None]:
# Load processed data files
# TODO: Update file paths based on actual processed data files

# Example:
# ratings_df = pd.read_csv('../data/processed/ratings_processed.csv')
# movies_df = pd.read_csv('../data/processed/movies_processed.csv')
# users_df = pd.read_csv('../data/processed/users_processed.csv')

print('Data loaded successfully')

## Basic Statistics

Examine dataset shapes, preview samples, and check for missing values.

In [None]:
# Dataset shape
# print('Ratings shape:', ratings_df.shape)
# print('Movies shape:', movies_df.shape)
# print('Users shape:', users_df.shape)

In [None]:
# Display first few rows
# ratings_df.head()

In [None]:
# Check for missing values
# print('\nMissing values in ratings:')
# print(ratings_df.isnull().sum())
# 
# print('\nMissing values in movies:')
# print(movies_df.isnull().sum())
# 
# print('\nMissing values in users:')
# print(users_df.isnull().sum())

## Distribution Visualizations

Visualize the distributions of key features including age, ratings, and genres.

### Age Distribution

In [None]:
# Plot age distribution
# plt.figure(figsize=(10, 6))
# plt.hist(users_df['age'], bins=30, edgecolor='black', alpha=0.7)
# plt.xlabel('Age')
# plt.ylabel('Frequency')
# plt.title('Distribution of User Ages')
# plt.show()

### Rating Distribution

In [None]:
# Plot rating distribution
# plt.figure(figsize=(10, 6))
# plt.hist(ratings_df['rating'], bins=20, edgecolor='black', alpha=0.7)
# plt.xlabel('Rating')
# plt.ylabel('Frequency')
# plt.title('Distribution of Movie Ratings')
# plt.show()

### Genre Distribution

In [None]:
# Plot genre distribution
# # Note: Adjust based on actual genre column format
# plt.figure(figsize=(12, 6))
# # genre_counts = movies_df['genre'].value_counts().head(10)
# # genre_counts.plot(kind='bar')
# plt.xlabel('Genre')
# plt.ylabel('Count')
# plt.title('Top 10 Movie Genres')
# plt.xticks(rotation=45)
# plt.tight_layout()
# plt.show()

## User Demographics Summary

Analyze and summarize user demographic information.

In [None]:
# User demographics summary
# print('User Demographics Summary:')
# print('=' * 50)
# print(f'Total users: {len(users_df)}')
# print(f'\nAge statistics:')
# print(users_df['age'].describe())
# 
# # Additional demographic analysis
# # print(f'\nGender distribution:')
# # print(users_df['gender'].value_counts())
# # 
# # print(f'\nOccupation distribution:')
# # print(users_df['occupation'].value_counts())

In [None]:
# Visualize user demographics
# fig, axes = plt.subplots(1, 2, figsize=(15, 5))
# 
# # Example visualizations
# # axes[0].pie(users_df['gender'].value_counts(), labels=users_df['gender'].value_counts().index, autopct='%1.1f%%')
# # axes[0].set_title('Gender Distribution')
# # 
# # users_df['occupation'].value_counts().head(10).plot(kind='barh', ax=axes[1])
# # axes[1].set_title('Top 10 Occupations')
# # axes[1].set_xlabel('Count')
# 
# plt.tight_layout()
# plt.show()

## Next Steps

- Perform deeper feature engineering
- Analyze correlations between features
- Prepare data for model training
- Develop recommendation algorithms