# Data Exploration Notebook

This notebook contains exploratory data analysis (EDA) for the movie recommendation dataset.

## Table of Contents

1. [Import Libraries](#1-import-libraries)
2. [Load Data](#2-load-data)
3. [Data Overview](#3-data-overview)
4. [Data Cleaning](#4-data-cleaning)
5. [Visualization](#5-visualization)
6. [Insights and Conclusions](#6-insights-and-conclusions)


Import Libraries

In [None]:
# 1. Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline


Load Data

In [None]:
df = pd.read_csv('../data/raw/movielens/ratings.csv')
movies = pd.read_csv('../data/raw/movielens/movies.csv')

# 2. Load Data
data_path = '../data/raw/static_data.csv'
data = pd.read_csv(data_path)



Data Overview

In [None]:
df.head()
df.info()
df.describe()


Check for Missing Values

In [None]:
df.isnull().sum()


4. Data Cleaning

In [None]:
# 4. Data Cleaning

# Drop duplicates
data = data.drop_duplicates()

# Handle missing values (if any)
data = data.dropna()

# Ensure correct data types
data['user_id'] = data['user_id'].astype(int)
data['movie_id'] = data['movie_id'].astype(int)
data['rating'] = data['rating'].astype(float)


Data Visualization / Distribution of Ratings

In [None]:
sns.countplot(x='rating', data=df)
plt.title('Distribution of Movie Ratings')
plt.show()


In [None]:
# 5. Visualization

# Distribution of ratings
plt.figure(figsize=(8,6))
sns.countplot(x='rating', data=data)
plt.title('Distribution of Ratings')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.show()


In [None]:
# Number of ratings per user
ratings_per_user = data.groupby('user_id')['rating'].count()
plt.figure(figsize=(10,6))
sns.histplot(ratings_per_user, bins=50, kde=True)
plt.title('Number of Ratings per User')
plt.xlabel('Number of Ratings')
plt.ylabel('Count of Users')
plt.show()


In [None]:
# Number of ratings per movie
ratings_per_movie = data.groupby('movie_id')['rating'].count()
plt.figure(figsize=(10,6))
sns.histplot(ratings_per_movie, bins=50, kde=True)
plt.title('Number of Ratings per Movie')
plt.xlabel('Number of Ratings')
plt.ylabel('Count of Movies')
plt.show()


- The distribution of ratings shows that most users give high ratings.
- There is a long-tail distribution in the number of ratings per user and per movie.
- Popular movies receive a high number of ratings, while many movies have very few ratings.
- Similar patterns are observed with users: some users rate many movies, while others rate only a few.


User Activity

In [None]:
user_activity = df['userId'].value_counts()
sns.histplot(user_activity, bins=50)
plt.title('Number of Ratings per User')
plt.show()


Movie Popularity

In [None]:
movie_popularity = df['movieId'].value_counts()
sns.histplot(movie_popularity, bins=50)
plt.title('Number of Ratings per Movie')
plt.show()
