## ***2 : DATA UNDERSTANDING***


### **2.1. Setup & Data Loading**

Before diving into the data, we need to import a number of Python libraries that will help us handle, manipulate, and analyze the dataset.

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import defaultdict
import warnings
warnings.filterwarnings('ignore')

# Surprise library for recommendation systems
from surprise import Dataset, Reader, accuracy
from surprise import KNNBasic, KNNWithMeans, SVD, NMF
from surprise.model_selection import cross_validate, GridSearchCV, train_test_split

# Set style for plots
sns.set_theme(style="whitegrid", palette="Set2")

#  inline plotting
%matplotlib inline

In [3]:
print("Loading  Datasets...\n")

movies  = pd.read_csv("../data/movies.csv")
ratings = pd.read_csv("../data/ratings.csv")
links   = pd.read_csv("../data/links.csv")
tags    = pd.read_csv("../data/tags.csv")

print(f"Movies: {len(movies)} records")
print(f"Ratings: {len(ratings)} records")
print(f"Links: {len(links)} records")
print(f"Tags: {len(tags)} records")


Loading  Datasets...

Movies: 9742 records
Ratings: 100836 records
Links: 9742 records
Tags: 3683 records


---
### **2.2 Initial Data Exploration**

Let's look at what our data contains.

In [4]:
# Movies dataset
print("Movies Dataset:")
print(f"Shape: {movies.shape}")
print(f"Columns: {movies.columns.tolist()}\n")
movies.head()

Movies Dataset:
Shape: (9742, 3)
Columns: ['movieId', 'title', 'genres']



Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [5]:
# Ratings dataset
print("Ratings Dataset:")
print(f"Shape: {ratings.shape}")
print(f"Columns: {ratings.columns.tolist()}\n")
ratings.head()

Ratings Dataset:
Shape: (100836, 4)
Columns: ['userId', 'movieId', 'rating', 'timestamp']



Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [6]:
# Tags dataset
print("Tags Dataset:")
print(f"Shape: {tags.shape}")
print(f"Columns: {tags.columns.tolist()}\n")
tags.head()

Tags Dataset:
Shape: (3683, 4)
Columns: ['userId', 'movieId', 'tag', 'timestamp']



Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


In [7]:
# Links dataset
print("Links Dataset:")
print(f"Shape: {links.shape}")
print(f"Columns: {links.columns.tolist()}\n")
links.head()


Links Dataset:
Shape: (9742, 3)
Columns: ['movieId', 'imdbId', 'tmdbId']



Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


---
### **2.3 Data Quality Check**

We check for missing values and duplicates to ensure clean data.

In [8]:
# Check for missing values
print("Missing Values:")
print(f"Movies: {movies.isnull().sum().sum()}")
print(f"Ratings: {ratings.isnull().sum().sum()}")
print(f"Tags: {tags.isnull().sum().sum()}")
print(f"Links: {links.isnull().sum().sum()}")


Missing Values:
Movies: 0
Ratings: 0
Tags: 0
Links: 8


In [9]:
# Check for duplicates
print("\nDuplicate Records:")
print(f"Movies: {movies.duplicated().sum()}")
print(f"Ratings: {ratings.duplicated(subset=['userId', 'movieId']).sum()}")
print(f"Tags: {tags.duplicated().sum()}")
print(f"Links: {links.duplicated().sum()}")



Duplicate Records:
Movies: 0
Ratings: 0
Tags: 0
Links: 0


---
### **2.4. Basic Statistics**

Understanding our data through key metrics.

In [10]:
# Rating statistics
print("Ratings Statistics:")
print(ratings['rating'].describe())

Ratings Statistics:
count    100836.000000
mean          3.501557
std           1.042529
min           0.500000
25%           3.000000
50%           3.500000
75%           4.000000
max           5.000000
Name: rating, dtype: float64


In [11]:
# User & Movie counts
print("\nUser & Movie Statistics:")
print(f"Unique Users: {ratings['userId'].nunique():,}")
print(f"Unique Movies: {ratings['movieId'].nunique():,}")


User & Movie Statistics:
Unique Users: 610
Unique Movies: 9,724


In [12]:
# Calculate sparsity
sparsity = (1 - len(ratings) / (ratings['userId'].nunique() * ratings['movieId'].nunique())) * 100
print(f"Sparsity: {sparsity:.2f}%")
print(f"\nðŸ’¡ Sparsity shows that {sparsity:.1f}% of user-movie combinations have no rating.")

Sparsity: 98.30%

ðŸ’¡ Sparsity shows that 98.3% of user-movie combinations have no rating.




<span style="color:#ff6b6b; font-weight:bold;">
After the initial data exploration,
</span>
the dataset includes
<span style="color:#4dabf7; font-weight:bold;">
9,742 movies
</span>
and
<span style="color:#4dabf7; font-weight:bold;">
100,836 ratings
</span>
from
<span style="color:#4dabf7; font-weight:bold;">
610 users
</span>,
with a
<span style="color:#ffa726; font-weight:bold;">
sparsity of 98.3%.
</span>
The ratings are clean and range from
<span style="color:#53cbf1; font-weight:bold;">
0.5 to 5.0
</span>,
averaging
<span style="color:#53cbf1; font-weight:bold;">
3.5
</span>.
<span style="color:#ff6b6b; font-weight:bold;">
The high sparsity highlights the need for robust recommendation algorithms.
</span>

---


### **2.5. Data Integration**

The datasets will be combined into one unified DataFrame for analysis.