# 01 - Data Collection: MovieLens Dataset

Notebook này thực hiện thu thập dữ liệu phim từ MovieLens dataset.

## Mục Tiêu
- Download MovieLens dataset
- Explore cấu trúc dữ liệu
- Verify yêu cầu: ≥2000 items, ≥5 features
- Document findings

## 1. Import Libraries

In [19]:
import sys
import os

# Add src to path
sys.path.append(os.path.abspath('../src'))

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from data_processing.collector import MovieDataCollector

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# Set plot style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("Libraries imported successfully")

Libraries imported successfully


## 2. Download Dataset

In [20]:
# Initialize collector
collector = MovieDataCollector(data_dir='../data/raw')

# Download small dataset (good for development)
# Options: 'small', '25m', 'latest'
dataset_dir = collector.download_dataset('small')

print(f"\nDataset location: {dataset_dir}")

Dataset already exists at ..\data\raw\ml-latest-small.zip
Dataset already extracted at ..\data\raw\ml-latest-small

Dataset location: ..\data\raw\ml-latest-small


## 3. Load Data

In [21]:
# Load all data files
movies = collector.load_movies(dataset_dir)
ratings = collector.load_ratings(dataset_dir)
tags = collector.load_tags(dataset_dir)
links = collector.load_links(dataset_dir)

Loaded 9742 movies
Loaded 100836 ratings
Loaded 3683 tags
Loaded 9742 links


## 4. Dataset Overview

In [22]:
# Get dataset info
info = collector.get_dataset_info(dataset_dir)

print("="*50)
print("DATASET INFORMATION")
print("="*50)
for key, value in info.items():
    print(f"{key}: {value}")

Loaded 9742 movies
Loaded 100836 ratings
Loaded 3683 tags
Loaded 9742 links
DATASET INFORMATION
num_movies: 9742
num_ratings: 100836
num_users: 610
num_tags: 3683
movies_columns: ['movieId', 'title', 'genres']
ratings_columns: ['userId', 'movieId', 'rating', 'timestamp']
avg_rating: 3.501556983616962
rating_range: (np.float64(0.5), np.float64(5.0))


## 5. Explore Movies Data

In [23]:
# Display first few rows
print("\nMovies DataFrame:")
print(f"Shape: {movies.shape}")
print(f"Columns: {list(movies.columns)}")
movies.head(10)


Movies DataFrame:
Shape: (9742, 3)
Columns: ['movieId', 'title', 'genres']


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


In [24]:
# Data types and missing values
print("\nData Info:")
movies.info()


Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


In [25]:
# Basic statistics
print("\nBasic Statistics:")
print(f"Total movies: {len(movies)}")
print(f"Unique movie IDs: {movies['movieId'].nunique()}")
print(f"Missing values:\n{movies.isnull().sum()}")


Basic Statistics:
Total movies: 9742
Unique movie IDs: 9742
Missing values:
movieId    0
title      0
genres     0
dtype: int64


In [26]:
# Sample movies
print("\nSample movies:")
movies.sample(5)


Sample movies:


Unnamed: 0,movieId,title,genres
2056,2735,"Golden Child, The (1986)",Action|Adventure|Comedy|Fantasy|Mystery
4889,7323,"Good bye, Lenin! (2003)",Comedy|Drama
5560,26717,Begotten (1990),Drama|Horror
1120,1460,That Darn Cat (1997),Children|Comedy|Mystery
6175,44759,Basic Instinct 2 (2006),Crime|Drama|Mystery|Thriller


## 6. Explore Ratings Data

In [27]:
print("\nRatings DataFrame:")
print(f"Shape: {ratings.shape}")
print(f"Columns: {list(ratings.columns)}")
ratings.head(10)


Ratings DataFrame:
Shape: (100836, 4)
Columns: ['userId', 'movieId', 'rating', 'timestamp']


Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
5,1,70,3.0,964982400
6,1,101,5.0,964980868
7,1,110,4.0,964982176
8,1,151,5.0,964984041
9,1,157,5.0,964984100


In [28]:
print("\nRatings Statistics:")
print(f"Total ratings: {len(ratings)}")
print(f"Unique users: {ratings['userId'].nunique()}")
print(f"Unique movies rated: {ratings['movieId'].nunique()}")
print(f"\nRating distribution:")
print(ratings['rating'].value_counts().sort_index())


Ratings Statistics:
Total ratings: 100836
Unique users: 610
Unique movies rated: 9724

Rating distribution:
rating
0.5     1370
1.0     2811
1.5     1791
2.0     7551
2.5     5550
3.0    20047
3.5    13136
4.0    26818
4.5     8551
5.0    13211
Name: count, dtype: int64


In [29]:
# Rating statistics
ratings['rating'].describe()

count    100836.000000
mean          3.501557
std           1.042529
min           0.500000
25%           3.000000
50%           3.500000
75%           4.000000
max           5.000000
Name: rating, dtype: float64

## 7. Explore Tags Data

In [30]:
if tags is not None:
    print("\nTags DataFrame:")
    print(f"Shape: {tags.shape}")
    print(f"Columns: {list(tags.columns)}")
    display(tags.head(10))
    
    print(f"\nTotal tags: {len(tags)}")
    print(f"Unique tags: {tags['tag'].nunique()}")
    print(f"\nMost common tags:")
    print(tags['tag'].value_counts().head(10))
else:
    print("No tags data available")


Tags DataFrame:
Shape: (3683, 4)
Columns: ['userId', 'movieId', 'tag', 'timestamp']


Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200
5,2,89774,Tom Hardy,1445715205
6,2,106782,drugs,1445715054
7,2,106782,Leonardo DiCaprio,1445715051
8,2,106782,Martin Scorsese,1445715056
9,7,48516,way too long,1169687325



Total tags: 3683
Unique tags: 1589

Most common tags:
tag
In Netflix queue     131
atmospheric           36
thought-provoking     24
superhero             24
surreal               23
funny                 23
Disney                23
religion              22
quirky                21
sci-fi                21
Name: count, dtype: int64


## 8. Explore Links Data

In [31]:
if links is not None:
    print("\nLinks DataFrame:")
    print(f"Shape: {links.shape}")
    print(f"Columns: {list(links.columns)}")
    display(links.head(10))
    
    print(f"\nIMDb IDs: {links['imdbId'].notna().sum()}")
    print(f"TMDB IDs: {links['tmdbId'].notna().sum()}")
else:
    print("No links data available")


Links DataFrame:
Shape: (9742, 3)
Columns: ['movieId', 'imdbId', 'tmdbId']


Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0
5,6,113277,949.0
6,7,114319,11860.0
7,8,112302,45325.0
8,9,114576,9091.0
9,10,113189,710.0



IMDb IDs: 9742
TMDB IDs: 9734


## 9. Extract Features from Movies

Extract additional features như year từ title và parse genres.

In [32]:
# Extract year from title
movies['year'] = movies['title'].str.extract(r'\((\d{4})\)', expand=False)
movies['year'] = pd.to_numeric(movies['year'], errors='coerce')

# Clean title (remove year)
movies['title_clean'] = movies['title'].str.replace(r'\s*\(\d{4}\)', '', regex=True)

print("\nExtracted features:")
movies[['title', 'title_clean', 'year', 'genres']].head(10)


Extracted features:


Unnamed: 0,title,title_clean,year,genres
0,Toy Story (1995),Toy Story,1995.0,Adventure|Animation|Children|Comedy|Fantasy
1,Jumanji (1995),Jumanji,1995.0,Adventure|Children|Fantasy
2,Grumpier Old Men (1995),Grumpier Old Men,1995.0,Comedy|Romance
3,Waiting to Exhale (1995),Waiting to Exhale,1995.0,Comedy|Drama|Romance
4,Father of the Bride Part II (1995),Father of the Bride Part II,1995.0,Comedy
5,Heat (1995),Heat,1995.0,Action|Crime|Thriller
6,Sabrina (1995),Sabrina,1995.0,Comedy|Romance
7,Tom and Huck (1995),Tom and Huck,1995.0,Adventure|Children
8,Sudden Death (1995),Sudden Death,1995.0,Action
9,GoldenEye (1995),GoldenEye,1995.0,Action|Adventure|Thriller


In [33]:
# Parse genres
print("\nGenre distribution:")
all_genres = movies['genres'].str.split('|').explode()
genre_counts = all_genres.value_counts()
print(genre_counts)


Genre distribution:
genres
Drama                 4361
Comedy                3756
Thriller              1894
Action                1828
Romance               1596
Adventure             1263
Crime                 1199
Sci-Fi                 980
Horror                 978
Fantasy                779
Children               664
Animation              611
Mystery                573
Documentary            440
War                    382
Musical                334
Western                167
IMAX                   158
Film-Noir               87
(no genres listed)      34
Name: count, dtype: int64


## 10. Merge Data for Rich Features

In [34]:
# Merge movies with average ratings
movie_stats = ratings.groupby('movieId').agg({
    'rating': ['mean', 'count']
}).reset_index()

movie_stats.columns = ['movieId', 'avg_rating', 'num_ratings']

# Merge with movies
movies_enriched = movies.merge(movie_stats, on='movieId', how='left')

# Fill NaN ratings (movies with no ratings yet)
movies_enriched['avg_rating'] = movies_enriched['avg_rating'].fillna(0)
movies_enriched['num_ratings'] = movies_enriched['num_ratings'].fillna(0)

print("\nEnriched Movies DataFrame:")
print(f"Shape: {movies_enriched.shape}")
movies_enriched.head(10)


Enriched Movies DataFrame:
Shape: (9742, 7)


Unnamed: 0,movieId,title,genres,year,title_clean,avg_rating,num_ratings
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995.0,Toy Story,3.92093,215.0
1,2,Jumanji (1995),Adventure|Children|Fantasy,1995.0,Jumanji,3.431818,110.0
2,3,Grumpier Old Men (1995),Comedy|Romance,1995.0,Grumpier Old Men,3.259615,52.0
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,1995.0,Waiting to Exhale,2.357143,7.0
4,5,Father of the Bride Part II (1995),Comedy,1995.0,Father of the Bride Part II,3.071429,49.0
5,6,Heat (1995),Action|Crime|Thriller,1995.0,Heat,3.946078,102.0
6,7,Sabrina (1995),Comedy|Romance,1995.0,Sabrina,3.185185,54.0
7,8,Tom and Huck (1995),Adventure|Children,1995.0,Tom and Huck,2.875,8.0
8,9,Sudden Death (1995),Action,1995.0,Sudden Death,3.125,16.0
9,10,GoldenEye (1995),Action|Adventure|Thriller,1995.0,GoldenEye,3.496212,132.0


## 11. Verify Requirements

Kiểm tra xem dataset có đáp ứng yêu cầu không:
- Dataset ≥ 2,000 items
- Có ít nhất 5 features

In [35]:
print("="*50)
print("REQUIREMENTS VERIFICATION")
print("="*50)

# Check số lượng items
num_movies = len(movies_enriched)
print(f"\n1. Number of items: {num_movies}")
if num_movies >= 2000:
    print("   PASS: Dataset has ≥ 2,000 items")
else:
    print("   FAIL: Dataset has < 2,000 items")

# Check số lượng features
features = ['movieId', 'title_clean', 'genres', 'year', 'avg_rating', 'num_ratings']
print(f"\n2. Available features ({len(features)}):")
for i, feat in enumerate(features, 1):
    print(f"   {i}. {feat}")

if len(features) >= 5:
    print("\n   PASS: Dataset has ≥ 5 features")
else:
    print("\n   FAIL: Dataset has < 5 features")

print("\n" + "="*50)
print("ALL REQUIREMENTS MET!")
print("="*50)

REQUIREMENTS VERIFICATION

1. Number of items: 9742
   PASS: Dataset has ≥ 2,000 items

2. Available features (6):
   1. movieId
   2. title_clean
   3. genres
   4. year
   5. avg_rating
   6. num_ratings

   PASS: Dataset has ≥ 5 features

ALL REQUIREMENTS MET!


## 12. Save Processed Data

In [36]:
# Create processed data directory
processed_dir = '../data/processed'
os.makedirs(processed_dir, exist_ok=True)

# Save enriched movies data
output_file = os.path.join(processed_dir, 'movies_enriched.csv')
movies_enriched.to_csv(output_file, index=False)
print(f"\nSaved enriched movies to: {output_file}")

# Save ratings
ratings_file = os.path.join(processed_dir, 'ratings.csv')
ratings.to_csv(ratings_file, index=False)
print(f"Saved ratings to: {ratings_file}")

print("\nData collection completed successfully!")


Saved enriched movies to: ../data/processed\movies_enriched.csv
Saved ratings to: ../data/processed\ratings.csv

Data collection completed successfully!


## 13. Summary

### Dataset Statistics
- **Total Movies:** {num_movies}
- **Total Ratings:** {num_ratings}
- **Total Users:** {num_users}
- **Features:** 6+ (movieId, title, genres, year, avg_rating, num_ratings)

### Next Steps
1. Data collection done
2. Data cleaning (notebook 02)
3. EDA & visualization (notebook 03)
4. Model building (notebook 04)
5. Model evaluation (notebook 05)