<a href="https://colab.research.google.com/github/RecSys-lab/MoVieFex/blob/main/examples/read_movieLens25M.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **MoViFex Framework - Load and Process `MovieLens-25M`**

🎬 Dataset MovieLens-25M: [link](https://grouplens.org/datasets/movielens/25m/)

🎬 Framework: [link](https://github.com/RecSys-lab/MoVieFex)


# [Step 1] - Load the Framework

Clone the framework into your `GDrive` and prepare it for experiments.

In [1]:
# Clone the repo
!git clone https://github.com/RecSys-lab/MoVieFex.git

# Install the required library
%cd MoVieFex
!pip install -e .

# Add the repository to the Python path
import sys
sys.path.append('/content/MoVieFex')

Cloning into 'MoVieFex'...
remote: Enumerating objects: 614, done.[K
remote: Counting objects: 100% (190/190), done.[K
remote: Compressing objects: 100% (129/129), done.[K
Receiving objects: 100% (614/614), 126.84 KiB | 2.88 MiB/s, done.
remote: Total 614 (delta 89), reused 154 (delta 55), pack-reused 424 (from 1)[K
Resolving deltas: 100% (306/306), done.
/content/MoVieFex
Obtaining file:///content/MoVieFex
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pytube>=15.0 (from MoVieFex==1.0.0)
  Downloading pytube-15.0.0-py3-none-any.whl.metadata (5.0 kB)
Collecting scipy>=1.14.1 (from MoVieFex==1.0.0)
  Downloading scipy-1.15.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.0/62.0 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
Downloading pytube-15.0.0-py3-none-any.whl (57 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 kB[0m [31m2.2 MB/s[0m eta

# [Step 2] - Use the Framework 🚀

Import the framework and define some variables to work with it.

In [7]:
import os
import moviefex

# Similar to the `config.yml` file in the framework - section `datasets/text_dataset`
configs = {
    "name": "Movielens-25m",
    "need_download": True,
    "url": "https://files.grouplens.org/datasets/movielens/ml-25m.zip",
    "download_path": "/content/ML25"
}

# Variables
needDownloaded = configs['need_download']
datasetPath = os.path.normpath(configs['download_path'])

**Test I. Downloading the Dataset**

- ⚙️ Function: `downloadMovielens25m`

In [11]:
from moviefex.datasets.movielens.downloader import downloadMovielens25m

# Check if you need to download the dataset or no
if needDownloaded:
  print(f"- The dataset needs to be downloaded! It will be downloaded in '{datasetPath}' ...")
  isDownloadSuccessful = downloadMovielens25m(configs['url'], datasetPath)
  if not isDownloadSuccessful:
    print('- Seems like there was a problem while downloading!')
  # Go inside the downloaded folder
  datasetPath = os.path.join(datasetPath, "ml-25m")
  print(f"- The dataset files are available in '{datasetPath}'!")

- The dataset needs to be downloaded! It will be downloaded in '/content/ML25' ...
- Downloading the dataset from 'https://files.grouplens.org/datasets/movielens/ml-25m.zip' ...
- Download completed and the dataset is saved as a 'zip' file!
- Now, extracting the dataset files inside /content/ML25 ...
- Dataset extracted to '/content/ML25' successfully!
- Removing the zip file /content/ML25/ml-25m.zip ...
- Zip file removed successfully!
- The dataset files are available in '/content/ML25/ml-25m'!


**Test II. Loading the Movies**

- 📁 File: `ML25/ml-25m/movies.csv`
- ⚙️ Function: `loadDataFromCSV`

In [13]:
from moviefex.utils import loadDataFromCSV

moviesDataFrame = loadDataFromCSV(os.path.join(datasetPath, "movies.csv"))
if moviesDataFrame is None:
  print(f"- The dataset files could not be found! Checking the inner folder ...")
  # Go inside the 'ml-25m' folder, if not yet
  datasetPath = os.path.join(datasetPath, "ml-25m")
  moviesDataFrame = loadDataFromCSV(os.path.join(datasetPath, "movies.csv"))
  if moviesDataFrame is None:
    print(f"- The dataset files could not be found! Exiting ...")

# Test#1 - Counting the number of movies
moviesCount = len(moviesDataFrame)
print(f"- The dataset contains {moviesCount} movies!")
# Test#2 - Some samples of the movies
print(f"- The structure of the movies data is as below:")
moviesDataFrame.head(3)

- The dataset contains 62423 movies!
- The structure of the movies data is as below:


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance


**Test III. Fetching All Genres of the Movies**

- 📁 File: `ML25/ml-25m/movies.csv`
- ⚙️ Function: `fetchAllUniqueGenres`

In [18]:
from moviefex.datasets.movielens.helper_movies import fetchAllUniqueGenres

print(f"- Fetching all genres from the dataset ...")
allGenres = fetchAllUniqueGenres(moviesDataFrame)
print(f"- The dataset contains {len(allGenres)} genres, including: {allGenres}")

- Fetching all genres from the dataset ...
- The dataset contains 20 genres, including: ['Adventure' 'Animation' 'Children' 'Comedy' 'Fantasy' 'Romance' 'Drama'
 'Action' 'Crime' 'Thriller' 'Horror' 'Mystery' 'Sci-Fi' 'IMAX'
 'Documentary' 'War' 'Musical' 'Western' 'Film-Noir' '(no genres listed)']


**Test IV. Fetching the Movies with a Specific Genre**

- 📁 File: `ML25/ml-25m/movies.csv`
- ⚙️ Function: `fetchMoviesByGenre`

In [17]:
from moviefex.datasets.movielens.helper_movies import fetchMoviesByGenre

# Variables
givenGenre = "Action"

print(f"- Fetching movies by a specific genre (given: {givenGenre}) ...")
moviesByGenre = fetchMoviesByGenre(moviesDataFrame, givenGenre)
print(f"- The dataset contains {len(moviesByGenre)} movies with the genre '{givenGenre}'!")
moviesByGenre.head(5)

- Fetching movies by a specific genre (given: Action) ...
- The dataset contains 7348 movies with the genre 'Action'!


Unnamed: 0,movieId,title,genres
5,6,Heat (1995),Action|Crime|Thriller
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller
14,15,Cutthroat Island (1995),Action|Adventure|Romance
19,20,Money Train (1995),Action|Comedy|Crime|Drama|Thriller


**Test V. Fetching the Movies with the Main Genres**

The main genrs contain `Action`, `Comedy`, `Drama`, and `Horror`.

- 📁 File: `ML25/ml-25m/movies.csv`
- ⚙️ Function: `filterMoviesWithMainGenres`

In [20]:
from moviefex.datasets.movielens.helper_movies import filterMoviesWithMainGenres, mainGenres

print(f"- Fetching movies by the main genres {mainGenres} ...")
mainGenresMoviesDataFrame = filterMoviesWithMainGenres(moviesDataFrame)
print(f"- The dataset contains {len(mainGenresMoviesDataFrame)} movies with the main genres!")
print(f"- A sample of the movies with the main genres:")
mainGenresMoviesDataFrame.head(5)

- Fetching movies by the main genres ['Action', 'Comedy', 'Drama', 'Horror'] ...
- The dataset contains 45596 movies with the main genres!
- A sample of the movies with the main genres:


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller


**Test VI. Fetching the Movies with the Main Genres (Binarized)**

The main genrs contain `Action`, `Comedy`, `Drama`, and `Horror`.

- 📁 File: `ML25/ml-25m/movies.csv`
- ⚙️ Function: `binarizeMovieGenres`

In [21]:
from moviefex.datasets.movielens.helper_movies import binarizeMovieGenres

moviesDFBinarizedGenres = binarizeMovieGenres(moviesDataFrame)
print(f"- The movies data with binarized genres is as below:")
moviesDFBinarizedGenres.head(5)

- The movies data with binarized genres is as below:


Unnamed: 0,movieId,isAction,isComedy,isDrama,isHorror
0,1,0,1,0,0
1,2,0,0,0,0
2,3,0,1,0,0
3,4,0,1,1,0
4,5,0,1,0,0


**Test VII. Augmenting the Movies Genres with Binarized Representations**

Here, we fuse the last two `DataFrames` for an informative representation.

- 📁 File: `ML25/ml-25m/movies.csv`
- ⚙️ Function: `augmentMoviesDFWithBinarizedGenres`

In [22]:
from moviefex.datasets.movielens.helper_movies import augmentMoviesDFWithBinarizedGenres

augmentedMoviesDataFrame = augmentMoviesDFWithBinarizedGenres(moviesDataFrame, moviesDFBinarizedGenres)
print(f"- The movie data augmented with binary genres is as below:")
augmentedMoviesDataFrame.head(5)

- The movie data augmented with binary genres is as below:


Unnamed: 0,movieId,title,genres,isAction,isComedy,isDrama,isHorror
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,0,1,0,0
1,2,Jumanji (1995),Adventure|Children|Fantasy,0,0,0,0
2,3,Grumpier Old Men (1995),Comedy|Romance,0,1,0,0
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,0,1,1,0
4,5,Father of the Bride Part II (1995),Comedy,0,1,0,0


**Test VIII. Reading User-Driven Data (Ratings)**

- 📁 File: `ML25/ml-25m/ratings.csv`
- ⚙️ Function: `loadDataFromCSV`

In [23]:
from moviefex.utils import loadDataFromCSV

print(f"- Reading dataset's user-driven data and fetching them into a DataFrame ...")
ratingsDataFrame = loadDataFromCSV(os.path.join(datasetPath, "ratings.csv"))
# Test#9 - Counting the number of ratings
ratingsCount = len(ratingsDataFrame)
print(f"- The dataset contains {ratingsCount} ratings!")
# Test#10 - Some samples of the movies
print(f"- The structure of the user-ratings data is as below:")
ratingsDataFrame.head(5)

- Reading dataset's user-driven data and fetching them into a DataFrame ...
- The dataset contains 25000095 ratings!
- The structure of the user-ratings data is as below:


Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510


**Test IX. Combining User-Driven and Item Data (Ratings and Movies)**

- 📁 File: `ML25/ml-25m/movies.csv`, `ML25/ml-25m/ratings.csv`
- ⚙️ Function: `mergeMainGenreMoviesDFWithRatingsDF`

In [25]:
from moviefex.datasets.movielens.helper_ratings import mergeMainGenreMoviesDFWithRatingsDF

print(f"- Merging the movies (of the main genres) and ratings DataFrames ...")
mergedDataFrame = mergeMainGenreMoviesDFWithRatingsDF(augmentedMoviesDataFrame, ratingsDataFrame)
print(f"- The merged DataFrame has {len(mergedDataFrame)} items, such as:")
mergedDataFrame.head(5)

- Merging the movies (of the main genres) and ratings DataFrames ...
- The merged DataFrame has 22681557 items, such as:


Unnamed: 0,movieId,title,genres,isAction,isComedy,isDrama,isHorror,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,0,1,0,0,2,3.5,1141415820
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,0,1,0,0,3,4.0,1439472215
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,0,1,0,0,4,3.0,1573944252
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,0,1,0,0,5,4.0,858625949
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,0,1,0,0,8,4.0,890492517
