<a href="https://colab.research.google.com/github/RecSys-lab/movifex_dataset/blob/main/examples/dataset_statistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **MoViFex Dataset - Statistics**

🎬 Dataset: [link](https://huggingface.co/datasets/alitourani/MoViFex_Dataset/tree/main)

🎬 Framework: [link](https://github.com/RecSys-lab/MoViFex)

## **[Step 1] Clone Dataset Management Toolkit**

Clone the framework into your `GDrive` and prepare it for experiments.

In [1]:
# Clone the repo
!git clone https://github.com/RecSys-lab/MoViFex.git

# Install the required library
%cd MoViFex
!pip install -e .

# Add the repository to the Python path
import sys
sys.path.append('/content/MoViFex')

# Go back to the root
%cd ..

Cloning into 'MoViFex'...
remote: Enumerating objects: 792, done.[K
remote: Counting objects: 100% (368/368), done.[K
remote: Compressing objects: 100% (251/251), done.[K
remote: Total 792 (delta 197), reused 266 (delta 111), pack-reused 424 (from 1)[K
Receiving objects: 100% (792/792), 1.03 MiB | 4.20 MiB/s, done.
Resolving deltas: 100% (414/414), done.
/content/MoViFex
Obtaining file:///content/MoViFex
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pytube>=15.0 (from MoViFex==1.0.0)
  Downloading pytube-15.0.0-py3-none-any.whl.metadata (5.0 kB)
Downloading pytube-15.0.0-py3-none-any.whl (57 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pytube, MoViFex
  Running setup.py develop for MoViFex
Successfully installed MoViFex-1.0.0 pytube-15.0.0
/content


## 🚀 **[Step 2] Use the Dataset and its Framework**

### I. *Load the Dataset Metadata File*

In [20]:
import os
import json
import movifex
import pandas as pd
from movifex.utils import loadJsonFromUrl

# Variables
datasetName = "MoViFex-visual"
datasetMetadataUrl = "https://huggingface.co/datasets/alitourani/MoViFex_Dataset/resolve/main/stats.json"

# Fetch the metadata of the movie features dataset
print(f"- Fetching the dataset metadata from '{datasetMetadataUrl}' ...")
jsonData = loadJsonFromUrl(datasetMetadataUrl)
movifexDF = pd.DataFrame(jsonData)

print(f'- {datasetName} dataset is loaded into a DataFrame:')
movifexDF.head(5)

- Fetching the dataset metadata from 'https://huggingface.co/datasets/alitourani/MoViFex_Dataset/resolve/main/stats.json' ...
- MoViFex-visual dataset is loaded into a DataFrame:


Unnamed: 0,id,title,year,genres
0,6,Heat,1995,"[Action, Crime, Thriller]"
1,50,"Usual Suspects, The",1995,"[Crime, Mystery, Thriller]"
2,111,Taxi Driver,1976,"[Crime, Drama, Thriller]"
3,150,Apollo 13,1995,"[Adventure, Drama, IMAX]"
4,165,Die Hard: With a Vengeance,1995,"[Action, Crime, Thriller]"


### II. *Download MovieLenz-25M*

In [21]:
from movifex.utils import loadDataFromCSV
from movifex.datasets.movielens.downloader import downloadMovielens25m

# Variables
movielenzVariant = "ml-25m"
datasetPath = "/content/ML25"
movielenzUrl = f"https://files.grouplens.org/datasets/movielens/{movielenzVariant}.zip"

# Download the MovieLenz Dataset
print(f"Downloading the '{movielenzVariant}' dataset from '{movielenzUrl}' ...")
isDownloadSuccessful = downloadMovielens25m(movielenzUrl, datasetPath)
if not isDownloadSuccessful:
  print('- Seems like there was a problem while downloading!')
datasetPath = os.path.join(datasetPath, "ml-25m")

# Load the Files
print(f"\nLoading '{movielenzVariant}' files from '{datasetPath}' ...")
mlMoviesDF = loadDataFromCSV(os.path.join(datasetPath, "movies.csv"))
mlRatingsDF = loadDataFromCSV(os.path.join(datasetPath, "ratings.csv"))
print(f"{len(mlMoviesDF)} movies and {len(mlRatingsDF)} ratings have been loaded!")
mlMoviesDF.head(5)

Downloading the 'ml-25m' dataset from 'https://files.grouplens.org/datasets/movielens/ml-25m.zip' ...
- Downloading the dataset from 'https://files.grouplens.org/datasets/movielens/ml-25m.zip' ...
- Download completed and the dataset is saved as a 'zip' file!
- Now, extracting the dataset files inside /content/ML25 ...
- Dataset extracted to '/content/ML25' successfully!
- Removing the zip file /content/ML25/ml-25m.zip ...
- Zip file removed successfully!

Loading 'ml-25m' files from '/content/ML25/ml-25m' ...
62423 movies and 25000095 ratings have been loaded!


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


## **📊 [Step 3] Check Some Stats**

In [23]:
# Some preparations
movifexDF = movifexDF.rename(columns={'id': 'itemId'})
mlRatingsDF = mlRatingsDF.rename(columns={'movieId': 'itemId'})
movifexDF['itemId'] = movifexDF['itemId'].astype(str).astype(int)

def printStatistics(data: pd.DataFrame):
  interactions = data.shape[0]
  uniqueUsers = data['userId'].nunique()
  uniqueItems = data['itemId'].nunique()
  print('------------------------')
  print(f" - Total interactions: {interactions}")
  print(f" - |U|: {uniqueUsers}")
  print(f" - |I|: {uniqueItems}")
  print(f" - |R|/|U|: {interactions / uniqueUsers:.2f}")
  print(f" - |R|/|I|: {interactions / uniqueItems:.2f}")
  print(f" - |R|/(|U|*|I|): {interactions / (uniqueUsers * uniqueItems):.10f}")
  print('------------------------')

# Merged
mergedDF1 = pd.merge(mlRatingsDF, movifexDF, on='itemId', how='inner')
printStatistics(mergedDF1)

------------------------
 - Total interactions: 3189185
 - |U|: 158146
 - |I|: 274
 - |R|/|U|: 20.17
 - |R|/|I|: 11639.36
 - |R|/(|U|*|I|): 0.0735988347
------------------------
