<a href="https://colab.research.google.com/github/RecSys-lab/movifex_dataset/blob/main/examples/dataset_statistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **MoViFex Dataset - Statistics**

🎬 Dataset: [link](https://huggingface.co/datasets/alitourani/MoViFex_Dataset/tree/main)

🎬 Framework: [link](https://github.com/RecSys-lab/MoViFex)

## **[Step 1] Clone Dataset Management Toolkit**

Clone the framework into your `GDrive` and prepare it for experiments.

In [1]:
# Clone the repo
!git clone https://github.com/RecSys-lab/MoViFex.git

# Install the required library
%cd MoViFex
!pip install -e .

# Add the repository to the Python path
import sys
sys.path.append('/content/MoViFex')

# Go back to the root
%cd ..

Cloning into 'MoViFex'...
remote: Enumerating objects: 792, done.[K
remote: Counting objects: 100% (368/368), done.[K
remote: Compressing objects: 100% (251/251), done.[K
remote: Total 792 (delta 197), reused 266 (delta 111), pack-reused 424 (from 1)[K
Receiving objects: 100% (792/792), 1.03 MiB | 6.29 MiB/s, done.
Resolving deltas: 100% (414/414), done.
/content/MoViFex
Obtaining file:///content/MoViFex
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pytube>=15.0 (from MoViFex==1.0.0)
  Downloading pytube-15.0.0-py3-none-any.whl.metadata (5.0 kB)
Downloading pytube-15.0.0-py3-none-any.whl (57 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 kB[0m [31m533.3 kB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pytube, MoViFex
  Running setup.py develop for MoViFex
Successfully installed MoViFex-1.0.0 pytube-15.0.0
/content


## 🚀 **[Step 2] Use the Dataset and its Framework**

### I. *Load the Dataset Metadata File*

In [2]:
import os
import json
import movifex
import pandas as pd
from movifex.utils import loadJsonFromUrl

# Variables
datasetName = "MoViFex-visual"
datasetMetadataUrl = "https://huggingface.co/datasets/alitourani/MoViFex_Dataset/resolve/main/stats.json"

# Fetch the metadata of the movie features dataset
print(f"- Fetching the dataset metadata from '{datasetMetadataUrl}' ...")
jsonData = loadJsonFromUrl(datasetMetadataUrl)
movifexDF = pd.DataFrame(jsonData)

print(f'- {datasetName} dataset is loaded into a DataFrame:')
movifexDF.head(5)

- Fetching the dataset metadata from 'https://huggingface.co/datasets/alitourani/MoViFex_Dataset/resolve/main/stats.json' ...
- MoViFex-visual dataset is loaded into a DataFrame:


Unnamed: 0,id,title,year,genres
0,6,Heat,1995,"[Action, Crime, Thriller]"
1,50,"Usual Suspects, The",1995,"[Crime, Mystery, Thriller]"
2,111,Taxi Driver,1976,"[Crime, Drama, Thriller]"
3,150,Apollo 13,1995,"[Adventure, Drama, IMAX]"
4,165,Die Hard: With a Vengeance,1995,"[Action, Crime, Thriller]"


### II. *Load MovieLenz*

In [3]:
from movifex.utils import loadDataFromCSV
from movifex.datasets.movielens.downloader import downloadMovielens25m

# Variables
SEED = 42
Sampler = 0.2     # Sampling the results
movieLenzVariants = ['ml-1m', 'ml-25m']

for variant in movieLenzVariants:
  movielenzUrl = f"https://files.grouplens.org/datasets/movielens/{variant}.zip"
  # Download the MovieLenz Dataset
  datasetPath = f"/content/{'ML25' if variant == 'ml-25m' else 'ML1M'}"
  print(f"Downloading the '{variant}' dataset from '{movielenzUrl}' ...")
  isDownloadSuccessful = downloadMovielens25m(movielenzUrl, datasetPath)
  if not isDownloadSuccessful:
    print('- Seems like there was a problem while downloading!')
    continue
  datasetPath = os.path.join(datasetPath, variant)
  # Load the Files
  print(f"\nLoading '{variant}' files from '{datasetPath}' ...")
  if variant == 'ml-25m':
    mlMoviesDF_25M = loadDataFromCSV(os.path.join(datasetPath, "movies.csv"))
    mlRatingsDF_25M = loadDataFromCSV(os.path.join(datasetPath, "ratings.csv"))
    print(f"{len(mlMoviesDF_25M)} movies and {len(mlRatingsDF_25M)} ratings have been loaded!\n")
    # Apply sampling
    origLen = len(mlRatingsDF_25M)
    mlRatingsDF_25M_Sampled = mlRatingsDF_25M.sample(frac=Sampler, random_state=SEED).reset_index(drop=True)
    print(f"{len(mlMoviesDF_25M)} movies and {len(mlRatingsDF_25M_Sampled)} (after sampling) ratings have been loaded!\n")
  elif variant == 'ml-1m':
    # Load the Files
    delim, eng = '::', 'python'
    moviesFile = os.path.join(datasetPath, "movies.dat")
    ratingsFile = os.path.join(datasetPath, "ratings.dat")
    mlMoviesDF_1M = pd.read_csv(moviesFile, sep=delim,
                             names=['itemId', 'title', 'genres'],
                             engine=eng, header=None, encoding='latin-1')
    mlRatingsDF_1M = pd.read_csv(ratingsFile, sep=delim,
                              names=['userId', 'itemId', 'rating', 'timestamp'],
                              engine=eng, header=None, encoding='latin-1')
    print(f"{len(mlMoviesDF_1M)} movies and {len(mlRatingsDF_1M)} ratings have been loaded!\n")

mlRatingsDF_1M.head(5)

Downloading the 'ml-1m' dataset from 'https://files.grouplens.org/datasets/movielens/ml-1m.zip' ...
- Downloading the dataset from 'https://files.grouplens.org/datasets/movielens/ml-1m.zip' ...
- Download completed and the dataset is saved as a 'zip' file!
- Now, extracting the dataset files inside /content/ML1M ...
- Dataset extracted to '/content/ML1M' successfully!
- Removing the zip file /content/ML1M/ml-25m.zip ...
- Zip file removed successfully!

Loading 'ml-1m' files from '/content/ML1M/ml-1m' ...
3883 movies and 1000209 ratings have been loaded!

Downloading the 'ml-25m' dataset from 'https://files.grouplens.org/datasets/movielens/ml-25m.zip' ...
- Downloading the dataset from 'https://files.grouplens.org/datasets/movielens/ml-25m.zip' ...
- Download completed and the dataset is saved as a 'zip' file!
- Now, extracting the dataset files inside /content/ML25 ...
- Dataset extracted to '/content/ML25' successfully!
- Removing the zip file /content/ML25/ml-25m.zip ...
- Zip file 

Unnamed: 0,userId,itemId,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


### III. *Load MMTF-14K (Fused with Augmented Text)*

In [4]:
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Variables
mmtfBase = "https://raw.githubusercontent.com/RecSys-lab/reproducibility_data/refs/heads/main/"
mmtfBaseAudio = mmtfBase + "fused_textual_audio/"
mmtfBaseVisual = mmtfBase + "fused_textual_visual/"
mmtfMapVisual = {
  "cnn":  "fused_llm_mmtf_avg.csv",
  "avf":  "fused_llm_mmtf_avf_avg.csv",
}
mmtfMapAudio = {
  "corr":   "fused_llm_mmtf_audio_correlation.csv",
  "delta":  "fused_llm_mmtf_audio_delta.csv",
  "log":    "fused_llm_mmtf_audio_log.csv",
  "spect":  "fused_llm_mmtf_audio_spectral.csv",
  "i_ivec": "i-vector/fused_llm_mmtf_audio_IVec_splitItem_fold_1_gmm_128_tvDim_20.csv",
}

# Parser Utility
def _parse_safe(s: str) -> np.ndarray:
  vec = np.fromstring(str(s).replace(',', ' '), sep=' ', dtype=np.float32)
  if not np.all(np.isfinite(vec)):
    vec = np.nan_to_num(vec, nan=0.0, posinf=0.0, neginf=0.0)
  return vec

parse = _parse_safe

# Loaders
def loadMMTFVisual(v: str, verbose=True):
  df = pd.read_csv(mmtfBaseVisual + mmtfMapVisual[v])
  df['visual'] = df.embedding.map(parse)
  if verbose:
    print(f"[Visual] Loaded {v:<18} items={len(df):,}")
  return df[['itemId','visual']]

def readMMTFAudioCSV(url):
  df=pd.read_csv(url,low_memory=False)
  df.drop(columns=['title','genres'],errors='ignore',inplace=True)
  df.rename(columns={'embedding':'embeddings'},inplace=True)
  df['embeddings']=df['embeddings'].astype(str).str.replace(',',' ')
  df['embeddings']=df['embeddings'].apply(parse)
  return df[['itemId','embeddings']]

def loadMMTFAudio(variant: str, pca_ratio=0.95, verbose=True):
  if variant=='i_ivec':
    df=readMMTFAudioCSV(mmtfBaseAudio + mmtfMapAudio['i_ivec'])
    df.rename(columns={'embeddings':'audio'},inplace=True)
    if verbose: print(f"[Audio] i‑vector items = {len(df):,}")
    return df
  if variant=='blf':
    dfs=[]
    for key in ('corr','delta','log','spect'):
        dfs.append(readMMTFAudioCSV(mmtfBaseAudio + mmtfMapAudio[key]).rename(columns={'embeddings':f'{key}_emb'}))
    dfm=dfs[0]
    for d in dfs[1:]: dfm=dfm.merge(d,on='itemId',how='inner')
    dfm['concat']=dfm.apply(lambda r:np.concatenate([r['corr_emb'],r['delta_emb'],r['log_emb'],r['spect_emb']]),axis=1)
    X=np.vstack(dfm['concat'].values)
    Xs=StandardScaler().fit_transform(X)
    pca=PCA(n_components=pca_ratio, svd_solver='full', random_state=SEED)
    Xp=pca.fit_transform(Xs).astype(np.float32)
    df_audio=pd.DataFrame({'itemId':dfm['itemId'],'audio':list(Xp)})
    if verbose: print(f"[Audio] BLF‑concat PCA-95 dims={Xp.shape[1]:<4} var={pca.explained_variance_ratio_.sum():.2f}  items={len(df_audio):,}")
    return df_audio
  raise ValueError(f"Unknown audio variant: {variant}")

# Load
mmtfAudioDF_blf = loadMMTFAudio('blf')
mmtfAudioDF_ivec = loadMMTFAudio('i_ivec')
mmtfVisualDF_cnn = loadMMTFVisual('cnn')
mmtfVisualDF_avf = loadMMTFVisual('avf')

[Audio] BLF‑concat PCA-95 dims=347  var=0.95  items=1,807
[Audio] i‑vector items = 1,807
[Visual] Loaded cnn                items=1,807
[Visual] Loaded avf                items=1,807


### IV. *Load LLM-Augmented Text*

In [5]:
# Variables
textAugBase = ("https://raw.githubusercontent.com/yasdel/Poison-RAG-Plus/"
                 "main/AttackData/Embeddings_from_Augmentation_Attack_Data/"
                 "ml-latest-small/")
textAugPrefix_aug = "enriched_description_part"
textAugPrefix_raw = "originalraw_combined_all_part"

def loadTextAugmented(model, augmented, max_parts=15, verbose=True):
  prefix = f'{model}_{textAugPrefix_aug}' if augmented else f'{model}_{textAugPrefix_raw}'
  dfs = []
  for i in range(1, max_parts+1):
    url = f"{textAugBase}{prefix}{i}.csv.gz"
    try:
        df = pd.read_csv(url, compression='gzip')
        df['text'] = df.embeddings.map(parse)
        dfs.append(df[['itemId','text']])
    except:
        break
  out = pd.concat(dfs).drop_duplicates('itemId')
  if verbose:
      tag = 'AUG' if augmented else 'ORIG'
      print(f"[Text]  {tag} parts={len(dfs)} items={len(out):,}")
  return out

# Load
textAug_st_aug = loadTextAugmented('st', True)
textAug_st_raw = loadTextAugmented('st', False)
textAug_llama_aug = loadTextAugmented('llama', True)
textAug_llama_raw = loadTextAugmented('llama', False)
textAug_openai_aug = loadTextAugmented('openai', True)
textAug_openai_raw = loadTextAugmented('openai', False)

[Text]  AUG parts=1 items=1,606
[Text]  ORIG parts=1 items=1,606
[Text]  AUG parts=4 items=1,606
[Text]  ORIG parts=4 items=1,606
[Text]  AUG parts=3 items=1,606
[Text]  ORIG parts=3 items=1,606


## **📊 [Step 3] Check Some Stats**

In [6]:
# Some preparations
movifexDF = movifexDF.rename(columns={'id': 'itemId'})
movifexDF['itemId'] = movifexDF['itemId'].astype(str).astype(int)

def printStatistics(data: pd.DataFrame):
  interactions = data.shape[0]
  uniqueUsers = data['userId'].nunique()
  uniqueItems = data['itemId'].nunique()
  print('------------------------')
  print(f" - Total interactions: {interactions}")
  print(f" - |U|: {uniqueUsers}")
  print(f" - |I|: {uniqueItems}")
  print(f" - |R|/|U|: {interactions / uniqueUsers:.2f}")
  print(f" - |R|/|I|: {interactions / uniqueItems:.2f}")
  print(f" - |R|/(|U|*|I|): {interactions / (uniqueUsers * uniqueItems):.10f}")
  print('------------------------\n')

for variant in movieLenzVariants:
  version = '25M' if variant == 'ml-25m' else '1M'
  if variant == 'ml-25m':
    mlRatingsDF = mlRatingsDF_25M
  elif variant == 'ml-1m':
    mlRatingsDF = mlRatingsDF_1M
  # Rename columns
  mlRatingsDF = mlRatingsDF.rename(columns={'movieId': 'itemId'})
  mlRatingsDFSampled = mlRatingsDF_25M_Sampled.rename(columns={'movieId': 'itemId'})

  # MovieLenz-25M
  print(f"Statistics of merging '{datasetName}' with 'MovieLenz-{version}' ratings:")
  mergedDF_ml = pd.merge(mlRatingsDF, movifexDF, on='itemId', how='inner')
  printStatistics(mergedDF_ml)

  # Check also the sampler (MovieLenz-25M)
  if variant == 'ml-25m':
    print(f"Statistics of merging '{datasetName}' with 'MovieLenz-{version}' (sampled) ratings:")
    mergedDF_ml_sampled = pd.merge(mlRatingsDFSampled, movifexDF, on='itemId', how='inner')
    printStatistics(mergedDF_ml_sampled)

  # MMTF-14K
  print(f"Statistics of merging 'MMTF-14K' with 'MovieLenz-{version}' ratings:")
  mergedDF_ml_mmtfAudBLF = pd.merge(mlRatingsDF, mmtfAudioDF_blf, on='itemId', how='inner')
  printStatistics(mergedDF_ml_mmtfAudBLF)

  commonMovieFexMMTF = pd.merge(movifexDF[['itemId']], mmtfAudioDF_blf[['itemId']], on='itemId', how='inner')
  print(f"Statistics of merging '{datasetName}' with 'MMTK-14K' and 'MovieLenz-{version}' ratings:")
  mergedDF_mmtfAudBLF = pd.merge(mlRatingsDF, commonMovieFexMMTF, on='itemId', how='inner')
  printStatistics(mergedDF_mmtfAudBLF)

  # Textual Augmented (LLM)
  print(f"Statistics of merging 'Textual Augmented' with 'MovieLenz-{version}' ratings:")
  mergedDF_ml_textAugStRaw = pd.merge(mlRatingsDF, textAug_st_raw, on='itemId', how='inner')
  printStatistics(mergedDF_ml_textAugStRaw)

  commonMovieFexTextAug = pd.merge(movifexDF[['itemId']], textAug_st_raw[['itemId']], on='itemId', how='inner')
  print(f"Statistics of merging '{datasetName}' with 'Textual Augmented' and 'MovieLenz-{version}' ratings:")
  mergedDF_textAugStRaw = pd.merge(mlRatingsDF, commonMovieFexTextAug, on='itemId', how='inner')
  printStatistics(mergedDF_textAugStRaw)

  # Check also the sampler (Textual Augmented)
  if variant == 'ml-25m':
    print(f"Statistics of merging '{datasetName}' with 'Textual Augmented' and 'MovieLenz-{version}' (sampled) ratings:")
    mergedDF_textAugStRaw_sample = pd.merge(mlRatingsDFSampled, commonMovieFexTextAug, on='itemId', how='inner')
    printStatistics(mergedDF_textAugStRaw_sample)

Statistics of merging 'MoViFex-visual' with 'MovieLenz-1M' ratings:
------------------------
 - Total interactions: 80608
 - |U|: 5916
 - |I|: 65
 - |R|/|U|: 13.63
 - |R|/|I|: 1240.12
 - |R|/(|U|*|I|): 0.2096218859
------------------------

Statistics of merging 'MMTF-14K' with 'MovieLenz-1M' ratings:
------------------------
 - Total interactions: 583376
 - |U|: 6040
 - |I|: 1139
 - |R|/|U|: 96.59
 - |R|/|I|: 512.18
 - |R|/(|U|*|I|): 0.0847984464
------------------------

Statistics of merging 'MoViFex-visual' with 'MMTK-14K' and 'MovieLenz-1M' ratings:
------------------------
 - Total interactions: 49232
 - |U|: 5734
 - |I|: 41
 - |R|/|U|: 8.59
 - |R|/|I|: 1200.78
 - |R|/(|U|*|I|): 0.2094141067
------------------------

Statistics of merging 'Textual Augmented' with 'MovieLenz-1M' ratings:
------------------------
 - Total interactions: 632397
 - |U|: 6040
 - |I|: 992
 - |R|/|U|: 104.70
 - |R|/|I|: 637.50
 - |R|/(|U|*|I|): 0.1055458569
------------------------

Statistics of merging