<a href="https://colab.research.google.com/github/RecSys-lab/Popcorn/blob/main/examples/colab/experiment_split.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **üçø Popcorn Framework in Google Colab**
### **Train/Test Split Setups**

üé¨ Popcorn Framework: [link](https://github.com/RecSys-lab/Popcorn)

## **[Step 1] Clone Popcorn Movie Recommender Tool**

Clone the framework into your `GDrive` and prepare it for experiments.

‚ö†Ô∏è You might see a *"Restart Session"* warning during the first run in Google Colab due to library version mismatches. This is expected! Accept the restart, re-run this cell, and continue!

In [1]:
# Clone the repo
!git clone https://github.com/RecSys-lab/Popcorn.git

# Install the required library
%cd Popcorn
!pip install -e .

# Add the repository to the Python path
import sys
sys.path.append('/content/Popcorn')

# Go back to the root
%cd ..

fatal: destination path 'Popcorn' already exists and is not an empty directory.
/content/Popcorn
Obtaining file:///content/Popcorn
  Preparing metadata (setup.py) ... [?25l[?25hdone
Installing collected packages: Popcorn
  Attempting uninstall: Popcorn
    Found existing installation: Popcorn 1.6.0
    Uninstalling Popcorn-1.6.0:
      Successfully uninstalled Popcorn-1.6.0
  Running setup.py develop for Popcorn
Successfully installed Popcorn-1.6.0
/content


## üöÄ **[Step 2] Use the Framework**

### *1. Load Configurations and Imports*

In [2]:
import os
import json
import pandas as pd
from popcorn.utils import readConfigs

# Start the Framework
print("Welcome to 'Popcorn' üçø! Starting the framework for your movie recommendation ...\n")

# Read the configuration file
configs = readConfigs("Popcorn/popcorn/config/config.yml")
# If properly read, print the configurations
if not configs:
    print("Error reading the configuration file!")

Welcome to 'Popcorn' üçø! Starting the framework for your movie recommendation ...

- Reading the framework's configuration file ...
- Configuration file loaded successfully!


### *2. Download the MovieLens Dataset*

In [3]:
from popcorn.datasets.movielens.downloader import downloadMovieLens

# Override (optional)
configs["datasets"]["unimodal"]["movielens"]["version"] = "1m" # '100k' | '1m' | '25m'
configs["datasets"]["unimodal"]["movielens"]["download_path"] = "/content/MovieLens"

# Variables
mlVersion = configs["datasets"]["unimodal"]["movielens"]["version"]
downloadPath = configs["datasets"]["unimodal"]["movielens"]["download_path"]

# Download MovieLens dataset
downloadMovieLens(mlVersion, downloadPath)


- Downloading the MovieLens-1m dataset ...
- Creating the download path '/content/MovieLens/ml-1m' ...
- Fetching data from 'https://files.grouplens.org/datasets/movielens/ml-1m.zip' ...
- Download completed and the dataset is saved as a 'zip' file!
- Extracting the dataset files inside '/content/MovieLens/ml-1m' ...
- Dataset extracted to '/content/MovieLens/ml-1m' successfully!
- Removing the zip file '/content/MovieLens/ml-1m/ml-1m.zip' ...
- Zip file removed successfully!


True

### *3. Load the MovieLens Dataset*

In [4]:
from popcorn.datasets.movielens.loader import loadMovieLens
from popcorn.datasets.utils import printTextualDatasetStats

# Load MovieLens
itemsDF, usersDF, ratingsDF = loadMovieLens(configs)
if itemsDF is None:
    print("Error in loading the MovieLens dataset! Exiting ...")
else:
  print(f"\n- ItemsDF (shape: {itemsDF.shape}): \n{itemsDF.head()}")
  print(f"\n- RatingsDF original row count: {len(ratingsDF):,}")
  printTextualDatasetStats(ratingsDF)


- Downloading the MovieLens-1m dataset ...
- The download path '/content/MovieLens/ml-1m' already exists! Skipping the download ...

- Loading 'MovieLens-1m' data from '/content/MovieLens/ml-1m/ml-1m' ...
- Items (movies) have been loaded. Number of rows: 3,883
- Users have been loaded. Number of rows: 6,040
- Ratings have been loaded. Number of rows: 1,000,209

- ItemsDF (shape: (3883, 3)): 
  item_id                               title  \
0       1                    Toy Story (1995)   
1       2                      Jumanji (1995)   
2       3             Grumpier Old Men (1995)   
3       4            Waiting to Exhale (1995)   
4       5  Father of the Bride Part II (1995)   

                             genres  
0   [Animation, Children's, Comedy]  
1  [Adventure, Children's, Fantasy]  
2                 [Comedy, Romance]  
3                   [Comedy, Drama]  
4                          [Comedy]  

- RatingsDF original row count: 1,000,209
--------------------------
- The Data

### *4. Apply K-Core 40*

In [8]:
from popcorn.datasets.utils import applyKcore

# Variables
K_CORE = configs["setup"]["k_core"] # Originally accessible via this config

# Before filtering
print(f"- Before {K_CORE}-core filtering row count: {len(ratingsDF):,}")

# Override to K_CORE=40
K_CORE = 40

# Apply K-Core
ratingsDF_filtered = applyKcore(ratingsDF, K_CORE)
print(f"- After {K_CORE}-core filtering row count: {len(ratingsDF_filtered):,}")

- Before 10-core filtering row count: 1,000,209
- Applying 40-core filtering ...
- After 40-core filtering row count: 945,972


### *5. Split the Data ('random' - 20% Test Ratio)*

In [9]:
from popcorn.datasets.movielens.process import trainTestSplit

# Override
configs["setup"]["split"]["mode"] = "random"
configs["setup"]["split"]["test_ratio"] = 0.2

# Keep the main DF
ratingDFClone = ratingsDF_filtered.copy()

# Split
trainTestSplit(ratingDFClone, configs)


- Splitting the ratings DataFrame using 'random' mode and test ratio '0.2' ...
- Splitting finished! Train data: 756,778 - Test data: 189,194.


(        user_id  item_id  rating   timestamp
 0           550     3868       4   976164508
 1            95       10       4   977626958
 2          2899      216       2   971920349
 3          1194      499       3  1036724437
 4          4344      914       4   966453930
 ...         ...      ...     ...         ...
 756773      351     2692       5   976677837
 756774     4286     1096       1   965278495
 756775     3285      832       4   968118712
 756776     4250      227       4   965308627
 756777      937     2959       5  1030489689
 
 [756778 rows x 4 columns],
         user_id  item_id  rating  timestamp
 756778      834     1286       4  975363405
 756779     1899     2971       4  974694741
 756780     5786     2379       3  958268809
 756781     2088     2088       3  974658351
 756782      236     1198       5  976764786
 ...         ...      ...     ...        ...
 945967     1658     2528       4  974715626
 945968     2263     3868       3  974585128
 945969      

### *6. Split the Data ('random' - 40% Test Ratio)*

In [10]:
from popcorn.datasets.movielens.process import trainTestSplit

# Override
configs["setup"]["split"]["mode"] = "random"
configs["setup"]["split"]["test_ratio"] = 0.4

# Keep the main DF
ratingDFClone = ratingsDF_filtered.copy()

# Split
trainTestSplit(ratingDFClone, configs)


- Splitting the ratings DataFrame using 'random' mode and test ratio '0.4' ...
- Splitting finished! Train data: 567,584 - Test data: 378,388.


(        user_id  item_id  rating   timestamp
 0           550     3868       4   976164508
 1            95       10       4   977626958
 2          2899      216       2   971920349
 3          1194      499       3  1036724437
 4          4344      914       4   966453930
 ...         ...      ...     ...         ...
 567579     3204     2858       4   968566899
 567580     3378     1240       4   967757388
 567581     5153     1839       3   961970007
 567582     4186     1466       2  1028160052
 567583     3483     2291       2   967069506
 
 [567584 rows x 4 columns],
         user_id  item_id  rating  timestamp
 567584     2825     3730       4  972610314
 567585     5401     3052       5  963267100
 567586     5518     2826       5  976418439
 567587     3808     1291       4  965963190
 567588     3762     1982       2  966434472
 ...         ...      ...     ...        ...
 945967     1658     2528       4  974715626
 945968     2263     3868       3  974585128
 945969      

### *7. Split the Data ('temporal' - 25% Test Ratio)*

In [11]:
from popcorn.datasets.movielens.process import trainTestSplit

# Override
configs["setup"]["split"]["mode"] = "temporal"
configs["setup"]["split"]["test_ratio"] = 0.25

# Keep the main DF
ratingDFClone = ratingsDF_filtered.copy()

# Split
trainTestSplit(ratingDFClone, configs)


- Splitting the ratings DataFrame using 'temporal' mode and test ratio '0.25' ...
- Splitting finished! Train data: 709,479 - Test data: 236,493.


(         user_id  item_id  rating  timestamp
 1000138     6040      858       4  956703932
 999873      6040      593       5  956703954
 1000153     6040     2384       4  956703954
 1000007     6040     1961       4  956703977
 1000192     6040     2019       5  956703977
 ...          ...      ...     ...        ...
 139007       889     1086       3  975248617
 137752       889      924       4  975248617
 138775       889     1964       5  975248617
 137816       889      942       4  975248617
 137655       889     2076       4  975248617
 
 [709479 rows x 4 columns],
         user_id  item_id  rating   timestamp
 139073      889     3788       5   975248617
 138736      889     1950       3   975248617
 137835      889     1625       4   975248642
 138660      889     3386       5   975248642
 138307      889      800       3   975248642
 ...         ...      ...     ...         ...
 825793     4958     2399       1  1046454338
 825438     4958     1407       5  1046454443
 825

### *8. Split the Data ('per_user' - 30% Test Ratio)*

In [12]:
from popcorn.datasets.movielens.process import trainTestSplit

# Override
configs["setup"]["split"]["mode"] = "per_user"
configs["setup"]["split"]["test_ratio"] = 0.3

# Keep the main DF
ratingDFClone = ratingsDF_filtered.copy()

# Split
trainTestSplit(ratingDFClone, configs)


- Splitting the ratings DataFrame using 'per_user' mode and test ratio '0.3' ...
- Splitting finished! Train data: 941,270 - Test data: 4,702.


(        user_id  item_id  rating  timestamp
 0             1     3186       4  978300019
 1             1     1270       5  978300055
 2             1     1721       4  978300055
 3             1     1022       5  978300055
 4             1     2340       3  978300103
 ...         ...      ...     ...        ...
 941265     6040      232       5  997454398
 941266     6040     2917       4  997454429
 941267     6040     1921       4  997454464
 941268     6040     1784       3  997454464
 941269     6040      161       3  997454486
 
 [941270 rows x 4 columns],
          user_id  item_id  rating  timestamp
 25             1       48       5  978824351
 136            2     1917       3  978300174
 232            3     2081       4  978298504
 258            5      288       2  978246585
 475            6      597       5  978239019
 ...          ...      ...     ...        ...
 998435      6035     3146       1  956713640
 999251      6036     2643       1  956755196
 999684      603

### *9. Split the Data (Unsupported Variant)*

In [13]:
from popcorn.datasets.movielens.process import trainTestSplit

# Override
configs["setup"]["split"]["mode"] = "leave_one_out" # Unsupported

# Keep the main DF
ratingDFClone = ratingsDF_filtered.copy()

# Split
trainTestSplit(ratingDFClone, configs)


- Splitting the ratings DataFrame using 'leave_one_out' mode and test ratio '0.3' ...
- [Error] Unsupported split mode 'leave_one_out'! Exiting ...


### *10. Split the Data (Unsupported Split Ratio)*

In [15]:
from popcorn.datasets.movielens.process import trainTestSplit

# Override
configs["setup"]["split"]["mode"] = "per_user"
configs["setup"]["split"]["test_ratio"] = 1.5 # Unsupported

# Keep the main DF
ratingDFClone = ratingsDF_filtered.copy()

# Split
trainTestSplit(ratingDFClone, configs)


- Splitting the ratings DataFrame using 'per_user' mode and test ratio '1.5' ...
- [Warn] Test ratio should be in (0, 1)! Setting to 0.2 ...
- Splitting finished! Train data: 941,270 - Test data: 4,702.


(        user_id  item_id  rating  timestamp
 0             1     3186       4  978300019
 1             1     1270       5  978300055
 2             1     1721       4  978300055
 3             1     1022       5  978300055
 4             1     2340       3  978300103
 ...         ...      ...     ...        ...
 941265     6040      232       5  997454398
 941266     6040     2917       4  997454429
 941267     6040     1921       4  997454464
 941268     6040     1784       3  997454464
 941269     6040      161       3  997454486
 
 [941270 rows x 4 columns],
          user_id  item_id  rating  timestamp
 25             1       48       5  978824351
 136            2     1917       3  978300174
 232            3     2081       4  978298504
 258            5      288       2  978246585
 475            6      597       5  978239019
 ...          ...      ...     ...        ...
 998435      6035     3146       1  956713640
 999251      6036     2643       1  956755196
 999684      603