Skip to content

Recommendation Engine powered by Matrix Factorization.

Notifications You must be signed in to change notification settings

SunilGolden/RecEngineMF

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RecEngineMF


Data

Command to Download and Extract Data

  1. Download MovieLens 20M Data

    wget --output-document=./ml-20m.zip  https://files.grouplens.org/datasets/movielens/ml-20m.zip
  2. Once the download is complete, extract the dataset

    unzip ml-20m.zip

Usage

Install Dependencies

  1. Install Python: Make sure Python is installed on your system. If not, you can download and install Python from the official Python website: https://www.python.org/downloads/

  2. Create a virtual environment:

    python -m venv myenv
  3. Activate the virtual environment

    For Windows CMD Users

    .\myenv\Scripts\Activate.bat

    For Windows Powershell Users

    .\myenv\Scripts\Activate.ps1

    For macOS/Linux Users

    source myenv/bin/activate
  4. Install the dependencies

    pip install -r requirements.txt
  5. Add wandb API key

    Sign in to https://wandb.ai and get your API key.
    Create a file secrets.json in the root directory and put your wandb API key.

    {
    	"WANDB_API_KEY": "YOUR_API_KEY"
    }

Train

python train.py --data_path DATA_PATH [--emb_size EMB_SIZE] [--random_seed RANDOM_SEED] 
                [--batch_size BATCH_SIZE] [--epochs EPOCHS] [--learning_rate LEARNING_RATE] 
                [--weight_decay WEIGHT_DECAY] [--step_size STEP_SIZE] [--gamma GAMMA] 
                [--patience PATIENCE] [--model_name MODEL_NAME] [--metrics_csv_name METRICS_CSV_NAME]
                [--silent] [--log_wandb]

Required Flag

  • --data_path: Path to the CSV file containing the ratings data.

Optional Flags

  • --emb_size: Size of the embedding for users and items. Default is 100.
  • --random_seed: Random seed for reproducibility. Default is 42.
  • --batch_size: Batch size for training. Default is 64000.
  • --epochs: Number of epochs for training. Default is 100.
  • --learning_rate: Learning rate for optimizer. Default is 0.001.
  • --weight_decay: Weight decay for optimizer. Default is 1e-5.
  • --step_size: Step size for learning rate scheduler. Default is 10.
  • --gamma: Gamma value for learning rate scheduler. Default is 0.1.
  • --patience: Patience for early stopping based on validation loss. Default is 3.
  • --model_name: Name of the trained model file to be saved. Default is 'mf_model.pth'.
  • --metrics_csv_name: Name of the CSV file to save the training metrics. Default is 'metrics.csv'.
  • --silent: Whether to hide verbose output during training.
  • --log_wandb: Whether to log metrics into weights and bias (wandb.ai).

Test

python test.py  --data_path DATA_PATH --model_path MODEL_PATH [--batch_size BATCH_SIZE] [--random_seed RANDOM_SEED]

Required Flags

  • --data_path: Path to the CSV file containing the ratings data.
  • --model_path: Path to the trained model file to be loaded for testing.

Optional Flags

  • --batch_size: Batch size for testing. Default is 64000.
  • --random_seed: Random seed for reproducibility. Default is 42.

Run Inference

python inference.py --data_path DATA_PATH --model_path MODEL_PATH --user_id USER_ID [--n_items N_ITEMS]

Required Flags

  • --data_path: Path to the CSV file containing the ratings data.
  • --model_path: Path to the trained model file to be loaded for testing.
  • --user_id: The id of the user for whom item is to be recommended.

Optional Flags

  • --n_items: The top n number of items to be recommended to the user. Default is 10.

Plot Curve

python plot.py --metrics_csv_path METRICS_CSV_PATH [--patience PATIENCE] [--file_name FILE_NAME]

Required Flags

  • --metrics_csv_path: Path to the CSV file containing the mertics data. [ CSV file with column names: 'Epoch', 'Train Loss', 'Val Loss' ]

Optional Flags

  • --patience: Patience for early stopping. Default is None.
  • --file_name: The name for saving the plot. Default is loss_curve.png.

References