This notebook compares various Factorization Machines implementations.

# I - Factorization Machines

The dataset used here is [MovieLens 100K](https://grouplens.org/datasets/movielens/).

In [1]:
%load_ext watermark
%watermark --python --machine --packages river,numpy,pandas,sklearn,xlearn --datename

Python implementation: CPython
Python version       : 3.8.5
IPython version      : 7.22.0

river  : 0.7.0
numpy  : 1.20.2
pandas : 1.2.4
sklearn: 0.24.1
xlearn : 0.4.4

Compiler    : Clang 10.0.0 
OS          : Darwin
Release     : 20.3.0
Machine     : x86_64
Processor   : i386
CPU cores   : 8
Architecture: 64bit



## LibFM

Download and uncompress [`libfm`](http://www.libfm.org/) into the working directory.

In [2]:
import os
import shutil
import tarfile
import urllib

archive = 'libfm.tar.gz'
with urllib.request.urlopen('http://www.libfm.org/libfm-1.42.src.tar.gz') as r, open(archive, 'wb') as f:
    shutil.copyfileobj(r, f)

tar = tarfile.open(archive, 'r:gz')
tar.extractall('.')
tar.close()

os.remove(archive)

libfm_dir = 'libfm-1.42.src'

Compile the tools.

In [3]:
%%bash -s "$libfm_dir"
cd $1
make all

cd src/libfm; make all
g++ -O3 -Wall -c libfm.cpp -o libfm.o
g++ -O3 -Wall libfm.o -o ../../bin/libFM
g++ -O3 -Wall -c tools/transpose.cpp -o tools/transpose.o
g++ -O3 tools/transpose.o -o ../../bin/transpose
g++ -O3 -Wall -c tools/convert.cpp -o tools/convert.o
g++ -O3 tools/convert.o -o ../../bin/convert


Let's prepare our dataset to [`libfm`](http://www.libfm.org/) format.

In [4]:
import pandas as pd

from river import datasets

def merge_X_y(x, y):
    x['y'] = y
    return x
    
ml_100k = [merge_X_y(x, y) for x, y in datasets.MovieLens100K()]
ml_100k = pd.DataFrame(ml_100k)
ml_100k = ml_100k[['user', 'item', 'gender', 'occupation', 'y']]

ml_100k.head()

Unnamed: 0,user,item,gender,occupation,y
0,259,255,M,student,4.0
1,259,286,M,student,4.0
2,259,298,M,student,4.0
3,259,185,M,student,4.0
4,259,173,M,student,4.0


Perform a 80/20 train test split and one hot encode categorical features.

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

X_train, X_test, y_train, y_test = train_test_split(ml_100k.drop(columns='y'), ml_100k[['y']], test_size=0.2, random_state=17)

ohe = OneHotEncoder(handle_unknown='ignore')

X_train = ohe.fit_transform(X_train)
X_test = ohe.transform(X_test)

Save the data to text files ready to use with [`libfm`](http://www.libfm.org/).

In [6]:
import numpy as np

from sklearn.datasets import dump_svmlight_file

train_file, test_file = 'libfm_train.txt', 'libfm_test.txt'

with open(train_file, 'wb') as f:
    dump_svmlight_file(X_train, y_train.values.squeeze(), f=f)
    
with open(test_file, 'wb') as f:
    dump_svmlight_file(X_test, np.zeros(len(y_test)), f=f)

pred_file = 'libfm_pred.txt'

Use [`libfm`](http://www.libfm.org/) to train a model and predict the test set.

In [7]:
%%bash -s "$libfm_dir" "$train_file" "$test_file" "$pred_file"
cd $1
./bin/libFM -task r -dim '1,1,10' -method sgd -iter 1 -learn_rate 0.01 -init_stdev 0.1 -regular '0,0,0' -train ../$2 -test ../$3 -out ../$4

----------------------------------------------------------------------------
libFM
  Version: 1.4.2
  Author:  Steffen Rendle, srendle@libfm.org
  WWW:     http://www.libfm.org/
This program comes with ABSOLUTELY NO WARRANTY; for details see license.txt.
This is free software, and you are welcome to redistribute it under certain
conditions; for details see license.txt.
----------------------------------------------------------------------------
Loading train...	
has x = 1
has xt = 0
num_rows=80000	num_values=320000	num_features=2622	min_target=1	max_target=5
Loading test... 	
has x = 1
has xt = 0
num_rows=20000	num_values=79973	num_features=2622	min_target=0	max_target=0
#relations: 0
Loading meta data...	
learnrate=0.01
learnrates=0.01,0.01,0.01
#iterations=1
SGD: DON'T FORGET TO SHUFFLE THE ROWS IN TRAINING DATA TO GET THE BEST RESULTS.
#Iter=  0	Train=0.954638	Test=3.69095
Final	Train=0.954638	Test=3.69095


Load [`libfm`](http://www.libfm.org/) predictions into memory and compute MAE and RMSE scores.

In [8]:
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

libfm_pred = pd.read_csv(pred_file, names=['y'])

print(f'LibFM MAE: {mean_absolute_error(y_test, libfm_pred):.4f}')
print(f'LibFM RMSE: {mean_squared_error(y_test, libfm_pred) ** 0.5:.4f}')

LibFM MAE: 0.7619
LibFM RMSE: 0.9701


Clean up the working directory.

In [9]:
to_remove = [train_file, test_file, pred_file]

for path in to_remove:
    os.remove(path)

shutil.rmtree(libfm_dir)

## River

Let's do the same thing with [`river`](https://online-ml.github.io/index.html) now!

In [10]:
from river import facto
from river import meta
from river import optim
from river import stream

X_train, X_test, y_train, y_test = train_test_split(ml_100k.drop(columns='y'),
                                                    ml_100k[['y']],
                                                    test_size=0.2,
                                                    random_state=17)

fm_params = {
    'n_factors': 10,
    'weight_optimizer': optim.SGD(0.01),
    'latent_optimizer': optim.SGD(0.01),
    'l1_weight': 0.,
    'l2_weight': 0.,
    'l1_latent': 0.,
    'l2_latent': 0.,
    'intercept': 0.,
    'intercept_lr': 0.01,
    'weight_initializer': optim.initializers.Zeros(),
    'latent_initializer': optim.initializers.Normal(mu=0., sigma=0.1, seed=85),
}

model = meta.PredClipper(
    regressor=facto.FMRegressor(**fm_params),
    y_min=1,
    y_max=5
)

for x, y in stream.iter_pandas(X_train, y_train):
    model.learn_one(x, y['y'])

river_pred = [model.predict_one(x) for x, _ in stream.iter_pandas(X_test)]

print(f'River MAE: {mean_absolute_error(y_test, river_pred):.4f}')
print(f'River RMSE: {mean_squared_error(y_test, river_pred) ** 0.5:.4f}')

River MAE: 0.7598
River RMSE: 0.9727


## Results

| FM - MovieLens100K | MAE      | RMSE     |
|:-------------------|:--------:|:--------:|
| LibFM              |  0.7619  |  0.9701  |
| River              |  0.7598  |  0.9727  |

# II - Field-aware Factorization Machines

The dataset used here is a 1% subsampling from [Criteo's challenge](https://www.kaggle.com/c/criteo-display-ad-challenge) built by [`libffm`](https://github.com/ycjuan/libffm). Clic [here](https://drive.google.com/uc?export=download&confirm=v1vT&id=1HZX7zSQJy26hY4_PxSlOWz4x7O-tbQjt) to download it from their Google drive, then move it into the working directory. Let's uncompress it now.

In [11]:
import zipfile

archive = 'libffm_toy.zip'
with zipfile.ZipFile(archive, 'r') as zf:
    zf.extractall('.')

os.remove(archive)

## LibFFM

Download and uncompress [`libffm`](https://github.com/ycjuan/libffm) into the working directory.

In [12]:
archive = 'libffm.zip'
with urllib.request.urlopen('https://github.com/ycjuan/libffm/archive/master.zip') as r, open(archive, 'wb') as f:
    shutil.copyfileobj(r, f)

with zipfile.ZipFile(archive, 'r') as zf:
    zf.extractall('.')

os.remove(archive)

libffm_dir = 'libffm-master'

Compile the tools.

In [13]:
%%bash -s "$libffm_dir"
cd $1
make

g++ -Wall -O3 -std=c++0x -march=native -DUSESSE -c -o timer.o timer.cpp
g++ -Wall -O3 -std=c++0x -march=native -DUSESSE -c -o ffm.o ffm.cpp
g++ -Wall -O3 -std=c++0x -march=native -DUSESSE -o ffm-train ffm-train.cpp ffm.o timer.o
g++ -Wall -O3 -std=c++0x -march=native -DUSESSE -o ffm-predict ffm-predict.cpp ffm.o timer.o


Use [`libffm`](https://github.com/ycjuan/libffm) to train a model and predict the test set.

In [14]:
train_file = 'libffm_toy/criteo.tr.r100.gbdt0.ffm'
test_file = 'libffm_toy/criteo.va.r100.gbdt0.ffm'
model_file = 'libffm_model'
pred_file = 'libffm_pred'

In [15]:
%%bash -s "$train_file" "$test_file" "$model_file" "$pred_file"
cd libffm-master
./ffm-train -l 0.0 -k 10 -t 1 -r 0.01 -s 4 ../$1 ../$3
./ffm-predict ../$2 ../$3 ../$4

First check if the text file has already been converted to binary format (0.0 seconds)
Binary file NOT found. Convert text file to binary file (2.0 seconds)
iter   tr_logloss      tr_time
   1      0.61859          9.1
logloss = 0.52888


Load [`libffm`](https://github.com/ycjuan/libffm) predictions into memory and compute Accuracy, Log loss and ROC AUC scores.

In [16]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import log_loss
from sklearn.metrics import roc_auc_score

y_test = pd.read_csv(test_file, sep=' ', names=['y_true'] + [i for i in range(39)], usecols=['y_true'])
libffm_pred = pd.read_csv(pred_file, names=['y_hat'])

print(f'LibFFM Accuracy: {accuracy_score(y_test, libffm_pred > .5):.4f}')
print(f'LibFFM Log loss: {log_loss(y_test, libffm_pred):.4f}')
print(f'LibFFM ROC AUC: {roc_auc_score(y_test, libffm_pred):.4f}')

LibFFM Accuracy: 0.7485
LibFFM Log loss: 0.5289
LibFFM ROC AUC: 0.6910


Clean up the working directory.

In [17]:
os.remove(model_file)
os.remove(pred_file)

shutil.rmtree(libffm_dir)

## xLearn

Use [`xlearn`](https://xlearn-doc.readthedocs.io/en/latest/index.html) to train a model and predict the test set.

In [18]:
import xlearn as xl

xlearn_model = xl.create_ffm()
xlearn_model.setSigmoid()
xlearn_model.setTrain(train_file)
xlearn_model.disableNorm() # Disable instance-wise normalization

xlearn_params = {
    'task': 'binary',
    'k': 10,
    'epoch': 1,
    'opt': 'sgd',
    'lr': 0.01,
    'lambda': 0.0,
    'nthread': 4
}

model_file = 'xlearn_model'
pred_file = 'xlearn_pred'

xlearn_model.fit(xlearn_params, model_file)

xlearn_model.setTest(test_file)
xlearn_model.predict('xlearn_model', pred_file)

Load [`xlearn`](https://xlearn-doc.readthedocs.io/en/latest/index.html) predictions into memory and compute Accuracy, Log loss and ROC AUC scores.

In [19]:
xlearn_pred = pd.read_csv(pred_file, names=['y_hat'])

print(f'xLearn Accuracy: {accuracy_score(y_test, xlearn_pred > .5):.4f}')
print(f'xLearn Log loss: {log_loss(y_test, xlearn_pred):.4f}')
print(f'xLearn ROC AUC: {roc_auc_score(y_test, xlearn_pred):.4f}')

xLearn Accuracy: 0.7642
xLearn Log loss: 0.5247
xLearn ROC AUC: 0.7401


Clean up the working directory.

In [20]:
os.remove(model_file)
os.remove(pred_file)

## River

Format data in order to be compatible with [`river`](https://online-ml.github.io/index.html).

In [21]:
def load_criteo_data(filepath):
    X = pd.read_csv(filepath, sep=' ', names=['y'] + [str(i) for i in range(39)])
    y = X[['y']].copy()
    X = X.drop(columns='y').applymap(lambda x: x.split(':')[1])
    return X, y

X_train, y_train = load_criteo_data(train_file)
X_test, y_test = load_criteo_data(test_file)

Use [`river`](https://online-ml.github.io/index.html) to train a model and predict the test set.

In [22]:
ffm_params = {
    'n_factors': 10,
    'weight_optimizer': optim.SGD(0.01),
    'latent_optimizer': optim.SGD(0.01),
    'l1_weight': 0.,
    'l2_weight': 0.,
    'l1_latent': 0.,
    'l2_latent': 0.,
    'intercept': 0.,
    'intercept_lr': 0.01,
    'weight_initializer': optim.initializers.Zeros(),
    'latent_initializer': optim.initializers.Normal(mu=0., sigma=0.1, seed=85),
}

model = facto.FFMClassifier(**ffm_params)

for x, y in stream.iter_pandas(X_train, y_train):
    model.learn_one(x, y['y'])

river_pred = [model.predict_proba_one(x)[True] for x, _ in stream.iter_pandas(X_test)]
river_pred = pd.Series(river_pred)

print(f'River Accuracy: {accuracy_score(y_test, river_pred > .5):.4f}')
print(f'River Log loss: {log_loss(y_test, river_pred):.4f}')
print(f'River ROC AUC: {roc_auc_score(y_test, river_pred):.4f}')

River Accuracy: 0.7551
River Log loss: 0.5134
River ROC AUC: 0.7422


Clean up the working directory.

In [23]:
shutil.rmtree('libffm_toy')

## Results

| FFM - Criteo subsampled | Accuracy | Log loss | ROC AUC |
|:------------------------|:--------:|:--------:|:-------:|
| LibFFM                  |  0.7485  |  0.5289  |  0.6910 |
| xLearn                  |  0.7642  |  0.5247  |  0.7401 |
| River                   |  0.7551  |  0.5134  |  0.7422 |