The example on the front page but with more realistic data.

In [1]:
import pandas as pd
import dias.rewriter
import urllib.request
import os

In [2]:
# Download the dataset. Source: https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset
url = 'https://uofi.box.com/shared/static/5qi9jcuyn70k6t5z0e9208elqxr01qks.csv'
filename = 'tmdb_metadata.csv'
urllib.request.urlretrieve(url, filename)
assert os.path.isfile(filename)

In [3]:
df = pd.read_csv('tmdb_metadata.csv', low_memory=False)
# Replicate the dataset slightly so that we can see clear effects.
df = pd.concat([df]*50, ignore_index=True)

In [4]:
m = 50
C = 5.6

## Original

A simple function appearing in recommender systems (e.g., this comes from "Hands-On Recommendation Systems with Python" by Rounak Banik, Chapter "Building An IMDB Top 250 Clone With Pandas" > "The simple recommender").

In [5]:
%%time
# DIAS_DISABLE
def weighted_rating(x, m=m, C=C):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)
orig = df.apply(weighted_rating, axis=1)

CPU times: user 9.27 s, sys: 560 ms, total: 9.83 s
Wall time: 9.83 s


## With Dias

Dias rewrites the code to be about **634x faster**.

In [6]:
%%time
def weighted_rating(x, m=m, C=C):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)
rewr = df.apply(weighted_rating, axis=1)

CPU times: user 89.8 ms, sys: 2.8 ms, total: 92.6 ms
Wall time: 15.5 ms


## Correctness check

We drop the NaNs because comparisons are always false.

In [7]:
assert (orig.dropna() == rewr.dropna()).all()