Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Surprise on multiple machines #373

Open
steventartakovsky-vungle opened this issue Nov 11, 2020 · 2 comments
Open

Surprise on multiple machines #373

steventartakovsky-vungle opened this issue Nov 11, 2020 · 2 comments

Comments

@steventartakovsky-vungle

Where is the documentation on the dataset size limitation and how to scale Surprise to multiple machines?

Thanks - Steven

@DiegoCorrea
Copy link

DiegoCorrea commented Dec 21, 2020

Nicholas, thanks for the surprise lib.

I have a question that is likely with the above.
I have 21-51 nodes with 24 core each.

I'm trying to use the Dask lib to work in parallel in multiple nodes.
So, what I want to do:

  1. For example, I want to Grid Search the NMF (biased=true) with cv=3.
  2. If I set in each parameter 3 option ("n_factors": [50, 100, 150]) in the 8 parameters, we have 3^8=6 561 processes in a CV... 6 561*3=19 683 processes in total.
  3. I can allocate a min of 21 nodes at the same time with 24 cores each. 21*24 = 504 processes work in the same time.
  4. So, I want to distribute 504 processes at the same time in 21 nodes with 24 cores... each process in one core.

OBS: I am using the Slurm job.

`

def grid_search_instance(instance, params, dataset, measures, folds, label, n_jobs=N_CORES):  
    """  
    Grid Search Cross Validation to get the best params to the recommender algorithm  
    :param label: Recommender string name  
    :param instance: Recommender algorithm instance  
    :param params: Recommender algorithm params set  
    :param dataset: A dataset modeled by the surprise Reader class  
    :param measures: A string with the measure name  
    :param folds: Number of folds to cross validation  
    :param n_jobs: Number of CPU/GPU to be used  
    :return: A Grid Search instance  
    """  
    cluster = SLURMCluster(cores=24,  
                           processes=2,  
                           memory='64GB',  
                           queue="nvidia_dev",  
                           project="NMF",  
                           name=label,  
                           log_directory='logs/slurm',  
                           walltime='00:15:00')  
    # cluster.scale(2)  
    cluster.adapt(minimum=1, maximum=360)  
    client = dask.distributed.Client(cluster)  
    print(client)  
    print(cluster.job_script())  
    gs = GridSearchCV(instance, params, measures=measures, cv=folds, joblib_verbose=100)  
    with joblib.parallel_backend("dask"):  
        print(client)  
        gs.fit(dataset)  
    return gs

`

And again, thanks for the surprise and for spend your time reading this question.

@NicolasHug
Copy link
Owner

Hi all, sorry for the late reply. Surprise doesn't support multi-node training, sorry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants