Skip to content

microsoft/RServer-for-HDInsight-example-CriteoDataSet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

RServer-for-HDInsight-example-CriteoDataSet

This repo contains a walkthrough of how to use RServer for HDInsight with large data sets like Criteo.

Running Instructions

It took about 10 hours to run the analysis on my cluster using the Criteo data for day 14 - day 23 (420 GB). You can test your cluster and the program by using a subset of the data, e.g., data for day 14 (46 GB).

Deploy an HDInsight cluster

More information about how to deploy R Server for HDInsight can be found at the documentation site. It is recommended that you install RStudio on the cluster by following the instructions as well. Here's the information on the cluster I deployed:

Type Cores RAM (GB) Nodes Pricing Tier
Head Nodes 32 224 2 D14
Worker Nodes 960 6,720 60 D14

Get the Criteo data

Information on the data can be found at Now Available on Azure ML – Criteo's 1TB Click Prediction Dataset. After downloading and extracting data for day 14 - day 23, upload them to a folder on your HDInsight cluster using tools like AzCopy.

Get the summary data

The summary data can be downloaded from an Azure blob. The summary is for the 1 TB data and includes frequency counts for categorical variables and means for integer variables. After downloading and extracting data, upload them to your HDInsight cluster using tools like AzCopy.

Update the programs

SetComputeContext.R

  • Enter the nodename of your cluster and update the WASB address.
  • Replace the value of dataDir with the correct path to where the data is saved. For example, I saved all data for my project in the folder "/lixun/CriteoAzure" so I assiged this path to dataDir.

CriteoMain.R

  • Update the paths to the raw Criteo data as well as summaries of categorical and integer variables.

CriteoMainCall.R

  • Change the working directory to point to your folder where the programs are saved.

Run CriteoMainCall.R

For example, you can run the program from RStudio installed on the HDInsight cluster.


This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

About

This repo contains a walkthrough of how to use RServer for HDInsight with large data sets like Criteo.

Resources

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages