Skip to content

Latest commit

 

History

History
136 lines (99 loc) · 9.78 KB

mlr_sona_en.md

File metadata and controls

136 lines (99 loc) · 9.78 KB

MLR

1. Algorithm Introduction

Model

MLR is a sub-regional linear model that is widely used in advertising ctr estimates. MLR adopts the divide and conquer strategy: firstly divide the feature space into multiple local intervals, then fit a linear model in each local interval, and the output is the weighted sum of multiple linear models. These two steps are to minimize the loss function. For the goal, learn at the same time. For details, see the Large Scale Piece-wise Linear Model (LS-PLM). The MLR algorithm has three distinct advantages:

  1. Nonlinear:Choosing enough partitions, the MLR algorithm can fit arbitrarily complex nonlinear functions.
  2. Scalability:Similar to the LR algorithm, the MLR algorithm has a good scalability for massive samples and ultra-high dimensional models.
  3. Sparsity:MLR algorithm with,regular terms can get good sparsity.

Formula

model

Note: is the partition function, is the parameter of fitting function . For a given sample x, our prediction function model has two parts. The first part divides the feature space into m regions, and the second part gives the predicted value for each region. The function ensures that the model satisfies the definition of the probability function.

MLR algorithm model uses softmax as the partition function ,Sigmoid function as a fitting function ,and: ,gets the model of MLR as follows:

model

The schematic diagram of the MLR model is as follows,

This model can be understood from two perspectives:

  • The MLR can be regarded as a three-layer neural network with threshold. There are k sigmoid neurons in the hidden layer. The output of each neuron has a gate. The output value of softmax is the switch of the gate.
  • The MLR can be regarded as an ensemble model, which is composed of k simple sigmoid models. The output value of softmax is the combination coefficient.

In many cases, a sub-model needs to be built on a part of the data, and then predicted by multiple models. MLR uses softmax to divide the data (soft division) and predict it with a unified model. Another advantage of MLR is that it can be characterized. Combination, some features are active for sigmoid, and other features are active for softmax, multiplying them is equivalent to making feature combinations at lower levels.

Note: Since the output value of sigmoid model is between 0 and 1, and the output value of softmax is between 0 and 1 and normalized, the combined value is also between 0 and 1 (when all sigmoid values are 1, the maximum value can be obtained, of course, in other cases, the combined sum is 1), which can be regarded as a probability.

2. Distributed Implementation on Angel

Gradient descent method

For,the model can be unified:

model

For the sample (x, y), the cross entropy loss function is:

model

Note: Under normal circumstances, cross entropy manifests itself as, The meaning of is given, and the probability at y = 1, ifrepresents the probability of Y at given x (i.e., y is not only the probability of y = 1), the expression of cross entropy is as follows:model

In this way, the derivative for a single sample is,

model

Gradient:

model

Implementation based on Angel

  • Model Storage:

    • The model parameters of MLR algorithm are: soft Max function parameters:,Sigmoid function parameters:. Where is an N-dimensional vector,N is the dimension of the data, that is, the number of features. A matrix of two m*N dimensions is used to represent a softmax matrix and a sigmodi matrix, respectively.
    • The truncated values of softmax function and sigmoid function are represented by two m*1 dimension matrices.
  • Model Calculation:

    • MLR model is trained by gradient descent method, and the algorithm is carried out iteratively. At the beginning of each iteration, worker pulls up the latest model parameters from PS, calculates the gradient with its own training data, and pushes the gradient to PS.
    • PS receives all the gradient values pushed by the worker, takes the average, and updates the PSModel.

3. Operation

Input format

The format of data is set by "ml. data. type" parameter, and the number of data features, that is, the dimension of feature vectors, is set by "ml. feature. num" parameter.

MLR on Angel supports "libsvm" and "dummy" data formats as follows:

  • dummy format

Each line of text represents a sample in the format of "y index 1 Index 2 index 3...". Among them: the ID of index feature; y of training data is the category of samples, which can take 1 and -1 values; y of prediction data is the ID value of samples. For example, the text of a positive sample [2.0, 3.1, 0.0, 0.0, -1, 2.2] is expressed as "10145", where "1" is the category and "0145" means that the values of dimension 0, 1, 4 and 5 of the eigenvector are not zero. Similarly, samples belonging to negative classes [2.0, 0.0, 0.1, 0.0, 0.0, 0.0] are represented as "-102".

  • libsvm format

Each line of text represents a sample in the form of "y index 1: value 1 index 2: value 1 index 3: value 3...". Among them: index is the characteristic ID, value is the corresponding eigenvalue; y of training data is the category of samples, and can take 1 and - 1 values; y of prediction data is the ID value of samples. For example, the text of a positive sample [2.0, 3.1, 0.0, 0.0, -1, 2.2] is expressed as "10:2.01:3.14:-15:2.2", where "1" is the category and "0:2.0" means the value of the zero feature is 2.0. Similarly, samples belonging to negative classes [2.0, 0.0, 0.1, 0.0, 0.0, 0.0] are represented as "-10:2.02:0.1".

Submitting script

Several steps must be done before editing the submitting script and running.

  1. confirm Hadoop and Spark have ready in your environment
  2. unzip sona--bin.zip to local directory (SONA_HOME)
  3. upload sona--bin directory to HDFS (SONA_HDFS_HOME)
  4. Edit $SONA_HOME/bin/spark-on-angel-env.sh, set SPARK_HOME, SONA_HOME, SONA_HDFS_HOME and ANGEL_VERSION

Here's an example of submitting scripts, remember to adjust the parameters and fill in the paths according to your own task.

#test description
actionType=train or predict
jsonFile=path-to-jsons/mixedlr.json
modelPath=path-to-save-model
predictPath=path-to-save-predict-results
input=path-to-data
queue=your-queue

HADOOP_HOME=my-hadoop-home
source ./bin/spark-on-angel-env.sh
export HADOOP_HOME=$HADOOP_HOME

$SPARK_HOME/bin/spark-submit \
  --master yarn-cluster \
  --conf spark.ps.jars=$SONA_ANGEL_JARS \
  --conf spark.ps.instances=10 \
  --conf spark.ps.cores=2 \
  --conf spark.ps.memory=10g \
  --jars $SONA_SPARK_JARS \
  --files $jsonFile \
  --driver-memory 20g \
  --num-executors 20 \
  --executor-cores 5 \
  --executor-memory 30g \
  --queue $queue \
  --class org.apache.spark.angel.examples.JsonRunnerExamples \
  ./lib/angelml-$SONA_VERSION.jar \
  jsonFile:./mixedlr.json \
  dataFormat:libsvm \
  data:$input \
  modelPath:$modelPath \
  predictPath:$predictPath \
  actionType:$actionType \
  numBatch:500 \
  maxIter:2 \
  lr:4.0 \
  numField:39