Skip to content

Commit

Permalink
[docs] Edits for grammer and clarity (#1389)
Browse files Browse the repository at this point in the history
* A nitpicky grammer edit with minor clarifications added.

* fix link

* strike s

* try a different optimal-split link, clarify experimental details

* smoothing the FAQ

* edit Features.rst

* several minor edits throughout docs

* historgram-based
  • Loading branch information
zkurtz authored and StrikerRUS committed May 26, 2018
1 parent 4bb2f2f commit af40156
Show file tree
Hide file tree
Showing 7 changed files with 111 additions and 122 deletions.
27 changes: 15 additions & 12 deletions docs/Advanced-Topics.rst
Expand Up @@ -4,35 +4,38 @@ Advanced Topics
Missing Value Handle
--------------------

- LightGBM enables the missing value handle by default, you can disable it by set ``use_missing=false``.
- LightGBM enables the missing value handle by default. Disable it by setting ``use_missing=false``.

- LightGBM uses NA (NaN) to represent the missing value by default, you can change it to use zero by set ``zero_as_missing=true``.
- LightGBM uses NA (NaN) to represent missing values by default. Change it to use zero by setting ``zero_as_missing=true``.

- When ``zero_as_missing=false`` (default), the unshown value in sparse matrices (and LightSVM) is treated as zeros.
- When ``zero_as_missing=false`` (default), the unshown values in sparse matrices (and LightSVM) are treated as zeros.

- When ``zero_as_missing=true``, NA and zeros (including unshown value in sparse matrices (and LightSVM)) are treated as missing.
- When ``zero_as_missing=true``, NA and zeros (including unshown values in sparse matrices (and LightSVM)) are treated as missing.

Categorical Feature Support
---------------------------

- LightGBM can offer a good accuracy when using native categorical features. Not like simply one-hot encoding, LightGBM can find the optimal split of categorical features.
Such an optimal split can provide the much better accuracy than one-hot encoding solution.
- LightGBM offers good accuracy with integer-encoded categorical features. LightGBM applies
`Fisher (1958) <http://www.csiss.org/SPACE/workshops/2004/SAC/files/fisher.pdf>`_
to find the optimal split over categories as
`described here <./Features.rst#optimal-split-for-categorical-features>`_. This often performs better than one-hot encoding.

- Use ``categorical_feature`` to specify the categorical features.
Refer to the parameter ``categorical_feature`` in `Parameters <./Parameters.rst>`__.

- Converting to ``int`` type is needed first, and there is support for non-negative numbers only. Also, all values should be less than ``Int32.MaxValue`` (2147483647).
It is better to convert into continues ranges.
- Categorical features must be encoded as non-negative integers (``int``) less than ``Int32.MaxValue`` (2147483647).
It is best to use a contiguous range of integers.

- Use ``min_data_per_group``, ``cat_smooth`` to deal with over-fitting
(when ``#data`` is small or ``#category`` is large).
- Use ``min_data_per_group``, ``cat_smooth`` to deal with over-fitting (when ``#data`` is small or ``#category`` is large).

- For categorical features with high cardinality (``#category`` is large), it is better to convert it to numerical features.
- For a categorical feature with high cardinality (``#category`` is large), it often works best to
treat the feature as numeric, either by simply ignoring the categorical interpretation of the integers or
by embedding the categories in a low-dimensional numeric space.

LambdaRank
----------

- The label should be ``int`` type, and larger numbers represent the higher relevance (e.g. 0:bad, 1:fair, 2:good, 3:perfect).
- The label should be of type ``int``, such that larger numbers correspond to higher relevance (e.g. 0:bad, 1:fair, 2:good, 3:perfect).

- Use ``label_gain`` to set the gain(weight) of ``int`` label.

Expand Down
32 changes: 15 additions & 17 deletions docs/Experiments.rst
Expand Up @@ -28,7 +28,7 @@ We used 5 datasets to conduct our comparison experiments. Details of data are li
Environment
^^^^^^^^^^^

We used one Linux server as experiment platform, details are listed in the following table:
We ran all experiments on a single Linux server with the following specifications:

+------------------+-----------------+---------------------+
| OS | CPU | Memory |
Expand All @@ -46,7 +46,7 @@ Both xgboost and LightGBM were built with OpenMP support.
Settings
^^^^^^^^

We set up total 3 settings for experiments, the parameters of these settings are:
We set up total 3 settings for experiments. The parameters of these settings are:

1. xgboost:

Expand Down Expand Up @@ -84,8 +84,8 @@ We set up total 3 settings for experiments, the parameters of these settings are
min_data_in_leaf = 0
min_sum_hessian_in_leaf = 100
xgboost grows tree depth-wise and controls model complexity by ``max_depth``.
LightGBM uses leaf-wise algorithm instead and controls model complexity by ``num_leaves``.
xgboost grows trees depth-wise and controls model complexity by ``max_depth``.
LightGBM uses a leaf-wise algorithm instead and controls model complexity by ``num_leaves``.
So we cannot compare them in the exact same model setting. For the tradeoff, we use xgboost with ``max_depth=8``, which will have max number leaves to 255, to compare with LightGBM with ``num_leves=255``.

Other parameters are default values.
Expand All @@ -96,7 +96,7 @@ Result
Speed
'''''

For speed comparison, we only run the training task, which was without any test or metric output. And we didn't count the time for IO.
We compared speed using only the training task without any test or metric output. We didn't count the time for IO.

The following table is the comparison of time cost:

Expand All @@ -114,12 +114,12 @@ The following table is the comparison of time cost:
| Allstate | 2867.22 s | 1355.71 s | **348.084475 s** |
+-----------+-----------+---------------+------------------+

We found LightGBM is faster than xgboost on all experiment data sets.
LightGBM ran faster than xgboost on all experiment data sets.

Accuracy
''''''''

For accuracy comparison, we used the accuracy on test data set to have a fair comparison.
We computed all accuracy metrics only on the test data set.

+-----------+-----------------+----------+---------------+----------+
| Data | Metric | xgboost | xgboost\_hist | LightGBM |
Expand Down Expand Up @@ -150,8 +150,8 @@ For accuracy comparison, we used the accuracy on test data set to have a fair co
Memory Consumption
''''''''''''''''''

We monitored RES while running training task. And we set ``two_round=true`` (will increase data-loading time,
but reduce peak memory usage, not affect training speed or accuracy) in LightGBM to reduce peak memory usage.
We monitored RES while running training task. And we set ``two_round=true`` (this will increase data-loading time and
reduce peak memory usage but not affect training speed or accuracy) in LightGBM to reduce peak memory usage.

+-----------+---------+---------------+-------------+
| Data | xgboost | xgboost\_hist | LightGBM |
Expand Down Expand Up @@ -181,15 +181,15 @@ We used a terabyte click log dataset to conduct parallel experiments. Details ar
| Criteo | Binary classification | `link`_ | 1,700,000,000 | 67 |
+--------+-----------------------+---------+---------------+----------+

This data contains 13 integer features and 26 category features of 24 days click log.
We statisticized the CTR and count for these 26 category features from the first ten days,
then used next ten days' data, which had been replaced the category features by the corresponding CTR and count, as training data.
This data contains 13 integer features and 26 categorical features for 24 days of click logs.
We statisticized the clickthrough rate (CTR) and count for these 26 categorical features from the first ten days.
Then we used next ten days' data, after replacing the categorical features by the corresponding CTR and count, as training data.
The processed training data have a total of 1.7 billions records and 67 features.

Environment
^^^^^^^^^^^

We used 16 Windows servers as experiment platform, details are listed in following table:
We ran our experiments on 16 Windows servers with the following specifications:

+---------------------+-----------------+---------------------+-------------------------------------------+
| OS | CPU | Memory | Network Adapter |
Expand All @@ -208,9 +208,7 @@ Settings
num_thread = 16
tree_learner = data
We used data parallel here, since this data is large in ``#data`` but small in ``#feature``.

Other parameters were default values.
We used data parallel here because this data is large in ``#data`` but small in ``#feature``. Other parameters were default values.

Results
^^^^^^^
Expand All @@ -229,7 +227,7 @@ Results
| 16 | 42 s | 11GB |
+----------+---------------+---------------------------+

From the results, we found that LightGBM performs linear speed up in parallel learning.
The results show that LightGBM achieves a linear speedup with parallel learning.

GPU Experiments
---------------
Expand Down
77 changes: 37 additions & 40 deletions docs/FAQ.rst
Expand Up @@ -17,13 +17,21 @@ Contents
Critical
~~~~~~~~

You encountered a critical issue when using LightGBM (crash, prediction error, non sense outputs...). Who should you contact?
Please post an issue in `Microsoft/LightGBM repository <https://github.com/Microsoft/LightGBM/issues>`__ for any
LightGBM issues you encounter. For critical issues (crash, prediction error, nonsense outputs...), you may also ping a
member of the core team according the relevant area of expertise by mentioning them with the arobase (@) symbol:

If your issue is not critical, just post an issue in `Microsoft/LightGBM repository <https://github.com/Microsoft/LightGBM/issues>`__.
- `@guolinke <https://github.com/guolinke>`__ (C++ code / R-package / Python-package)
- `@chivee <https://github.com/chivee>`__ (C++ code / Python-package)
- `@Laurae2 <https://github.com/Laurae2>`__ (R-package)
- `@wxchan <https://github.com/wxchan>`__ (Python-package)
- `@henry0312 <https://github.com/henry0312>`__ (Python-package)
- `@StrikerRUS <https://github.com/StrikerRUS>`__ (Python-package)
- `@huanzhang12 <https://github.com/huanzhang12>`__ (GPU support)

If it is a critical issue, identify first what error you have:
Please include as much of the following information as possible when submitting a critical issue:

- Do you think it is reproducible on CLI (command line interface), R, and/or Python?
- Is it reproducible on CLI (command line interface), R, and/or Python?

- Is it specific to a wrapper? (R or Python?)

Expand All @@ -33,19 +41,9 @@ If it is a critical issue, identify first what error you have:

- Are you able to reproduce this issue with a simple case?

- Are you able to (not) reproduce this issue after removing all optimization flags and compiling LightGBM in debug mode?

Depending on the answers, while opening your issue, feel free to ping (just mention them with the arobase (@) symbol) appropriately so we can attempt to solve your problem faster:

- `@guolinke <https://github.com/guolinke>`__ (C++ code / R-package / Python-package)
- `@chivee <https://github.com/chivee>`__ (C++ code / Python-package)
- `@Laurae2 <https://github.com/Laurae2>`__ (R-package)
- `@wxchan <https://github.com/wxchan>`__ (Python-package)
- `@henry0312 <https://github.com/henry0312>`__ (Python-package)
- `@StrikerRUS <https://github.com/StrikerRUS>`__ (Python-package)
- `@huanzhang12 <https://github.com/huanzhang12>`__ (GPU support)
- Does the issue persist after removing all optimization flags and compiling LightGBM in debug mode?

Remember this is a free/open community support. We may not be available 24/7 to provide support.
When submitting issues, please keep in mind that this is largely a volunteer effort, and we may not be available 24/7 to provide support.

--------------

Expand All @@ -54,64 +52,63 @@ LightGBM

- **Question 1**: Where do I find more details about LightGBM parameters?

- **Solution 1**: Take a look at `Parameters <./Parameters.rst>`__ and `Laurae++/Parameters <https://sites.google.com/view/lauraepp/parameters>`__ website.
- **Solution 1**: Take a look at `Parameters <./Parameters.rst>`__ and the `Laurae++/Parameters <https://sites.google.com/view/lauraepp/parameters>`__ website.

--------------

- **Question 2**: On datasets with million of features, training do not start (or starts after a very long time).
- **Question 2**: On datasets with million of features, training does not start (or starts after a very long time).

- **Solution 2**: Use a smaller value for ``bin_construct_sample_cnt`` and a larger value for ``min_data``.

--------------

- **Question 3**: When running LightGBM on a large dataset, my computer runs out of RAM.

- **Solution 3**: Multiple solutions: set ``histogram_pool_size`` parameter to the MB you want to use for LightGBM (histogram\_pool\_size + dataset size = approximately RAM used),
- **Solution 3**: Multiple solutions: set the ``histogram_pool_size`` parameter to the MB you want to use for LightGBM (histogram\_pool\_size + dataset size = approximately RAM used),
lower ``num_leaves`` or lower ``max_bin`` (see `Microsoft/LightGBM#562 <https://github.com/Microsoft/LightGBM/issues/562>`__).

--------------

- **Question 4**: I am using Windows. Should I use Visual Studio or MinGW for compiling LightGBM?

- **Solution 4**: It is recommended to `use Visual Studio <https://github.com/Microsoft/LightGBM/issues/542>`__ as its performance is higher for LightGBM.
- **Solution 4**: Visual Studio `performs best for LightGBM <https://github.com/Microsoft/LightGBM/issues/542>`__.

--------------

- **Question 5**: When using LightGBM GPU, I cannot reproduce results over several runs.

- **Solution 5**: It is a normal issue, there is nothing we/you can do about,
you may try to use ``gpu_use_dp = true`` for reproducibility (see `Microsoft/LightGBM#560 <https://github.com/Microsoft/LightGBM/pull/560#issuecomment-304561654>`__).
You may also use CPU version.
- **Solution 5**: This is normal and expected behaviour, but you may try to use ``gpu_use_dp = true`` for reproducibility
(see `Microsoft/LightGBM#560 <https://github.com/Microsoft/LightGBM/pull/560#issuecomment-304561654>`__).
You may also use the CPU version.

--------------

- **Question 6**: Bagging is not reproducible when changing the number of threads.

- **Solution 6**: As LightGBM bagging is running multithreaded, its output is dependent on the number of threads used.
- **Solution 6**: LightGBM bagging is multithreaded, so its output depends on the number of threads used.
There is `no workaround currently <https://github.com/Microsoft/LightGBM/issues/632>`__.

--------------

- **Question 7**: I tried to use Random Forest mode, and LightGBM crashes!

- **Solution 7**: It is by design.
You must use ``bagging_fraction`` and ``feature_fraction`` different from 1, along with a ``bagging_freq``.
See `this thread <https://github.com/Microsoft/LightGBM/issues/691>`__ as an example.
- **Solution 7**: This is expected behaviour for arbitrary parameters. To enable Random Forest,
you must use ``bagging_fraction`` and ``feature_fraction`` different from 1, along with a ``bagging_freq``.
`This thread <https://github.com/Microsoft/LightGBM/issues/691>`__ includes an example.

--------------

- **Question 8**: CPU are not kept busy (like 10% CPU usage only) in Windows when using LightGBM on very large datasets with many core systems.
- **Question 8**: CPU usage is low (like 10%) in Windows when using LightGBM on very large datasets with many core systems.

- **Solution 8**: Please use `Visual Studio <https://www.visualstudio.com/downloads/>`__
as it may be `10x faster than MinGW <https://github.com/Microsoft/LightGBM/issues/749>`__ especially for very large trees.

--------------

- **Question 9**: When I'm trying to specify some column as categorical by using ``categorical_feature`` parameter, I get segmentation fault in LightGBM.
- **Question 9**: When I'm trying to specify a categorical column with the ``categorical_feature`` parameter, I get a segmentation fault.

- **Solution 9**: Probably you're trying to pass via ``categorical_feature`` parameter a column with very large values. For instance, it can be some IDs.
In LightGBM categorical features are limited by int32 range, so you cannot pass values that are greater than ``Int32.MaxValue`` (2147483647) as categorical features
(see `Microsoft/LightGBM#1359 <https://github.com/Microsoft/LightGBM/issues/1359>`__). You should convert them into integer range from zero to number of categories first.
- **Solution 9**: The column you're trying to pass via ``categorical_feature`` likely contains very large values.
Categorical features in LightGBM are limited by int32 range, so you cannot pass values that are greater than ``Int32.MaxValue`` (2147483647) as categorical features (see `Microsoft/LightGBM#1359 <https://github.com/Microsoft/LightGBM/issues/1359>`__). You should convert them to integers ranging from zero to the number of categories first.

--------------

Expand Down Expand Up @@ -156,7 +153,7 @@ Python-package

Cannot get/set label/weight/init_score/group/num_data/num_feature before construct dataset

but I've already constructed dataset by some code like
but I've already constructed a dataset by some code like

::

Expand All @@ -169,16 +166,16 @@ Python-package
Cannot set predictor/reference/categorical feature after freed raw data, set free_raw_data=False when construct Dataset to avoid this.

- **Solution 2**: Because LightGBM constructs bin mappers to build trees, and train and valid Datasets within one Booster share the same bin mappers,
categorical features and feature names etc., the Dataset objects are constructed when construct a Booster.
And if you set ``free_raw_data=True`` (default), the raw data (with Python data struct) will be freed.
categorical features and feature names etc., the Dataset objects are constructed when constructing a Booster.
If you set ``free_raw_data=True`` (default), the raw data (with Python data struct) will be freed.
So, if you want to:

- get label(or weight/init\_score/group) before construct dataset, it's same as get ``self.label``
- get label (or weight/init\_score/group) before constructing a dataset, it's same as get ``self.label``

- set label(or weight/init\_score/group) before construct dataset, it's same as ``self.label=some_label_array``
- set label (or weight/init\_score/group) before constructing a dataset, it's same as ``self.label=some_label_array``

- get num\_data(or num\_feature) before construct dataset, you can get data with ``self.data``,
then if your data is ``numpy.ndarray``, use some code like ``self.data.shape``
- get num\_data (or num\_feature) before constructing a dataset, you can get data with ``self.data``.
Then, if your data is ``numpy.ndarray``, use some code like ``self.data.shape``

- set predictor(or reference/categorical feature) after construct dataset,
- set predictor (or reference/categorical feature) after constructing a dataset,
you should set ``free_raw_data=False`` or init a Dataset object with the same raw data

0 comments on commit af40156

Please sign in to comment.