[docs] Edits for grammer and clarity (#1389)

* A nitpicky grammer edit with minor clarifications added. * fix link * strike s * try a different optimal-split link, clarify experimental details * smoothing the FAQ * edit Features.rst * several minor edits throughout docs * historgram-based
microsoft · May 26, 2018 · af40156 · af40156
1 parent 4bb2f2f
commit af40156
Show file tree

Hide file tree

Showing 7 changed files with 111 additions and 122 deletions.
diff --git a/docs/Advanced-Topics.rst b/docs/Advanced-Topics.rst
@@ -4,35 +4,38 @@ Advanced Topics
 Missing Value Handle
 --------------------
 
--  LightGBM enables the missing value handle by default, you can disable it by set ``use_missing=false``.
+-  LightGBM enables the missing value handle by default. Disable it by setting ``use_missing=false``.
 
--  LightGBM uses NA (NaN) to represent the missing value by default, you can change it to use zero by set ``zero_as_missing=true``.
+-  LightGBM uses NA (NaN) to represent missing values by default. Change it to use zero by setting ``zero_as_missing=true``.
 
--  When ``zero_as_missing=false`` (default), the unshown value in sparse matrices (and LightSVM) is treated as zeros.
+-  When ``zero_as_missing=false`` (default), the unshown values in sparse matrices (and LightSVM) are treated as zeros.
 
--  When ``zero_as_missing=true``, NA and zeros (including unshown value in sparse matrices (and LightSVM)) are treated as missing.
+-  When ``zero_as_missing=true``, NA and zeros (including unshown values in sparse matrices (and LightSVM)) are treated as missing.
 
 Categorical Feature Support
 ---------------------------
 
--  LightGBM can offer a good accuracy when using native categorical features. Not like simply one-hot encoding, LightGBM can find the optimal split of categorical features.
-   Such an optimal split can provide the much better accuracy than one-hot encoding solution.
+-  LightGBM offers good accuracy with integer-encoded categorical features. LightGBM applies
+   `Fisher (1958) <http://www.csiss.org/SPACE/workshops/2004/SAC/files/fisher.pdf>`_
+   to find the optimal split over categories as
+   `described here <./Features.rst#optimal-split-for-categorical-features>`_. This often performs better than one-hot encoding.
 
 -  Use ``categorical_feature`` to specify the categorical features.
    Refer to the parameter ``categorical_feature`` in `Parameters <./Parameters.rst>`__.
 
--  Converting to ``int`` type is needed first, and there is support for non-negative numbers only. Also, all values should be less than ``Int32.MaxValue`` (2147483647).
-   It is better to convert into continues ranges.
+-  Categorical features must be encoded as non-negative integers (``int``) less than ``Int32.MaxValue`` (2147483647).
+   It is best to use a contiguous range of integers.
 
--  Use ``min_data_per_group``, ``cat_smooth`` to deal with over-fitting
-   (when ``#data`` is small or ``#category`` is large).
+-  Use ``min_data_per_group``, ``cat_smooth`` to deal with over-fitting (when ``#data`` is small or ``#category`` is large).
 
--  For categorical features with high cardinality (``#category`` is large), it is better to convert it to numerical features.
+-  For a categorical feature with high cardinality (``#category`` is large), it often works best to
+   treat the feature as numeric, either by simply ignoring the categorical interpretation of the integers or
+   by embedding the categories in a low-dimensional numeric space.
 
 LambdaRank
 ----------
 
--  The label should be ``int`` type, and larger numbers represent the higher relevance (e.g. 0:bad, 1:fair, 2:good, 3:perfect).
+-  The label should be of type ``int``, such that larger numbers correspond to higher relevance (e.g. 0:bad, 1:fair, 2:good, 3:perfect).
 
 -  Use ``label_gain`` to set the gain(weight) of ``int`` label.
 

diff --git a/docs/Experiments.rst b/docs/Experiments.rst
@@ -28,7 +28,7 @@ We used 5 datasets to conduct our comparison experiments. Details of data are li
 Environment
 ^^^^^^^^^^^
 
-We used one Linux server as experiment platform, details are listed in the following table:
+We ran all experiments on a single Linux server with the following specifications:
 
 +------------------+-----------------+---------------------+
 | OS               | CPU             | Memory              |
@@ -46,7 +46,7 @@ Both xgboost and LightGBM were built with OpenMP support.
 Settings
 ^^^^^^^^
 
-We set up total 3 settings for experiments, the parameters of these settings are:
+We set up total 3 settings for experiments. The parameters of these settings are:
 
 1. xgboost:
 
@@ -84,8 +84,8 @@ We set up total 3 settings for experiments, the parameters of these settings are
        min_data_in_leaf = 0
        min_sum_hessian_in_leaf = 100
 
-xgboost grows tree depth-wise and controls model complexity by ``max_depth``.
-LightGBM uses leaf-wise algorithm instead and controls model complexity by ``num_leaves``.
+xgboost grows trees depth-wise and controls model complexity by ``max_depth``.
+LightGBM uses a leaf-wise algorithm instead and controls model complexity by ``num_leaves``.
 So we cannot compare them in the exact same model setting. For the tradeoff, we use xgboost with ``max_depth=8``, which will have max number leaves to 255, to compare with LightGBM with ``num_leves=255``.
 
 Other parameters are default values.
@@ -96,7 +96,7 @@ Result
 Speed
 '''''
 
-For speed comparison, we only run the training task, which was without any test or metric output. And we didn't count the time for IO.
+We compared speed using only the training task without any test or metric output. We didn't count the time for IO.
 
 The following table is the comparison of time cost:
 
@@ -114,12 +114,12 @@ The following table is the comparison of time cost:
 | Allstate  | 2867.22 s | 1355.71 s     | **348.084475 s** |
 +-----------+-----------+---------------+------------------+
 
-We found LightGBM is faster than xgboost on all experiment data sets.
+LightGBM ran faster than xgboost on all experiment data sets.
 
 Accuracy
 ''''''''
 
-For accuracy comparison, we used the accuracy on test data set to have a fair comparison.
+We computed all accuracy metrics only on the test data set.
 
 +-----------+-----------------+----------+---------------+----------+
 | Data      | Metric          | xgboost  | xgboost\_hist | LightGBM |
@@ -150,8 +150,8 @@ For accuracy comparison, we used the accuracy on test data set to have a fair co
 Memory Consumption
 ''''''''''''''''''
 
-We monitored RES while running training task. And we set ``two_round=true`` (will increase data-loading time,
-but reduce peak memory usage, not affect training speed or accuracy) in LightGBM to reduce peak memory usage.
+We monitored RES while running training task. And we set ``two_round=true`` (this will increase data-loading time and
+reduce peak memory usage but not affect training speed or accuracy) in LightGBM to reduce peak memory usage.
 
 +-----------+---------+---------------+-------------+
 | Data      | xgboost | xgboost\_hist | LightGBM    |
@@ -181,15 +181,15 @@ We used a terabyte click log dataset to conduct parallel experiments. Details ar
 | Criteo | Binary classification | `link`_ | 1,700,000,000 | 67       |
 +--------+-----------------------+---------+---------------+----------+
 
-This data contains 13 integer features and 26 category features of 24 days click log.
-We statisticized the CTR and count for these 26 category features from the first ten days,
-then used next ten days' data, which had been replaced the category features by the corresponding CTR and count, as training data.
+This data contains 13 integer features and 26 categorical features for 24 days of click logs.
+We statisticized the clickthrough rate (CTR) and count for these 26 categorical features from the first ten days.
+Then we used next ten days' data, after replacing the categorical features by the corresponding CTR and count, as training data.
 The processed training data have a total of 1.7 billions records and 67 features.
 
 Environment
 ^^^^^^^^^^^
 
-We used 16 Windows servers as experiment platform, details are listed in following table:
+We ran our experiments on 16 Windows servers with the following specifications:
 
 +---------------------+-----------------+---------------------+-------------------------------------------+
 | OS                  | CPU             | Memory              | Network Adapter                           |
@@ -208,9 +208,7 @@ Settings
     num_thread = 16
     tree_learner = data
 
-We used data parallel here, since this data is large in ``#data`` but small in ``#feature``.
-
-Other parameters were default values.
+We used data parallel here because this data is large in ``#data`` but small in ``#feature``. Other parameters were default values.
 
 Results
 ^^^^^^^
@@ -229,7 +227,7 @@ Results
 | 16       | 42 s          | 11GB                      |
 +----------+---------------+---------------------------+
 
-From the results, we found that LightGBM performs linear speed up in parallel learning.
+The results show that LightGBM achieves a linear speedup with parallel learning.
 
 GPU Experiments
 ---------------

diff --git a/docs/FAQ.rst b/docs/FAQ.rst
@@ -17,13 +17,21 @@ Contents
 Critical
 ~~~~~~~~
 
-You encountered a critical issue when using LightGBM (crash, prediction error, non sense outputs...). Who should you contact?
+Please post an issue in `Microsoft/LightGBM repository <https://github.com/Microsoft/LightGBM/issues>`__ for any
+LightGBM issues you encounter. For critical issues (crash, prediction error, nonsense outputs...), you may also ping a
+member of the core team according the relevant area of expertise by mentioning them with the arobase (@) symbol:
 
-If your issue is not critical, just post an issue in `Microsoft/LightGBM repository <https://github.com/Microsoft/LightGBM/issues>`__.
+-  `@guolinke <https://github.com/guolinke>`__ (C++ code / R-package / Python-package)
+-  `@chivee <https://github.com/chivee>`__ (C++ code / Python-package)
+-  `@Laurae2 <https://github.com/Laurae2>`__ (R-package)
+-  `@wxchan <https://github.com/wxchan>`__ (Python-package)
+-  `@henry0312 <https://github.com/henry0312>`__ (Python-package)
+-  `@StrikerRUS <https://github.com/StrikerRUS>`__ (Python-package)
+-  `@huanzhang12 <https://github.com/huanzhang12>`__ (GPU support)
 
-If it is a critical issue, identify first what error you have:
+Please include as much of the following information as possible when submitting a critical issue:
 
--  Do you think it is reproducible on CLI (command line interface), R, and/or Python?
+-  Is it reproducible on CLI (command line interface), R, and/or Python?
 
 -  Is it specific to a wrapper? (R or Python?)
 
@@ -33,19 +41,9 @@ If it is a critical issue, identify first what error you have:
 
 -  Are you able to reproduce this issue with a simple case?
 
--  Are you able to (not) reproduce this issue after removing all optimization flags and compiling LightGBM in debug mode?
-
-Depending on the answers, while opening your issue, feel free to ping (just mention them with the arobase (@) symbol) appropriately so we can attempt to solve your problem faster:
-
--  `@guolinke <https://github.com/guolinke>`__ (C++ code / R-package / Python-package)
--  `@chivee <https://github.com/chivee>`__ (C++ code / Python-package)
--  `@Laurae2 <https://github.com/Laurae2>`__ (R-package)
--  `@wxchan <https://github.com/wxchan>`__ (Python-package)
--  `@henry0312 <https://github.com/henry0312>`__ (Python-package)
--  `@StrikerRUS <https://github.com/StrikerRUS>`__ (Python-package)
--  `@huanzhang12 <https://github.com/huanzhang12>`__ (GPU support)
+-  Does the issue persist after removing all optimization flags and compiling LightGBM in debug mode?
 
-Remember this is a free/open community support. We may not be available 24/7 to provide support.
+When submitting issues, please keep in mind that this is largely a volunteer effort, and we may not be available 24/7 to provide support.
 
 --------------
 
@@ -54,64 +52,63 @@ LightGBM
 
 -  **Question 1**: Where do I find more details about LightGBM parameters?
 
--  **Solution 1**: Take a look at `Parameters <./Parameters.rst>`__ and `Laurae++/Parameters <https://sites.google.com/view/lauraepp/parameters>`__ website.
+-  **Solution 1**: Take a look at `Parameters <./Parameters.rst>`__ and the `Laurae++/Parameters <https://sites.google.com/view/lauraepp/parameters>`__ website.
 
 --------------
 
--  **Question 2**: On datasets with million of features, training do not start (or starts after a very long time).
+-  **Question 2**: On datasets with million of features, training does not start (or starts after a very long time).
 
 -  **Solution 2**: Use a smaller value for ``bin_construct_sample_cnt`` and a larger value for ``min_data``.
 
 --------------
 
 -  **Question 3**: When running LightGBM on a large dataset, my computer runs out of RAM.
 
--  **Solution 3**: Multiple solutions: set ``histogram_pool_size`` parameter to the MB you want to use for LightGBM (histogram\_pool\_size + dataset size = approximately RAM used),
+-  **Solution 3**: Multiple solutions: set the ``histogram_pool_size`` parameter to the MB you want to use for LightGBM (histogram\_pool\_size + dataset size = approximately RAM used),
    lower ``num_leaves`` or lower ``max_bin`` (see `Microsoft/LightGBM#562 <https://github.com/Microsoft/LightGBM/issues/562>`__).
 
 --------------
 
 -  **Question 4**: I am using Windows. Should I use Visual Studio or MinGW for compiling LightGBM?
 
--  **Solution 4**: It is recommended to `use Visual Studio <https://github.com/Microsoft/LightGBM/issues/542>`__ as its performance is higher for LightGBM.
+-  **Solution 4**: Visual Studio `performs best for LightGBM <https://github.com/Microsoft/LightGBM/issues/542>`__.
 
 --------------
 
 -  **Question 5**: When using LightGBM GPU, I cannot reproduce results over several runs.
 
--  **Solution 5**: It is a normal issue, there is nothing we/you can do about,
-   you may try to use ``gpu_use_dp = true`` for reproducibility (see `Microsoft/LightGBM#560 <https://github.com/Microsoft/LightGBM/pull/560#issuecomment-304561654>`__).
-   You may also use CPU version.
+-  **Solution 5**: This is normal and expected behaviour, but you may try to use ``gpu_use_dp = true`` for reproducibility
+   (see `Microsoft/LightGBM#560 <https://github.com/Microsoft/LightGBM/pull/560#issuecomment-304561654>`__).
+   You may also use the CPU version.
 
 --------------
 
 -  **Question 6**: Bagging is not reproducible when changing the number of threads.
 
--  **Solution 6**: As LightGBM bagging is running multithreaded, its output is dependent on the number of threads used.
+-  **Solution 6**: LightGBM bagging is multithreaded, so its output depends on the number of threads used.
    There is `no workaround currently <https://github.com/Microsoft/LightGBM/issues/632>`__.
 
 --------------
 
 -  **Question 7**: I tried to use Random Forest mode, and LightGBM crashes!
 
--  **Solution 7**: It is by design.
-   You must use ``bagging_fraction`` and ``feature_fraction`` different from 1, along with a ``bagging_freq``.
-   See `this thread <https://github.com/Microsoft/LightGBM/issues/691>`__ as an example.
+-  **Solution 7**: This is expected behaviour for arbitrary parameters. To enable Random Forest,
+   you must use ``bagging_fraction`` and ``feature_fraction`` different from 1, along with a ``bagging_freq``.
+   `This thread <https://github.com/Microsoft/LightGBM/issues/691>`__ includes an example.
 
 --------------
 
--  **Question 8**: CPU are not kept busy (like 10% CPU usage only) in Windows when using LightGBM on very large datasets with many core systems.
+-  **Question 8**: CPU usage is low (like 10%) in Windows when using LightGBM on very large datasets with many core systems.
 
 -  **Solution 8**: Please use `Visual Studio <https://www.visualstudio.com/downloads/>`__
    as it may be `10x faster than MinGW <https://github.com/Microsoft/LightGBM/issues/749>`__ especially for very large trees.
 
 --------------
 
--  **Question 9**: When I'm trying to specify some column as categorical by using ``categorical_feature`` parameter, I get segmentation fault in LightGBM.
+-  **Question 9**: When I'm trying to specify a categorical column with the ``categorical_feature`` parameter, I get a segmentation fault.
 
--  **Solution 9**: Probably you're trying to pass via ``categorical_feature`` parameter a column with very large values. For instance, it can be some IDs.
-   In LightGBM categorical features are limited by int32 range, so you cannot pass values that are greater than ``Int32.MaxValue`` (2147483647) as categorical features
-   (see `Microsoft/LightGBM#1359 <https://github.com/Microsoft/LightGBM/issues/1359>`__). You should convert them into integer range from zero to number of categories first.
+-  **Solution 9**: The column you're trying to pass via ``categorical_feature`` likely contains very large values.
+   Categorical features in LightGBM are limited by int32 range, so you cannot pass values that are greater than ``Int32.MaxValue`` (2147483647) as categorical features (see `Microsoft/LightGBM#1359 <https://github.com/Microsoft/LightGBM/issues/1359>`__). You should convert them to integers ranging from zero to the number of categories first.
 
 --------------
 
@@ -156,7 +153,7 @@ Python-package
 
        Cannot get/set label/weight/init_score/group/num_data/num_feature before construct dataset
 
-   but I've already constructed dataset by some code like
+   but I've already constructed a dataset by some code like
 
    ::
 
@@ -169,16 +166,16 @@ Python-package
        Cannot set predictor/reference/categorical feature after freed raw data, set free_raw_data=False when construct Dataset to avoid this.
 
 -  **Solution 2**: Because LightGBM constructs bin mappers to build trees, and train and valid Datasets within one Booster share the same bin mappers,
-   categorical features and feature names etc., the Dataset objects are constructed when construct a Booster.
-   And if you set ``free_raw_data=True`` (default), the raw data (with Python data struct) will be freed.
+   categorical features and feature names etc., the Dataset objects are constructed when constructing a Booster.
+   If you set ``free_raw_data=True`` (default), the raw data (with Python data struct) will be freed.
    So, if you want to:
 
-   -  get label(or weight/init\_score/group) before construct dataset, it's same as get ``self.label``
+   -  get label (or weight/init\_score/group) before constructing a dataset, it's same as get ``self.label``
 
-   -  set label(or weight/init\_score/group) before construct dataset, it's same as ``self.label=some_label_array``
+   -  set label (or weight/init\_score/group) before constructing a dataset, it's same as ``self.label=some_label_array``
 
-   -  get num\_data(or num\_feature) before construct dataset, you can get data with ``self.data``,
-      then if your data is ``numpy.ndarray``, use some code like ``self.data.shape``
+   -  get num\_data (or num\_feature) before constructing a dataset, you can get data with ``self.data``.
+      Then, if your data is ``numpy.ndarray``, use some code like ``self.data.shape``
 
-   -  set predictor(or reference/categorical feature) after construct dataset,
+   -  set predictor (or reference/categorical feature) after constructing a dataset,
       you should set ``free_raw_data=False`` or init a Dataset object with the same raw data