Merge pull request #41 from mlennert/v_class_mlR_enhancements

Significative enhancement of v.class.mlR Merging so that it will be easier to test.
OSGeo · Oct 15, 2019 · 3bd1d49 · 3bd1d49
2 parents ce122fe + 3bf1c74
commit 3bd1d49
Show file tree

Hide file tree

Showing 2 changed files with 412 additions and 137 deletions.
diff --git a/grass7/vector/v.class.mlR/v.class.mlR.html b/grass7/vector/v.class.mlR/v.class.mlR.html
@@ -1,8 +1,8 @@
 <h2>DESCRIPTION</h2>
 
 <p>
-<em>v.class.mlR</em> is a wrapper script that uses the R caret package 
-for machine learning in R to classify features using training features 
+<em>v.class.mlR</em> is a wrapper module that uses the R caret package 
+for machine learning in R to classify objects using training features 
 by supervised learning.
 
 <p>The user provides a set of objects (or segments) to be classified, including
@@ -22,6 +22,10 @@ <h2>DESCRIPTION</h2>
 of features, a text file (<em>classification_results</em>) or reclassed 
 raster maps (<em>classified_map</em>).
 
+<p>When using text file input, the training data should not contain an id 
+column. The object data  (i.e., full set of data to be classified) should have
+the ids in the first column.
+
 <p>The user has to provide the name of the column in the training data
 that contains the class values (<em>train_class_column</em>), the prefix
 of the columns that will contain the final class after classification
@@ -30,11 +34,12 @@ <h2>DESCRIPTION</h2>
 (<em>output_prob_column</em> - see below).
 
 <p>Different classifiers are proposed <em>classifiers</em>: 
-k-nearest neighbor (knn and knn1 for k=1), support vector machine 
-with a radial kernel (svmRadial), random forest (rf) and recursive 
-partitioning (rpart). Each of these classifiers is tuned automatically 
-throught repeated cross-validation. caret will automatically determine 
-a reasonable set of values for tuning. See the 
+k-nearest neighbor (knn), support vector machine with a radial kernel 
+(svmRadial), support vector machine with a linear kernel (svmLinear), random 
+forest (rf), C5.0 (C5.0) and XGBoost (xgbTree) decision trees and recursive 
+partitioning (rpart). Each of these classifiers is tuned automatically through
+repeated cross-validation. Caret will automatically determine a reasonable set 
+of values for tuning. See the 
 <a href="http://topepo.github.io/caret/modelList.html">caret webpage</a> 
 for more information about the tuning parameters for each classifier, and
 more generally for the information about how caret works. By default, the
@@ -56,6 +61,34 @@ <h2>DESCRIPTION</h2>
 tunegrids="{'svmRadial': 'sigma=c(0.01,0.05,0.1), C=c(1,16,128)', 'rf': 'mtry=c(3,10,20)'}"
 </pre></div>
 
+<p>Tuning is potentially very time consuming. Using only a subset of the training
+data for tuning can thus speed up the process significantly, without losing 
+much quality in the tuning results. For training, depending on the number of 
+features used, some R functions can reach their capacity limit.
+The user can, therefore, define a maximum size of samples per class both for 
+tuning (<em>tuning_sample_size</em>) and for training 
+(<em>training_sample_size</em>).
+
+<p>Classifying using too many features (i.e. variables describing the objects 
+to be classified) as input can have negative effects on classification accuracy
+(Georganos et al, 2018). The module therefore provides the possibility to run a
+feature selection algorithm on the training data in order to identify those 
+features that are the most efficient for classification. Using less features
+also speeds up the tuning, training and classification processes. To activate 
+feature selection, the user has to set the <em>max_features</em> parameter to 
+the maximum number of features that the model should select. Often, less than 
+this maximum will be selected. The method used for feature selection is recursive 
+feature elimination based on a random forest model. Note thus that feature 
+selection might be sub-optimal for other classifiers, notably non tree-based.
+
+<p>The module can be run only for tuning and training a model, but without 
+<p>Optionally, the module can be run for tuning and training only, 
+i.e., no prediction (<em>-t flag</em>). Any trained model can be saved to a
+file (<em>output_model_file</em>) which can then be read into the module
+at a later stage for the prediction step (<em>input_model_file</em>). This can be 
+particularly useful for cluster computing approaches where a trained model
+can be applied to different datasets in parallel.
+
 <p>The module can run the model tuning using parallel processing. In order
 for this to work, the R-package <em>doParallel</em> has to be installed. The
 <em>processes</em> parameter allows to chose the number of processes to
@@ -83,13 +116,19 @@ <h2>DESCRIPTION</h2>
 <p>Optional output of the module include detailed information about the 
 different classifier models and their cross-validation results 
 <em>model_details</em> (for details of these results see the train, 
-resamples and confusionMatrix.train functions in the caret package)  a 
+resamples and confusionMatrix.train functions in the caret package),  a 
 box-and-whisker plot indicating the resampling variance based on the 
-cross-validation for each classifier (<em>bw_plot_file</em>) and a csv 
+cross-validation for each classifier (<em>bw_plot_file</em>), a csv 
 file containing accuracy measures (overall accuracy and kappa) for each 
-classifier (<em>accuracy_file</em>). The user can also chose to write the 
-R script constructed and used internally to a text file for study or further 
-modification.
+classifier (<em>accuracy_file</em>), and a file containing variable importance
+as determined by the classifier (for those classifiers that allow such 
+calculation). When the <em>-p</em> flag is given, 
+the module also provides probabilities per class for each classifier (at least
+for those where caret can calculate such probabilities). This allows to evaluate
+the confidence of classification of each object. 
+The user can also chose to 
+write the R script constructed and used internally to a text file for study or
+further modification.
 
 <h2>NOTES</h2>
 
@@ -143,7 +182,12 @@ <h2>EXAMPLE</h2>
 
 <h2>REFERENCES</h2>
 
-<p>Moreno-Seco, F. et al. (2006), Comparison of Classifier Fusion Methods for Classification in Pattern Recognition Tasks. In D.-Y. Yeung et al., eds. Structural, Syntactic, and Statistical Pattern Recognition. Lecture Notes in Computer Science. Springer Berlin Heidelberg, pp. 705–713, <a href="http://dx.doi.org/10.1007/11815921_77">http://dx.doi.org/10.1007/11815921_77</a>. 
+<p>
+<ul>
+	<li>Moreno-Seco, F. et al. (2006), Comparison of Classifier Fusion Methods for Classification in Pattern Recognition Tasks. In D.-Y. Yeung et al., eds. Structural, Syntactic, and Statistical Pattern Recognition. Lecture Notes in Computer Science. Springer Berlin Heidelberg, pp. 705–713, <a href="http://dx.doi.org/10.1007/11815921_77">http://dx.doi.org/10.1007/11815921_77</a>.</li>
+	<li> Georganos, S. et al (2018), Less is more: optimizing classification performance through feature selection in a very-high-resolution remote sensing object-based urban application, GIScience and Remote Sensing, 55:2, 221-242, DOI: 10.1080/15481603.2017.1408892</li>
+</ul>
+
 
 <h2>SEE ALSO</h2>