There is currently a wide availability of data mining algorithms in big data, however, there are no specific metrics focused on tackling complexity and redundancy problems in large datasets. Thus, we propose to answer the following question: do classification algorithms need so much data?
To address this objective, we propose two metrics and the implementation of several state of the art metrics, adapting them to the big data problem. This repository is a open-source package that includes complexity metrics. This package includes two original proposals for studying the density and complexity of large datasets:
- ND: Neighborhood Density, this metric returns the percentual difference of the euclidean distance, calculated with all available data, and with half of the randomly chosen data.
- DTP: Decision Tree Progression, this metric returns the accuracy percentage difference by training a Decision Tree with the totality of the data, and discarding half of them randomly.
It also includes some state-of-the-art metrics [1], which have been designed and developed to run large datasets.
- F1: Maximum Fisher's discriminant ratio
- F2: Volume of overlapping region
- F3: Maximum individual feature efficiency
- F4: Collective feature efficiency
- C1: Entropy of class portions
- C2: Imbalance ratio
If you want to know more about the metrics and experiments carried out and the conclusions obtained, please consult and cite the following reference
J. Maillo, I. Triguero and F. Herrera, "Redundancy and Complexity Metrics for Big Data Classification: Towards Smart Data," in IEEE Access, vol. 8, pp. 87918-87928 (2020) doi: 10.1109/ACCESS.2020.2991800.
[1] Lorena, A. C., Garcia, L. P., Lehmann, J., Souto, M. C., & Ho, T. K. (2019). How Complex Is Your Classification Problem?: A Survey on Measuring Classification Complexity. ACM Computing Surveys (CSUR), 52(5), 107.
The following software have to get installed:
- Scala. Version 2.11
- Spark. Version 2.3.2
- Maven. Version 3.5.2
- JVM. Java Virtual Machine. Version 1.8.0 because Scala run over it.
- Download source code: It is host on GitHub. To get the sources and compile them we will need the next git instruction.
git clone https://github.com/JMailloH/ComplexityMetrics.git
- Build jar file: Once we have the sources, we generate the .jar file of the project by the next maven instruction.
mvn package -Dmaven.test.skip=true.
Another alternative is download by spark-package at https://spark-packages.org/package/JMailloH/ComplexityMetrics
If you want to run the software and obtain the result of each one of the metrics, you can consult the example file: runMetrics.scala.
The run.sh file contains an example call for each algorithm. A generic sample of run could be:
spark-submit --master "URL" --class org.apache.spark.run.runMetrics ./target/ComplexityMetrics-1.0.jar "path-to-dataset" "path-to-output" "Metric" "number-of-maps"
--class org.apache.spark.run.runMetrics ./target/ComplexityMetrics-1.0.jar
Determine the jar file to be run."path-to-dataset"
Path from HDFS to dataset."path-to-output"
Path from HDFS to output."Metric"
ND, DTP or ClassicMetrics (to compute the rest of the metrics)."number-of-maps"
Number of map tasks.