Needs Jython and Weka.
Uses UCI Michalski and Chilausky soybean data set
Originally developed for a class assignment.
- ** setup.bat** Shows how to set up classpath to use WEKA from Jython
- preprocess_soybeans.py Pre-processes the soybean data set
- find_best_attributes.py Finds subset of attributes that give best classification accuracy for a given algorithm and data set
- arff.py Weka .arff file reader and writer
- split_data.py Splits a WEKA .arff file to preserve class distribution and maximize or minimize aggregate accuracy of a set of classifiers. Output is 2 WEKA .arff files
- **find_soybean_split.bat / find_soybean_split.sh ** Shows how to run split_data.py on a pre-processed soybean .arff file
Results are in the data directory.
Example use of split_data.py
The batch/shell file find_soybean_split.bat / find_soybean_split.sh runs split_data.py on soybean-large.data.missing.values.replaced.arff to create the training and test files soybean-large.data.missing.values.replaced.best.train.arff and soybean-large.data.missing.values.replaced.best.test.arff which give the classification results soybean.split.results.txt whose summary is
Classifier | Correct (out of 60) | Percentage Correct |
---|---|---|
NaiveBayes | 57 | 95 % |
J48 | 58 | 96.67 % |
BayesNet | 59 | 98.33 % |
RandomForest | 59 | 98.33 % |
JRip | 60 | 100 % |
KStar | 60 | 100 % |
SMO | 60 | 100 % |
MLP | 60 | 100 % |