Skip to content

scalation/data

Repository files navigation

data

Please visit this GitHub for more information.

ScalaTion Dataset Collection

This repository contains a collection of datasets that are used in ScalaTion. The datasets are organized under 5 sub categories. They are

  • analytics
  • graphanalytics
  • linalgebra
  • relalgebra
  • tableau

Datasets can be downloaded via the .download.sh script in the main scalation data directory. See below for usage.

Installation

# If scalation is not installed
$ git clone https://github.com/scalation/scalation.git

# Go to the scalation data directory
$ cd scalation/data

# Execute the download script with no arguments to download all datasets
$ ./download.sh

# If you would like only download a single category, then specify it as an argument
$ ./download.sh linalgebra

Regression Datasets

Name #rows #attrs Size Description Path
auto-mpg 392 8 0.02MB Auto-MPG Dataset from UCI analytics/regression/auto_mpg.csv
airfoil 1503 5 0.06MB Airfoil Self Noise Dataset from UCI(NASA) analytics/regression/airfoil/airfoil_self_noise.csv
concrete_compressive 1030 9 0.06MB Concrete Compressive Strength Dataset from UCI analytics/regression/concrete_compressive/Concrete_Data.csv
ccpp 9568 4 0.29MB Combined Cycle Power Plant Dataset from UCI analytics/regression/ccpp/Folds5x2_pp.csv
concrete_slump_1 103 10 0.004MB Concrete Slump Dataset from UCI (target: SLUMP) analytics/regression/concrete_slump/slump_test.csv
concrete_slump_2 103 10 0.004MB Concrete Slump Dataset from UCI (target: FLOW) analytics/regression/concrete_slump/slump_test.csv
concrete_slump_3 103 10 0.004MB Concrete Slump Dataset from UCI (target: Compressive Strength) analytics/regression/concrete_slump/slump_test.csv
nist_gauss_1 250 1 0.005MB NIST Gauss1 dataset. The data are two well-separated Gaussians on a decaying exponential baseline plus normally distributed zero-mean noise with variance = 6.25. analytics/regression/nist_gauss_1.csv
prostate 97 8 0.01MB R Prostate Cancer dataset analytics/regression/prostate.csv
kin8nm 8192 8 0.66MB kin8nm dataset from OpenML (https://www.openml.org/d/189) analytics/regression/dataset_2175_kin8nm.csv
computer_activity_1 8192 21 0.69MB Torgo Computer Activity Dataset analytics/regression/computer_activity/cpu_act.data
computer_activity_2 8192 12 0.43MB Torgo Computer Activity Dataset - Small version analytics/regression/computer_activity/cpu_small.data
wisconsin_breast 194 32 0.04MB Wisconsin Breast Cancer Dataset analytics/regression/wisconsin_breast_cancer/r_wpbc.data
auto_price 159 15 0.01MB Torgo Auto Price Dataset analytics/regression/auto_price/price.data
gym_crowdedness 62184 10 3.29MB Kaggle Campus Gym Crowdedness Dataset analytics/regression/gym_crowdedness.csv
forest_fire 517 12 0.02MB UCI Forest Fire Dataset analytics/regression/forest_fire/forestfires.csv
housing 506 13 0.04MB Boston Housing Dataset analytics/regression/housing/housing_fixed.csv
istanbul_stock 536 9 0.06MB UCI Istanbul Stock Exchange Dataset analytics/regression/data_akbilgic.csv
tecator_moisture 240 100 0.18MB OPENML Tecator Dataset(target: Moisture) analytics/regression/tecator/tecator_moisture.csv
tecator_fat 240 100 0.18MB OPENML Tecator Dataset(target: Fat) analytics/regression/tecator/tecator_fat.csv
tecator_protein 240 100 0.18MB OPENML Tecator Dataset(target: Protein) analytics/regression/tecator/tecator_protein.csv
bike_sharing_total_hour 17379 16 1.09MB UCI Bike Sharing Dataset Hourly Data Total Count
bike_sharing_total_day 731 15 0.05MB UCI Bike Sharing Dataset Daily Data Total Count
bng_breast 116640 9 6.32MB OPENML BNG Breast Tumor Dataset analytics/regression/BNG_breastTumor.csv
visualizing_soil 8641 4 0.20MB OPENML Visualizing Soil Dataset analytics/regression/visualizing_soil.csv
bank8fm 8192 8 0.59MB OPENML Customer Bank Selection Dataset analytics/regression/bank8fm.csv
abalone 4177 8 0.18MB Torgo Abalone Dataset analytics/regression/abalone/abalone.data
electricity_prices 37682 16 2.77MB OPENML ICON Electricity Challenge Dataset analytics/regression/electricity_prices/electricity_prices_nomissing.csv
casp 45730 9 3.37MB UCI Protein Tertiary Structure DataSet analytics/regression/CASP.csv
appliance_energy 19735 28 4.04MB UCI Appliance Energy DataSet analytics/regression/appliance_energy/energy_data_clean.csv
crime_norm 1993 100 0.90MB UCI Communities Crime(target: ViolentPerPop) DataSet analytics/regression/communities/communities.csv
parkinson_1 5875 18 0.78MB UCI Parkinson Telemonitoring Dataset(target: total) analytics/regression/parkinson/parkinsons_motor_updrs.csv
parkinson_2 5875 18 0.78MB UCI Parkinson Telemonitoring Dataset(target: motor) analytics/regression/parkinson/parkinsons_total_updrs.csv
servo 167 4 0.003MB UCI Servo Dataset analytics/regression/servo/servo.data.txt
student_1 395 29 0.04MB UCI Student Performance Dataset(target: mat) analytics/regression/student/student-mat.csv
student_2 649 29 0.06MB UCI Student Performance Dataset(target: por) analytics/regression/student/student-por.csv
yacht 308 6 0.01MB UCI Yacht Hydodynamics Dataset analytics/regression/yacht_hydrodynamics.data
fb_metric_1 496 15 0.03MB UCI Facebook Metric Dataset(target: total) analytics/regression/fb/dataset_total.csv
fb_metric_2 496 15 0.03MB UCI Facebook Metric Dataset(target: like) analytics/regression/fb/dataset_like.csv
fb_metric_3 496 15 0.03MB UCI Facebook Metric Dataset(target: comment) analytics/regression/fb/dataset_comment.csv
fb_metric_4 496 15 0.03MB UCI Facebook Metric Dataset(target: share) analytics/regression/fb/dataset_share.csv
cars 1447 13 0.11MB Applied Predictive Modeling Cars Dataset(all) analytics/regression/cars/cars_all.csv
chick_weight 578 2 0.004MB R Caret Package Chick Weight Dataset analytics/regression/chick_weight.csv
life_cycle_savings 50 4 0.001MB R Caret Package Life Cycle Savings Dataset analytics/regression/life_cycle_savings.csv
hi 22272 11 1.05MB R Health Insurance Housewives Dataset analytics/regression/HI.csv
body_fat 252 17 0.02MB Bilkent Body Fat Dataset analytics/regression/body_fat.csv
fried 40768 10 2.55MB Bilkent Fried Dataset analytics/regression/fried.csv
plastic 1650 2 0.02MB Bilkent Plastic Dataset analytics/regression/plastic.csv
quake 2178 3 0.04MB Bilkent Quake Dataset analytics/regression/quake.csv
weather_1 1609 9 0.08MB Bilkent Weather Ankara Dataset analytics/regression/WA.dat
weather_2 1461 9 0.07MB Bilkent Weather Izmir Dataset analytics/regression/WI.dat
treasury 1049 15 0.09MB Bilkent Treasury Dataset analytics/regression/TR.dat
pwlinear 177147 10 5.58MB OPENML PWLinear Dataset analytics/regression/BNG_pwLinear.csv
puma32h 8192 32 2.40MB Torgo Puma32H Dataset analytics/regression/puma32H.csv
puma8nh 8192 8 0.66MB Torgo Puma8NH Dataset analytics/regression/puma8NH.csv
2dplanes 40768 10 1.25MB Torgo 2dplanes Dataset analytics/regression/2dplanes.csv
pol 15000 26 0.90MB OPENML Pole Telecom Dataset analytics/regression/pole_telecomm/pol_all.csv
solar 1066 9 0.02MB UCI Solar Flare Dataset analytics/regression/solar/flare.data2
qsar_47555 1158 51 0.12MB OPENML QSAR Dataset(47555) analytics/regression/qsar/qsar_47555.csv
qsar_31274 1189 132 0.31MB OPENML QSAR Dataset(31274) analytics/regression/qsar/clean_qsar_31274.csv
air 999249 18 11.55MB RITA Airline on-time Performance Dataset (1987 only) analytics/regression/air_1987_clean.csv.gz
buzz_toms 28179 96 1.53MB UCI Social Media Buzz Dataset - Toms Hardware) analytics/regression/buzz/TomsHardware.data.gz
buzz_twitter 583250 77 31.76MB UCI Social Media Buzz Dataset - Twitter analytics/regression/buzz/Twitter.data.gz
qsar_47749 6003 610 7.02MB OPENML QSAR Dataset(47749) analytics/regression/qsar/qsar_47749.csv
olympic2000 66 11 0.00MB Olympic2000 Dataset from "Analyzing Categorical Data" analytics/regression/analcatdata_olympic2000.csv
qsar_191 4442 1023 8.71MB OPENML QSAR Dataset(191) analytics/regression/qsar/qsar_191.csv
qsar_33511 6003 420 4.85MB OPENML QSAR Dataset(33511) analytics/regression/qsar/clean_qsar_33511.csv
corn_m5spec_moisture 80 700 0.48MB NIR of Corn Samples for Standardization Benchmarking Dataset (Moisture) analytics/regression/corn/corn_m5spec_moisture.tsv
corn_m5spec_oil 80 700 0.48MB NIR of Corn Samples for Standardization Benchmarking Dataset (Oil) analytics/regression/corn/corn_m5spec_oil.tsv
corn_m5spec_protein 80 700 0.48MB NIR of Corn Samples for Standardization Benchmarking Dataset (Protein) analytics/regression/corn/corn_m5spec_protein.tsv
corn_m5spec_starch 80 700 0.48MB NIR of Corn Samples for Standardization Benchmarking Dataset (Starch) analytics/regression/corn/corn_m5spec_starch.tsv
qsar_12789 309 1024 0.62MB OPENML QSAR Dataset(12789) analytics/regression/qsar/qsar_12789.csv
energy_efficiency_1 768 9 0.04MB UCI Energy Efficiency Dataset(Heating Load) analytics/regression/energy_efficiency/ENB2012_data.csv
energy_efficiency_2 768 9 0.04MB UCI Energy Efficiency Dataset(Cooling Load) analytics/regression/energy_efficiency/ENB2012_data.csv
cbm_1 11934 14 1.17MB UCI CBM Dataset(Compressor) analytics/regression/cbm/data_compressor.csv
cbm_2 11934 14 1.17MB UCI CBM Dataset(Turbine) analytics/regression/cbm/data_turbine.csv
triazines 186 58 0.04MB Bilkent Triazines Dataset analytics/regression/TZ.dat
cars_kbb 804 17 0.04MB R Caret Package KBB Price Cars Dataset analytics/regression/cars_kbb.csv
chem 176 57 0.04MB Applied Predictive Modeling Chemical Manufacturing Dataset analytics/regression/chemical_manufacturing_process.csv
crime_unnorm_autoTheft 2211 102 1.14MB UCI Communities Crime(unnorm-autoTheft) DataSet analytics/regression/communities/unnorm/communities_autoTheft.csv
crime_unnorm_burgl 2211 102 1.14MB UCI Communities Crime(unnorm-burgl) DataSet analytics/regression/communities/unnorm/communities_burgl.csv
crime_unnorm_larc 2211 102 1.14MB UCI Communities Crime(unnorm-larc) DataSet analytics/regression/communities/unnorm/communities_larc.csv
crime_unnorm_nonViol 2117 102 1.09MB UCI Communities Crime(unnorm-nonViol) DataSet analytics/regression/communities/unnorm/communities_nonViol.csv
crime_unnorm_violent 1993 102 1.03MB UCI Communities Crime(unnorm-violent) DataSet analytics/regression/communities/unnorm/communities_violent.csv
crime_unnorm_total 1901 102 0.98MB UCI Communities Crime(unnorm-total) DataSet analytics/regression/communities/unnorm/communities_total.csv
crime_unnorm_arsons 2123 102 1.09MB UCI Communities Crime(unnorm-arsons) DataSet analytics/regression/communities/unnorm/communities_arsons.csv
crime_unnorm_assault 2201 102 1.13MB UCI Communities Crime(unnorm-assault) DataSet analytics/regression/communities/unnorm/communities_assault.csv
crime_unnorm_rapes 2006 102 1.03MB UCI Communities Crime(unnorm-rapes) DataSet analytics/regression/communities/unnorm/communities_rapes.csv
crime_unnorm_murd 2214 102 1.13MB UCI Communities Crime(unnorm-murd) DataSet analytics/regression/communities/unnorm/communities_murd.csv
crime_unnorm_robbb 2213 102 1.14MB UCI Communities Crime(unnorm-robbb) DataSet analytics/regression/communities/unnorm/communities_robbb.csv
ailerons 13750 40 2.31MB Ailerons Dataset analytics/regression/ailerons/ailerons_all.csv
elevators 16599 18 1.51MB Elevators Dataset analytics/regression/dataset_2202_elevators.csv
transcoding 68784 19 7.19MB UCI Video Transcoding Dataset analytics/regression/transcoding_measurement.csv
sol_1 1267 228 1.81MB Applied Predictive Modeling Solubility Dataset analytics/regression/solubility/sol.csv
sol_2 632 228 0.41MB Applied Predictive Modeling Solubility Dataset(trans) analytics/regression/solubility/solTrans.csv
blood_brain 208 127 0.18MB Applied Predictive Modeling Blood Brain Barrier Dataset analytics/regression/blood_brain.csv
aquatic_tox_1 322 23 0.03MB R QSARData Package Aquatic Toxicity Dataset(lcalc) analytics/regression/aquatic_tox/aquatic_tox_lcalc.csv
aquatic_tox_3 322 65 0.13MB R QSARData Package Aquatic Toxicity Dataset(moe3d) analytics/regression/aquatic_tox/aquatic_tox_moe3d.csv
aquatic_tox_4 319 48 0.07MB R QSARData Package Aquatic Toxicity Dataset(qprop) analytics/regression/aquatic_tox/aquatic_tox_qprop.csv
aquatic_tox_2 322 220 0.31MB R QSARData Package Aquatic Toxicity Dataset(moe2d) analytics/regression/aquatic_tox/aquatic_tox_moe2d.csv
cox2 462 205 0.49MB R Caret Package Cox2 Dataset analytics/regression/cox2.csv
melting_point 4401 203 6.97MB R QSARData Package Melting Point Dataset analytics/regression/melting_point.csv
aloi 108000 128 1.74MB OPENML Aloi Dataset analytics/regression/aloi.csv.gz
nci_60_90th_1 59 3489 1.20MB NCI-60 Dataset(target: KRT18) analytics/regression/nci-60/90th_1.csv
nci_60_90th_2 59 3489 1.20MB NCI-60 Dataset(target: KRT19) analytics/regression/nci-60/90th_2.csv
nci_60_90th_3 59 3489 1.20MB NCI-60 Dataset(target: KRT7) analytics/regression/nci-60/90th_3.csv
nci_60_90th_4 59 3489 1.20MB NCI-60 Dataset(target: TP53_26_GBL00064) analytics/regression/nci-60/90th_4.csv
nci_60_90th_5 59 3489 1.20MB NCI-60 Dataset(target: VASP) analytics/regression/nci-60/90th_5.csv
nci_60_90th_6 59 3489 1.20MB NCI-60 Dataset(target: MSN_4) analytics/regression/nci-60/90th_6.csv
nci_60_90th_7 59 3489 1.20MB NCI-60 Dataset(target: CDKN2A) analytics/regression/nci-60/90th_7.csv
nci_60_90th_8 59 3489 1.20MB NCI-60 Dataset(target: KRT8) analytics/regression/nci-60/90th_8.csv
nci_60_90th_9 59 3489 1.20MB NCI-60 Dataset(target: TP53_10_24342) analytics/regression/nci-60/90th_9.csv
qsar_36276 6003 39 0.88MB OPENML QSAR Dataset(36726) analytics/regression/qsar/qsar_36276.csv
qsar_47652 1731 83 0.29MB OPENML QSAR Dataset(47652) analytics/regression/qsar/qsar_47652.csv
blog_feedback 52397 142 21.81MB UCI Blog Feedback Dataset analytics/regression/blog_feedback/clean_blogData_train.csv
online_news_pop 39644 58 18.48MB UCI Mashable Online News Popularity Dataset analytics/regression/online_news_pop/OnlineNewsPopularity.csv
ct_slice 53500 379 16.88MB UCI CT Axis Prediction Dataset analytics/regression/ct_slice_localization_data.csv.gz
loan_default 105471 769 604.10MB Loan Default Prediction Dataset from Kaggle analytics/regression/loan_default_prediction/clean_train_v2_imputed.csv.gz