Gump

an attempt to learn decision trees and random forests by implementing them myself.

explanation

basically, decision trees try to answer the question "which feature makes the biggest difference"

my understanding of the algorithm(s). numbers like 1. are a necessary step, while bulletpoints are different choices we could make.

variable can be continuous or categorical, changing the way we split.
- on categorical variables we can split on every value or do a binary split.
- on continuous variables we can find an optimal binary split value or we can discretize values, meaning we kinda get categorical buckets.

find the feature with the highest value split, and perform that split.
for each branch of that split (2 if binary, n if fully categorical) we get a filtered down subset of the measurement data above.
repeat recursively.

Note: make sure to have a stopping condition, so that splits don't result in too few datapoints.

does the target always have to be categorical?
bagging seems to be a bit of a choice, i.e. how fine-grained do we want to make the buckets
- can this choice be optimized as well?
- seems non-trivial, because a greedy algorithm might way overfit at this step

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
bench		bench
data		data
src		src
.ghcid		.ghcid
.gitignore		.gitignore
.hlint.yaml		.hlint.yaml
.stylish-haskell.yaml		.stylish-haskell.yaml
LICENSE		LICENSE
Readme.md		Readme.md
Spec.md		Spec.md
cabal.project.local		cabal.project.local
gump.cabal		gump.cabal
stack.yaml		stack.yaml
stack.yaml.lock		stack.yaml.lock