From 49061123ceb6d74f9a6d208ea9cb7f84c6d5c7d9 Mon Sep 17 00:00:00 2001 From: Okon Samuel <39421418+OkonSamuel@users.noreply.github.com> Date: Sun, 19 May 2024 21:12:48 +0100 Subject: [PATCH] Add documentation (#5) * add draft for RFE model * rename package * Add FeatureSelectore and some tests * fix current tests * complete RFE model and add tests * Update model docstring * fix code, Update readme and add more tests * Apply suggestions from code review Co-authored-by: Anthony Blaom, PhD * rename n_features_to_select to n_features * update readme with * Apply suggestions from code review Co-authored-by: Anthony Blaom, PhD * set max column limit to 92 in readme * add Aqua.jl tests and refactor code * update ci * Apply suggestions from code review Co-authored-by: Anthony Blaom, PhD * fix bug, add support for serialization and add more tests * Update ci.yml * Update ci.yml * Update ci.yml * Update ci.yml * Update ci.yml * add documentation * Disable julia Nighly tests --------- Co-authored-by: Anthony Blaom, PhD --- .github/workflows/ci_nightly.yml | 50 --------- Project.toml | 2 +- README.md | 99 +---------------- docs/.gitignore | 2 + docs/Project.toml | 11 ++ docs/make.jl | 34 ++++++ docs/src/api.md | 9 ++ docs/src/index.md | 185 +++++++++++++++++++++++++++++++ src/models/rfe.jl | 1 - 9 files changed, 243 insertions(+), 150 deletions(-) delete mode 100644 .github/workflows/ci_nightly.yml create mode 100644 docs/.gitignore create mode 100644 docs/Project.toml create mode 100644 docs/make.jl create mode 100644 docs/src/api.md create mode 100644 docs/src/index.md diff --git a/.github/workflows/ci_nightly.yml b/.github/workflows/ci_nightly.yml deleted file mode 100644 index dbb29cd..0000000 --- a/.github/workflows/ci_nightly.yml +++ /dev/null @@ -1,50 +0,0 @@ -name: CI (Julia nightly) -on: - pull_request: - branches: - - master - - dev - push: - branches: - - master - - dev - tags: '*' -env: - TEST_MLJBASE: "true" -jobs: - test: - name: Julia ${{ matrix.version }} - ${{ matrix.os }} - ${{ matrix.arch }} - ${{ github.event_name }} - runs-on: ${{ matrix.os }} - strategy: - fail-fast: false - matrix: - version: - - 'nightly' - os: - - ubuntu-latest - arch: - - x64 - steps: - - uses: actions/checkout@v2 - - uses: julia-actions/setup-julia@v1 - with: - version: ${{ matrix.version }} - arch: ${{ matrix.arch }} - - uses: actions/cache@v1 - env: - cache-name: cache-artifacts - with: - path: ~/.julia/artifacts - key: ${{ runner.os }}-test-${{ env.cache-name }}-${{ hashFiles('**/Project.toml') }} - restore-keys: | - ${{ runner.os }}-test-${{ env.cache-name }}- - ${{ runner.os }}-test- - ${{ runner.os }}- - - uses: julia-actions/julia-buildpkg@v1 - - uses: julia-actions/julia-runtest@v1 - env: - JULIA_NUM_THREADS: 2 - - uses: julia-actions/julia-processcoverage@v1 - - uses: codecov/codecov-action@v1 - with: - file: lcov.info diff --git a/Project.toml b/Project.toml index 7473f24..5303a45 100644 --- a/Project.toml +++ b/Project.toml @@ -1,6 +1,6 @@ name = "FeatureSelection" uuid = "33837fe5-dbff-4c9e-8c2f-c5612fe2b8b6" -authors = ["Anthony D. Blaom "] +authors = ["Anthony D. Blaom ", "Samuel Okon importance pairs -``` -We can view the important features used by our model by inspecting the `fitted_params` -object. -```julia -p = fitted_params(mach) -p.features_left == [:x1, :x2, :x3, :x4, :x5] -``` -We can also call the `predict` method on the fitted machine, to predict using a -random forest regressor trained using only the important features, or call the `transform` -method, to select just those features from some new table including all the original -features. For more info, type `?RecursiveFeatureElimination` on a Julia REPL. - -Okay, let's say that we didn't know that our synthetic dataset depends on only five -columns from our feature table. We could apply cross fold validation -`StratifiedCV(nfolds=5)` with our recursive feature elimination model to select the -optimal value of `n_features` for our model. In this case we will use a simple Grid -search with root mean square as the measure. -```julia -rfe = RecursiveFeatureElimination(model = forest) -tuning_rfe_model = TunedModel( - model = rfe, - measure = rms, - tuning = Grid(rng=rng), - resampling = StratifiedCV(nfolds = 5), - range = range( - rfe, :n_features, values = 1:10 - ) -) -self_tuning_rfe_mach = machine(tuning_rfe_model, X, y) -fit!(self_tuning_rfe_mach) -``` -As before we can inspect the important features by inspecting the object returned by -`fitted_params` or `feature_importances` as shown below. -```julia -fitted_params(self_tuning_rfe_mach).best_fitted_params.features_left == [:x1, :x2, :x3, :x4, :x5] -feature_importances(self_tuning_rfe_mach) # returns dict of feature => importance pairs -``` -and call `predict` on the tuned model machine as shown below -```julia -Xnew = MLJ.table(rand(rng, 50, 10)) # create test data -predict(self_tuning_rfe_mach, Xnew) -``` -In this case, prediction is done using the best recursive feature elimination model gotten -from the tuning process above. - -For resampling methods different from cross-validation, and for other - `TunedModel` options, such as parallelization, see the - [Tuning Models](https://alan-turing-institute.github.io/MLJ.jl/dev/tuning_models/) section of the MLJ manual. -[MLJ Documentation](https://alan-turing-institute.github.io/MLJ.jl/dev/) \ No newline at end of file +Repository housing feature selection algorithms for use with the machine learning toolbox [MLJ](https://juliaai.github.io/MLJ.jl/dev/). diff --git a/docs/.gitignore b/docs/.gitignore new file mode 100644 index 0000000..264c13f --- /dev/null +++ b/docs/.gitignore @@ -0,0 +1,2 @@ +Manifest.toml +build/ \ No newline at end of file diff --git a/docs/Project.toml b/docs/Project.toml new file mode 100644 index 0000000..ca94df0 --- /dev/null +++ b/docs/Project.toml @@ -0,0 +1,11 @@ +[deps] +Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4" +MLJ = "add582a8-e3ab-11e8-2d5e-e98b27df1bc7" +FeatureSelection = "33837fe5-dbff-4c9e-8c2f-c5612fe2b8b6" +StableRNGs = "860ef19b-820b-49d6-a774-d7a799459cd3" + +[compat] +Documenter = "^1.4" +MLJ = "^0.20" +StableRNGs = "^1.0" +julia = "^1.0" \ No newline at end of file diff --git a/docs/make.jl b/docs/make.jl new file mode 100644 index 0000000..e3aa86f --- /dev/null +++ b/docs/make.jl @@ -0,0 +1,34 @@ +using Documenter, FeatureSelection + +makedocs(; + authors = """ + Anthony D. Blaom , + Sebastian Vollmer , + Okon Samuel + """, + format = Documenter.HTML(; + prettyurls= get(ENV, "CI", "false") == "true", + edit_link = "dev" + ), + modules = [FeatureSelection], + pages=[ + "Home" => "index.md", + "API" => "api.md" + ], + doctest = false, # don't runt doctest as doctests are automatically run separately in ci. + repo = Remotes.GitHub("JuliaAI", "FeatureSelection.jl"), + sitename = "FeatureSelection.jl", +) + +# By default Documenter does not deploy docs just for PR +# this causes issues with how we're doing things and ends +# up choking the deployment of the docs, so here we +# force the environment to ignore this so that Documenter +# does indeed deploy the docs +#ENV["GITHUB_EVENT_NAME"] = "pull_request" + +deploydocs(; + deploy_config = Documenter.GitHubActions(), + repo="github.com/JuliaAI/FeatureSelection.jl.git", + push_preview=true +) \ No newline at end of file diff --git a/docs/src/api.md b/docs/src/api.md new file mode 100644 index 0000000..0321ede --- /dev/null +++ b/docs/src/api.md @@ -0,0 +1,9 @@ +```@meta +CurrentModule = FeatureSelection +``` +# API +# Models +```@docs +FeatureSelector +RecursiveFeatureElimination +``` \ No newline at end of file diff --git a/docs/src/index.md b/docs/src/index.md new file mode 100644 index 0000000..0221261 --- /dev/null +++ b/docs/src/index.md @@ -0,0 +1,185 @@ +# FeatureSelection + +FeatureSelction is a julia package containing implementations of feature selection algorithms for use with the machine learning toolbox +[MLJ](https://juliaai.github.io/MLJ.jl/dev/). + +# Installation +On a running instance of Julia with at least version 1.6 run +```julia +import Pkg; +Pkg.add("FeatureSelection") +``` + +# Example Usage +Lets build a supervised recursive feature eliminator with `RandomForestRegressor` +from [DecisionTree.jl](https://github.com/JuliaAI/DecisionTree.jl) as our base model. +But first we need a dataset to train on. We shall create a synthetic dataset popularly +known in the R community as the friedman dataset#1. Notice how the target vector for this +dataset depends on only the first five columns of feature table. So we expect that our +recursive feature elimination should return the first columns as important features. +```@meta +DocTestSetup = quote + using MLJ, FeatureSelection, StableRNGs + rng = StableRNG(10) + A = rand(rng, 50, 10) + X = MLJ.table(A) # features + y = @views( + 10 .* sin.( + pi .* A[:, 1] .* A[:, 2] + ) .+ 20 .* (A[:, 3] .- 0.5).^ 2 .+ 10 .* A[:, 4] .+ 5 * A[:, 5] + ) # target + RandomForestRegressor = @load RandomForestRegressor pkg=DecisionTree + forest = RandomForestRegressor(rng=rng) + rfe = RecursiveFeatureElimination( + model = forest, n_features=5, step=1 + ) # see doctring for description of defaults + mach = machine(rfe, X, y) + fit!(mach) + + rfe = RecursiveFeatureElimination(model = forest) + tuning_rfe_model = TunedModel( + model = rfe, + measure = rms, + tuning = Grid(rng=rng), + resampling = StratifiedCV(nfolds = 5), + range = range( + rfe, :n_features, values = 1:10 + ) + ) + self_tuning_rfe_mach = machine(tuning_rfe_model, X, y) + fit!(self_tuning_rfe_mach) +end +``` +```@example example1 +using MLJ, FeatureSelection, StableRNGs +rng = StableRNG(10) +A = rand(rng, 50, 10) +X = MLJ.table(A) # features +y = @views( + 10 .* sin.( + pi .* A[:, 1] .* A[:, 2] + ) .+ 20 .* (A[:, 3] .- 0.5).^ 2 .+ 10 .* A[:, 4] .+ 5 * A[:, 5] +) # target +``` +Now we that we have our data we can create our recursive feature elimination model and +train it on our dataset +```@example example1 +RandomForestRegressor = @load RandomForestRegressor pkg=DecisionTree +forest = RandomForestRegressor(rng=rng) +rfe = RecursiveFeatureElimination( + model = forest, n_features=5, step=1 +) # see doctring for description of defaults +mach = machine(rfe, X, y) +fit!(mach) +``` +We can inspect the feature importances in two ways: +```jldoctest +julia> report(mach).ranking +10-element Vector{Int64}: + 1 + 1 + 1 + 1 + 1 + 2 + 3 + 4 + 5 + 6 + +julia> feature_importances(mach) +10-element Vector{Pair{Symbol, Int64}}: + :x1 => 6 + :x2 => 5 + :x3 => 4 + :x4 => 3 + :x5 => 2 + :x6 => 1 + :x7 => 1 + :x8 => 1 + :x9 => 1 + :x10 => 1 +``` +Note that a variable with lower rank has more significance than a variable with higher rank while a variable with higher feature importance is better than a variable with lower feature importance. + +We can view the important features used by our model by inspecting the `fitted_params` +object. +```jldoctest +julia> p = fitted_params(mach) +(features_left = [:x1, :x2, :x3, :x4, :x5], + model_fitresult = (forest = Ensemble of Decision Trees +Trees: 100 +Avg Leaves: 25.26 +Avg Depth: 8.36,),) + +julia> p.features_left +5-element Vector{Symbol}: + :x1 + :x2 + :x3 + :x4 + :x5 +``` +We can also call the `predict` method on the fitted machine, to predict using a +random forest regressor trained using only the important features, or call the `transform` +method, to select just those features from some new table including all the original +features. For more info, type `?RecursiveFeatureElimination` on a Julia REPL. + +Okay, let's say that we didn't know that our synthetic dataset depends on only five +columns from our feature table. We could apply cross fold validation +`StratifiedCV(nfolds=5)` with our recursive feature elimination model to select the +optimal value of `n_features` for our model. In this case we will use a simple Grid +search with root mean square as the measure. +```@example example1 +rfe = RecursiveFeatureElimination(model = forest) +tuning_rfe_model = TunedModel( + model = rfe, + measure = rms, + tuning = Grid(rng=rng), + resampling = StratifiedCV(nfolds = 5), + range = range( + rfe, :n_features, values = 1:10 + ) +) +self_tuning_rfe_mach = machine(tuning_rfe_model, X, y) +fit!(self_tuning_rfe_mach) +``` +As before we can inspect the important features by inspecting the object returned by +`fitted_params` or `feature_importances` as shown below. +```jldoctest +julia> fitted_params(self_tuning_rfe_mach).best_fitted_params.features_left +5-element Vector{Symbol}: + :x1 + :x2 + :x3 + :x4 + :x5 + +julia> feature_importances(self_tuning_rfe_mach) +10-element Vector{Pair{Symbol, Int64}}: + :x1 => 6 + :x2 => 5 + :x3 => 4 + :x4 => 3 + :x5 => 2 + :x6 => 1 + :x7 => 1 + :x8 => 1 + :x9 => 1 + :x10 => 1 +``` +and call `predict` on the tuned model machine as shown below +```@example example1 +Xnew = MLJ.table(rand(rng, 50, 10)) # create test data +predict(self_tuning_rfe_mach, Xnew) +``` +In this case, prediction is done using the best recursive feature elimination model gotten +from the tuning process above. + +For resampling methods different from cross-validation, and for other + `TunedModel` options, such as parallelization, see the + [Tuning Models](https://juliaai.github.io/MLJ.jl/dev/tuning_models/) section of the MLJ manual. +[MLJ Documentation](https://juliaai.github.io/MLJ.jl/dev/) +```@meta +DocTestSetup = nothing +``` \ No newline at end of file diff --git a/src/models/rfe.jl b/src/models/rfe.jl index 1b10260..db636b9 100644 --- a/src/models/rfe.jl +++ b/src/models/rfe.jl @@ -139,7 +139,6 @@ Xnew = MLJ.table(rand(rng, 50, 10)); predict(mach, Xnew) ``` - """ function RecursiveFeatureElimination( args...;