-
-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEA Add Information Gain and Information Gain Ratio feature selection functions #28905
Open
StefanieSenger
wants to merge
48
commits into
scikit-learn:main
Choose a base branch
from
StefanieSenger:information_gain
base: main
Could not load branches
Branch not found: {{ refName }}
Could not load tags
Nothing to show
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+515
−13
Open
Changes from 44 commits
Commits
Show all changes
48 commits
Select commit
Hold shift + click to select a range
12300fd
Added IG and IGR feature selection functions
vpekar 2ce8c92
Fixed a broken test
vpekar a0ca2f9
Merge branch 'master' into ig-and-igr-feature-selection
vpekar 1ba5b75
Added an extra return var to conform to other feature selection funct…
vpekar e576fe0
Removed the pvals return param from mi function
vpekar 97744fe
Dealing with functions that don't return pvals
vpekar 2cda7af
Removed unused import
vpekar d7701f2
Renamed vars, using __future__.division
vpekar 56eb381
Moved __future__.division
vpekar b4f02f8
Fixed import error
vpekar 39053da
Merge branch 'master' into ig-and-igr-feature-selection
vpekar 201abc4
Fixing flake8 errors
vpekar d453dae
Merge branch 'ig-and-igr-feature-selection' of https://github.com/vpe…
vpekar a7b663f
Added support for dense arrays for ig and igr, added formulas
vpekar 4a6a849
Removed unused import
vpekar 1eb379a
Removed unused import
vpekar f4f0517
Corrected IGR formula
vpekar 6d55cea
Updated docstrings
vpekar 6ad6f7d
Added info_gain and info_gain_ratio examples
vpekar ef48e09
Fixed PyFlakes errors
vpekar 1deb585
Code refactoring, using safe_sparse_dot on all matrix types
vpekar 3684364
Reverted feature_selection.rst
vpekar fc01086
Using max as the default globalization strategy
vpekar a966d1e
Updated docstrings and rst documentation
vpekar 738afc2
Merge branch 'master' into ig-and-igr-feature-selection
vpekar 8c2a41c
Docstrings: links only on titles
vpekar 676bbdc
Refactored to calculate IGR inside _info_gain; added tests against ma…
vpekar 30ff737
Removed IGR tests
vpekar 1b76234
Added an example comparing different univariate feature selection fun…
vpekar b21c655
Removed IG and IGR from two examples
vpekar 6aa3bde
Fixed PEP errors
vpekar 01c1f5c
Fixed more PEP errors
vpekar 50158f5
Using CountVectorizer in feature selection example; added chart for p…
vpekar 0b1b9fa
merge with main after 7 years
StefanieSenger 325cc87
update example
StefanieSenger def38dc
update test
StefanieSenger c18b303
sparse containers for testing
StefanieSenger 973caab
error corrected docstrings
StefanieSenger 2b9b6bd
added testing for aggretate={'mean', 'sum'}
StefanieSenger e43c2c5
Merge branch 'main' into information_gain
StefanieSenger 506855c
add test for equally distributed classes
StefanieSenger 8f97e01
unfunctional code removed
StefanieSenger 4bcbf46
Merge branch 'main' into information_gain
StefanieSenger b6d0481
update changelog
StefanieSenger 6bc738c
Apply suggestions from code review
StefanieSenger 4d4b368
Merge branch 'main' into information_gain
StefanieSenger 3718599
resolve merge conflict
StefanieSenger b7d25ac
delete classes.rst again
StefanieSenger File filter
Filter by extension
Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
|
@@ -38,7 +38,7 @@ See :ref:`array_api` for more details. | |||||||||
|
||||||||||
**Classes:** | ||||||||||
|
||||||||||
- | ||||||||||
- | ||||||||||
|
||||||||||
Changelog | ||||||||||
--------- | ||||||||||
|
@@ -54,6 +54,13 @@ Changelog | |||||||||
:pr:`123456` by :user:`Joe Bloggs <joeongithub>`. | ||||||||||
where 123455 is the *pull request* number, not the issue number. | ||||||||||
|
||||||||||
:mod:`sklearn.feature_selection` | ||||||||||
................................ | ||||||||||
|
||||||||||
- |Feature| :func:`~feature_selection.info_gain` and | ||||||||||
:func:`~feature_selection.info_gain_ratio` can now be used for | ||||||||||
univariate feature selection. :pr:`28905` by :user:`Viktor Pekar <vpekar>`. | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||
|
||||||||||
Thanks to everyone who has contributed to the maintenance and improvement of | ||||||||||
the project since version 1.5, including: | ||||||||||
|
||||||||||
|
115 changes: 115 additions & 0 deletions
115
examples/feature_selection/plot_compare_feature_selection.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,115 @@ | ||
""" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We will probably avoid to have a new example and instead we should edit an existing one. |
||
========================================= | ||
Comparison of feature selection functions | ||
========================================= | ||
|
||
This example illustrates the performance of different univariate feature selection | ||
functions on a text classification task (the 20 newsgroups dataset). | ||
|
||
The plot shows the accuracy of a multinomial Naive Bayes classifier as a function of the | ||
amount of the best features selected for training it using four methods: chi-square, | ||
information gain, information gain ratio and F-test. Kraskov et al's mutual information | ||
based on k-nearest neighbor distances is too slow for this example and is therefore | ||
excluded. | ||
""" | ||
|
||
# %% | ||
# Load data | ||
# ========= | ||
from sklearn.datasets import fetch_20newsgroups | ||
|
||
remove = ("headers", "footers", "quotes") | ||
data_train = fetch_20newsgroups( | ||
subset="train", categories=None, shuffle=True, random_state=42, remove=remove | ||
) | ||
data_test = fetch_20newsgroups( | ||
subset="test", categories=None, shuffle=True, random_state=42, remove=remove | ||
) | ||
|
||
# %% | ||
# Train-test split | ||
# ================ | ||
import numpy as np | ||
|
||
from sklearn.feature_extraction.text import CountVectorizer | ||
|
||
y_train, y_test = data_train.target, data_test.target | ||
categories = data_train.target_names # for case categories == None | ||
|
||
vectorizer = CountVectorizer(max_df=0.5, stop_words="english") | ||
X_train = vectorizer.fit_transform(data_train.data) | ||
X_test = vectorizer.transform(data_test.data) | ||
feature_names = vectorizer.get_feature_names_out() | ||
cutoffs = [ | ||
int(x) for x in np.logspace(np.log10(1000.0), np.log10(X_train.shape[1]), num=10) | ||
] | ||
|
||
|
||
# %% | ||
# Calculate accuracy of Naive Bayes classifier | ||
# ============================================ | ||
import time | ||
|
||
from sklearn import metrics | ||
from sklearn.feature_selection import ( | ||
SelectKBest, | ||
chi2, | ||
f_classif, | ||
info_gain, | ||
info_gain_ratio, | ||
) | ||
from sklearn.naive_bayes import MultinomialNB | ||
|
||
results = {} | ||
|
||
clf = MultinomialNB(alpha=0.01) | ||
|
||
for func in [chi2, info_gain, info_gain_ratio, f_classif]: | ||
|
||
results[func.__name__] = [] | ||
|
||
for k in cutoffs: | ||
|
||
# apply feature selection | ||
t0 = time.time() | ||
selector = SelectKBest(func, k=k) | ||
X_train2 = selector.fit_transform(X_train, y_train) | ||
X_test2 = selector.transform(X_test) | ||
duration = time.time() - t0 | ||
|
||
# keep selected feature names | ||
feature_names2 = [feature_names[i] for i in selector.get_support(indices=True)] | ||
feature_names2 = np.asarray(feature_names2) | ||
|
||
# train and evaluate a classifier | ||
clf.fit(X_train2, y_train) | ||
pred = clf.predict(X_test2) | ||
score = metrics.accuracy_score(y_test, pred) | ||
|
||
results[func.__name__].append((score, duration)) | ||
|
||
# %% | ||
# Plot results | ||
# ============ | ||
import matplotlib.pyplot as plt | ||
|
||
f, (ax1, ax2) = plt.subplots(2, sharex=True, figsize=(12, 8)) | ||
ax1.set_title("20 newsgroups dataset") | ||
|
||
ax1.set_xlabel("#Features") | ||
ax1.set_ylabel("Accuracy") | ||
ax2.set_ylabel("Time, secs") | ||
colors = "bgrcmyk" | ||
plt.ticklabel_format(useOffset=False) | ||
|
||
for i, (name, results) in enumerate(results.items()): | ||
scores, durations = zip(*results) | ||
ax1.plot(cutoffs, scores, color=colors[i], label=name) | ||
ax2.plot(cutoffs, durations, color=colors[i], label=name) | ||
|
||
ax1.grid(True) | ||
ax2.grid(True) | ||
ax1.legend(loc="best") | ||
ax2.legend(loc="best") | ||
|
||
_ = plt.show() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also add a full stop on the line before.