# Dimensionality Reduction

## Feature Selection

| **Category**             | **Methods**                                |
|---------------------------|--------------------------------------------|
| **Measures of Association** | Covariance                                |
|                           | Pearson’s Correlation & Causation          |
|                           | Spearman's Rank Correlation Coefficient    |
|                           | Mutual Information                         |



## Feature Projection

| **Category**             | **Methods**                                |
|---------------------------|--------------------------------------------|
| **Manifold Learning**     | t-SNE (t-Distributed Stochastic Neighbor Embedding) |
|                           | UMAP (Uniform Manifold Approximation and Projection) |
| **Decomposition Techniques** | Principal Component Analysis (PCA)       |
|                           | Independent Component Analysis (ICA)       |
|                           | Singular Value Decomposition (SVD)         |
| **Discriminant Analysis** | Linear Discriminant Analysis (LDA)         |
|                           | Quadratic Discriminant Analysis (QDA)      |
|                           | Generalized Discriminant Analysis (GDA)    |
| **Matrix Factorization**  | Non-negative Matrix Decomposition          |
|                           | Sequential Non-negative Matrix Factorization (NMF) |


# sklearn.feature_selection Techniques

## General Feature Selection
| **Technique**               | **Description**                                                                 |
|------------------------------|---------------------------------------------------------------------------------|
| `GenericUnivariateSelect`    | Univariate feature selector with configurable strategy.                        |
| `SelectFromModel`            | Meta-transformer for selecting features based on importance weights.           |
| `VarianceThreshold`          | Removes all low-variance features.                                             |
| `SequentialFeatureSelector`  | Transformer that performs sequential feature selection.                        |
| `SelectorMixin`              | Transformer mixin that performs feature selection given a support mask.        |

---

## Recursive Feature Elimination (RFE)
| **Technique**    | **Description**                                                                           |
|-------------------|-------------------------------------------------------------------------------------------|
| `RFE`            | Feature ranking with recursive feature elimination.                                       |
| `RFECV`          | Recursive feature elimination with cross-validation to select features.                   |

---

## Statistical Feature Selection
| **Technique**     | **Description**                                                                          |
|--------------------|------------------------------------------------------------------------------------------|
| `SelectKBest`      | Select features according to the k highest scores.                                       |
| `SelectPercentile` | Select features according to a percentile of the highest scores.                        |
| `SelectFpr`        | Select p-values below alpha based on a false positive rate (FPR) test.                   |
| `SelectFdr`        | Select p-values for an estimated false discovery rate (FDR).                            |
| `SelectFwe`        | Select p-values corresponding to a family-wise error rate (FWE).                        |

---

## Statistical Tests
| **Test**                 | **Description**                                                                   |
|---------------------------|-----------------------------------------------------------------------------------|
| `chi2`                   | Compute chi-squared stats between each non-negative feature and class.            |
| `f_classif`              | Compute the ANOVA F-value for the provided sample.                                |
| `mutual_info_classif`    | Estimate mutual information for a discrete target variable.                       |
| `r_regression`           | Compute Pearson's r for each feature and the target.   (Pearson’s Correlation)                           |
| `f_regression`           | Univariate linear regression tests returning F-statistic and p-values.            |
| `mutual_info_regression` | Estimate mutual information for a continuous target variable.                     |


In [None]:
import sklearn.feature_selection

# Compute chi-squared statistics and p-values
chi2_stats, p_values = sklearn.feature_selection.chi2('input_data', 'target_data')

'''
Chi-square statistic measures the strength of the relationship between a feature and the target.
Higher values indicate stronger associations between the feature and target.
P-value indicates the statistical significance; lower p-values (typically < 0.05) suggest a feature is meaningful.

Interpretation of p-values:
- p-value < 0.05  : Statistically significant, feature is likely important.
- p-value < 0.01  : Strong significance, feature has a meaningful relationship.
- p-value < 0.001 : Very strong significance, feature is highly important.
'''

# Compute F-statistics and p-values for classification tasks (using f_classif)
f_stats, p_values = sklearn.feature_selection.f_classif(X, y)

'''
 The F-test compares the variance between group means to the variance within groups.
 If the between-group variance is much larger, it suggests the group means are significantly different from each other
 Higher F-values indicate a stronger relationship between the feature and target.
'''


In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import chi2
import pandas as pd

# Load the breast cancer dataset
X, y = load_breast_cancer(return_X_y=True)

# Compute the chi-squared statistics and p-values
chi2_stats, p_values = chi2(X, y)

feature_names = load_breast_cancer().feature_names  # Get feature names
result_df = pd.DataFrame({                          # Create a DataFrame to display the feature names along with their chi-squared stats and p-values
    'Feature': feature_names,
    'Chi-squared Statistic': chi2_stats,
    'P-value': p_values
})

# Sort the results by Chi-squared statistic in descending order
result_df_sorted = result_df.sort_values(by=['Chi-squared Statistic','P-value'], ascending=[False,True])
result_df_sorted

In [None]:
from sklearn.feature_selection import f_classif
import numpy as np
from sklearn.datasets import load_breast_cancer

# Load dataset
X, y = load_breast_cancer(return_X_y=True)

# Perform ANOVA F-test
f_stats, p_values = f_classif(X, y)

feature_names = load_breast_cancer().feature_names  # Get feature names
result_df = pd.DataFrame({                          # Create a DataFrame to display the feature names along with their chi-squared stats and p-values
    'Feature': feature_names,
    'f- Statistic': f_stats,
    'P-value': p_values
})

# Sort the results by Chi-squared statistic in descending order
result_df_sorted = result_df.sort_values(by=['f_stats','P-value'], ascending=[False,True])
result_df_sorted

### General Feature Selection

In [None]:
import sklearn.feature_selection
from sklearn.feature_selection import f_classif
# Create a GenericUnivariateSelect object
selector = sklearn.feature_selection.GenericUnivariateSelect(
    score_func=f_classif,  # Measures the relationship between each feature and the target using ANOVA F-value.
                           # Other possible values for score_func:
                           # - `f_classif`: For classification tasks with continuous features (ANOVA F-test).
                           # - `chi2`: For classification tasks with categorical features.
                           # - `mutual_info_classif`: For classification tasks, measures mutual information.
                           # - `f_regression`: For regression tasks with continuous features (ANOVA F-test).
                           # - `mutual_info_regression`: For regression tasks, measures mutual information.
    mode='percentile',     # Determines the strategy for selecting features.
                           # - 'percentile': Retain features based on their percentile score.
                           # - 'k_best': Retain the top `k` features.
                           # - 'fpr': Retain features with p-values below the threshold.
                           # - 'fdr': Select using False Discovery Rate.
                           # - 'fwe': Select using Family-Wise Error rate.
    param=1.0              # Controls the extent of feature selection based on `mode`:
                           #   - For 'percentile': Value is a float (0-100), representing the percentage of features to retain.
                           #   - For 'k_best': Value is an integer, specifying the number of top features to keep.
                           #   - For 'fpr', 'fdr', or 'fwe': Value is a float, representing the statistical threshold.
                                #fpr: 0.01 to 0.05 (typically 0.05).
                                #fdr: 0.01 to 0.05 (typically 0.05).
                                #fwe: 0.01 to 0.05 (typically 0.01)
)


from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import GenericUnivariateSelect, chi2
X, y = load_breast_cancer(return_X_y=True)
print(X.shape)
transformer = GenericUnivariateSelect(chi2, mode='fpr',param=0.005)
X_new = transformer.fit_transform(X, y)
print(X_new.shape)


In [None]:
SelectFromModel_selection=sklearn.feature_selection.SelectFromModel(
estimator: Any,
    *,
    threshold: float | str | None = None,
    prefit: bool = False,
    norm_order: float | int = 1,
    max_features: ((...) -> Any) | int | None = None,
    importance_getter: str | ((...) -> Any) = "auto"
)


In [None]:
# VarianceThreshold
# SequentialFeatureSelector
# SelectorMixin