# dimReduction.ipynb
- Implementing combination of data preprocessing and feature selection for different datasets using Random Forest Classifier and feature reduction techniques (PCA and Gaussian Random Projection). 
- It aims to identify relevant components for the datasets using PCA, reduce the dataset dimensions using Gaussian Random Projection, and perform feature selection using the Recursive Feature Elimination with Cross-Validation (RFECV) algorithm in conjunction with a Random Forest Classifier.

Note :
- Both PCA and Gaussian Random Projection output entirely new attributes and neither are garunteed to improve the mode
- Feature Importance with RFECV can be used to get the names of the most relevant attributes (but can take awhile)

In [None]:
from sklearn.random_projection import GaussianRandomProjection  # type: ignore
from sklearn.preprocessing import PolynomialFeatures  # type: ignore
from sklearn.ensemble import RandomForestClassifier  # type: ignore
from sklearn.feature_selection import RFECV  # type: ignore
from sklearn.decomposition import PCA  # type: ignore
from evaluator import ModelEvaluator
import sys, warnings

sys.path.append("../")
from datasets.setCreator import SetCreator
from datasets.setModifier import SetModifier

Purpose:  
Silences warnings for cleaner output and loads datasets and dataset modifiers for analysis

In [None]:
warnings.filterwarnings("ignore")

setModifier = SetModifier()
setCreator = SetCreator()
dataset1 = setCreator.getSetList1()
dataset2 = setCreator.getSetList2()

PCA

Purpose :
- performs Principal Component Analysis ([PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)) on each dataset.
- The PCA is used to reduce the dimensionality of the dataset and identify relevant components that explain the most variance in the data.

Psuedocode :
- Loops over each dataset.
- Creates PCA class.
- Train model with current dataset
- Get relevant components and explained variance ratio of each principal component. 

In [None]:
for currData in dataset1:
    pca = PCA()
    pca.fit_transform(currData["train"])

    print(f"[SUCCESS] identified {len(pca.components_)} relevant components")
    print(pca.components_)
    print(pca.explained_variance_ratio_)

Gaussian Projection

Purpose :
- It applies [Gaussian Random Projection](https://scikit-learn.org/stable/modules/random_projection.html) to reduce the dimensionality of the training data for each dataset. 

Psuedocode :
- Loop through datasets.
- Create an instance of the GaussianRandomProjection class with a random state of 0 and sets the number of components to 25. 
- Fit the Gaussian Random Projection model to the training data of the current dataset and performs the actual random projection to reduce the dimensionality. 

Note: 
- The n_components parameter determines the desired reduced dimensionality of the dataset.

In [None]:
for currData in dataset2:
    currData["train"] = currData["train"].fillna(0)
    gaussian_rnd_proj = GaussianRandomProjection(random_state=0, n_components=25)
    X_reduced = gaussian_rnd_proj.fit_transform(currData["train"])

    print(X_reduced)

### [From the book](https://www.amazon.ca/Cleaning-Data-Effective-Science-command-line/dp/1801071292/ref=sr_1_1?keywords=data+cleaning&sr=8-1)

This code snippet helps determine the name of the most relevant attributes in a model, however it is resource demanding and therefore is modified into the scripts below

```
for currData in dataset1:
    trainningData = currData["train"]
    trainY = trainningData["ergot_present_in_q4"]
    trainX = setModifier.rmErgotPredictors(trainningData)

    poly = dict()
    X_poly = dict()

    For n in [2, 3, 4, 5]:
        poly[n] = PolynomialFeatures(n)
        X_poly[n] = poly[n].fit_transform(trainX)

        model = RandomForestClassifier(n_estimators=100,max_depth=5, n_jobs=4, random_state=2)
        rfecv = RFECV(estimator=model, n_jobs=1) # apply feature elimination/cross-validation to model
        best_feat = rfecv.fit(X_poly[2], trainY)
        X_support = X_poly[2][:, best_feat.support_] # X_support now automatically holds the best subset

        print(X_support.shape)  # tells you the best dimensions to use
        print(X_support)
```

runs it exactly once

Purpose:  
- Perform feature selection using Recursive Feature Elimination with Cross-Validation (RFECV) in combination with a Random Forest Classifier. 
- The goal is to identify the best subset of features that optimizes the classifier's performance.

Psuedocode:  
- Loop through the dataset.
- Retrieve the training data and extract the target variable from the training data.
- Create an instance of the RandomForestClassifier class with specific hyperparameters.
- Create an instance of the [RFECV](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html) class, which performs Recursive Feature Elimination with Cross-Validation using the given Random Forest Classifier model as the estimator.
- Fit the RFECV model to the training data and retrieve the subset of features that were selected as the best subset by RFECV.

Note:   
- The RFECV process is ran exactly once in the snippet below
- The n_jobs parameter specifies the number of CPU cores to use during cross-validation.

<br>

**Using the snippet for feature analysis**:  
- Change the dataset to whichever dataset youd like to use in the following line:  
```for currData in dataset1:```  
- Set the target in the following line (give the name of the column):  
```trainY = trainningData["ergot_present_in_q4"]```  
- Remove all Ergot features or only other predictors in the following line:  
```trainX = setModifier.rmErgotPredictors(trainningData)``` which can also be replaced with ```trainX = setModifier.rmErgotFeatures(trainningData)```

In [None]:
for currData in dataset1:
    trainningData = currData["train"]
    trainY = trainningData["ergot_present_in_q4"]
    trainX = setModifier.rmErgotPredictors(trainningData)

    try:
        model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=0)

        # apply feature elimination/cross-validation to model
        rfecv = RFECV(estimator=model, n_jobs=1)

        best_feat = rfecv.fit(trainX, trainY)

        # X_support now automatically holds the best subset
        X_support = trainX.loc[:, best_feat.support_]

        print("[SUCCESS]")
        print(X_support.shape)  # tells you the best dimensions to use

        for col in X_support.columns.tolist():
            print(col)
    except Exception as e:
        pass

Purpose:  
- Perform feature selection using Recursive Feature Elimination with Cross-Validation (RFECV) in combination with a Random Forest Classifier. 
- The goal is to identify the best subset of features that optimizes the classifier's performance.

Psuedocode:  
- Loop through the dataset.
- Retrieve the training data and extract the target variable from the training data.
- Create an instance of the RandomForestClassifier class with specific hyperparameters.
- Create an instance of the [RFECV](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html) class, which performs Recursive Feature Elimination with Cross-Validation using the given Random Forest Classifier model as the estimator.
- Fit the RFECV model to the training data and retrieve the subset of features that were selected as the best subset by RFECV.

Note:   
- The RFECV process is looped until no further changes are made between two iterrations (this can take awhile)
- The n_jobs parameter specifies the number of CPU cores to use during cross-validation.

<br>

**Using the snippet for feature analysis**:  
- Change the dataset to whichever dataset youd like to use in the following line:  
```for currData in dataset1:```  
- Set the target in the following line (give the name of the column):  
```trainY = trainningData["ergot_present_in_q4"]```  
- Remove all Ergot features or only other predictors in the following line:  
```trainX = setModifier.rmErgotPredictors(trainningData)``` which can also be replaced with ```trainX = setModifier.rmErgotFeatures(trainningData)```

In [None]:
for currData in dataset2:
    trainningData = currData["train"]
    trainY = trainningData["ergot_present_in_q4"]
    trainX = setModifier.rmErgotFeatures(trainningData)

    reducable = True  # controls the loop, is true by default then is determined by

    # comparing the calculated set of features against the current set of features
    try:
        while reducable:
            model = RandomForestClassifier(
                n_estimators=100, max_depth=5, random_state=0
            )

            # apply feature elimination/cross-validation to model
            rfecv = RFECV(estimator=model, n_jobs=1)

            best_feat = rfecv.fit(trainX, trainY)

            # X_support now automatically holds the best subset
            X_support = trainX.loc[:, best_feat.support_]

            # reduce the set to the subset proposed by the best features if we can
            if X_support.shape < trainX.shape:
                trainX = trainX[X_support.columns.tolist()]
            else:
                reducable = False

        print(f'[SUCCESS] reduced data in dataset: {currData["desc"]}')
        print(X_support.shape)  # tells you the best dimensions to use

        for col in X_support.columns.tolist():
            print(col)

        print()
    except Exception as e:
        pass