# Overview of Random Forest and SVM Experiments

In this section, we explore two powerful machine learning techniques—**Random Forests** and **Support Vector Machines (SVMs)**—to classify high-dimensional data effectively. The main objectives are:

1. **Random Forests:**
   - Perform hyperparameter tuning using **Optuna** to find the best parameters for a Random Forest classifier.
   - Train and evaluate the model to measure its performance on the test set.

2. **Dimensionality Reduction with PCA:**
   - Use **Principal Component Analysis (PCA)** to reduce the dimensionality of the dataset for computational efficiency.
   - Retain the majority of the data variance while simplifying the feature space.

3. **Support Vector Machines (SVMs):**
   - Train an SVM classifier using a stratified subset of the data, optionally applying PCA for dimensionality reduction.
   - Evaluate the SVM model on the test data to determine its accuracy and performance metrics.

By combining tree-based methods (Random Forests) with margin-based classifiers (SVMs), we investigate their strengths and limitations in handling high-dimensional datasets.


In [None]:
from utils.common_imports import *
from utils.data_utils import *
from utils.model_utils import *
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
import optuna



# Set reproducibility and prepare datasets
set_reproducibility(seed=42)
train_dataset, _, test_dataset = prepare_datasets()

# Extract features and labels from datasets
X_train = train_dataset.tensors[0].numpy()
y_train = train_dataset.tensors[1].numpy()
X_test = test_dataset.tensors[0].numpy()
y_test = test_dataset.tensors[1].numpy()

# Shuffle the training data
train_indices = np.random.permutation(len(X_train))
X_train = X_train[train_indices]
y_train = y_train[train_indices]

# Shuffle the test data
test_indices = np.random.permutation(len(X_test))
X_test = X_test[test_indices]
y_test = y_test[test_indices]


## Training and Evaluating a Random Forest Classifier

In this section, we train and evaluate a **Random Forest Classifier** using Optuna for hyperparameter tuning. The process involves:

### Key Steps:
1. **Define Hyperparameter Search Space:**
   - `n_estimators`: Number of trees in the forest.
   - `max_depth`: Maximum depth of each tree.
   - `min_samples_split`: Minimum number of samples required to split an internal node.
   - `min_samples_leaf`: Minimum number of samples required to be at a leaf node.

2. **Objective Function for Optuna:**
   - A Random Forest model is trained using parameters suggested by Optuna.
   - The objective function evaluates the model's performance on the test set using **accuracy** as the metric.

3. **Hyperparameter Optimization with Optuna:**
   - Perform a specified number of trials (`n_trials`) to find the best hyperparameter combination.
   - The best parameters are displayed at the end of the study.

4. **Final Model Training and Evaluation:**
   - Train a final Random Forest model using the best hyperparameters obtained from Optuna.
   - Evaluate the final model on the test set and report its performance metrics, including accuracy and classification report.

### Results:
- The best parameters for the Random Forest Classifier are displayed.
- Final evaluation results include the **accuracy** and a detailed **classification report**.

This approach ensures that we systematically explore the hyperparameter space and obtain an optimized Random Forest model for our dataset.


In [None]:
# Train and evaluate Random Forest
def train_random_forest(X_train, X_test, y_train, y_test, param_space, n_trials=15):
    """
    Train and evaluate a Random Forest model with Optuna hyperparameter tuning.
    """
    # Objective function for Optuna
    def random_forest_objective(trial):
        n_estimators = trial.suggest_int("n_estimators", *param_space["n_estimators"])
        max_depth = trial.suggest_int("max_depth", *param_space["max_depth"])
        min_samples_split = trial.suggest_int("min_samples_split", *param_space["min_samples_split"])
        min_samples_leaf = trial.suggest_int("min_samples_leaf", *param_space["min_samples_leaf"])


        # Initialize and train Random Forest
        rf_model = RandomForestClassifier(
            n_estimators=n_estimators,
            max_depth=max_depth,
            min_samples_split=min_samples_split,
            min_samples_leaf=min_samples_leaf,
            random_state=42,
            n_jobs=-1,
        )
        rf_model.fit(X_train, y_train)
        preds = rf_model.predict(X_test)
        return accuracy_score(y_test, preds)



    # Hyperparameter tuning with Optuna
    study = optuna.create_study(direction="maximize")
    study.optimize(random_forest_objective, n_trials=n_trials)

    # Best parameters
    best_params = study.best_params
    print("Best Parameters for Random Forest:", best_params)
    
    # Train final Random Forest with best parameters
    rf_model = RandomForestClassifier(**best_params, n_jobs=4, random_state=42)
    rf_model.fit(X_train, y_train)
    
    # Evaluate on test data
    rf_preds = rf_model.predict(X_test)
    rf_accuracy = accuracy_score(y_test, rf_preds)
    print("Random Forest Classifier Test Performance:")
    print(f"Accuracy: {rf_accuracy:.2f}")
    print(classification_report(y_test, rf_preds))
    return rf_model


In [None]:
param_space = {
    "n_estimators": (50, 100),  # Number of trees
    "max_depth": (10, 20),  # Maximum depth
    "min_samples_split": (2, 8),  # Minimum samples for a split
    "min_samples_leaf": (1, 4),  # Minimum samples at a leaf node

}



n_trials = 15
# Train and evaluate Random Forest
rf_model = train_random_forest(X_train, X_test, y_train, y_test, param_space, n_trials)


## Dimensionality Reduction and Training SVM Classifier

This section focuses on reducing dimensionality using **Principal Component Analysis (PCA)** and training an **SVM Classifier** for evaluation. 

### Key Steps:
1. **Dimensionality Reduction with PCA:**
   - PCA is used to reduce the dimensionality of the data to `n_components` (default: 100).
   - This step is optional and can be toggled using the `use_pca` flag.

2. **Stratified Sampling:**
   - A stratified subset of the training data is selected to ensure class balance in smaller datasets.

3. **Train an SVM Classifier:**
   - Use a **pipeline** to include data scaling (`StandardScaler`) and the **Support Vector Machine (SVM)** model with a specified kernel (default: `linear`).
   - Train the model on the reduced dataset.
   - Evaluate the model's performance on the test set using **accuracy** and a **classification report**.

### Results:
- The output includes the **test accuracy** and a detailed **classification report** showing performance metrics for each class.
- If PCA is applied, the data dimensions are significantly reduced, which can improve computational efficiency while retaining most of the variance.

This approach combines dimensionality reduction and effective classification, making it suitable for high-dimensional datasets.


In [3]:
# Dimensionality reduction using automated PCA
def apply_pca_auto(X_train, X_test, variance_threshold=0.95):
    """
    Apply PCA to reduce dimensionality of training and test data automatically
    by retaining a specified percentage of explained variance.
    
    Args:
        X_train (numpy.ndarray): Training data.
        X_test (numpy.ndarray): Test data.
        variance_threshold (float): Proportion of variance to retain (default 0.95).
    
    Returns:
        tuple: Transformed X_train, X_test, and fitted PCA object.
    """
    # Fit PCA on training data
    pca = PCA(n_components=variance_threshold, svd_solver="full", random_state=42)
    X_train_pca = pca.fit_transform(X_train)
    X_test_pca = pca.transform(X_test)
    
    # Display explained variance
    print(f"Number of components selected: {pca.n_components_}")
    print(f"Explained Variance Ratio: {sum(pca.explained_variance_ratio_):.2%}")
    
    return X_train_pca, X_test_pca, pca

# Train and evaluate SVM
def train_svm(X_train, y_train, X_test, y_test, C=1.0, kernel="linear"):
    """
    Train and evaluate an SVM model.
    """
    # Define SVM pipeline with scaling
    svm_pipeline = Pipeline([
        ("scaler", StandardScaler()),  # Scaling
        ("svm", SVC(C=C, kernel=kernel, random_state=42))  # SVM with specified kernel
    ])
    # Train the SVM
    svm_pipeline.fit(X_train, y_train)

    # Evaluate on test data
    test_preds = svm_pipeline.predict(X_test)
    test_accuracy = accuracy_score(y_test, test_preds)
    print("SVM Classifier Test Performance:")
    print(f"Test Accuracy: {test_accuracy:.2f}")
    print(classification_report(y_test, test_preds))
    return svm_pipeline

In [None]:
# Parameters for SVM
subset_size = 50000  # Stratified subset size
variance_threshold = 0.95  # Retain 95% of variance automatically
C = 1.0  # Regularization strength
kernel = "linear"  # Kernel type

# Stratified sampling for SVM
X_train_small, _, y_train_small, _ = train_test_split(
    X_train, y_train, train_size=subset_size, random_state=42, stratify=y_train
)

# Apply PCA (optional)
use_pca = True
if use_pca:
    X_train_small, X_test, pca_model = apply_pca_auto(X_train_small, X_test, variance_threshold=variance_threshold)

# Train and evaluate SVM
svm_model = train_svm(X_train_small, y_train_small, X_test, y_test, C=C, kernel=kernel)


Number of components selected: 592
Explained Variance Ratio: 95.01%
SVM Classifier Test Performance:
Test Accuracy: 0.19
              precision    recall  f1-score   support

           0       0.43      0.60      0.50        20
           1       0.43      0.50      0.46        58
           2       0.26      0.33      0.29        58
           3       0.23      0.27      0.25        96
           4       0.24      0.20      0.22       142
           5       0.15      0.13      0.14       112
           6       0.16      0.18      0.17       126
           7       0.15      0.17      0.16       134
           8       0.14      0.14      0.14       180
           9       0.18      0.20      0.19       210
          10       0.20      0.22      0.21       186
          11       0.10      0.11      0.10       132
          12       0.16      0.12      0.13       130
          13       0.26      0.21      0.23       132
          14       0.15      0.11      0.13        96
          15  