### **Hyperparameter Optimization: Random Forest using Randomized Search**

This notebook focuses on optimizing hyperparameters for the **Random Forest** classifier using **Randomized Search**. By exploring the hyperparameter space efficiently, the goal is to identify the best configurations for the model to enhance performance on multi-class classification tasks. This process ensures that the classifier is fine-tuned for robustness, accuracy, and efficiency in subsequent analysis.

---

### **Workflow Overview**

1. **Hyperparameter Tuning for Random Forest**
   - Use **RandomizedSearchCV** to efficiently explore key hyperparameters:
     - `n_estimators`: Number of trees in the forest.
     - `max_depth`: Maximum depth of each tree.
     - `min_samples_split` and `min_samples_leaf`: Control tree complexity to prevent overfitting.
     - `class_weight`: Handle class imbalance effectively.
   - Perform tuning for:
     - **Multi-class classification**: Normal, Amplified, Deleted.
     - **Binary classification**: Normal vs Amplified.
   - Evaluate the best combination of hyperparameters using cross-validation F1-score and analyze results.

---

### Import libraries

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV

### Loading the Combined DataFrame

In [2]:
combined_data = pd.read_pickle('combined_data.pkl')

### Hyperparameter Tuning with GridSearchCV

In [3]:
# Split data into features (X) and target (y)
X = combined_data.drop(columns=['status'])  # Features
y = combined_data['status']                # Target variable

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Define the reduced parameter distributions focused on avoiding overfitting
param_dist = {
    'n_estimators': [100, 200],                # Default = 100
    'max_depth': [None, 10, 20],              # Default = None
    'min_samples_split': [2, 5, 10],          # Default = 2
    'min_samples_leaf': [1, 2, 4],            # Default = 1
    'class_weight': ['balanced', 'balanced_subsample']  # Default = None or 'balanced'
}

# Initialize RandomForestClassifier
rf = RandomForestClassifier(random_state=42, n_jobs=-1)

# Set up RandomizedSearchCV
random_search = RandomizedSearchCV(
    estimator=rf,
    param_distributions=param_dist,
    n_iter=20,                # Limit to 20 combinations
    scoring='f1_weighted',    # Use F1-weighted metric for imbalanced data
    cv=5,                     # 5-fold cross-validation
    verbose=2,
    random_state=42,
    n_jobs=-1
)

# Fit RandomizedSearchCV to the training data
random_search.fit(X_train, y_train)

# Print the best parameters and their score
print("Best parameters found:")
print(random_search.best_params_)

print("\nBest cross-validated F1-weighted score:")
print(random_search.best_score_)

# Evaluate the best model on the test set
best_model = random_search.best_estimator_
y_pred = best_model.predict(X_test)

print("\nClassification Report on Test Set:")
print(classification_report(y_test, y_pred))


Fitting 5 folds for each of 20 candidates, totalling 100 fits
[CV] END class_weight=balanced_subsample, max_depth=10, min_samples_leaf=1, min_samples_split=10, n_estimators=200; total time=24.9min
[CV] END class_weight=balanced_subsample, max_depth=10, min_samples_leaf=1, min_samples_split=10, n_estimators=200; total time=24.9min
[CV] END class_weight=balanced_subsample, max_depth=10, min_samples_leaf=1, min_samples_split=10, n_estimators=200; total time=24.9min
[CV] END class_weight=balanced_subsample, max_depth=10, min_samples_leaf=1, min_samples_split=10, n_estimators=200; total time=25.0min
[CV] END class_weight=balanced_subsample, max_depth=10, min_samples_leaf=1, min_samples_split=10, n_estimators=200; total time=25.0min
[CV] END class_weight=balanced, max_depth=None, min_samples_leaf=2, min_samples_split=10, n_estimators=100; total time=25.0min
[CV] END class_weight=balanced, max_depth=None, min_samples_leaf=2, min_samples_split=10, n_estimators=100; total time=25.1min
[CV] END 

### The default parameters are the best for Random Forest.