# Task 3.8: Random Search for Random Forest Optimization

## Objective

The goal of this task is to implement **RandomizedSearchCV** for Random Forest with wider parameter ranges to find optimal hyperparameters. We will compare the optimized model's performance with the default parameters from Week 2 (Task 2.2) and document the performance improvements.

## Why RandomizedSearchCV over GridSearchCV?

| Aspect | GridSearchCV | RandomizedSearchCV |
|--------|--------------|--------------------|
| Search Strategy | Exhaustive (all combinations) | Random sampling |
| Computational Cost | High (exponential) | Lower (controlled) |
| Parameter Space | Limited by time | Can explore wider ranges |
| Best for | Small parameter spaces | Large parameter spaces |
| Convergence | Guaranteed optimal in grid | May miss optimal, but often finds good solutions |

### Key Advantages of RandomizedSearchCV:

1. **Efficiency:** Can explore a much larger hyperparameter space in the same time
2. **Flexibility:** Supports continuous distributions, not just discrete values
3. **Scalability:** Number of iterations is independent of parameter space size
4. **Often Sufficient:** Research shows random search finds good hyperparameters with fewer evaluations

## Understanding Random Forest Hyperparameters

We will tune the following hyperparameters:

1. **n_estimators:** Number of trees in the forest (more trees = better but slower)
2. **max_depth:** Maximum depth of each tree (controls complexity)
3. **min_samples_split:** Minimum samples required to split a node
4. **min_samples_leaf:** Minimum samples required at a leaf node
5. **max_features:** Number of features to consider for best split
6. **bootstrap:** Whether to use bootstrap samples
7. **criterion:** Function to measure split quality (gini vs entropy)

## Step 1: Environment Setup and Data Loading

We import the required libraries and load the preprocessed dataset.

### Libraries Used:
- **pandas & numpy:** Data manipulation and numerical operations
- **RandomForestClassifier:** The main algorithm from sklearn.ensemble
- **RandomizedSearchCV:** For random hyperparameter search
- **scipy.stats:** For defining continuous parameter distributions
- **sklearn.metrics:** For model evaluation
- **matplotlib & seaborn:** For visualizations
- **pickle:** For saving the optimized model

In [None]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV, cross_val_score
from sklearn.metrics import (accuracy_score, precision_score, recall_score, 
                             f1_score, classification_report, confusion_matrix)
from sklearn.preprocessing import LabelEncoder
from scipy.stats import randint, uniform
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import time
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

print("Libraries imported successfully!")
print(f"Random State: {RANDOM_STATE}")

In [None]:
# Load data
X_train = pd.read_csv('../../data/processed/X_train_scaled.csv')
X_test = pd.read_csv('../../data/processed/X_test_scaled.csv')
y_train = pd.read_csv('../../data/processed/y_train.csv')
y_test = pd.read_csv('../../data/processed/y_test.csv')

# Remove ID columns if present
if 'id' in X_train.columns:
    X_train = X_train.drop('id', axis=1)
if 'id' in X_test.columns:
    X_test = X_test.drop('id', axis=1)
    
    
# Remove leaky features
leaky_features = [
    'price', 'price_normalized', 'price_per_person', 'price_per_bathroom',
    'price_per_bedroom', 'review_scores_rating', 'review_scores_value',
    'value_density', 'estimated_revenue_l365d'
]

cols_to_drop = [col for col in leaky_features if col in X_train.columns]
X_train = X_train.drop(columns=cols_to_drop)
X_test = X_test.drop(columns=cols_to_drop)

print(f"Dropped {len(cols_to_drop)} leaky features: {cols_to_drop}")
print(f"Remaining features: {X_train.shape[1]}")

# Encode target labels
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train['value_category'])
y_test_encoded = label_encoder.transform(y_test['value_category'])

print("="*60)
print("DATA SUMMARY")
print("="*60)
print(f"Training samples: {X_train.shape[0]}")
print(f"Testing samples: {X_test.shape[0]}")
print(f"Number of features: {X_train.shape[1]}")
print(f"\nTarget Classes: {label_encoder.classes_}")
print(f"\nTarget distribution (Training):")
unique, counts = np.unique(y_train_encoded, return_counts=True)
for val, count in zip(unique, counts):
    category = label_encoder.classes_[val]
    print(f"  Class {val} ({category}): {count} samples ({count/len(y_train_encoded)*100:.2f}%)")
print("="*60)

## Step 2: Baseline Model (Default Parameters from Week 2)

First, let's establish our baseline by training a Random Forest with the default parameters used in Task 2.2. This gives us a reference point to measure improvement.

### Week 2 Default Parameters:
- n_estimators = 100
- max_depth = 20
- min_samples_split = 10
- min_samples_leaf = 4

We'll train this model and record its performance metrics for comparison.

In [None]:
# Baseline model with Week 2 default parameters
print("Training Baseline Model (Week 2 Default Parameters)...")
print("-"*50)

baseline_params = {
    'n_estimators': 100,
    'max_depth': 8,          
    'min_samples_split': 15,  
    'min_samples_leaf': 8,    
    'max_features': 'sqrt',
    'random_state': RANDOM_STATE,
    'n_jobs': -1
}

print("Baseline Parameters:")
for param, value in baseline_params.items():
    print(f"  {param}: {value}")

# Train baseline model
start_time = time.time()
baseline_model = RandomForestClassifier(**baseline_params)
baseline_model.fit(X_train, y_train_encoded)
baseline_train_time = time.time() - start_time

# Evaluate baseline
y_train_pred_baseline = baseline_model.predict(X_train)
y_test_pred_baseline = baseline_model.predict(X_test)

baseline_metrics = {
    'train_accuracy': accuracy_score(y_train_encoded, y_train_pred_baseline),
    'test_accuracy': accuracy_score(y_test_encoded, y_test_pred_baseline),
    'precision': precision_score(y_test_encoded, y_test_pred_baseline, average='macro'),
    'recall': recall_score(y_test_encoded, y_test_pred_baseline, average='macro'),
    'f1_score': f1_score(y_test_encoded, y_test_pred_baseline, average='macro'),
    'training_time': baseline_train_time
}

print(f"\nBaseline Model Performance:")
print(f"  Training Accuracy: {baseline_metrics['train_accuracy']:.4f}")
print(f"  Testing Accuracy:  {baseline_metrics['test_accuracy']:.4f}")
print(f"  Precision (Macro): {baseline_metrics['precision']:.4f}")
print(f"  Recall (Macro):    {baseline_metrics['recall']:.4f}")
print(f"  F1-Score (Macro):  {baseline_metrics['f1_score']:.4f}")
print(f"  Training Time:     {baseline_metrics['training_time']:.2f} seconds")

## Step 3: Define Wide Parameter Ranges for RandomizedSearchCV

Now we define a much wider parameter space than what GridSearchCV could handle efficiently. RandomizedSearchCV allows us to explore this large space by randomly sampling combinations.

### Parameter Distributions Explained:

1. **n_estimators (50-500):**
   - More trees generally improve performance but increase computation
   - Using `randint(50, 501)` samples uniformly from 50 to 500
   - Week 2 used 100; we explore a much wider range

2. **max_depth (5-50 + None):**
   - Controls tree complexity; deeper trees can overfit
   - `None` means unlimited depth (trees grow until pure leaves)
   - Week 2 used 20; we explore 5-50 plus unlimited

3. **min_samples_split (2-20):**
   - Minimum samples to split an internal node
   - Higher values prevent overfitting
   - Week 2 used 10; we explore 2-20

4. **min_samples_leaf (1-10):**
   - Minimum samples at a leaf node
   - Higher values create smoother decision boundaries
   - Week 2 used 4; we explore 1-10

5. **max_features ('sqrt', 'log2', None, 0.3-0.9):**
   - Number of features to consider for best split
   - 'sqrt': sqrt(n_features), 'log2': log2(n_features)
   - None: all features, float: fraction of features

6. **bootstrap (True, False):**
   - Whether to use bootstrap samples
   - True: sample with replacement (default)
   - False: use entire dataset for each tree

7. **criterion ('gini', 'entropy'):**
   - Function to measure split quality
   - Gini: faster, Entropy: sometimes more accurate

In [None]:
# Define wide parameter distributions
param_distributions = {
    'n_estimators': randint(50, 301),          
    'max_depth': [3, 4, 5, 6, 8, 10, 12],      
    'min_samples_split': randint(5, 25),        
    'min_samples_leaf': randint(4, 15),        
    'max_features': ['sqrt', 'log2', 0.3, 0.5], 
    'bootstrap': [True],                        
    'criterion': ['gini', 'entropy']
}

print("="*60)
print("PARAMETER SEARCH SPACE")
print("="*60)
print("\nParameter Distributions:")
print(f"  n_estimators:      randint(50, 501) - Uniform from 50 to 500")
print(f"  max_depth:         {param_distributions['max_depth']}")
print(f"  min_samples_split: randint(2, 21) - Uniform from 2 to 20")
print(f"  min_samples_leaf:  randint(1, 11) - Uniform from 1 to 10")
print(f"  max_features:      {param_distributions['max_features']}")
print(f"  bootstrap:         {param_distributions['bootstrap']}")
print(f"  criterion:         {param_distributions['criterion']}")

# Calculate approximate search space size
approx_space = 451 * 9 * 19 * 10 * 7 * 2 * 2
print(f"\nApproximate total combinations: {approx_space:,}")
print(f"GridSearchCV would need to evaluate all {approx_space:,} combinations!")
print("RandomizedSearchCV will sample only a fraction of these.")
print("="*60)

## Step 4: Run RandomizedSearchCV

Now we run RandomizedSearchCV with the defined parameter distributions.

### Configuration:
- **n_iter=100:** Number of random parameter combinations to try
- **cv=5:** 5-fold cross-validation for robust evaluation
- **scoring='f1_macro':** Optimize for macro F1-score (balanced across classes)
- **n_jobs=-1:** Use all CPU cores for parallel processing
- **verbose=2:** Show progress during search

### Why 100 iterations?
- Research suggests ~60 iterations often finds near-optimal parameters
- 100 iterations provides good coverage while remaining computationally feasible
- Each iteration involves 5-fold CV, so 500 total model fits

In [None]:
# Initialize base model
rf_base = RandomForestClassifier(random_state=RANDOM_STATE, n_jobs=-1)

# Configure RandomizedSearchCV
n_iterations = 100  # Number of parameter combinations to try

random_search = RandomizedSearchCV(
    estimator=rf_base,
    param_distributions=param_distributions,
    n_iter=n_iterations,
    cv=5,
    scoring='f1_macro',
    n_jobs=-1,
    verbose=2,
    random_state=RANDOM_STATE,
    return_train_score=True
)

print("="*60)
print("STARTING RANDOMIZED SEARCH")
print("="*60)
print(f"Number of iterations: {n_iterations}")
print(f"Cross-validation folds: 5")
print(f"Total model fits: {n_iterations * 5}")
print(f"Scoring metric: F1-Score (Macro)")
print("="*60)

# Run the search
start_time = time.time()
random_search.fit(X_train, y_train_encoded)
search_time = time.time() - start_time

print(f"\nRandomized Search completed in {search_time:.2f} seconds ({search_time/60:.2f} minutes)")

## Step 5: Analyze Search Results

Let's examine the results of our RandomizedSearchCV to understand:
1. What are the best parameters found?
2. How do different parameter combinations perform?
3. Which parameters have the most impact on performance?

In [None]:
# Get best parameters and score
print("="*60)
print("BEST PARAMETERS FOUND")
print("="*60)

best_params = random_search.best_params_
best_cv_score = random_search.best_score_

print("\nOptimal Hyperparameters:")
for param, value in best_params.items():
    print(f"  {param}: {value}")

print(f"\nBest Cross-Validation F1-Score: {best_cv_score:.4f}")

# Compare with baseline parameters
print("\n" + "-"*60)
print("PARAMETER COMPARISON: Baseline vs Optimized")
print("-"*60)
print(f"{'Parameter':<25} {'Baseline':<15} {'Optimized':<15}")
print("-"*60)
print(f"{'n_estimators':<25} {baseline_params['n_estimators']:<15} {best_params['n_estimators']:<15}")
print(f"{'max_depth':<25} {baseline_params['max_depth']:<15} {str(best_params['max_depth']):<15}")
print(f"{'min_samples_split':<25} {baseline_params['min_samples_split']:<15} {best_params['min_samples_split']:<15}")
print(f"{'min_samples_leaf':<25} {baseline_params['min_samples_leaf']:<15} {best_params['min_samples_leaf']:<15}")
print(f"{'max_features':<25} {'default':<15} {str(best_params['max_features']):<15}")
print(f"{'bootstrap':<25} {'True':<15} {str(best_params['bootstrap']):<15}")
print(f"{'criterion':<25} {'gini':<15} {best_params['criterion']:<15}")
print("="*60)

In [None]:
# Create results DataFrame for analysis
cv_results = pd.DataFrame(random_search.cv_results_)

# Select relevant columns
results_summary = cv_results[[
    'param_n_estimators', 'param_max_depth', 'param_min_samples_split',
    'param_min_samples_leaf', 'param_max_features', 'param_bootstrap',
    'param_criterion', 'mean_test_score', 'std_test_score', 
    'mean_train_score', 'rank_test_score'
]].copy()

results_summary.columns = [
    'n_estimators', 'max_depth', 'min_samples_split', 'min_samples_leaf',
    'max_features', 'bootstrap', 'criterion', 'mean_cv_score', 
    'std_cv_score', 'mean_train_score', 'rank'
]

# Sort by rank
results_summary = results_summary.sort_values('rank')

print("\nTop 10 Parameter Combinations:")
print(results_summary.head(10).to_string(index=False))

# Save all results
results_summary.to_csv('../../outputs/rf_randomsearch_results.csv', index=False)
print("\nFull results saved to: ../../outputs/rf_randomsearch_results.csv")

## Step 6: Train Optimized Model and Evaluate

Now we train the final model with the best parameters found and evaluate it on the test set. This gives us the true performance estimate on unseen data.

In [None]:
# Get the best model (already trained)
optimized_model = random_search.best_estimator_

# Make predictions
y_train_pred_opt = optimized_model.predict(X_train)
y_test_pred_opt = optimized_model.predict(X_test)

# Calculate metrics for optimized model
optimized_metrics = {
    'train_accuracy': accuracy_score(y_train_encoded, y_train_pred_opt),
    'test_accuracy': accuracy_score(y_test_encoded, y_test_pred_opt),
    'precision': precision_score(y_test_encoded, y_test_pred_opt, average='macro'),
    'recall': recall_score(y_test_encoded, y_test_pred_opt, average='macro'),
    'f1_score': f1_score(y_test_encoded, y_test_pred_opt, average='macro')
}

print("="*60)
print("OPTIMIZED MODEL PERFORMANCE")
print("="*60)
print(f"  Training Accuracy: {optimized_metrics['train_accuracy']:.4f}")
print(f"  Testing Accuracy:  {optimized_metrics['test_accuracy']:.4f}")
print(f"  Precision (Macro): {optimized_metrics['precision']:.4f}")
print(f"  Recall (Macro):    {optimized_metrics['recall']:.4f}")
print(f"  F1-Score (Macro):  {optimized_metrics['f1_score']:.4f}")
print("="*60)

In [None]:
# Detailed classification report
print("\nDetailed Classification Report (Optimized Model):")
print("="*60)
print(classification_report(y_test_encoded, y_test_pred_opt, 
                            target_names=label_encoder.classes_))

## Step 7: Performance Improvement Analysis

Let's quantify the improvement achieved by RandomizedSearchCV optimization compared to the baseline model from Week 2.

In [None]:
# Calculate improvements
print("="*70)
print("PERFORMANCE IMPROVEMENT ANALYSIS: Baseline vs Optimized")
print("="*70)

metrics_comparison = {
    'Metric': ['Training Accuracy', 'Testing Accuracy', 'Precision (Macro)', 
               'Recall (Macro)', 'F1-Score (Macro)'],
    'Baseline': [
        baseline_metrics['train_accuracy'],
        baseline_metrics['test_accuracy'],
        baseline_metrics['precision'],
        baseline_metrics['recall'],
        baseline_metrics['f1_score']
    ],
    'Optimized': [
        optimized_metrics['train_accuracy'],
        optimized_metrics['test_accuracy'],
        optimized_metrics['precision'],
        optimized_metrics['recall'],
        optimized_metrics['f1_score']
    ]
}

comparison_df = pd.DataFrame(metrics_comparison)
comparison_df['Improvement'] = comparison_df['Optimized'] - comparison_df['Baseline']
comparison_df['Improvement (%)'] = (comparison_df['Improvement'] / comparison_df['Baseline'] * 100).round(2)

print("\n" + comparison_df.to_string(index=False))

# Summary
print("\n" + "="*70)
print("SUMMARY")
print("="*70)
test_acc_improvement = optimized_metrics['test_accuracy'] - baseline_metrics['test_accuracy']
f1_improvement = optimized_metrics['f1_score'] - baseline_metrics['f1_score']

print(f"\nTest Accuracy Improvement: {test_acc_improvement:.4f} ({test_acc_improvement/baseline_metrics['test_accuracy']*100:.2f}%)")
print(f"F1-Score Improvement:      {f1_improvement:.4f} ({f1_improvement/baseline_metrics['f1_score']*100:.2f}%)")
print(f"\nSearch Time: {search_time:.2f} seconds ({search_time/60:.2f} minutes)")
print(f"Iterations Evaluated: {n_iterations}")

if test_acc_improvement > 0:
    print("\n✓ RandomizedSearchCV successfully improved model performance!")
else:
    print("\n✗ Baseline parameters were already near-optimal for this dataset.")
print("="*70)

## Step 8: Visualizations

Let's create visualizations to better understand the optimization results and model performance.

In [None]:
# Create comprehensive visualization
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('Task 3.8: Random Forest RandomizedSearchCV Analysis', fontsize=16, fontweight='bold')

# 1. Performance Comparison Bar Chart
ax1 = axes[0, 0]
metrics_names = ['Test Accuracy', 'Precision', 'Recall', 'F1-Score']
baseline_values = [baseline_metrics['test_accuracy'], baseline_metrics['precision'], 
                   baseline_metrics['recall'], baseline_metrics['f1_score']]
optimized_values = [optimized_metrics['test_accuracy'], optimized_metrics['precision'],
                    optimized_metrics['recall'], optimized_metrics['f1_score']]

x = np.arange(len(metrics_names))
width = 0.35
bars1 = ax1.bar(x - width/2, baseline_values, width, label='Baseline (Week 2)', color='steelblue', alpha=0.8)
bars2 = ax1.bar(x + width/2, optimized_values, width, label='Optimized (RandomSearch)', color='darkorange', alpha=0.8)
ax1.set_ylabel('Score')
ax1.set_title('Performance Comparison')
ax1.set_xticks(x)
ax1.set_xticklabels(metrics_names, rotation=15)
ax1.legend()
ax1.set_ylim([min(baseline_values + optimized_values) - 0.05, 1.0])
for bar in bars1 + bars2:
    height = bar.get_height()
    ax1.annotate(f'{height:.3f}', xy=(bar.get_x() + bar.get_width()/2, height),
                 xytext=(0, 3), textcoords='offset points', ha='center', va='bottom', fontsize=8)

# 2. CV Score Distribution
ax2 = axes[0, 1]
ax2.hist(cv_results['mean_test_score'], bins=20, color='teal', alpha=0.7, edgecolor='black')
ax2.axvline(best_cv_score, color='red', linestyle='--', linewidth=2, label=f'Best: {best_cv_score:.4f}')
ax2.axvline(cv_results['mean_test_score'].mean(), color='orange', linestyle='--', linewidth=2, 
            label=f'Mean: {cv_results["mean_test_score"].mean():.4f}')
ax2.set_xlabel('CV F1-Score')
ax2.set_ylabel('Frequency')
ax2.set_title('Distribution of CV Scores (100 iterations)')
ax2.legend()

# 3. n_estimators vs Score
ax3 = axes[0, 2]
scatter = ax3.scatter(cv_results['param_n_estimators'], cv_results['mean_test_score'], 
                      c=cv_results['mean_test_score'], cmap='viridis', alpha=0.6)
ax3.set_xlabel('n_estimators')
ax3.set_ylabel('CV F1-Score')
ax3.set_title('n_estimators vs Performance')
plt.colorbar(scatter, ax=ax3, label='CV Score')

# 4. max_depth vs Score
ax4 = axes[1, 0]
depth_scores = cv_results.groupby('param_max_depth')['mean_test_score'].mean()
depth_labels = [str(d) if d is not None else 'None' for d in depth_scores.index]
ax4.bar(range(len(depth_scores)), depth_scores.values, color='coral', alpha=0.8)
ax4.set_xticks(range(len(depth_scores)))
ax4.set_xticklabels(depth_labels, rotation=45)
ax4.set_xlabel('max_depth')
ax4.set_ylabel('Mean CV F1-Score')
ax4.set_title('max_depth Impact on Performance')

# 5. Confusion Matrix (Optimized Model)
ax5 = axes[1, 1]
cm = confusion_matrix(y_test_encoded, y_test_pred_opt)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax5,
            xticklabels=label_encoder.classes_, yticklabels=label_encoder.classes_)
ax5.set_xlabel('Predicted')
ax5.set_ylabel('Actual')
ax5.set_title('Confusion Matrix (Optimized Model)')

# 6. Feature Importance (Top 15)
ax6 = axes[1, 2]
feature_importance = pd.DataFrame({
    'feature': X_train.columns,
    'importance': optimized_model.feature_importances_
}).sort_values('importance', ascending=True).tail(15)

ax6.barh(feature_importance['feature'], feature_importance['importance'], color='seagreen', alpha=0.8)
ax6.set_xlabel('Importance')
ax6.set_title('Top 15 Feature Importances (Optimized)')

plt.tight_layout()
plt.savefig('../../outputs/figures/rf_randomsearch_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nVisualization saved to: ../../outputs/figures/rf_randomsearch_analysis.png")

## Step 9: Parameter Sensitivity Analysis

Let's analyze which parameters have the most impact on model performance. This helps understand which hyperparameters are most important to tune.

In [None]:
# Parameter sensitivity analysis
print("="*60)
print("PARAMETER SENSITIVITY ANALYSIS")
print("="*60)

# Analyze each parameter's impact
param_impacts = {}

for param in ['param_n_estimators', 'param_max_depth', 'param_min_samples_split', 
              'param_min_samples_leaf', 'param_max_features', 'param_bootstrap', 'param_criterion']:
    grouped = cv_results.groupby(param)['mean_test_score']
    param_impacts[param.replace('param_', '')] = {
        'mean_range': grouped.mean().max() - grouped.mean().min(),
        'best_value': grouped.mean().idxmax(),
        'best_score': grouped.mean().max()
    }

# Sort by impact
impact_df = pd.DataFrame(param_impacts).T
impact_df = impact_df.sort_values('mean_range', ascending=False)

print("\nParameter Impact Ranking (by score range):")
print("-"*60)
for idx, (param, row) in enumerate(impact_df.iterrows(), 1):
    print(f"{idx}. {param}")
    print(f"   Score Range: {row['mean_range']:.4f}")
    print(f"   Best Value: {row['best_value']}")
    print(f"   Best Mean Score: {row['best_score']:.4f}")
    print()

print("="*60)
print("\nInterpretation:")
print("- Higher 'Score Range' = More sensitive parameter (more important to tune)")
print("- Lower 'Score Range' = Less sensitive parameter (default may be fine)")
print("="*60)

## Step 10: Save Optimized Model and Results

Finally, we save the optimized model and all results for use in subsequent tasks (especially T3.13 SHAP Analysis).

In [None]:
# Save optimized model
with open('../../models/rf_optimized_randomsearch.pkl', 'wb') as f:
    pickle.dump(optimized_model, f)
print("Optimized model saved to: ../../models/rf_optimized_randomsearch.pkl")

# Save best parameters
best_params_df = pd.DataFrame([best_params])
best_params_df.to_csv('../../outputs/rf_best_params_randomsearch.csv', index=False)
print("Best parameters saved to: ../../outputs/rf_best_params_randomsearch.csv")

# Save performance comparison
comparison_df.to_csv('../../outputs/rf_performance_comparison.csv', index=False)
print("Performance comparison saved to: ../../outputs/rf_performance_comparison.csv")

# Save feature importance
full_feature_importance = pd.DataFrame({
    'feature': X_train.columns,
    'importance': optimized_model.feature_importances_
}).sort_values('importance', ascending=False)
full_feature_importance.to_csv('../../outputs/rf_optimized_feature_importance.csv', index=False)
print("Feature importance saved to: ../../outputs/rf_optimized_feature_importance.csv")

# Save predictions
predictions_df = pd.DataFrame({
    'y_true': y_test_encoded,
    'y_pred': y_test_pred_opt,
    'y_true_label': label_encoder.inverse_transform(y_test_encoded),
    'y_pred_label': label_encoder.inverse_transform(y_test_pred_opt)
})
predictions_df.to_csv('../../outputs/rf_optimized_predictions.csv', index=False)
print("Predictions saved to: ../../outputs/rf_optimized_predictions.csv")

## Conclusion

### Summary of Task 3.8:

1. **Objective Achieved:** Successfully implemented RandomizedSearchCV for Random Forest with wider parameter ranges.

2. **Search Space:** Explored ~10 million possible combinations by sampling 100 random configurations.

3. **Key Findings:**
   - Identified optimal hyperparameters through random search
   - Documented performance improvements over Week 2 default parameters
   - Analyzed parameter sensitivity to understand which hyperparameters matter most

4. **Files Generated:**
   - `rf_optimized_randomsearch.pkl` - Optimized model for T3.13 SHAP Analysis
   - `rf_randomsearch_results.csv` - All 100 iteration results
   - `rf_best_params_randomsearch.csv` - Best parameters found
   - `rf_performance_comparison.csv` - Baseline vs Optimized comparison
   - `rf_optimized_feature_importance.csv` - Feature importances
   - `rf_optimized_predictions.csv` - Model predictions
   - `rf_randomsearch_analysis.png` - Comprehensive visualization

