# Homework 5 RF Accuracy Improvement

This assignment is inspired by examples of Shan-Hung Wu from National Tsing Hua University.

Requirement: improve the accuracy per feature of the following code from 0.03 up to at least 0.45 and accuracy should be more than 0.92

Here are three hints:

    You can improve the ratio by picking out or "creating" several features.
    Tune hyperparameters
    The ratio can be improved from 0.03 up to 0.47.

In [7]:
from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel

# load the breast_cancer dataset
init_data = load_breast_cancer()
(X, y) = load_breast_cancer(return_X_y=True)

print(X.shape)

# Feature selection
rf_for_feature_selection = RandomForestClassifier(random_state=42)
rf_for_feature_selection.fit(X, y)
sfm = SelectFromModel(rf_for_feature_selection, threshold=0.01)
sfm.fit(X, y)
X_selected = sfm.transform(X)
print("Selected features data shape:", X_selected.shape)

# 5 fold cross validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Hyperparameter tuning
param_grid = {
    'n_estimators': [100, 150],  # Reduced range
    'max_depth': [None, 30],     # Reduced range
    'min_samples_split': [2],
    'criterion': ['gini']
}
grid_search = GridSearchCV(
    estimator=RandomForestClassifier(random_state=42),
    param_grid=param_grid,
    cv=kf,
    scoring='accuracy'
)

print("Starting grid search...")
grid_search.fit(X_selected, y)
print("Grid search completed.")

# Average accuracy score
average_accuracy = grid_search.best_score_
print("Average accuracy:", average_accuracy)

# Average accuracy per feature
average_accuracy_per_feature = average_accuracy / X_selected.shape[1]
print("Average accuracy per feature:", average_accuracy_per_feature)

(569, 30)
Selected features data shape: (569, 18)
Starting grid search...
Grid search completed.
Average accuracy: 0.9613258810743673
Average accuracy per feature: 0.05340699339302041
