# Boulevard of broken analyses

## Outlier detection

Initial model R2 was ~86%. Post-outlier detection (defined as a row where more than 10% of members were outliers) saw a drop to between 45-65%.

In [2]:
from sklearn.ensemble import IsolationForest

def detect_outliers(data, contamination=0.05, random_state=None):
    """
    Detect outliers/anomalies in each feature of the data using Isolation Forest algorithm.
    
    Parameters:
        data (DataFrame): The input data for outlier detection.
        contamination (float, optional): The proportion of outliers/anomalies in the data.
                                         Defaults to 0.05.
        random_state (int, RandomState instance or None, optional): 
            Controls the random seed for reproducibility. Defaults to None.
    
    Returns:
        DataFrame: A DataFrame indicating whether each data point is an outlier/anomaly (1) or not (0) for each feature.
    """
    # Initialize DataFrame to store outlier detection results
    outlier_df = pd.DataFrame(index=data.index)
    
    # Get feature names
    feature_names = data.columns.tolist()
    
    # Initialize Isolation Forest model with feature names
    model = IsolationForest(contamination=contamination, random_state=random_state)
    
    # Initialize list to store outlier predictions
    outlier_columns = []
    
    # Detect outliers for each feature
    for column in data.columns:
        # Fit the model and predict outliers for the current feature
        outliers = model.fit_predict(data[[column]].values)
        
        # Convert outliers numpy array to DataFrame
        outlier_series = pd.Series(outliers, index=data.index)
        
        # Append the outlier predictions to the list
        outlier_columns.append(outlier_series)
    
    # Concatenate the outlier predictions into a DataFrame
    outlier_df = pd.concat(outlier_columns, axis=1)
    
    # Convert predictions to binary (1 for outliers, 0 otherwise)
    outlier_df[outlier_df != -1] = 0
    
    return outlier_df

In [None]:
outlier_df = detect_outliers(data, random_state=42)

n_i = data.shape[0]

# Due to the vast number of features, rows were removed if over 10% of their features were outliers.
mask = pd.DataFrame(outlier_df)

# Calculate the percentage of -1 values in each row
percentage_of_minus_1 = (mask == -1).sum(axis=1) / mask.shape[1]

# Filter rows where over 10% of the columns have -1
rows_to_remove = percentage_of_minus_1 > 0.10

# Remove rows from the original DataFrame
data = data[~rows_to_remove]

n_f = data.shape[0]

print(f"Removed a total of {n_i - n_f} outliers from the data.")

## Random Search (HPO)

This method did not work out as expected. See the following response from ChatGPT:

```
To find the total number of combinations, multiply these numbers:

251 (n_estimators) * ∞ (learning_rate) * 8 (max_depth) * 20 (min_samples_split) * 10 (min_samples_leaf) * ∞ (subsample) * 3 (max_features) * ∞ (alpha) * ∞ (tol) * 2 (warm_start)

However, it's important to note that learning_rate, subsample, alpha, and tol have continuous distributions, so technically, there are infinitely many possibilities within their defined ranges. Thus, the actual number of possible combinations is practically infinite for these parameters.

For the discrete parameters (n_estimators, max_depth, min_samples_split, min_samples_leaf, max_features, warm_start), the total number of combinations is:

251 * 8 * 20 * 10 * 3 * 2 = 1,204,800 combinations.

However, considering the continuous parameters, the search space is effectively much larger, and exhaustive search across all combinations would be practically infeasible. This is one of the reasons why randomized search is preferred for hyperparameter optimization in such cases.
```

In [None]:
# Define the expanded parameter grid for random search
param_dist = {
    'n_estimators': randint(50, 300),  # Number of boosting stages
    'learning_rate': uniform(0.01, 0.2 - 0.01),  # Learning rate
    'max_depth': randint(3, 10),  # Maximum depth of the individual estimators
    'min_samples_split': randint(2, 21),  # Minimum number of samples required to split an internal node
    'min_samples_leaf': randint(1, 11),  # Minimum number of samples required to be at a leaf node
    'subsample': uniform(0.6, 0.4),  # Subsample ratio of the training instance
    'max_features': [1.0, 'sqrt', 'log2'],  # Number of features to consider at each split
    'alpha': uniform(0.0, 0.1),  # Regularization parameter
    'tol': uniform(1e-5, 1e-3),  # Tolerance for stopping criteria
    'warm_start': [True, False]  # Whether to reuse the solution of the previous call to fit as initialization
}

# Initialize GradientBoostingRegressor
gb_reg = GradientBoostingRegressor()

# Initialize RandomizedSearchCV
random_search = GridSearchCV(gb_reg, param_distributions=param_dist, n_iter=1000, cv=5, scoring='r2', random_state=42)

# Perform random search
random_search.fit(X_train, y_train)

# Get the best parameters
best_params = random_search.best_params_
print("Best parameters:", best_params)

# Evaluate the best model
best_model = random_search.best_estimator_
test_score = best_model.score(X_test, y_test)
print("Test R2 score of the best model:", test_score)