<font size='5'>**Introduction**</font>

**Title: Versatile Machine Learning Pipeline**

Description:
This script implements a comprehensive machine learning pipeline designed to handle both classification and regression tasks. It processes a dataset specified in a configuration file ("algoparams_from_ui.json"), performs feature engineering, trains a model, and evaluates its performance. The pipeline includes the following key stages:

1. **Data Loading and Preprocessing**:
   - Loads the dataset from a CSV file specified in the configuration.
   - Handles numerical and categorical features, converts categorical feature to numerical by encoding, using `FeatureHasher`.

2. **Feature Engineering**:
   - Generates new features through linear, polynomial, and explicit pairwise interactions as defined in the configuration.
   - Applies `RobustScaler` to scale the generated features for better model performance.

3. **Feature Reduction**:
   - Supports multiple feature reduction techniques (e.g., Correlation with Target, Tree-based, PCA) to reduce dimensionality while retaining the most important features.
   - The method and number of features to keep are specified in the configuration.

4. **Model Training**:
   - Supports a variety of algorithms (e.g., RandomForest, Gradient Boosting, Linear Models, SVM, XGBoost, etc.) for both regression and classification tasks.
   - Uses hyperparameter tuning with Grid Search if specified, and applies sample weighting for classification tasks if configured.
   - Trains the model on a split dataset (80% train, 20% test) using `train_test_split`.

5. **Evaluation**:
   - For classification tasks, evaluates the model using F1 Score, Accuracy, Recall, and Precision.
   - For regression tasks, computes Mean Squared Error (MSE) and RÂ² Score.
   - Ensures robust evaluation by handling prediction types appropriately and providing clear metrics output.

The pipeline is highly configurable through the "algoparams_from_ui.json" file, allowing users to customize feature handling, model selection, hyperparameter tuning, and evaluation metrics. This script is designed to be flexible and reusable for various datasets and machine learning tasks.

<font size='5'>**Importing Dataset and Loading JSON File**</font>

### Loading JSON File

In [1]:
import warnings

In [2]:
import json
with open("algoparams_from_ui.json") as f:
    config=json.load(f)

### Importing Dataset

In [3]:
import pandas as pd
original_data = pd.read_csv(config['design_state_data']['session_info']['dataset'])
df=original_data

In [4]:
df.head(10)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa


<font size='5'>**Preprocesing**</font>

### Feature Handling

In [5]:
import pandas as pd
from sklearn.feature_extraction import FeatureHasher
df = pd.DataFrame(df)
for col in config['design_state_data']['feature_handling']:
    if config['design_state_data']['feature_handling'][col]['feature_variable_type']=='numerical':
        continue
    else:
        data_values = df[col].tolist()
        data_list = [[values] for values in data_values]
        n_features = df[col].value_counts().count()-1
        hasher = FeatureHasher(n_features=n_features, input_type='string')
        hashed_features = hasher.fit_transform(data_list)
        hashed_features_dense = hashed_features.toarray()
        hashed_df = pd.DataFrame(hashed_features_dense, columns=[f'species_hashed_{i}' for i in range(n_features)])
        df = df.drop(col, axis=1)
        df = pd.concat([df, hashed_df], axis=1)

In [6]:
df.sample(10)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species_hashed_0,species_hashed_1
99,5.7,2.8,4.1,1.3,-1.0,0.0
18,5.7,3.8,1.7,0.3,0.0,1.0
137,6.4,3.1,5.5,1.8,0.0,-1.0
94,5.6,2.7,4.2,1.3,-1.0,0.0
40,5.0,3.5,1.3,0.3,0.0,1.0
4,5.0,3.6,1.4,0.2,0.0,1.0
118,7.7,2.6,6.9,2.3,0.0,-1.0
39,5.1,3.4,1.5,0.2,0.0,1.0
134,6.1,2.6,5.6,1.4,0.0,-1.0
41,4.5,2.3,1.3,0.3,0.0,1.0


<font size='5'>**Feature Engineering**</font>

### Feature Generation

1) The column 'species_sum' is created so that any encoded column can be used to generate a new feature, not just a single encoded column.
2) A warning is generated if an unknown feature generation technique is used that is not defined in the JSON file.

In [7]:
from sklearn.preprocessing import RobustScaler

for interaction in config['design_state_data']['feature_generation']:
    if interaction == 'linear_interactions':
        for columns in config['design_state_data']['feature_generation'][interaction]:
            feat1, feat2 = columns
            if feat2 not in df.columns:
                matching_cols = [col for col in df.columns if col.startswith(feat2 + '_')]
                if len(matching_cols) > 1:
                    df[f'{feat2}_sum'] = df[matching_cols].sum(axis=1)
                    feat2 = f'{feat2}_sum'
                else:
                    feat2 = feat2
            df[interaction] = df[feat1] + df[feat2]
            scaler = RobustScaler()
            df[[interaction]] = scaler.fit_transform(df[[interaction]])

    elif interaction == 'polynomial_interactions':
        for columns in config['design_state_data']['feature_generation'][interaction]:
            feat1, feat2 = columns.split('/')
            if feat2 not in df.columns:
                matching_cols = [col for col in df.columns if col.startswith(feat2 + '_')]
                if len(matching_cols) > 1:
                    df[f'{feat2}_sum'] = df[matching_cols].sum(axis=1)
                    feat2 = f'{feat2}_sum'
                else:
                    feat2 = feat2
            epsilon = 1e-6
            df[f'{feat1}_div_{feat2}'] = df[feat1] / (df[feat2] + epsilon)

    elif interaction == 'explicit_pairwise_interactions':
        for columns in config['design_state_data']['feature_generation'][interaction]:
            feat1, feat2 = columns.split('/')
            if feat2 not in df.columns:
                matching_cols = [col for col in df.columns if col.startswith(feat2 + '_')]
                if len(matching_cols) > 1:
                    df[f'{feat2}_sum'] = df[matching_cols].sum(axis=1)
                    feat2 = f'{feat2}_sum'
                else:
                    feat2 = feat2
            epsilon = 1e-6
            df[f'{feat1}_div_{feat2}'] = df[feat1] / (df[feat2] + epsilon)

    else:
        warnings.warn(f"Unknown feature generation method: {interaction}")



In [8]:
df.sample(5)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species_hashed_0,species_hashed_1,linear_interactions,petal_length_div_sepal_width,species_sum,petal_width_div_species_sum,sepal_width_div_sepal_length,petal_width_div_sepal_length
112,6.8,3.0,5.5,2.1,0.0,-1.0,0.482143,1.833333,-1.0,-2.100002,0.441176,0.308823
4,5.0,3.6,1.4,0.2,0.0,1.0,-0.767857,0.388889,1.0,0.2,0.72,0.04
40,5.0,3.5,1.3,0.3,0.0,1.0,-0.839286,0.371428,1.0,0.3,0.7,0.06
132,6.4,2.8,5.6,2.2,0.0,-1.0,0.446429,1.999999,-1.0,-2.200002,0.4375,0.34375
77,6.7,3.0,5.0,1.7,-1.0,0.0,0.303571,1.666666,-1.0,-1.700002,0.447761,0.253731


### Feature Reduction 

1) The 'species_sum' column is removed from X, as it is only an intermediate column used during feature generation.
2) A warning is generated if an unknown feature reduction technique is used that is not defined in the JSON file.

In [9]:
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.utils.multiclass import type_of_target
import numpy as np
import warnings

target_column=config['design_state_data']['target']['target']
X = df.drop(columns=[target_column])

X = X.drop(columns=['species_sum'], errors='ignore')

y = df[target_column]

feature_reduction_cfg = config['design_state_data']['feature_reduction']
method = feature_reduction_cfg['feature_reduction_method']

X_reduced = X.copy()

target_type = type_of_target(y)

if method == 'No Reduction':
    pass  

elif method == 'Corr with Target':
    corr = X.corrwith(y).abs()
    top_features = corr.nlargest(int(feature_reduction_cfg['num_of_features_to_keep'])).index
    X_reduced = X[top_features]

elif method == 'Tree-based':
    if target_type in ['binary', 'multiclass']:
        model = RandomForestClassifier(n_estimators=int(feature_reduction_cfg['num_of_trees']),
                                      max_depth=int(feature_reduction_cfg['depth_of_trees']),
                                      random_state=42)
    else:
        model = RandomForestRegressor(n_estimators=int(feature_reduction_cfg['num_of_trees']),
                                      max_depth=int(feature_reduction_cfg['depth_of_trees']),
                                      random_state=42)

    model.fit(X, y)
    importances = model.feature_importances_

    top_features = X.columns[np.argsort(importances)[::-1][:int(feature_reduction_cfg['num_of_features_to_keep'])]]
    X_reduced = X[top_features]

elif method == 'PCA':
    pca = PCA(n_components=int(feature_reduction_cfg['num_of_features_to_keep']))
    X_pca = pca.fit_transform(X)
    X_reduced = pd.DataFrame(X_pca, columns=[f'PCA_{i+1}' for i in range(X_pca.shape[1])])

else:
    warnings.warn(f"Unknown reduction method: {method}")



df = pd.concat([X_reduced, y], axis=1)

In [10]:
df.sample(5)

Unnamed: 0,species_hashed_1,petal_width_div_species_sum,petal_length,petal_length_div_sepal_width,petal_width
81,0.0,-1.000001,3.7,1.541666,1.0
139,-1.0,-2.100002,5.4,1.741935,2.1
129,-1.0,-1.600002,5.8,1.933333,1.6
89,0.0,-1.300001,4.0,1.599999,1.3
121,-1.0,-2.000002,4.9,1.749999,2.0


<font size='5'>**Train Test Split**</font>

In [11]:
from sklearn.model_selection import train_test_split

X = df.drop(columns=[target_column])
y = df[target_column]

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)

<font size='5'>**Probability Calibration**</font>

In [12]:
probability_calibration = config['design_state_data'].get('probability_calibration', {})

# Helper function to wrap classifier with CalibratedClassifierCV if needed
def wrap_classifier_with_calibration(model, calibration_method):
    if calibration_method == "Sigmoid - Platt Scaling":
        return CalibratedClassifierCV(base_estimator=model, method='sigmoid')
    return model

<font size='5'>**Model Selection**</font>

1) Lists are created for regression and classification tasks based on prediction_type to reduce redundancy.
2) A warning is generated if an unknown prediction_type is mentioned.

In [13]:
prediction_type=config['design_state_data']['target']['prediction_type']
if prediction_type=='Regression':
    algorithms=['RandomForestRegressor','GBTRegressor','LinearRegression','RidgeRegression','LassoRegression','ElasticNetRegression','DecisionTreeRegressor','xg_boost','SVM','KNN','SGD','extra_random_trees','neural_network']
    
elif prediction_type=='Classification':
    algorithms=['RandomForestClassifier','GBTClassifier','LogisticRegression','DecisionTreeClassifier','xg_boost','SVM','KNN','SGD','extra_random_trees','neural_network']

else:
    warnings.warn(f"Unknown prediction_type: {prediction_type}")
    

1) It checks which model is selected for fitting

In [14]:
for model in algorithms:
    if config['design_state_data']['algorithms'][model]['is_selected']==True:
        model_selected=model
        break
print(model_selected)        

RandomForestRegressor


<font size='5'>**Calibrated Models with JSON File**</font>

1) Calibrating the original models using model parameters from the JSON file.
2) Using parameters for hyperparameter tuning.
3) Performing probability calibration as well.

In [15]:
import numpy as np
from sklearn.ensemble import (
    RandomForestRegressor, RandomForestClassifier,
    GradientBoostingRegressor, GradientBoostingClassifier,
    ExtraTreesRegressor, ExtraTreesClassifier
)
from sklearn.linear_model import (
    LinearRegression, LogisticRegression,
    Ridge, Lasso, ElasticNet,
    SGDRegressor, SGDClassifier
)
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.svm import SVR, SVC
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.neural_network import MLPRegressor, MLPClassifier
from xgboost import XGBRegressor, XGBClassifier
from sklearn.calibration import CalibratedClassifierCV

# model_cfg and model_selected should be defined externally
model_cfg = config['design_state_data']['algorithms'][model_selected]


# Helper function to adjust n_jobs
def adjust_n_jobs(n_jobs):
    return 1 if n_jobs == 0 else n_jobs

model_mapping = {
    'RandomForestRegressor': RandomForestRegressor,
    'RandomForestClassifier': RandomForestClassifier,
    'GBTRegressor': GradientBoostingRegressor,
    'GBTClassifier': GradientBoostingClassifier,
    'LinearRegression': LinearRegression,
    'LogisticRegression': LogisticRegression,
    'RidgeRegression': Ridge,
    'LassoRegression': Lasso,
    'ElasticNetRegression': ElasticNet,
    'DecisionTreeRegressor': DecisionTreeRegressor,
    'DecisionTreeClassifier': DecisionTreeClassifier
}

if model_selected in ['RandomForestRegressor', 'RandomForestClassifier']:
    final_model = model_mapping[model_selected](
        n_estimators=np.random.randint(model_cfg['min_trees'], model_cfg['max_trees']),
        max_depth=np.random.randint(model_cfg['min_depth'], model_cfg['max_depth']),
        min_samples_leaf=np.random.randint(model_cfg['min_samples_per_leaf_min_value'], model_cfg['min_samples_per_leaf_max_value']),
        n_jobs=adjust_n_jobs(model_cfg['parallelism']),
        random_state=42
    )
    if model_selected == 'RandomForestClassifier' and probability_calibration.get('probability_calibration_method'):
        final_model = wrap_classifier_with_calibration(final_model, probability_calibration['probability_calibration_method'])
    param_grid = {
        'n_estimators': [model_cfg['min_trees'], model_cfg['max_trees']],
        'max_depth': [model_cfg['min_depth'], model_cfg['max_depth']],
        'min_samples_leaf': [model_cfg['min_samples_per_leaf_min_value'], model_cfg['min_samples_per_leaf_max_value']]
    }    

elif model_selected in ['GBTRegressor', 'GBTClassifier']:
    final_model = model_mapping[model_selected](
        n_estimators=np.random.randint(model_cfg['num_of_BoostingStages'][0], model_cfg['num_of_BoostingStages'][1]),
        learning_rate=np.random.uniform(model_cfg['min_stepsize'], model_cfg['max_stepsize']),
        subsample=np.random.uniform(model_cfg['min_subsample'], model_cfg['max_subsample']),
        max_depth=np.random.randint(model_cfg['min_depth'], model_cfg['max_depth']),
        random_state=42
    )
    if model_selected == 'GBTClassifier' and probability_calibration.get('probability_calibration_method'):
        final_model = wrap_classifier_with_calibration(final_model, probability_calibration['probability_calibration_method'])
    param_grid = {
        'n_estimators': [model_cfg['num_of_BoostingStages'][0], model_cfg['num_of_BoostingStages'][1]],
        'learning_rate': [model_cfg['min_stepsize'], model_cfg['max_stepsize']],
        'subsample': [model_cfg['min_subsample'], model_cfg['max_subsample']],
        'max_depth': [model_cfg['min_depth'], model_cfg['max_depth']]
    }
    
elif model_selected in ['LinearRegression', 'LogisticRegression']:
    final_model = model_mapping[model_selected](
        alpha=np.random.uniform(model_cfg['min_regparam'], model_cfg['max_regparam']),
        l1_ratio=np.random.uniform(model_cfg['min_elasticnet'], model_cfg['max_elasticnet']),
        max_iter=np.random.randint(model_cfg['min_iter'], model_cfg['max_iter']),
        n_jobs=adjust_n_jobs(model_cfg['parallelism']),
        random_state=42
    )
    if model_selected == 'LogisticRegression' and probability_calibration.get('probability_calibration_method'):
        final_model = wrap_classifier_with_calibration(final_model, probability_calibration['probability_calibration_method'])
    param_grid = {
        'alpha': [model_cfg['min_regparam'], model_cfg['max_regparam']],
        'l1_ratio': [model_cfg['min_elasticnet'], model_cfg['max_elasticnet']],
        'max_iter': [model_cfg['min_iter'], model_cfg['max_iter']]
    }
elif model_selected in ['RidgeRegression', 'LassoRegression']:
    final_model = model_mapping[model_selected](
        alpha=np.random.uniform(model_cfg['min_regparam'], model_cfg['max_regparam']),
        max_iter=np.random.randint(model_cfg['min_iter'], model_cfg['max_iter']),
        random_state=42
    )
    param_grid = {
        'alpha': [model_cfg['min_regparam'], model_cfg['max_regparam']],
        'max_iter': [model_cfg['min_iter'], model_cfg['max_iter']]
    }

elif model_selected == 'ElasticNetRegression':
    final_model = model_mapping[model_selected](
        alpha=np.random.uniform(model_cfg['min_regparam'], model_cfg['max_regparam']),
        l1_ratio=np.random.uniform(model_cfg['min_elasticnet'], model_cfg['max_elasticnet']),
        max_iter=np.random.randint(model_cfg['min_iter'], model_cfg['max_iter']),
        random_state=42
    )
    param_grid = {
        'alpha': [model_cfg['min_regparam'], model_cfg['max_regparam']],
        'l1_ratio': [model_cfg['min_elasticnet'], model_cfg['max_elasticnet']],
        'max_iter': [model_cfg['min_iter'], model_cfg['max_iter']]
    }
    
elif model_selected in ['DecisionTreeRegressor', 'DecisionTreeClassifier']:
    final_model = model_mapping[model_selected](
        max_depth=np.random.randint(model_cfg['min_depth'], model_cfg['max_depth'] + 1),
        min_samples_leaf=np.random.randint(model_cfg['min_samples_per_leaf'][0], model_cfg['min_samples_per_leaf'][1] + 1),
        criterion='entropy' if model_cfg.get('use_entropy', False) else 'gini',
        random_state=42
    )
    if model_selected == 'DecisionTreeClassifier' and probability_calibration.get('probability_calibration_method'):
        final_model = wrap_classifier_with_calibration(final_model, probability_calibration['probability_calibration_method'])
    param_grid = {
        'max_depth': [model_cfg['min_depth'], model_cfg['max_depth']],
        'min_samples_leaf': [model_cfg['min_samples_per_leaf'][0], model_cfg['min_samples_per_leaf'][1]],
        'criterion': ['gini', 'entropy'] if model_cfg.get('use_entropy', False) else ['gini']
    }
    
elif model_selected == 'xg_boost':
    model_class = XGBRegressor if prediction_type == 'Regression' else XGBClassifier
    final_model = model_class(
        booster='dart',
        n_jobs=adjust_n_jobs(model_cfg['parallelism']),
        max_depth=np.random.randint(model_cfg['max_depth_of_tree'][0], model_cfg['max_depth_of_tree'][1]),
        learning_rate=np.random.uniform(model_cfg['learningRate'][0], model_cfg['learningRate'][1]),
        reg_alpha=model_cfg['l1_regularization'][0],
        reg_lambda=model_cfg['l2_regularization'][0],
        gamma=model_cfg['gamma'][0],
        min_child_weight=model_cfg['min_child_weight'][0],
        subsample=model_cfg['sub_sample'][0],
        colsample_bytree=model_cfg['col_sample_by_tree'][0],
        random_state=42
    )
    if prediction_type == 'Classification' and probability_calibration.get('probability_calibration_method'):
        final_model = wrap_classifier_with_calibration(final_model, probability_calibration['probability_calibration_method'])
    param_grid = {
        'max_depth': [model_cfg['max_depth_of_tree'][0], model_cfg['max_depth_of_tree'][1]],
        'learning_rate': [model_cfg['learningRate'][0], model_cfg['learningRate'][1]],
        'reg_alpha': [model_cfg['l1_regularization'][0]],
        'reg_lambda': [model_cfg['l2_regularization'][0]],
        'gamma': [model_cfg['gamma'][0]],
        'min_child_weight': [model_cfg['min_child_weight'][0]],
        'subsample': [model_cfg['sub_sample'][0]],
        'colsample_bytree': [model_cfg['col_sample_by_tree'][0]]
    }
    
elif model_selected == 'SVM':
    model_class = SVR if prediction_type == 'Regression' else SVC
    final_model = model_class(
        C=model_cfg['c_value'],
        kernel='rbf',
        tol=model_cfg['tolerance'],
        max_iter=model_cfg['max_iter']
    )
    if prediction_type == 'Classification' and probability_calibration.get('probability_calibration_method'):
        final_model = wrap_classifier_with_calibration(final_model, probability_calibration['probability_calibration_method'])
    param_grid = {
        'C': [model_cfg['c_value']],
        'kernel': ['rbf'],
        'tol': [model_cfg['tolerance']],
        'max_iter': [model_cfg['max_iter']]
    }
    
elif model_selected == 'SGD':
    model_class = SGDRegressor if prediction_type == 'Regression' else SGDClassifier
    final_model = model_class(
        alpha=np.random.uniform(model_cfg['alpha_value'][0], model_cfg['alpha_value'][1]),
        l1_ratio=0.15,
        tol=model_cfg['tolerance'],
        loss='log', 
        n_jobs=adjust_n_jobs(model_cfg['parallelism']),
        random_state=42
    )
    if prediction_type == 'Classification' and probability_calibration.get('probability_calibration_method'):
        final_model = wrap_classifier_with_calibration(final_model, probability_calibration['probability_calibration_method'])
    param_grid = {
        'alpha': [model_cfg['alpha_value'][0], model_cfg['alpha_value'][1]],
        'l1_ratio': [0.15],
        'tol': [model_cfg['tolerance']]
    }


elif model_selected == 'KNN':
    model_class = KNeighborsRegressor if prediction_type == 'Regression' else KNeighborsClassifier
    final_model = model_class(
        n_neighbors=model_cfg['k_value'][0],
        weights='distance',
        algorithm='auto',
        p=model_cfg['p_value']
    )
    if prediction_type == 'Classification' and probability_calibration.get('probability_calibration_method'):
        final_model = wrap_classifier_with_calibration(final_model, probability_calibration['probability_calibration_method'])
    param_grid = {
        'n_neighbors': [model_cfg['k_value'][0]],
        'weights': ['distance'],
        'algorithm': ['auto'],
        'p': [model_cfg['p_value']]
    }
    
elif model_selected == 'extra_random_trees':
    model_class = ExtraTreesRegressor if prediction_type == 'Regression' else ExtraTreesClassifier
    final_model = model_class(
        n_estimators=np.random.randint(model_cfg['num_of_trees'][0], model_cfg['num_of_trees'][1]),
        max_depth=np.random.randint(model_cfg['max_depth'][0], model_cfg['max_depth'][1]),
        min_samples_leaf=np.random.randint(model_cfg['min_samples_per_leaf'][0], model_cfg['min_samples_per_leaf'][1]),
        max_features='sqrt', 
        n_jobs=adjust_n_jobs(model_cfg['parallelism']),
        random_state=42
    )
    if model_selected == 'ExtraTreesClassifier' and probability_calibration.get('probability_calibration_method'):
        final_model = wrap_classifier_with_calibration(final_model, probability_calibration['probability_calibration_method'])
    param_grid = {
        'n_estimators': [model_cfg['num_of_trees'][0], model_cfg['num_of_trees'][1]],
        'max_depth': [model_cfg['max_depth'][0], model_cfg['max_depth'][1]],
        'min_samples_leaf': [model_cfg['min_samples_per_leaf'][0], model_cfg['min_samples_per_leaf'][1]]
    }
    
elif model_selected == 'neural_network':
    model_class = MLPRegressor if prediction_type == 'Regression' else MLPClassifier
    final_model = model_class(
        hidden_layer_sizes=(np.random.randint(model_cfg['hidden_layer_size'][0], model_cfg['hidden_layer_size'][1]),),
        alpha=np.random.uniform(model_cfg['min_alpha'], model_cfg['max_alpha']),
        early_stopping=model_cfg['early_stopping'],
        solver=model_cfg['solver'],
        shuffle=model_cfg['shuffle_data'],
        random_state=42
    )
    if prediction_type == 'Classification' and probability_calibration.get('probability_calibration_method'):
        final_model = wrap_classifier_with_calibration(final_model, probability_calibration['probability_calibration_method'])
    param_grid = {
        'hidden_layer_sizes': [(model_cfg['hidden_layer_size'][0],), (model_cfg['hidden_layer_size'][1],)],
        'alpha': [model_cfg['min_alpha'], model_cfg['max_alpha']]
    }    

<font size='5'>**Hyper-parameter Tuning**</font>

In [16]:
from sklearn.model_selection import GridSearchCV, StratifiedKFold, TimeSeriesSplit
hyperparameter_strategy = config['design_state_data'].get('hyperparameters', {})
if hyperparameter_strategy.get('strategy') == "Grid Search":
    cv = get_cv_strategy(
        prediction_type=prediction_type,
        cv_strategy=hyperparameter_strategy.get('cross_validation_strategy', ''),
        num_folds=hyperparameter_strategy.get('num_of_folds', 5),
        shuffle=hyperparameter_strategy.get('shuffle_grid', False),
        random_state=hyperparameter_strategy.get('random_state', None),
        stratified=hyperparameter_strategy.get('stratified', False)
    )
    final_model = GridSearchCV(
        estimator=final_model,
        param_grid=param_grid,
        cv=cv,
        n_jobs=adjust_n_jobs(hyperparameter_strategy.get('parallelism', 1)),
        scoring=None,  # Use default scoring for the model
        random_state=hyperparameter_strategy.get('random_state', None)
    )

<font size='5'>**Weighting Strategy**</font>

In [17]:
weighting_strategy = config['design_state_data'].get('weighting_strategy', {})
if prediction_type == 'Classification' and weighting_strategy.get('weighting_strategy_method') == 'Sample weights':
    weight_variable = weighting_strategy.get('weighting_strategy_weight_variable')
    if weight_variable in X_train.columns:
        sample_weights = X_train[weight_variable].values
        final_model.fit(X_train, y_train, sample_weight=sample_weights)
    else:
        warnings.warn(f"Weight variable '{weight_variable}' not found in dataset columns.")
else:
    final_model.fit(X_train, y_train)


<font size='5'>**Model Prediction**</font>

In [18]:
y_pred=final_model.predict(X_test)

<font size='5'>**Model Evaluation**</font>

In [19]:
from sklearn.metrics import mean_squared_error, f1_score, accuracy_score, r2_score, recall_score, precision_score

if prediction_type == 'Classification':
    y_pred = final_model.predict(X_test)
    f1 = f1_score(y_test, y_pred)
    accuracy = accuracy_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    print(f"F1 Score: {f1}")
    print(f"Accuracy: {accuracy}")
    print(f"Recall: {recall}")
    print(f"Precision: {precision}")
elif prediction_type == 'Regression':
    y_pred = final_model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    print(f"Mean Squared Error: {mse}")
    print(f"R2 Score: {r2}")
else:
    print("Unknown prediction type.")

Mean Squared Error: 0.004562253018305402
R2 Score: 0.99282275915245


<font size='5'>**Summary**</font>


1) This code can be used for any machine learning problem, supporting both classification and regression tasks with a wide range of algorithms.
2) It covers all important steps in the ML problem, including data loading, preprocessing with feature hashing and interaction generation, dimensionality reduction, model training with hyperparameter tuning, and evaluation using relevant metrics.
3) However, the provided JSON file must follow the same format as given in "algoparams_from_ui.json" for the pipeline to function correctly.
4) And file should be in **JSON Format** only not in **Rich Format**, if **Rich Format** then convert to **JSON**.

Conclusion: This pipeline offers a flexible and comprehensive solution for any ML task, provided the configuration adheres to the specified JSON structure and is properly formatted, ensuring seamless execution and effective model performance.mance.