I would break it down into two main tasks:
1. predicting the numerical outcome 
2. predicting the time to reach that outcome

Let's extend the code by adding all the regression models you want to try and compute the evaluation metrics (MAE, MSE, RMSE) for each model. We'll use the following regression models:

the regression models are:
1.Linear Regression
2.Decision Tree Regressor
3.Random Forest Regressor
4.Gradient Boosting Regressor
5.Support Vector Regressor (SVR)
6.K-Nearest Neighbors Regressor (KNN)

In [33]:
import pandas as pd  # Data handling
import numpy as np  # Numerical operations
from sklearn.model_selection import train_test_split  # Splitting dataset
from sklearn.metrics import r2_score,mean_absolute_error, mean_squared_error  # Error metrics

In [34]:
# Regressor models

from sklearn.linear_model import LinearRegression  
# Linear Regression: Basic linear model, fits a linear equation to data.

from sklearn.tree import DecisionTreeRegressor  
# Decision Tree Regressor: Non-linear model that splits the data into regions based on feature values.

from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor  
# Random Forest Regressor: Ensemble of decision trees, reduces overfitting by averaging multiple tree predictions.
# Gradient Boosting Regressor: Sequentially builds models, focusing on errors from the previous model to improve performance.

from sklearn.svm import SVR  
# Support Vector Regressor: Non-linear model that uses a hyperplane to predict continuous values.

from sklearn.neighbors import KNeighborsRegressor  
# K-Nearest Neighbors Regressor: Predicts the value of a point based on the average of its k-nearest neighbors.


In [35]:
np.random.seed(42)

np.random.seed(42): Setting the random seed ensures that the results are consistent every time the code is run. The choice of 42 is arbitrary , and it helps to ensure that randomized processes like train-test splitting or model initialization generate the same results across runs, which is essential for debugging and reproducibility.

In [36]:
# Assuming the data is already loaded or generated
# You can replace this with actual data
n_samples = 100 # Generate a dataset with 100 rows 
n_features = 5  # Generate a dataset with  5 features

In [37]:
X = np.random.rand(n_samples, n_features)   # Generate random features and target data
y_outcome = np.random.rand(n_samples) * 100  # Outcome target
y_time = np.random.randint(1, 60, size=n_samples)  # Time target

In [38]:


# Create DataFrame from generated data
df = pd.DataFrame(X, columns=[f'Feature_{i+1}' for i in range(X.shape[1])])#X.shape[0]: 100(number of rows) in pandas
                                                                           #X.shape[1]: 5 (number of columns)in pandas
df['Outcome'] = y_outcome
df['Time'] = y_time

# Display the first few rows of the DataFrame
print(df.head())


   Feature_1  Feature_2  Feature_3  Feature_4  Feature_5    Outcome  Time
0   0.374540   0.950714   0.731994   0.598658   0.156019  69.816171     5
1   0.155995   0.058084   0.866176   0.601115   0.708073  53.609637    18
2   0.020584   0.969910   0.832443   0.212339   0.181825  30.952762    28
3   0.183405   0.304242   0.524756   0.431945   0.291229  81.379502    42
4   0.611853   0.139494   0.292145   0.366362   0.456070  68.473117    22


In [39]:
df.tail()

Unnamed: 0,Feature_1,Feature_2,Feature_3,Feature_4,Feature_5,Outcome,Time
95,0.992965,0.073797,0.553854,0.969303,0.523098,47.396164,22
96,0.629399,0.695749,0.454541,0.627558,0.584314,66.755774,34
97,0.901158,0.045446,0.280963,0.950411,0.890264,17.231987,47
98,0.455657,0.620133,0.277381,0.188121,0.463698,19.228902,8
99,0.353352,0.583656,0.077735,0.974395,0.986211,4.086862,40


In [40]:
df.describe().T


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Feature_1,100.0,0.504711,0.297463,0.009197,0.279637,0.502326,0.77347,0.992965
Feature_2,100.0,0.525232,0.299905,0.011354,0.281719,0.556379,0.772236,0.990054
Feature_3,100.0,0.487698,0.294821,0.005522,0.19363,0.513164,0.746157,0.966655
Feature_4,100.0,0.521254,0.29322,0.005062,0.297294,0.532713,0.79842,0.974395
Feature_5,100.0,0.453913,0.30813,0.015457,0.167322,0.419329,0.726623,0.986887
Outcome,100.0,52.529974,29.712209,1.454467,25.552032,52.820203,81.114025,99.971767
Time,100.0,30.33,16.536495,1.0,17.0,27.5,45.25,59.0


In [41]:
df.shape


(100, 7)

In [42]:
df.isnull().sum()

Feature_1    0
Feature_2    0
Feature_3    0
Feature_4    0
Feature_5    0
Outcome      0
Time         0
dtype: int64

In [43]:
# Split data into training and testing sets
X_train, X_test, y_outcome_train, y_outcome_test, y_time_train, y_time_test = train_test_split(
    X, y_outcome, y_time, test_size=0.2, random_state=42)

In [44]:
# Initialize all regression models
models = {
    "Linear Regression": LinearRegression(),
    "Decision Tree": DecisionTreeRegressor(random_state=42),
    "Random Forest": RandomForestRegressor(random_state=42),
    "Gradient Boosting": GradientBoostingRegressor(random_state=42),
    "Support Vector Regressor": SVR(),
    "K-Nearest Neighbors": KNeighborsRegressor()
}

No randomness is requried:

linear reg: Linear Regression does not involve randomness in its training process. It directly computes the best-fit line using closed-form solutions or numerical optimization.

Support Vector Reg:SVR solves a mathematical problem to find the best fit. Sometimes, it involves some randomness in the process, but usually, it works the same way each time.

K-Nearest Neighbors Reg:KNN is a non-parametric method that makes predictions based on the distance to neighbors.

randomness is requried:

Decision Tree Regressor:Decision Trees involve randomness in how they split nodes and select features.

Random Forest Regressor:Random Forests build multiple decision trees with random subsets of features and data.

Gradient Boosting Regressor: Gradient Boosting builds models sequentially, where each model corrects the errors of the previous one. Randomness is involved in how each model is trained.





In [45]:
# Dictionaries to store evaluation metrics
outcome_metrics = {} 
time_metrics = {}

why we are using dictionary to store data?
because,Using dictionaries to store evaluation metrics helps keep your results organized, easily accessible, and scalable, making it straightforward to manage and report metrics for different models.
outcome_metrics['Linear Regression'] = {'MAE': 5.4, 'MSE': 30.2, 'RMSE': 5.5}
print(outcome_metrics['Linear Regression']['MAE'])  # Output: 5.4



In [46]:
# Loop through each model and evaluate
for model_name, model in models.items():
    # Train and predict for outcome
    model.fit(X_train, y_outcome_train)
    y_outcome_pred = model.predict(X_test)
    # Train and predict for time
    model.fit(X_train, y_time_train)
    y_time_pred = model.predict(X_test)
    # Calculate the outcome prediction
    mae_outcome = mean_absolute_error(y_outcome_test, y_outcome_pred)
    mse_outcome = mean_squared_error(y_outcome_test, y_outcome_pred)
    rmse_outcome = np.sqrt(mse_outcome) 
    r2_outcome = r2_score(y_outcome_test, y_outcome_pred)
    # Evaluate Time Metrics
    mae_time = mean_absolute_error(y_time_test, y_time_pred)
    mse_time = mean_squared_error(y_time_test, y_time_pred)
    rmse_time = np.sqrt(mse_time)
    r2_time = r2_score(y_time_test, y_time_pred)
    # Store results in dictionaries
    outcome_metrics[model_name] = {"MAE": mae_outcome, "MSE": mse_outcome, "RMSE": rmse_outcome,"r2_score":r2_outcome}
    time_metrics[model_name] = {"MAE": mae_time, "MSE": mse_time, "RMSE": rmse_time,"r2_score":r2_time}
    
    


In [47]:
#Shows the average error size, simple and robust to outliers.
# Highlights larger errors more, can be skewed by outliers.
# Gives error magnitude in the same units as the target, interpretable but also sensitive to large errors.
# R-squared measures the proportion of the variance in the dependent variable that is predictable from 
# the independent variables. 
# It ranges from 0 to 1, with 1 indicating a perfect fit.
# Higher values indicate better performance.




In [48]:

# Print results for Outcome
print("Outcome Prediction Metrics:")
for model_name, metrics in outcome_metrics.items():
    print(f"\n{model_name}:")
    print(f"MAE: {metrics['MAE']:.2f}")
    print(f"MSE: {metrics['MSE']:.2f}")
    print(f"RMSE: {metrics['RMSE']:.2f}")
    print(f"r2_score: {metrics['r2_score']:.2f}")

print("\nTime Prediction Metrics:")
for model_name, metrics in time_metrics.items():
    print(f"\n{model_name}:")
    print(f"MAE: {metrics['MAE']:.2f}")
    print(f"MSE: {metrics['MSE']:.2f}")
    print(f"RMSE: {metrics['RMSE']:.2f}")
    print(f"r2_score: {metrics['r2_score']:.2f}")

Outcome Prediction Metrics:

Linear Regression:
MAE: 31.65
MSE: 1239.28
RMSE: 35.20
r2_score: -0.62

Decision Tree:
MAE: 33.80
MSE: 1546.33
RMSE: 39.32
r2_score: -1.02

Random Forest:
MAE: 34.42
MSE: 1510.97
RMSE: 38.87
r2_score: -0.98

Gradient Boosting:
MAE: 34.56
MSE: 1513.46
RMSE: 38.90
r2_score: -0.98

Support Vector Regressor:
MAE: 25.64
MSE: 904.22
RMSE: 30.07
r2_score: -0.18

K-Nearest Neighbors:
MAE: 28.53
MSE: 1095.49
RMSE: 33.10
r2_score: -0.43

Time Prediction Metrics:

Linear Regression:
MAE: 15.25
MSE: 307.23
RMSE: 17.53
r2_score: -0.10

Decision Tree:
MAE: 18.95
MSE: 512.25
RMSE: 22.63
r2_score: -0.84

Random Forest:
MAE: 14.56
MSE: 282.65
RMSE: 16.81
r2_score: -0.01

Gradient Boosting:
MAE: 15.14
MSE: 316.29
RMSE: 17.78
r2_score: -0.13

Support Vector Regressor:
MAE: 14.78
MSE: 280.03
RMSE: 16.73
r2_score: -0.00

K-Nearest Neighbors:
MAE: 15.84
MSE: 311.53
RMSE: 17.65
r2_score: -0.12


Overall Observations
*SVR performs the best for both outcome and time prediction based on MAE, RMSE, and R-squared values.
*KNN is also a good performer but slightly behind SVR.
Tree-based models (Decision Tree, Random Forest, Gradient Boosting) have higher errors and poorer fit for both outcome and time predictions.
==>Linear Regression performs reasonably well but is outperformed by SVR and KNN.
Recommendations
*For Outcome Prediction: SVR is the preferred model based on current metrics.
*For Time Prediction: SVR is again the best model, with Random Forest as a close second.